Lecture 3

11February, 2018

Clustering

Pick out 3 subgroups as suggested by the data.

Clustering

Basic idea: chunk together observations that are similar to each other.

First important question What does similar mean?

(Dis)similarity, proximity and metrics

Hierarchy of axiom sets

Metric

\(d:X\times X\to\mathbb{R}_{\geq 0}\)
Symmetric: \(d(x,y) = d(y,x)\)
Identity of indiscernibles: \(d(x,y)=0\) implies \(x=y\)
Triangle inequality: \(d(x,y)+d(y,z)\geq d(x,z)\)

A quasimetric skips the Identity of indiscernibles.

An ultrametric has the stronger triangle inequality \(\max(d(x,y),d(y,z))\geq d(x,z)\)

Hierarchy of axiom sets

Positive definite kernels

\(d:X\times X\to\mathbb{R}_{\geq 0}\)
Symmetric: \(d(x,y) = d(y,x)\)

Help guarantee convergence in many different optimization scenarios.

Hierarchy of axiom sets

Dissimilarity measures

Retains only the non-negativity condition:

\(d:X\times X\to\mathbb{R}_{\geq 0}\)

Quasi-metric

Drops symmetry.

In both of these cases, symmetry can be imposed:

\(d'(x,y) = \frac{d(x,y) + d(y,x)}{2}\)

Example metrics: numeric data

\(L_p\): \(d_p(x,y) = \left(\sum|x_i-y_i|^p\right)^{1/p}\).

Example metrics: numeric data

Note \(L_0\) counts non-zero entries and \(L_\infty\) uses maximum value.

\(L_0\) counts non-zero entries in \(|x_i-y_i|\).
\(L_1\) sums up \(|x_i-y_i\). Taxicab metric.
\(L_2\) is the classical Euclidean metric.
\(L_\infty\) picks the maximum out of \(x_i-y_i\).

Notice: if observations are standardized or whitened: \[ X_i = \frac{x_i-\overline x}{\sigma_x} \qquad Y_i = \frac{y_i-\overline y}{\sigma_y} \] then \(d_2(X,Y)\propto 1-\rho_{x,y}\), for Pearson correlation \(\rho_{x,y}\). This is the correlation metric.

Example metrics: numeric data

Variance normalized Euclidean distance:

\[ d(x,y) = d_2(x/\sigma_x, y/\sigma_y) \]

Mahalanobis distance:

\[ d(x,y) = \sqrt{(x-y)^T\text{Cov}(x,y)(x-y)} \]

Example metrics: ordinal data

Generic strategy for ordinal data:

Remap categories \(x_1 < x_2 < \dots < x_M\) to \(1/M, 2/M, \dots, (M-1)/M, 1\).

Then use numeric distances.

Example metrics: categorical data

With categorical data we need to specifically pick distance between each pair of labels.

Example metrics: other cases

Discrete metric \[ d(x,y) = \begin{cases}1 & x=y \\ 0 & \text{otherwise}\end{cases} \]

Graph metric produces distance between vertices in a graph by summing edge lengths along shortest path between vertices.

Edit distance counts minimum edits between points.
On graphs: add/delete vertex/edge, merge/split vertices, contracting edges.
On strings: add/remove/substitute letters (Levenshtein distance).

Hamming distance count differences.

Example metrics: probability

Densities are functions: use a function metric.

Each \(L_p\) produces a function distance \(d(f,g)=\left(\int|f(x)-g(x)|^pdx\right)^{1/p}\)

Example metrics: probability

The Earth mover's distance: use a cost function \(c(x,x')\) – distance between pairs of points in joint distribution. \[ d(\mu,\nu)=\inf_{\gamma\in\Gamma(\mu,\nu)}\iint c(x,y)d\mu(x)d\nu(y) \] where \(\Gamma(\mu,\nu)\) is the set of transport plans: \(\gamma(x,x')\) measures how much mass goes from \(x\) to \(x'\). Requires \(\int\gamma(x,x')dx'=\mu(x)\) and \(\int\gamma(x,x')dx=\nu(x')\).

Transport plans can be seen as joint probability distributions that match \(\mu\) and \(\nu\) as marginal distributions.

Wasserstein distance: \(W_p(\mu,\nu)\) is the EMD with \(c(x,x')=d_p\) is the \(L_p\) distance.