11February, 2018
Pick out 3 subgroups as suggested by the data.
Basic idea: chunk together observations that are similar to each other.
First important question What does similar mean?
Metric
A quasimetric skips the Identity of indiscernibles.
An ultrametric has the stronger triangle inequality \(\max(d(x,y),d(y,z))\geq d(x,z)\)
Positive definite kernels
Help guarantee convergence in many different optimization scenarios.
Dissimilarity measures
Retains only the non-negativity condition:
Quasi-metric
Drops symmetry.
In both of these cases, symmetry can be imposed:
Note \(L_0\) counts non-zero entries and \(L_\infty\) uses maximum value.
Notice: if observations are standardized or whitened: \[ X_i = \frac{x_i-\overline x}{\sigma_x} \qquad Y_i = \frac{y_i-\overline y}{\sigma_y} \] then \(d_2(X,Y)\propto 1-\rho_{x,y}\), for Pearson correlation \(\rho_{x,y}\). This is the correlation metric.
Variance normalized Euclidean distance:
\[ d(x,y) = d_2(x/\sigma_x, y/\sigma_y) \]
Mahalanobis distance:
\[ d(x,y) = \sqrt{(x-y)^T\text{Cov}(x,y)(x-y)} \]
Generic strategy for ordinal data:
Remap categories \(x_1 < x_2 < \dots < x_M\) to \(1/M, 2/M, \dots, (M-1)/M, 1\).
Then use numeric distances.
With categorical data we need to specifically pick distance between each pair of labels.
Discrete metric \[ d(x,y) = \begin{cases}1 & x=y \\ 0 & \text{otherwise}\end{cases} \]
Graph metric produces distance between vertices in a graph by summing edge lengths along shortest path between vertices.
Edit distance counts minimum edits between points.
On graphs: add/delete vertex/edge, merge/split vertices, contracting edges.
On strings: add/remove/substitute letters (Levenshtein distance).
Hamming distance count differences.
Densities are functions: use a function metric.
Each \(L_p\) produces a function distance \(d(f,g)=\left(\int|f(x)-g(x)|^pdx\right)^{1/p}\)
The Earth mover's distance: use a cost function \(c(x,x')\) – distance between pairs of points in joint distribution. \[ d(\mu,\nu)=\inf_{\gamma\in\Gamma(\mu,\nu)}\iint c(x,y)d\mu(x)d\nu(y) \] where \(\Gamma(\mu,\nu)\) is the set of transport plans: \(\gamma(x,x')\) measures how much mass goes from \(x\) to \(x'\). Requires \(\int\gamma(x,x')dx'=\mu(x)\) and \(\int\gamma(x,x')dx=\nu(x')\).
Transport plans can be seen as joint probability distributions that match \(\mu\) and \(\nu\) as marginal distributions.
Wasserstein distance: \(W_p(\mu,\nu)\) is the EMD with \(c(x,x')=d_p\) is the \(L_p\) distance.