6 March, 2018

Basic idea

  • k-Means: what if a cluster is a partition around a choice of prototypes

Basic idea

  • k-Means: what if a cluster is a partition around a choice of prototypes
  • Hierarchical clustering: what if a cluster is the things that mold together quicker than others?

Why do we need something different?

Why do we need something different?

Encoding closeness: \(\epsilon\)-distance graph

  1. Pick a threshold \(\epsilon\)
  2. Nodes: data points
  3. Edges: \([i,j]\) if \(d(x_i,x_j)<\epsilon\)

Encoding closeness: \(\epsilon\)-distance graph

Encoding closeness: \(\epsilon\)-distance graph

Encoding closeness: \(\epsilon\)-distance graph

Encoding closeness: \(\epsilon\)-distance graph

Encoding closeness: \(\epsilon\)-distance graph

Encoding closeness: simplicial complexes

Closely related to hypergraphs

Definition A simplicial complex on a set \(V\) of vertices is a set of subsets (simplices) \(\mathcal S\subseteq 2^V\) such that if \(\sigma\in\mathcal S\) and \(\tau\subset\sigma\) then \(\tau\in\mathcal S\).

Intuition Each simplex \(\sigma\) is a generalized triangle; the set are glued together along shared subsimplices.

Graphs are 1-dimensional simplicial complexes.

Encoding closeness: simplicial complexes

ÄŒech complex:

  1. Pick a threshold \(\epsilon\)
  2. Vertices: data points
  3. Simplices: \([x_0,\dots,x_d]\) if \(\cap_{j=0}^dB_\epsilon(x_j)\neq\emptyset\)

Vietoris-Rips complex:

  1. Pick a threshold \(\epsilon\)
  2. Generate the \(\epsilon\)-distance graph \(\Gamma_\epsilon\)
  3. Add simplex \([x_0,\dots,x_d]\) if each \([x_i,x_j]\) is an edge in \(\Gamma_\epsilon\)

From encoding to knowledge

Agglomerative clustering:

  1. Start with each data point a cluster of its own
  2. Merge the closest clusters

Divisive clustering:

  1. Start with everything one cluster
  2. Split one of the clusters

Persistent homology:

  • Homology: use linear algebra to find gaps and *holes
  • Persistent: sweep over \(\epsilon\) to glue gaps into persistent topological features

Visualizing

Dendrogram: dissimilarity on vertical axis. Horizontal bar at the \(\epsilon\) that merges (splits) clusters. Vertical lines representing each cluster.

Agglomerative clustering

Linkages

When do we merge clusters?

Single linkage Merge \(G\) and \(H\) at \(\epsilon\) when \(\min d(x_g, x_h)<\epsilon\)

Average linkage Merge \(G\) and \(H\) at \(\epsilon\) when \(\text{mean}(d(x_g,x_h))<\epsilon\)

Complete linkage Merge \(G\) and \(H\) at \(\epsilon\) when \(\max d(x_g,x_h)<\epsilon\)

Ward's method Merge \(G\) and \(H\) when \(G\) and \(H\) are the pair that minimize the increase in total within-cluster variance

Single linkage

Connected components of \(\epsilon\)-distance graph.

Sensitive to chaining.

Single linkage

Connected components of \(\epsilon\)-distance graph.

Does well on the double spiral.

Complete linkage

Cliques (complete subgraphs) of \(\epsilon\)-distance graph.

Handles chaining better.

Complete linkage

Cliques (complete subgraphs) of \(\epsilon\)-distance graph.

Can violate the closeness property: can group points that are closer to members of other clusters than their own clusters.

Average linkage

Balances the closeness of single linkage against the chaining robustness of complete linkage.

Average linkage

Balances the closeness of single linkage against the chaining robustness of complete linkage.

Ward's Method

Ward's Method

Cutting a dendrogram

Pick a cutoff level, or a number of clusters. For each vertical line at the cutoff level, assign a cluster to all points below that cutoff in the tree.

Produces a partition into clusters from the hierarchical structure.

Cutting a dendrogram

Cutting a dendrogram

Cophenetic distance

A dendrogram induces a metric on the data points: \(d_C(x,y)\) is the height of the connection where clusters containing \(x\) and \(y\) separately first merge together.

Divisive clustering

Average dissimilarity split

  1. All observations in a single group \(G\)
  2. Pick the observation \(x\) that maximizes \(\text{mean}[d(x,g) : g\in G]\) average dissimilarity from all other points is largest. Put in a new cluster \(H\).
  3. While \(G\) contains some \(y\) such that \(\text{mean}[d(y,h) : h\in H] - \text{mean}[d(y,g) : g\in G] > 0\), pick the maximizing \(y\) and move to \(H\).
  4. Pick one of the existing clusters as the new \(G\) and go to 2.

One way to pick in 4. is to split the cluster with largest diameter \(\max_{x,y\in G}d(x,y)\).

Another is to split the cluster with largest average within-group dissimilarity

Average dissimilarity split

In R: use diana from the library cluster.

In python: no easily accessible implementation available.

Code

In R

distances = dist(dataset)
h.clusters = hclust(distances)
ggdendrogram(h.clusters) # or plot(clusters)
clusters = cutree(h.clusters, k=3)
clusters = cutree(h.clusters, h=6.5)

dist takes the option method with values "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski" (minkowski also takes the option p, and forms the \(L_p\) distance).

hclust takes the option method with values "single", "average", "complete", "ward.D", "ward.D2", "median", "centroid", "mcquitty".

In Python (scikit-learn)

from sklearn import cluster
model = cluster.AgglomerativeClustering(
    linkage="ward", affinity="euclidean")
model.fit(X)
clusters = model.predict(X)

linkage can also be "average" or "complete". affinity can also be "l1", "l2", "manhattan", "cosine", or "precomputed" Both linkage and affinity can be omitted for Ward's method on Euclidean distance.

In Python (scipy)

from scipy.cluster import hierarchy
h_clusters = hierarchy.linkage(X, method="single", metric="euclidean")
hierarchy.dendrogram(h_clusters)
clusters = hierarchy.fcluster(h_clusters, t=6.5)

method also takes values "complete", "average", "weighted", "centroid", "median", "ward". metric takes 22 different values, or an explicitly written distance function. See the docs for scipy.spatial.distance.pdist for details.

TDA in R

dgm = ripsDiag(circle, 3, 0.5)
plot(dgm$diagram)

TDA in R

dgm = ripsDiag(circle, 3, 0.5)
plot(dgm$diagram, barcode=T)

TDA in Python

Use Dionysus.

import dionysus
cpx = dionysus.fill_rips(xy, 3, 0.5)
homology = dionysus.homology_persistence(cpx)
diagrams = dionysus.init_diagrams(homology, cpx)
dionysus.plot.plot_bars(diagrams[0])
dionysus.plot.plot_bars(diagrams[1])

TDA my way

Use JavaPlex in Matlab.

Examples

Iris

Iris

Not finished

MNIST

Collection of handwritten digits. Each a 28x28 greyscale image.

60k training images, 10k test images.

Published by Yann Lecun.

MNIST

10-means on the first 1000 instances: