- k-Means: what if a cluster is a partition around a choice of prototypes
6 March, 2018
Closely related to hypergraphs
Definition A simplicial complex on a set \(V\) of vertices is a set of subsets (simplices) \(\mathcal S\subseteq 2^V\) such that if \(\sigma\in\mathcal S\) and \(\tau\subset\sigma\) then \(\tau\in\mathcal S\).
Intuition Each simplex \(\sigma\) is a generalized triangle; the set are glued together along shared subsimplices.
Graphs are 1-dimensional simplicial complexes.
ÄŒech complex:
Vietoris-Rips complex:
Agglomerative clustering:
Divisive clustering:
Persistent homology:
Dendrogram: dissimilarity on vertical axis. Horizontal bar at the \(\epsilon\) that merges (splits) clusters. Vertical lines representing each cluster.
When do we merge clusters?
Single linkage Merge \(G\) and \(H\) at \(\epsilon\) when \(\min d(x_g, x_h)<\epsilon\)
Average linkage Merge \(G\) and \(H\) at \(\epsilon\) when \(\text{mean}(d(x_g,x_h))<\epsilon\)
Complete linkage Merge \(G\) and \(H\) at \(\epsilon\) when \(\max d(x_g,x_h)<\epsilon\)
Ward's method Merge \(G\) and \(H\) when \(G\) and \(H\) are the pair that minimize the increase in total within-cluster variance
Connected components of \(\epsilon\)-distance graph.
Sensitive to chaining.
Connected components of \(\epsilon\)-distance graph.
Does well on the double spiral.
Cliques (complete subgraphs) of \(\epsilon\)-distance graph.
Handles chaining better.
Cliques (complete subgraphs) of \(\epsilon\)-distance graph.
Can violate the closeness property: can group points that are closer to members of other clusters than their own clusters.
Balances the closeness of single linkage against the chaining robustness of complete linkage.
Balances the closeness of single linkage against the chaining robustness of complete linkage.
Pick a cutoff level, or a number of clusters. For each vertical line at the cutoff level, assign a cluster to all points below that cutoff in the tree.
Produces a partition into clusters from the hierarchical structure.
A dendrogram induces a metric on the data points: \(d_C(x,y)\) is the height of the connection where clusters containing \(x\) and \(y\) separately first merge together.
One way to pick in 4. is to split the cluster with largest diameter \(\max_{x,y\in G}d(x,y)\).
Another is to split the cluster with largest average within-group dissimilarity
In R
: use diana
from the library cluster
.
In python
: no easily accessible implementation available.
distances = dist(dataset) h.clusters = hclust(distances) ggdendrogram(h.clusters) # or plot(clusters) clusters = cutree(h.clusters, k=3) clusters = cutree(h.clusters, h=6.5)
dist
takes the option method
with values "euclidean"
, "maximum"
, "manhattan"
, "canberra"
, "binary"
, "minkowski"
(minkowski
also takes the option p
, and forms the \(L_p\) distance).
hclust
takes the option method
with values "single"
, "average"
, "complete"
, "ward.D"
, "ward.D2"
, "median"
, "centroid"
, "mcquitty"
.
scikit-learn
)from sklearn import cluster model = cluster.AgglomerativeClustering( linkage="ward", affinity="euclidean") model.fit(X) clusters = model.predict(X)
linkage
can also be "average"
or "complete"
. affinity
can also be "l1"
, "l2"
, "manhattan"
, "cosine"
, or "precomputed"
Both linkage
and affinity
can be omitted for Ward's method on Euclidean distance.
scipy
)from scipy.cluster import hierarchy h_clusters = hierarchy.linkage(X, method="single", metric="euclidean") hierarchy.dendrogram(h_clusters) clusters = hierarchy.fcluster(h_clusters, t=6.5)
method
also takes values "complete"
, "average"
, "weighted"
, "centroid"
, "median"
, "ward"
. metric
takes 22 different values, or an explicitly written distance function. See the docs for scipy.spatial.distance.pdist
for details.
dgm = ripsDiag(circle, 3, 0.5) plot(dgm$diagram)
dgm = ripsDiag(circle, 3, 0.5) plot(dgm$diagram, barcode=T)
Use Dionysus
.
import dionysus cpx = dionysus.fill_rips(xy, 3, 0.5) homology = dionysus.homology_persistence(cpx) diagrams = dionysus.init_diagrams(homology, cpx) dionysus.plot.plot_bars(diagrams[0]) dionysus.plot.plot_bars(diagrams[1])
Use JavaPlex
in Matlab.
Collection of handwritten digits. Each a 28x28 greyscale image.
60k training images, 10k test images.
Published by Yann Lecun.
10-means on the first 1000 instances: