Lecture 3

27 February, 2018

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Semantic Features

A knife is [+SHARP EDGE] [+HAND SIZED HANDLE] [-LONG BLADE] [-DOUBLE EDGE] [-HEAVY EDGE]

Similar to linear regression techniques.

Prototypes, cognition and semantics

How does meaning work? What do we mean when we say knife?

Prototype Theory

A knife is something similar to

More similarity imbues more "knifeness"

How to find prototypes

A candidate procedure: pick a number of prototypes to locate and then

Pick random candidates
Improve each candidate

How to find prototypes

A candidate procedure: pick a number of prototypes to locate and then

Pick random candidates
Improve each candidate with respect to a quality measure

Iterative greedy descent

Quality measures: scatters

Within cluster point scatter: \[ W(C) = \frac12\sum_{k=1}^K\sum_{x\in C_k}\sum_{y\in C_k} d(x,y) \] Sum distances between points that occupy the same cluster.

Between cluster point scatter: \[ B(C) = \frac12\sum_{k=1}^K\sum_{x\in C_k}\sum_{y\not\in C_k} d(x,y) \]

Total scatter: \[ T = \frac12\sum_x\sum_y d(x,y) = W(C) + B(C) \]

K-Means

Iterative greedy descent algorithm for clustering.

All variables quantitative
Squared Euclidean distance

Squared Euclidean scatter

Note that \[ \sum_{x,x'}(x-x')^2 = \sum_{x,x'}(x-\overline x + \overline x-x')^2 = \\ \sum_{x,x'}(x-\overline x)^2 + \sum_{x,x'}(x'-\overline x)^2 - 2\sum_{x,x'}(x-\overline x)(x'-\overline x) = \\ 2\sum_{x}(x-\overline x)^2 -2\sum_{x'}(x'-\overline x)\sum_{x}(x-\overline x) \]

And that \(\sum_{x}(x-\overline x) = \left(\sum x - N\overline x\right) = (N\overline x- N\overline x) = 0\).

Squared Euclidean scatter

Using \(\sum_{x,x'}(x-x')^2 = 2\sum_x(x-\overline x)^2\) it follows that \[ W(C) = \frac12\sum_k\sum_{x\in C_k}\sum_{x'\in C_k}\|x-x'\|^2 = \sum_k|C_k|\sum_{x\in C_k}\|x-\overline x_k\|^2 \]

It follows that for a particular cluster assignment \(C\), the within group scatter is completely determined by the means or centroids of each cluster.

Iterative descent algorithm

For each chosen mean \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the centroids \(x_k=\frac{1}{|C_k|}\sum_{x\in C_k}x\).
If no cluster assignment changes, the algorithm terminates.

Iterative descent algorithm

Once the means are fixed, minimizing distances to assigned mean corresponds to picking the closest mean for each point.

Once the clusters are fixed, the within group scatter is determined by the centroid.

Example: Iris

Example code - R

Data matrix is X. Number of clusters is k.

kct = kmeans(X,k)
kct$cluster # holds cluster index for points in X
kct$centers # holds coordinates for cluster centers
kct$tot.withinss # holds within-group scatter

Example code - Python / scikit-learn

Data matrix is X. Number of clusters is k.

from sklearn import cluster
model = cluster.KMeans(k).fit(X)
model.labels_          # holds cluster indices
model.cluster_centers_ # holds coordinates for cluster centers

Example code - Python / scikit-learn

Data matrix is X. Number of clusters is k.

from sklearn import cluster
model = cluster.KMeans(k)
labels = model.fit_predict(X) # holds cluster indices
model.cluster_centers_

Example code - Python / scikit-learn

Data matrix is X. Number of clusters is k.

from sklearn import cluster
model = cluster.KMeans(k)
clusterspace = model.fit_transform(X)
model.labels_
model.cluster_centers_

Cluster space

The scikit-learn implementation additionally provides the possibility to change coordinates into cluster-distance space:

The point \(x\) is assigned the new coordinate \[ \hat x = (d(x,x_1), \dots, d(x,x_k)) \] consisting of distances to all cluster centers.

Cluster space - Iris

Soft clusters

Mixture models

One way to model multimodal distributions is to add two or more densities together, weighted so that the weights sum to 1. \[ p_{\mathcal M}(x) = \lambda_1p_1(x) + \dots + \lambda_kp_k(x) \]

Most common setup: mixed Gaussians \[ p_{\mathcal M}(x) = \lambda_1p_{\mathcal N(\mu_1,\sigma_1^2)}(x) + \dots + \lambda_kp_{\mathcal N(\mu_k,\sigma_k^2)}(x) \]

Mixture models

Expectation Maximization

Fitting a mixture model is difficult. The solution is the EM-algorithm.

Consider a two-sample mixture model, determined by \(\lambda_1,\mu_1,\mu_2,\sigma_1^2,\sigma_2^2\). Posit latent variables \(\Delta_i\) that takes the value \(1\) if \(X_i\) comes from model 1, \(0\) otherwise.

Instead of \(\Delta_i\), which is inaccessible, we work with \[ \gamma_i(\theta) = \mathbb{E}(\Delta_i|\theta,X) = \mathbb{P}(\Delta_i=1|\theta,X) \]

We call \(\gamma_i\) the responsibility of model 2 for observation i.

Expectation Maximization

The Expectation Maximization algorithm alternates two steps:

Expectation Calculate a soft assignment of responsibilities to the observations, using current estimates of all parameters.

Maximization Use the responsibility estimates for a maximum likelihood estimation of the parameters.

Expectation Maximization

Guess at parameters: Means - pick observations at random; variances - overall sample variance; weights \(\hat\lambda_j\) - \(1/k\)
Compute responsibilities \[ \hat\gamma^j_i = \frac {\hat\lambda_j\mathcal L_{\mathcal N(\hat\mu_j,\hat\sigma_j^2)}(x_i)} {\sum_k\hat\lambda_k\mathcal L_{\mathcal N(\hat\mu_k,\hat\sigma_k^2)}(x_i)} \]
Recompute weighted sample means and variances \[ \hat\mu_j = \frac {\sum_i\hat\gamma^j_ix_i} {\sum_i\hat\gamma^j_i} \qquad \hat\sigma_j^2 = \frac {\sum_i\hat\gamma_i^j(x_i-\hat\mu_j)^2} {\sum_i\hat\gamma_i^j} \\ \hat\lambda_j=\sum\hat\gamma_i^j/n \]

k-Medioids

Relaxing requirements

We had two main requirements for \(k\)-means:

Sum of squares dissimilarity measure.
All continuous numeric variables

Relaxing requirements

We had two main requirements for \(k\)-means:

Sum of squares dissimilarity measure.
All continuous numeric variables

We can relax these by:

Picking any dissimilarity measure, changing the task of minimizing \(W(C)\).

Relaxing requirements

We had two main requirements for \(k\)-means:

Sum of squares dissimilarity measure.
All continuous numeric variables

We can relax these by:

Picking any dissimilarity measure, changing the task of minimizing \(W(C)\).
Picking observations instead of synthesizing centroids.

k-Means to k-Medioids

Recall the algorithm

For each chosen mean \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the centroids \(x_k=\frac{1}{|C_k|}\sum_{x\in C_k}x\).
If no cluster assignment changes, the algorithm terminates.

k-Means to k-Medioids

Recall the algorithm

For each chosen mean \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the centroids \(x_k=\frac{1}{|C_k|}\sum_{x\in C_k}x\).
If no cluster assignment changes, the algorithm terminates.

k-Means to k-Medioids

Recall the algorithm

For each chosen medioid \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the centroids \(x_k=\frac{1}{|C_k|}\sum_{x\in C_k}x\).
If no cluster assignment changes, the algorithm terminates.

k-Means to k-Medioids

Recall the algorithm

For each chosen medioid \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the centroids \(x_k=\frac{1}{|C_k|}\sum_{x\in C_k}x\).
If no cluster assignment changes, the algorithm terminates.

k-Means to k-Medioids

Recall the algorithm

For each chosen medioid \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the medioids \(x_k=\text{argmin}_{x\in C_k}\sum_{x'\in C_k}d(x,x')\).
If no cluster assignment changes, the algorithm terminates.

k-Means to k-Medioids

Recall the algorithm

For each chosen medioid \(x_k\), create the clusters \(C_k=\{x:d(x,x_k) < d(x,x_{k'}) \forall k'\}\).
For each chosen cluster \(C_k\), calculate the medioids \(x_k=\text{argmin}_{x\in C_k}\sum_{x'\in C_k}d(x,x')\).
If no cluster assignment changes, the algorithm terminates.

Since the data within each \(x\) is never referenced, only the dissimilarities, the \(k\)-medioids algorithm can operate directly on dissimilarity matrices without reference to the data.

Complexity

The \(k\)-means algorithm is NP-hard if either the number of clusters or the dimensionality of the data are not fixed.

For fixed \(k\) and \(d\), \(k\)-means can be solved in \(O(n^{dk+1})\).

There are several improvements: Lloyd's algorithm runs in \(O(nkdi)\). We will not describe this algorithm here.

\(k\)-medioids increases the computational complexity of the optimization step from \(O(kn)\) to \(O(kn^2)\)

Greedy iterative k-medioids

One proposed strategy to improve \(k\)-medioids is to iteratively changing each medioid for another one that decreases \[ \min_{C, \{x_k\}}\sum_k\sum_{x\in C_k}d(x,x_k) \]

When no more improvement can be found, the algorithm terminates.

Vector Quantization

Codebooks and compression

\(k\)-means produces a compression scheme. We'll explain on an image:

Subdivide the image into \(d\times d\) pixel blocks.
Create \(k\)-means clusters of the pixel blocks.
Use the list of means together with cluster indices to produce a compressed image.

Example

Picking parameters

Cluster count

Usually through examining \(W(C)\). Balance complexity of model (high \(k\)) against explanatory power (low \(W(C)\)).

Plot \(W\) for different choices of \(k\), look for a sharp inflection point
Splitting separated groups brings a larger drop in \(W(C)\) than splitting within a group.

Picking parameters

The gap statistic compares the \(\log W(C)\) curve to the correpsonding curve from uniform data and picks the point with the largest distance between the two curves.
Pick the first value close enough (using simulated std.dev.) to a local (or global) maximum of the gap statistic.

Starting points

Pick random data points
Pick random \(x_1\), then pick each new data point to minimize \[ \text{argmin}_{x_k}\sum_k\sum_{x\in C_k} d(x,x_k) \]
Pick starting centers using domain knowledge