How does meaning work? What do we mean when we say knife?
27 February, 2018
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
How does meaning work? What do we mean when we say knife?
Semantic Features
A knife is [+SHARP EDGE] [+HAND SIZED HANDLE] [-LONG BLADE] [-DOUBLE EDGE] [-HEAVY EDGE]
Similar to linear regression techniques.
How does meaning work? What do we mean when we say knife?
Prototype Theory
A knife is something similar to
More similarity imbues more "knifeness"
A candidate procedure: pick a number of prototypes to locate and then
A candidate procedure: pick a number of prototypes to locate and then
Iterative greedy descent
Within cluster point scatter: \[ W(C) = \frac12\sum_{k=1}^K\sum_{x\in C_k}\sum_{y\in C_k} d(x,y) \] Sum distances between points that occupy the same cluster.
Between cluster point scatter: \[ B(C) = \frac12\sum_{k=1}^K\sum_{x\in C_k}\sum_{y\not\in C_k} d(x,y) \]
Total scatter: \[ T = \frac12\sum_x\sum_y d(x,y) = W(C) + B(C) \]
Iterative greedy descent algorithm for clustering.
Note that \[ \sum_{x,x'}(x-x')^2 = \sum_{x,x'}(x-\overline x + \overline x-x')^2 = \\ \sum_{x,x'}(x-\overline x)^2 + \sum_{x,x'}(x'-\overline x)^2 - 2\sum_{x,x'}(x-\overline x)(x'-\overline x) = \\ 2\sum_{x}(x-\overline x)^2 -2\sum_{x'}(x'-\overline x)\sum_{x}(x-\overline x) \]
And that \(\sum_{x}(x-\overline x) = \left(\sum x - N\overline x\right) = (N\overline x- N\overline x) = 0\).
Using \(\sum_{x,x'}(x-x')^2 = 2\sum_x(x-\overline x)^2\) it follows that \[ W(C) = \frac12\sum_k\sum_{x\in C_k}\sum_{x'\in C_k}\|x-x'\|^2 = \sum_k|C_k|\sum_{x\in C_k}\|x-\overline x_k\|^2 \]
It follows that for a particular cluster assignment \(C\), the within group scatter is completely determined by the means or centroids of each cluster.
Once the means are fixed, minimizing distances to assigned mean corresponds to picking the closest mean for each point.
Once the clusters are fixed, the within group scatter is determined by the centroid.
Data matrix is X
. Number of clusters is k
.
kct = kmeans(X,k) kct$cluster # holds cluster index for points in X kct$centers # holds coordinates for cluster centers kct$tot.withinss # holds within-group scatter
Data matrix is X
. Number of clusters is k
.
from sklearn import cluster model = cluster.KMeans(k).fit(X) model.labels_ # holds cluster indices model.cluster_centers_ # holds coordinates for cluster centers
Data matrix is X
. Number of clusters is k
.
from sklearn import cluster model = cluster.KMeans(k) labels = model.fit_predict(X) # holds cluster indices model.cluster_centers_
Data matrix is X
. Number of clusters is k
.
from sklearn import cluster model = cluster.KMeans(k) clusterspace = model.fit_transform(X) model.labels_ model.cluster_centers_
The scikit-learn
implementation additionally provides the possibility to change coordinates into cluster-distance space:
The point \(x\) is assigned the new coordinate \[ \hat x = (d(x,x_1), \dots, d(x,x_k)) \] consisting of distances to all cluster centers.
One way to model multimodal distributions is to add two or more densities together, weighted so that the weights sum to 1. \[ p_{\mathcal M}(x) = \lambda_1p_1(x) + \dots + \lambda_kp_k(x) \]
Most common setup: mixed Gaussians \[ p_{\mathcal M}(x) = \lambda_1p_{\mathcal N(\mu_1,\sigma_1^2)}(x) + \dots + \lambda_kp_{\mathcal N(\mu_k,\sigma_k^2)}(x) \]
Fitting a mixture model is difficult. The solution is the EM-algorithm.
Consider a two-sample mixture model, determined by \(\lambda_1,\mu_1,\mu_2,\sigma_1^2,\sigma_2^2\). Posit latent variables \(\Delta_i\) that takes the value \(1\) if \(X_i\) comes from model 1, \(0\) otherwise.
Instead of \(\Delta_i\), which is inaccessible, we work with \[ \gamma_i(\theta) = \mathbb{E}(\Delta_i|\theta,X) = \mathbb{P}(\Delta_i=1|\theta,X) \]
We call \(\gamma_i\) the responsibility of model 2 for observation i.
The Expectation Maximization algorithm alternates two steps:
Expectation Calculate a soft assignment of responsibilities to the observations, using current estimates of all parameters.
Maximization Use the responsibility estimates for a maximum likelihood estimation of the parameters.
We had two main requirements for \(k\)-means:
We had two main requirements for \(k\)-means:
We can relax these by:
We had two main requirements for \(k\)-means:
We can relax these by:
Recall the algorithm
Recall the algorithm
Recall the algorithm
Recall the algorithm
Recall the algorithm
Recall the algorithm
Since the data within each \(x\) is never referenced, only the dissimilarities, the \(k\)-medioids algorithm can operate directly on dissimilarity matrices without reference to the data.
The \(k\)-means algorithm is NP-hard if either the number of clusters or the dimensionality of the data are not fixed.
For fixed \(k\) and \(d\), \(k\)-means can be solved in \(O(n^{dk+1})\).
There are several improvements: Lloyd's algorithm runs in \(O(nkdi)\). We will not describe this algorithm here.
\(k\)-medioids increases the computational complexity of the optimization step from \(O(kn)\) to \(O(kn^2)\)
One proposed strategy to improve \(k\)-medioids is to iteratively changing each medioid for another one that decreases \[ \min_{C, \{x_k\}}\sum_k\sum_{x\in C_k}d(x,x_k) \]
When no more improvement can be found, the algorithm terminates.
\(k\)-means produces a compression scheme. We'll explain on an image:
Usually through examining \(W(C)\). Balance complexity of model (high \(k\)) against explanatory power (low \(W(C)\)).