Lecture 7

8March, 2018

Dimensionality reduction and Manifold learning

Intrinsic and extrinsic dimension

(Tidy) Data comes with an embedding or extrinsic dimensionality: data comes as a \(N\times d\) matrix of observations of single variables.

Data itself might be (much) more restricted.

Example

Reaven and Miller: 145 nonobese patients with or without diabetes. Patients were given glucose; their blood sugar and insulin were measured before and after their snack. Measures

relwt Relative weight: actual weight / expected weight (given height)
glufast Fasting plasma glucose level
glutest Test plasma glucose level
instest Plasma insulin during test
sspg Steady state plasma glucose
group Grouping into Normal, Chemical_Diabetic and Overt_Diabetic.

Example

Extrinsic dimension of the predictors is 5.

Intrinsic dimension?

Example

Dimensions and Manifolds

Dimensionality Reduction changes data representation to closer approximate its intrinsic dimension.

Many dimensionality reduction techniques are linear: find a good projection matrix.

Manifold Learning tries to fit data to a more complicated manifold representation, allowing curved shapes etc.

Self-organized Maps

Example

Self-organized Maps

Introduced by Kohonen et al (2000)

Pick a grid. Gradually adapt it to data in a decaying update process. Move entire neighborhood closer to data in each step.

Self-organized Maps

Input: Grid \(Q_1\times Q_2\), data \(X=[x_1,\dots]\subset\mathbb R^d\)

Initialize all \(m_{k_1,k_2}\in\mathbb R^d\)
For the next observation \(x\)

Find closest \(m_{k_1,k_2}\) to \(x\)
Find set of neighbors \(\{q_1,\dots,q_m\}\subset Q_1\times Q_2\) of \({k_1,k_2}\)
Move all the \(m_{q_j}\) closer to \(x\): \[ m_{q_j} = m_{q_j} + \alpha(x-m_{q_j}) \]

Decay neighborhood size and \(\alpha\) and repeat the traversal

Self-organized Maps (alternative)

Input: Grid \(Q_1\times Q_2\), data \(X=[x_1,\dots]\subset\mathbb R^d\)

Initialize all \(m_{(k_1,k_2)}\in\mathbb R^d\)
For the next observation \(x\)

Find closest \(m_{(k_1,k_2)}\) to \(x\)
Find set of neighbors \(\{q_1,\dots,q_m\}\subset Q_1\times Q_2\) of \({(k_1,k_2)}\)
Move all the \(m_{q_j}\) closer to \(x\): \[ m_{q_j} = m_{q_j} + \alpha h[d(q_j,(k_1,k_2)](x-m_{q_j}) \]

Decay neighborhood size and \(\alpha\) and repeat the traversal

Principal Component Analysis

Geometry of Covariance

Let \(X=x_1,\dots,x_N\in\mathbb R^p\) be 0-centered, ie \(\overline x=0\). The sample covariance is the matrix with \(ij\) entry

\[ cov(X)_{ij} = \frac{1}{N-1}\sum_{k=1}^N x_{ki}x_{kj} = \frac{1}{N-1} X_i^TX_j \\ cov(X) = \frac{1}{N-1}X^TX \]

Diagonal entries measure variance of each variable; off-diagonal entries measure covariance (and thus correlation) of pairs of variables.

If we could decorrelate the variables, we isolate effects.

Geometry of Covariance

\(X=x_1,\dots,x_N\in\mathbb R^p\) 0-centered. \(cov(X)=X^TX/(N-1)\).

Decorrelation is a diagonalization of the covariance matrix: find a linear transformation \(P:\mathbb R^p\to\mathbb R^p\) such that \(P^{-1}cov(X)P\) is diagonal.

Theorem If \(M\) is real and symmetric, then there is a matrix \(P\) with orthonormal columns st \(P^TMP\) is diagonal. For such a \(P\), \(P^{-1}=P^T\).

Covariance is symmetric, so \(X^TX\) is real and symmetric.

Geometry of Covariance

\(X=x_1,\dots,x_N\in\mathbb R^p\) 0-centered. \(cov(X)=X^TX/(N-1)\).

By diagonalization, \[ D = P^T(X^TX)P = (XP)^T(XP) \]

Thus \(XP\) is a transformed decorrelated dataset, and \(P\) encodes a decorrelating data transformation.

By permuting entries in \(P\), we can ensure that \(D\) has diagonal values in decreasing order: \(d_1 \geq d_2 \geq \dots \geq d_p \geq 0\). The entries are non-negative since the diagonal encodes variances.

Geometry of Covariance

\(X=x_1,\dots,x_N\in\mathbb R^p\) 0-centered. \(cov(X)=X^TX/(N-1)\). \(D\) diagonal with entries \(d_1\geq d_2 \geq \dots \geq d_p \geq 0\). \(D=(XP)^T(XP)\).

With orthonormal \(P\) and up to rearranging the order of the variances \(d_j\), the variances are unique.

The columns \(y_j\) of \(XP\) have decreasing variances, with \(y_1\) maximizing the variance.

Hence \(y_1\) is the unitary transformation of the data with largest resulting variance.

Principal Components

\(X=x_1,\dots,x_N\in\mathbb R^p\).

Theorem A diagonalizing \(P\) for the matrix \(X^TX\) can be found through the singular value decomposition (SVD):

SVD produces \(X = UDV^T\), with \(U\) and \(V\) orthogonal matrices (\(U^{-1}=U^T\), \(V^{-1}=V^T\)).

Then \[ X^TX = (UDV^T)^TUDV^T = VDU^TUDV^T = VD^2V^T \\ V^TX^TXV = D^2 \]

Hence, the transformation \(X\mapsto XV = UDV^TV = UD\) produces the decorrelation: \(P=V\).

Principal Components

\(X=x_1,\dots,x_N\in\mathbb R^p\). \(X = UDV^T\) by singular value decomposition.

The principal components of \(X\) are the columns of \(XV=UD\).

The matrix \(V\) is often called the loadings of the principal components.

The importance of each component is often measured in terms of explained variance: each entry \(d_j\) is a sum of squares for one transformed variable.

Hence \(\sum_j d_j = \sum_i\sum_j y_{ij}^2\). The quantity \(\frac{d_i}{\sum d_j}\) measures the contribution to this sum of squares produced by the \(j\)th component.

Principal Components

For an approach more focused on the linear algebra of projection matrices rather than decorrelation, the book has a good exposition.

The basic idea is to seek, for a rank \(q\) linear model \[ x = \mu + V_q\lambda \qquad V_q:\mathbb R^q\to\mathbb R^p \] to estimate \(\mu, \lambda_i, V_q\) with lowest square loss across the \(x_i\).

This leads us to consider the matrix \(V_qV_q^T\) as a projection matrix from \(\mathbb R^p\) onto the column space of \(V_q\). The SVD emerges for estimating \(V_q\): \(V_q\) is the first \(q\) columns of the \(V\) from the SVD.

Application: interpretation

Just like with linear regressions, the loadings can be taken as explanatory for the data itself:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	sdev	explained
0.3613866	-0.0845225	0.8566706	0.3582892	2.0494032	68.934513
-0.6565888	-0.7301614	0.1733727	0.0754810	0.4909714	16.514504
-0.5820299	0.5979108	0.0762361	0.5458314	0.2787259	9.375330
0.3154872	-0.3197231	-0.4798390	0.7536574	0.1538707	5.175654

Application: interpretation

Application: Visualization

My most common tool for visualizing high-dimensional datasets: PCA.

Application: Dimensionality Reduction

By picking only the first few principal components, the dataset can be (sometimes drastically) reduced in dimensionality.

Our book has an example on MNIST digits (Figure 14.23). This reduces dimensionality from 256 to 2.

Application: Dimensionality Reduction

How to tell what dimension to pick?

Similar to our previous use of diminishing returns for model errors, we can use explained variance here. A scree plot plots # components used vs. fraction of explained variance.

Here, a dataset with handwritten digits.

Application: Procrustes Distance

Common in geometric statistics – handwriting recognition, classification in archaeology, etc – is to compare sets of landmarks with each other.

Orientation and translation may be different in two different data collections: we would look for a best possible fit of two shapes to each other:

\[ \min_{\mu,R} \|X_2 - (X_1R + \mathbb 1\mu^T)\|_F \] with \(X_1, X_2\) the two sets of landmarks, \(R\) orthonormal and \(\mu\) a vector of location coordinates. The norm is the Frobenius norm \(\|X\|_F = \text{trace}(X^TX) = \sum_i (X^TX)_{ii} = \sum_i\sum_j X_{ij}^2\).

Application: Procrustes Distance

\(X_1, X_2\) datasets, \(R\) orthonormal, \(\mu\) location. Frobenius norm \(\|X\|_F = \text{trace}(X^TX)\). \[ \min_{\mu,R} \|X_2 - (X_1R + \mathbb 1\mu^T)\|_F \]

By the same approach as with principal components, the solution comes out to, with \(\hat X_j=X_j-\overline x_j\) \[ \text{SVD}(\hat X_1^T\hat X_2) = UDV^T\\ \hat R = UV^T \qquad \hat\mu = \overline x_2-\hat R\overline x_1 \]

Application: Procrustes Distance

\(\hat X_1, \hat X_2\) 0-centered datasets, \(R\) orthonormal, \(\mu\) location. Frobenius norm \(\|X\|_F = \text{trace}(X^TX)\). \[ \min_{\mu,R} \|X_2 - (X_1R + \mathbb 1\mu^T)\|_F \\ \text{SVD}(\hat X_1^T\hat X_2) = UDV^T\\ \hat R = UV^T \qquad \hat\mu = \overline x_2-\hat R\overline x_1 \]

The same basic approach can be done with a variety of choices of admissible transformations of the data to be aligned.

Repeatedly aligning and taking sample means produces the Procrustes average.

Application: Principal Curves

\(f:\mathbb R\to \mathbb R^p\) a smooth curve. \(X\in\mathbb R^p\) a random variable. Write \(\lambda_f(x)\) for the point closest to \(x\): \(d(f(\lambda_f(x)),x) \leq d(f(\lambda),x)\) for all other \(\lambda\).

\(f\) is a principal curve for \(X\) if it is self-consistent, ie \[ f(\lambda) = \mathbb E[X\ |\ \lambda_f(X)=\lambda] \] the curve is the average of the points that project to it.

Application: Principal Curves

Translate to data: now \(X\subset\mathbb R^p\). Alternate

\(\hat f_j(\lambda) = \mathbb E[X_j\ |\ \lambda(X)=\lambda]\)
\(\hat\lambda_f(x) = \text{arg}\min_{\lambda'}\|x-\hat f(\lambda')\|^2\)

The setting generalizes transparently to fitting higher dimensional manifolds: let \(f:\mathbb R^d\to \mathbb R^p\) be parametrized by \(\lambda = \lambda_1,\dots,\lambda_d\).

Estimate \(\hat f_j\) using the coordinates \(\lambda_1,\dots,\lambda_p\) nearest to \(X\), and estimate the coordinates the same way as above.

Application: Spectral Clustering

One approach comes from graph theory: neighborhood weighted graph on data points with some similarity measure, including an edge if similarity is sufficiently low.

Similarity: many distance options. One example is \(w_{ij}=\exp[-d(x_i,x_j)^2/c]\) for scale parameter \(c\).
K-nearest neighbor graph: include \(e_{ij}\) if \(x_i\) is among \(x_j\)'s K nearest neighbors or vice versa.
Fully connected graph.

Application: Spectral Clustering

Edge weight matrix \(W\) contains an entry \(w_{ij}\) set to the similarity measure if \(e_{ij}\) is in the graph. Degree matrix \(G\) is diagonal with entries \(g_{ii}=\sum_j w_{ij}\).

We define the Graph Laplacian \[ L = G-W \]

The Graph Laplacian is a discrete version of the heat equation. Steady states emerge as eigenvectors, with their eigenvalues measuring the energy of the state.

Lowest energy steady state of heat diffusion is constant: each vertex assigned the value 1.

Application: Spectral Clustering

Picking low energy (but non-trivial) eigenvectors of \(L\) represents the data in a way focusing on the graph structure.

Eigenvectors belonging to low (but non-negative) eigenvalues can be used for

Graph layout (spectral layout)
Clustering (spectral clustering)

Represent each datapoint with the last few (15 or so?) non-trivial eigenvectors as coordinates. Cluster the data with k-means in "graph Laplacian space".

Application: Spectral Clustering

Extension: Kernel PCA

Recall, for PCA we calculate an eigen-decomposition of the (double-centered) Gram matrix \[ \hat K = (X-\overline X)(X-\overline X)^T = (I-11^T/N)XX^T(I-11^T/N) = UD^2U^T \]

The principal components are \(Z=UD\).

Extension: Kernel PCA

Kernel methods (in general) rely on a perspective shift:

The Kernel Trick: Many non-linear tasks can be made linear by mapping data into a higher dimension. The kernel trick is based on an imagined mapping \(\Phi:\mathbb R^p\to\mathcal V\), where \(\mathcal V\) is high- (possibly infinitely-) dimensional.

For linear methods, many if not all tasks use the dot product. So we would need to work with \(\Phi(x)\cdot\Phi(y)\).

By creating a function (the kernel) that calculate \(K(x,y)=\Phi(x)\cdot\Phi(y)\) without ever calculating \(\Phi\) explicitly, we can work as if data was mapped.

Extension: Kernel PCA

The Gram matrix \(XX^T\) encodes an inner product on \(\mathbb R^p\) by \((XX^T)_{ij} = \sum_{k=1}^p x_{ik}x_{jk} = \langle x_i, x_j\rangle\).

Following the Kernel trick, we would, instead of \(\langle x_i,x_j\rangle\) compute \(K(x,y)\). This produces the Kernel gram matrix \(K\). Viewing this matrix as our Gram matrix, we can calculate an eigenvalue decomposition \[ K = UD^2U^T \] and use \(Z=UD\) as our kernel principal components.

Extension: Kernel PCA

Using the radial kernel \[ K(x,y) = \exp[-\|x-y\|^2/c] \] the Gram matrix is the weight matrix \(W\) we introduced in spectral clustering earlier. Kernel PCA finds eigenvectors corresponding to the largest eigenvalues of \(K\). This is equivalent to the smallest eigenvalues of \(I-K\), which is the Graph Laplacian approach in spectral clustering.

Extension: Kernel PCA

Common kernels include

Radial \(K(x,y) = \exp[-\|x-y\|^2/c]\)
Polynomial \(K(x,y) = [a\langle x,y\rangle+b]^d\)
Linear \(K(x,y) = \langle x,y\rangle\)
tanh \(K(x,y) = \tanh(a\langle x,y\rangle + b)\)
Laplacian \(K(x,y) = \exp[-\|x-y\|/c]\)

One powerful feature is that kernels can be built for non-numeric data. One class is string kernels: generating an inner product to compare pairs of strings.