Lecture 8

25March, 2018

Addendum to PCA

PCA in Python

With scikit-learn

from sklearn import decomposition # implements PCA, KernelPCA, ICA, FA, ...
pca = decomposition.PCA(n_components)
scores = pca.fit_transform(X) # or pca.fit() if not interested in scores
loadings = pca.components_
pca.explained_variance_, pca.explained_variance_ratio_

PCA in Python

With scikit-learn

Model	Task	Options
`decomposition.SparsePCA`	Uses optimization penalty to minimize count of non-zero loadings	`alpha` controls emphasis on sparsity
`decomposition.KernelPCA`	Kernel PCA	`kernel` takes values `"linear"`, `"poly"`, `"rbf"`, `"sigmoid"`, `"cosine"`, `"precomputed"`. Different fields – check documentation.
`decomposition.IncrementalPCA`	Memory efficient PCA	same as `PCA`

PCA in R

Two commands: princomp and prcomp.

`princomp`	`prcomp`
Uses `eigen` to calculate principal components, relies on a full eigenvalue decomposition. Only handles feature extraction of variables and requires more observations than variables.	Preferred. Uses `svd` to calculate principal components.

PCA in R

X.pca = prcomp(X)
explained.variance = X.pca$sdev^2
explained.variance.ratio = explained.variance/sum(explained.variance)
X.scores = X.pca$x
X.loadings = X.pca$rotation

Factor Analysis

Latent Variables

If our data were explained by the influence of a collection of unobserved variables: latent variables or factors, one linear model would be

\[ X = \mu + LF + \epsilon \]

where

\(F\) measures the factors. Reasonable assumption is uncorrelated and unit variance – ie \(F\) is orthonormal
\(L\) contains the loadings – how much influence each factor has on each variable
\(\epsilon\) is normal independent noise: \(\Sigma=Cov(\epsilon)\) is a diagonal matrix. We also assume \(\epsilon\) independent from \(F\).

Latent Variables

The similarity with the PCA setup, where \(X-\mu \sim (UD)V^T\) is not accidential; in early factor analysis, PCA and Factor Analysis (FA) were considered synonymous.

Now, note \[ Cov(X) = \mathbb E(X-\mu)(X-\mu)^T = \mathbb E(LF+\epsilon)(LF+\epsilon)^T =\\ \mathbb E(LF)(LF)^T + \mathbb E(LF)\epsilon^T + \mathbb E\epsilon(LF)^T + \mathbb E\epsilon\epsilon^T = \\ L(\mathbb EFF^T)L^T + L\mathbb EF\epsilon^T + (\mathbb E\epsilon F^T)L^T + \mathbb \epsilon\epsilon^T \]

\(F\) orthonormal means \(\mathbb EFF^T=\mathbb 1\). \(\epsilon\) independent of \(\epsilon\) and of \(F\) means \(\mathbb E\epsilon F^T = 0\), \(\mathbb EF\epsilon = 0\) and \(\mathbb E\epsilon\epsilon^T=\Sigma\)

\[ Cov(X) = LL^T + \Sigma \]

Latent Variables

The variance for \(X_j\) decomposes as \[ \mathbb VX_j = (LL^T)_{jj} + \Sigma_{jj} = h^2_j + \psi_j \]

for the communality \(h^2_j\) and the uniqueness \(\psi_j\).

Eigenvalue decomposition estimates the loadings, communality and uniqueness through:

\[ Cov(X) = V\Lambda V^T = Cov(V\sqrt{\Lambda}) \\ \hat L = V\sqrt{\Lambda} \qquad \hat\psi_j = \sigma_j - (\hat L\hat L^T)_{jj} \]

This approach emphasizes (because it is PCA in disguise) the spread of each variable, ie the uniqueness.

Latent Variables

To put focus on commonality, an iterated method works by using the sample correlation \(R\) and using \[ LL^T = R-\Sigma \] by iteratively updating

\(\hat R = R-\hat\Sigma\)
\(\hat V\hat\Lambda\hat V^T\) the eigenvalue decomposition of \(\hat R\)
\(\hat\Sigma = \mathbb 1 - \text{diag}(\hat L\hat L^T)\)

Latent Variables

To get an easy to interpret solution, assume normality and seek a maximum likelihood estimate. Write \(S=LL^T+\Sigma\),

\[ \ell = \log\mathcal L = \\ -\frac12\left[ np\log2\pi - n\log|S^{-1}| + \sum (x_i-\mu)^TS^{-1}(x_i-\mu) \right] \]

and seek loadings \(L\) that maximize \(\ell\).

Rotated loadings

If \(X=\mu+LF+\epsilon\) is a latent variable decomposition of \(X\), then for any orthonormal matrix \(R\), \[ X=\mu+(LR)(R^TF) + \epsilon = \mu + L'L'^T + \epsilon\\ Cov(X) = L'L'^T + \Sigma = (LR)(LR)^T + \Sigma = LRR^TL^T+\Sigma \]

With \(R\) orthonormal, \(RR^T=\mathbb 1\), so the covariance is invariant under transforming factors and loadings by an orthonormal transformation.

Several principles for finding an optimal rotation: varimax, quartimax, equamax, parsimax work by balancing an optimization betwen sum of quartics and sum of (sum of squares).

Independent Component Analysis

If our data were explained by the influence of a collection of unobserved variables: latent variables or factors, one linear model would be

\[ X = \mu + LF + \epsilon \]

where

\(F\) measures the factors. Reasonable assumption is uncorrelated and unit variance – ie \(F\) is orthonormal
\(L\) contains the loadings – how much influence each factor has on each variable
\(\epsilon\) is normal independent noise: \(\Sigma=Cov(\epsilon)\) is a diagonal matrix. We also assume \(\epsilon\) independent from \(F\).

Independent Component Analysis

If our data were explained by the influence of a collection of unobserved variables: latent variables or factors, one linear model would be

\[ X = \mu + LF + \epsilon \]

where

\(F\) measures the factors. Reasonable assumption is independent and unit variance – ie \(F\) is orthonormal
\(L\) contains the loadings – how much influence each factor has on each variable
\(\epsilon\) is normal independent noise: \(\Sigma=Cov(\epsilon)\) is a diagonal matrix. We also assume \(\epsilon\) independent from \(F\).

Independent Component Analysis

With independent component analysis, under an assumption of non-normal data, the loadings \(L\) can be uniquely determined: no longer unique only up to rotation.

By focusing on non-Gaussian distributions, ICA components will generate long tails and emphasize extreme examples.

ICA: Example

Entropy

Let \(X\) be a random variable with density function \(p(x)\).

The information content \(I(X)\) of \(X\) is the minimal amount of information required to encode the information in \(X\). For independent events (some sets of possible outcomes) \(A, B\):

\(I(A) = 0\) if \(\mathbb P(B) = 1\). (no information if outcome is guaranteed)
\(I(A) > 0\) if \(\mathbb P(B) < 1\).
\(I(A\cap B) = I(A)+I(B)\) for independent \(A, B\). (information is additive)

From multiplicative law of probability follows that if \(I(A) = f(\mathbb P(A))\), then \(f(x\cdot y) = f(x)+f(y)\).

Entropy

Functions \(f\) such that \(f(x\cdot y)=f(x)+f(y)\) are all multiples of logarithms: \(I(A)=K\log(\mathbb P(A))\).

To ensure positive information, \(K<0\); we define \(I(A)=-\log_b(\mathbb P(A))\).
Base 2: bits. Base \(e\): nats. Base 10: hartleys.

The entropy of a random variable \(X\) is its expected information content:

\[ H(X) = \mathbb EI = -\int p(x)\log p(x)dx \]

Theorem Gaussian \(X\) maximizes entropy among variables with fixed variance.

Entropy

If \(X\) is a random vector, we can use entropy to build a measure of dependence between its components:

The mutual information can be defined as \[ MI(X) = \sum H(X_j)-H(X) \]

This is an instance of the Kullback-Leibler divergence (relative entropy) between probability densities: \[ KL(p \| q) = \mathbb E_p\frac{I_p}{I_q} = \int\log\frac{p(x)}{q(x)}p(x)dx \\ MI(X) = KL\left(p \middle\| \prod p_j\right) \]

Entropy and ICA

Without loss of generality, \(X\) is zero-centered, decorrelated and with unit variance (whitened). Then since \(X\) is orthogonal, \(F\) is orthogonal it follows that \(L\) is orthogonal. Hence:

If \(X=LF\) then \(F=L^TX\), and \[ MI(F) = \sum H(F_j) - H(X) - \log|\det L| = \sum H(F_j)-H(X) \]

Choosing \(L\) to minimize \(MI(F)\) amounts to maximizing independence between the components of \(L^TX\), equivalently minimizing sum of entropies of each separate component, equivalent to maximizing departure from normality.

Entropy and ICA

Simplification (Hyvärinen – Oja, 2000) by using negentropy instead: the departure from normality is measured through \[ J(F_j) = H(Z_j) - H(F_j) \qquad Z_j\sim\mathcal N(\mathbb EF_j, \mathbb VF_j) \\ J(F_j) \approx\frac{1}{a^2}\left[ \mathbb E\left(\log\cosh(aF_j\right) - \mathbb E\left(\log\cosh(aZ_j\right) \right]^2 \]

with \(1\leq a\leq 2\). This (replacing expectations and variances with sample expectations and sample variances) yields the FastICA algorithm for calculating ICA.

Product Density ICA algorithm

Here is one way to perform ICA, using generalized additive models. Write \(\phi=\text{PDF}_{\mathcal N(0,1)}\), and represent each factor \[ f_j(s_j) = \phi(s_j)\exp[g_j(s_j)] \] for some choice of densities \(g_j\). Assume whitened \(X\). Then log-likelihood for \(LF\) is \[ \ell(L, g_j | X) = \sum_{i=1}^N\sum_{j=1}^p[\log\phi_j(L_j^Tx_i)+g_j(L_j^Tx_i)] \]

Product Density ICA algorithm

We add constraints to enforce

\(\hat f_j=\phi\exp[\hat g_j]\) is a probability density
\(\hat g_j\) keeps its jerk low (ie \(\hat g_j'''\))

This leads us to maximize

\[ \sum_{j=1}^p \left[ \frac{1}{N}\sum_{i=1}^N[\log\phi(L_j^Tx_i) + g_j(L_j^Tx_i)] - \\ \int\phi(t)\exp[{g_j(t)}]dt - \lambda_j\int[g_j'''(t)]^2dt \right] \]

Product Density ICA algorithm

Using quartic splines as a function representation for \(\hat g_j\), the algorithm amounts to:

Initialize \(L\) as a random Gaussian matrix, followed by orthogonalization of columns.

For each \(g_j\), fix \(L\) and all other \(g_k\), optimize the target separately for each \(g_j\)
Fixing the \(g_j\), optimize for \(L\)

Hastie and Tibshirani (who proposed the algorithm) have clever steps for each part of this algorithm, using a spline representation, together with SVD decomposition for orthogonalization.

Multi-dimensional Scaling

Distance fidelity

All methods for dimensionality reduction share a common goal:

Goal Find representatives \(z_1,\dots,z_N\) of the points \(x_1,\dots,x_N\) such that \(d(z_i,z_j)\approx d(x_i,x_j)\).

PCA tries to do this by finding a linear projection. Result: if \(d(x_i,x_j)\) is small, then \(d(z_i,z_j)\) is also small. Converse does not hold.

SOM tries to fit a map to the data directly. Allows for very large distances in data space between representatives of map grid locations.

Distance fidelity

Goal Find representatives \(z_1,\dots,z_N\) of the points \(x_1,\dots,x_N\) such that \(d(z_i,z_j)\approx d(x_i,x_j)\).

If we know the goal, we could just try to fulfill this goal explicitly and directly.

Define a stress function:

\[ S_M(z) = \sum_{i\neq j} L(d(i,j), d'(z_i,z_j)) \]

for some loss function \(L\) and some distance \(d'\). Next, find \(z\) to minimize \(S_M\) given data \(x\).

Stress functions

Loss is usually the square of some difference between distance measures.

Name	Stress \(L(d_{ij},z_i,z_j)\)
Kruskal-Shephard / least squares	\((d_{ij}-\\|z_i-z_j\\|)^2\)
Sammon	\(\frac{(d_{ij}-\\|z_i-z_j\\|)^2}{d_{ij}}\)
Classical	\((s(x_i,x_j)-s(z_i,z_j))^2\) with \(s(y_i,y_j) = \langle y_i-\overline y, y_j-\overline y\rangle\)
Shephard-Kruskal non-metric	\(S(z) = \frac{\sum_{i\neq j}(\\|z_i-z_j\\|^2-\theta(d_{ij}))^2}{\sum_{i\neq j}\\|z_i-z_j\\|^2}\) with \(\theta\) an increasing function

Minimizing stress

Usually done through gradient descent.

Calculate \[ \nabla S(z) = \left( \frac{\partial}{\partial z_1}S(z), \frac{\partial}{\partial z_2}S(z), \dots \frac{\partial}{\partial z_N}S(z) \right) \]
Update \[ z = z - \epsilon\nabla S(z) \]

Gradient descent has a LOT of variations, improvements, tactics, details, and forms the foundation of most of the field of deep learning.

Local MDS

By adjusting the stress function, the influence of large distances can be decreased to emphasize the maintenance of local distances at the expense of very large distances.

Let \(N\) be the symmetric \(k\)-nearest neighbor graph: \((i,j)\in N\) if either \(x_i\) is one of \(x_j\)'s \(k\) nearest neighbors, or \(x_j\) is one of \(x_i\)'s.

\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 + \sum_{(i,j)\not\in N} w(D-\|z_i-z_j\|)^2 \]

This uses some large constant \(D\) and some weight \(w\) to make non-neighbors look very remote, and gives their influence a low weight. For simplification, \(w\approx 1/D\) produces

\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 - \tau\sum_{(i,j)\not\in N} \|z_i-z_j\| \]