25March, 2018
With scikit-learn
from sklearn import decomposition # implements PCA, KernelPCA, ICA, FA, ... pca = decomposition.PCA(n_components) scores = pca.fit_transform(X) # or pca.fit() if not interested in scores loadings = pca.components_ pca.explained_variance_, pca.explained_variance_ratio_
With scikit-learn
Model | Task | Options |
---|---|---|
decomposition.SparsePCA |
Uses optimization penalty to minimize count of non-zero loadings | alpha controls emphasis on sparsity |
decomposition.KernelPCA |
Kernel PCA | kernel takes values "linear" , "poly" , "rbf" , "sigmoid" , "cosine" , "precomputed" . Different fields – check documentation. |
decomposition.IncrementalPCA |
Memory efficient PCA | same as PCA |
Two commands: princomp
and prcomp
.
princomp |
prcomp |
---|---|
Uses eigen to calculate principal components, relies on a full eigenvalue decomposition. Only handles feature extraction of variables and requires more observations than variables. |
Preferred. Uses svd to calculate principal components. |
X.pca = prcomp(X) explained.variance = X.pca$sdev^2 explained.variance.ratio = explained.variance/sum(explained.variance) X.scores = X.pca$x X.loadings = X.pca$rotation
If our data were explained by the influence of a collection of unobserved variables: latent variables or factors, one linear model would be
\[ X = \mu + LF + \epsilon \]
where
The similarity with the PCA setup, where \(X-\mu \sim (UD)V^T\) is not accidential; in early factor analysis, PCA and Factor Analysis (FA) were considered synonymous.
Now, note \[ Cov(X) = \mathbb E(X-\mu)(X-\mu)^T = \mathbb E(LF+\epsilon)(LF+\epsilon)^T =\\ \mathbb E(LF)(LF)^T + \mathbb E(LF)\epsilon^T + \mathbb E\epsilon(LF)^T + \mathbb E\epsilon\epsilon^T = \\ L(\mathbb EFF^T)L^T + L\mathbb EF\epsilon^T + (\mathbb E\epsilon F^T)L^T + \mathbb \epsilon\epsilon^T \]
\(F\) orthonormal means \(\mathbb EFF^T=\mathbb 1\). \(\epsilon\) independent of \(\epsilon\) and of \(F\) means \(\mathbb E\epsilon F^T = 0\), \(\mathbb EF\epsilon = 0\) and \(\mathbb E\epsilon\epsilon^T=\Sigma\)
\[ Cov(X) = LL^T + \Sigma \]
The variance for \(X_j\) decomposes as \[ \mathbb VX_j = (LL^T)_{jj} + \Sigma_{jj} = h^2_j + \psi_j \]
for the communality \(h^2_j\) and the uniqueness \(\psi_j\).
Eigenvalue decomposition estimates the loadings, communality and uniqueness through:
\[ Cov(X) = V\Lambda V^T = Cov(V\sqrt{\Lambda}) \\ \hat L = V\sqrt{\Lambda} \qquad \hat\psi_j = \sigma_j - (\hat L\hat L^T)_{jj} \]
This approach emphasizes (because it is PCA in disguise) the spread of each variable, ie the uniqueness.
To put focus on commonality, an iterated method works by using the sample correlation \(R\) and using \[ LL^T = R-\Sigma \] by iteratively updating
To get an easy to interpret solution, assume normality and seek a maximum likelihood estimate. Write \(S=LL^T+\Sigma\),
\[ \ell = \log\mathcal L = \\ -\frac12\left[ np\log2\pi - n\log|S^{-1}| + \sum (x_i-\mu)^TS^{-1}(x_i-\mu) \right] \]
and seek loadings \(L\) that maximize \(\ell\).
If \(X=\mu+LF+\epsilon\) is a latent variable decomposition of \(X\), then for any orthonormal matrix \(R\), \[ X=\mu+(LR)(R^TF) + \epsilon = \mu + L'L'^T + \epsilon\\ Cov(X) = L'L'^T + \Sigma = (LR)(LR)^T + \Sigma = LRR^TL^T+\Sigma \]
With \(R\) orthonormal, \(RR^T=\mathbb 1\), so the covariance is invariant under transforming factors and loadings by an orthonormal transformation.
Several principles for finding an optimal rotation: varimax, quartimax, equamax, parsimax work by balancing an optimization betwen sum of quartics and sum of (sum of squares).
If our data were explained by the influence of a collection of unobserved variables: latent variables or factors, one linear model would be
\[ X = \mu + LF + \epsilon \]
where
If our data were explained by the influence of a collection of unobserved variables: latent variables or factors, one linear model would be
\[ X = \mu + LF + \epsilon \]
where
With independent component analysis, under an assumption of non-normal data, the loadings \(L\) can be uniquely determined: no longer unique only up to rotation.
By focusing on non-Gaussian distributions, ICA components will generate long tails and emphasize extreme examples.
Let \(X\) be a random variable with density function \(p(x)\).
The information content \(I(X)\) of \(X\) is the minimal amount of information required to encode the information in \(X\). For independent events (some sets of possible outcomes) \(A, B\):
From multiplicative law of probability follows that if \(I(A) = f(\mathbb P(A))\), then \(f(x\cdot y) = f(x)+f(y)\).
Functions \(f\) such that \(f(x\cdot y)=f(x)+f(y)\) are all multiples of logarithms: \(I(A)=K\log(\mathbb P(A))\).
To ensure positive information, \(K<0\); we define \(I(A)=-\log_b(\mathbb P(A))\).
Base 2: bits. Base \(e\): nats. Base 10: hartleys.
The entropy of a random variable \(X\) is its expected information content:
\[ H(X) = \mathbb EI = -\int p(x)\log p(x)dx \]
Theorem Gaussian \(X\) maximizes entropy among variables with fixed variance.
If \(X\) is a random vector, we can use entropy to build a measure of dependence between its components:
The mutual information can be defined as \[ MI(X) = \sum H(X_j)-H(X) \]
This is an instance of the Kullback-Leibler divergence (relative entropy) between probability densities: \[ KL(p \| q) = \mathbb E_p\frac{I_p}{I_q} = \int\log\frac{p(x)}{q(x)}p(x)dx \\ MI(X) = KL\left(p \middle\| \prod p_j\right) \]
Without loss of generality, \(X\) is zero-centered, decorrelated and with unit variance (whitened). Then since \(X\) is orthogonal, \(F\) is orthogonal it follows that \(L\) is orthogonal. Hence:
If \(X=LF\) then \(F=L^TX\), and \[ MI(F) = \sum H(F_j) - H(X) - \log|\det L| = \sum H(F_j)-H(X) \]
Choosing \(L\) to minimize \(MI(F)\) amounts to maximizing independence between the components of \(L^TX\), equivalently minimizing sum of entropies of each separate component, equivalent to maximizing departure from normality.
Simplification (Hyvärinen – Oja, 2000) by using negentropy instead: the departure from normality is measured through \[ J(F_j) = H(Z_j) - H(F_j) \qquad Z_j\sim\mathcal N(\mathbb EF_j, \mathbb VF_j) \\ J(F_j) \approx\frac{1}{a^2}\left[ \mathbb E\left(\log\cosh(aF_j\right) - \mathbb E\left(\log\cosh(aZ_j\right) \right]^2 \]
with \(1\leq a\leq 2\). This (replacing expectations and variances with sample expectations and sample variances) yields the FastICA algorithm for calculating ICA.
Here is one way to perform ICA, using generalized additive models. Write \(\phi=\text{PDF}_{\mathcal N(0,1)}\), and represent each factor \[ f_j(s_j) = \phi(s_j)\exp[g_j(s_j)] \] for some choice of densities \(g_j\). Assume whitened \(X\). Then log-likelihood for \(LF\) is \[ \ell(L, g_j | X) = \sum_{i=1}^N\sum_{j=1}^p[\log\phi_j(L_j^Tx_i)+g_j(L_j^Tx_i)] \]
We add constraints to enforce
This leads us to maximize
\[ \sum_{j=1}^p \left[ \frac{1}{N}\sum_{i=1}^N[\log\phi(L_j^Tx_i) + g_j(L_j^Tx_i)] - \\ \int\phi(t)\exp[{g_j(t)}]dt - \lambda_j\int[g_j'''(t)]^2dt \right] \]
Using quartic splines as a function representation for \(\hat g_j\), the algorithm amounts to:
Initialize \(L\) as a random Gaussian matrix, followed by orthogonalization of columns.
Hastie and Tibshirani (who proposed the algorithm) have clever steps for each part of this algorithm, using a spline representation, together with SVD decomposition for orthogonalization.
All methods for dimensionality reduction share a common goal:
Goal Find representatives \(z_1,\dots,z_N\) of the points \(x_1,\dots,x_N\) such that \(d(z_i,z_j)\approx d(x_i,x_j)\).
PCA tries to do this by finding a linear projection. Result: if \(d(x_i,x_j)\) is small, then \(d(z_i,z_j)\) is also small. Converse does not hold.
SOM tries to fit a map to the data directly. Allows for very large distances in data space between representatives of map grid locations.
Goal Find representatives \(z_1,\dots,z_N\) of the points \(x_1,\dots,x_N\) such that \(d(z_i,z_j)\approx d(x_i,x_j)\).
If we know the goal, we could just try to fulfill this goal explicitly and directly.
Define a stress function:
\[ S_M(z) = \sum_{i\neq j} L(d(i,j), d'(z_i,z_j)) \]
for some loss function \(L\) and some distance \(d'\). Next, find \(z\) to minimize \(S_M\) given data \(x\).
Loss is usually the square of some difference between distance measures.
Name | Stress \(L(d_{ij},z_i,z_j)\) |
---|---|
Kruskal-Shephard / least squares | \((d_{ij}-\|z_i-z_j\|)^2\) |
Sammon | \(\frac{(d_{ij}-\|z_i-z_j\|)^2}{d_{ij}}\) |
Classical | \((s(x_i,x_j)-s(z_i,z_j))^2\) with \(s(y_i,y_j) = \langle y_i-\overline y, y_j-\overline y\rangle\) |
Shephard-Kruskal non-metric | \(S(z) = \frac{\sum_{i\neq j}(\|z_i-z_j\|^2-\theta(d_{ij}))^2}{\sum_{i\neq j}\|z_i-z_j\|^2}\) with \(\theta\) an increasing function |
Usually done through gradient descent.
Gradient descent has a LOT of variations, improvements, tactics, details, and forms the foundation of most of the field of deep learning.
By adjusting the stress function, the influence of large distances can be decreased to emphasize the maintenance of local distances at the expense of very large distances.
Let \(N\) be the symmetric \(k\)-nearest neighbor graph: \((i,j)\in N\) if either \(x_i\) is one of \(x_j\)'s \(k\) nearest neighbors, or \(x_j\) is one of \(x_i\)'s.
\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 + \sum_{(i,j)\not\in N} w(D-\|z_i-z_j\|)^2 \]
This uses some large constant \(D\) and some weight \(w\) to make non-neighbors look very remote, and gives their influence a low weight. For simplification, \(w\approx 1/D\) produces
\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 - \tau\sum_{(i,j)\not\in N} \|z_i-z_j\| \]