25March, 2018
Recall that dimensionality reduction is about the following:
Goal Find representatives \(z_1,\dots,z_N\) of the points \(x_1,\dots,x_N\) such that \(d(z_i,z_j)\approx d(x_i,x_j)\).
If what we care about is local distances more than global distances: we are willing to have a large error in the approximation if the points are far apart to start with, the methods we develop focus on connectivity more than on shape.
This makes all locally faithful dimensionality reduction methods essentially topological.
By adjusting the stress function, the influence of large distances can be decreased to emphasize the maintenance of local distances at the expense of very large distances.
Let \(N\) be the symmetric \(k\)-nearest neighbor graph: \((i,j)\in N\) if either \(x_i\) is one of \(x_j\)'s \(k\) nearest neighbors, or \(x_j\) is one of \(x_i\)'s.
\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 + \sum_{(i,j)\not\in N} w(D-\|z_i-z_j\|)^2 \]
This uses some large constant \(D\) and some weight \(w\) to make non-neighbors look very remote, and gives their influence a low weight. For simplification, \(w\approx 1/D\) produces
\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 - \tau\sum_{(i,j)\not\in N} \|z_i-z_j\| \]
Idea All reasonable shapes are locally almost Euclidean. So approximate each data point as a linear combination of nearby points. Glue these local approximations together and fit the patches in a lower dimensional space.
The third step can be written \[ \text{Tr}[(Y-WY)^T(Y-WY)] = \text{Tr}[Y^T(1-W)^T(1-W)Y] \] We can find \(\hat Y\) as trailing eigenvectors of \((1-W)^T(1-W)\).
Isometric feature Mapping uses distances between neighbors to estimate global distances, and then uses multi-dimensional scaling where distances are computed by accumulating nearest neighbor distances along the neighbor graph.
Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.
Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.
Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.
Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.
Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.
Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.
Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.
Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.
Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.
Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.
Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.
Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.
scikit-learn
has models in sklearn.manifold
:
Model | Task |
---|---|
Isomap |
Isomap |
LocallyLinearEmbedding |
LLE |
MDS |
MDS and local MDS |
TSNE |
t-SNE (not covered) |
Ayasdi has a commercial implementation of Mapper. Open source implementations include kepler-mapper
and PythonMapper
.
The vegan
package includes functions isomap
, metaMDS
(local MDS). The lle
package has an implementation of LLE.
Mapper is available in the package TDAmapper
.
We have data.
We have clustered our data. Picked appropriate parameters.
Now we get new data.
What do we do with it?
Idea assign new observations to the known clusters.
This leads us to an implicit version of supervised learning.
Definition Supervised Learning: given paired data \((X,y)\) learn a function \(f(X)\) that accurately predicts \(y\) as \(f(X)\).
Definition Regression: supervised learning with continuous responses \(y\).
Definition Classification: supervised learning with categorical responses \(y\).
One solution is to assign to a new \(X\) whatever class its nearest neighbor belongs to.
More refined: majority vote among \(k\) nearest neighbors.
Pro
Con
Fundamentally in classification problems, we are looking for estimated probabilities (in a Bayesian sense) of a new observation \(x\) being assigned a class label \(g\).
Hence, we want to estimate \[ \mathbb P(g | x) \]
With these probabilities in hand, we can
A plain linear regression would be a bad fit for describing a probability: \(\beta\cdot X\) takes on values in \((-\infty,\infty)\) while probabilities are in \((0,1)\).
One option is to focus on the log odds: \[ \omega = \log\left(\frac{p}{1-p}\right) \qquad\qquad p = \frac{\exp[\omega]}{1+\exp[\omega]} = \frac{1}{\exp[-\omega]+1} \]
The quantity arises naturally when analyzing Bernoulli and Binomial probabilities, and we see it is equivalent to the probability itself.
Log odds takes values within \((-\infty,\infty)\).
Here is one approach: estimate \(p\) based on predictors \(x\) by fitting \(\omega\) to a linear regression.
Our model is, for a vector \(X\) of predictors, \[ \omega = X\beta \\ p = \frac{1}{1+\exp[-(X\beta)]} \]
With a logistic regression fitted to data, we can classify by assigning the label \(g_j\) with maximal value for \(\omega\).
In the binary case, \(\beta^T\cdot X > 0\) can be used to choose the label \(g_2\) over \(g_1\).
Let us model the log probability linearly \[ \log\mathbb P(g_j) = (X\beta)_j - \beta_0 \] where we pick \(\beta_0\) to ensure that the resulting distribution across the \(g_j\) is a probability distribution.
Then \[ \mathbb P(g_j|X) = \frac{\exp[(X\beta)_j]}{\beta_0} \\ 1 = \sum_j \mathbb P(g_j|X) = \frac{\sum\exp[(X\beta)_j]}{\beta_0} \] so \(\beta_0=\sum_j\exp[(X\beta)_j]\)
The softmax function is \[ \text{Softmax}(j,p_1,\dots,p_k) = \frac{\exp[p_j]}{\sum\exp[p_i]} \]
Hence, the log-linear model specifies \[ \mathbb P(g_j|X) = \text{Softmax}(\bullet, X\beta) \]
Note that across all \(k\) classes, the model is redundant: the normalization is there to ensure \(\sum\mathbb P(g_j|X)=1\). By picking one probability as reference probability, the redundancy is removed, and \(\beta\) less underdetermined:
\[ \beta'_j = \beta_j-\beta_k \qquad \beta'_k = 0 \\ \mathbb P(g_j|X) = \frac{\exp[(\beta' X)_j]}{\sum\exp[(\beta' X)_i]+1} \\ \mathbb P(g_k|X) = \frac{1}{\sum\exp[(\beta' X)_i]+1} \]
We arrive at the same formula if we start from doing \(k-1\) different binary classifications:
\[ \log\frac{\mathbb P(g_j|X)}{\mathbb P(g_k|X)} = \beta_jX\\ \mathbb P(g_j|X) = \mathbb P(g_k|X)\exp[\beta_jX] \\ 1 = \mathbb P(g_k|X)\left(1+\sum_{j=1}^{k-1}\exp[\beta_jX]\right) \\ \mathbb P(g_k|X) = \frac{1}{1+\sum\exp[\beta_jX]} \]
If we have \(k\) different class labels \(g_1,\dots,g_k\), we can use a one-hot encoding or incidence matrix to train a multivariate linear regression \[ \omega = X\beta \qquad\omega\in\mathbb R^{1\times k} \qquad\beta\in\mathbb R^{p\times k} \qquad X\in\mathbb R^{1\times p} \]
This would express \[ \omega \]
Among the \(k\) assigned log odds in \(\omega\), we get \(k\) likelihoods as \[ \ell_j = \frac{1}{\exp[\omega_j]} \]
Let's work with two example data sets
We can start out with a linear regression against a response that takes on the value \(1\) for the group a
, \(0\) for b
.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.5078209 | 0.0124431 | 40.81153 | 0 |
V1 | 0.1852310 | 0.0140017 | 13.22918 | 0 |
V2 | 0.2199979 | 0.0130105 | 16.90929 | 0 |
We can start out with a linear regression against a response that takes on the value \(1\) for the group a
, \(0\) for b
.
This linear regression takes values between -0.4152142 and 1.4396741, not allowing for any kind of probabilistic interpretation.
Just putting these values through the inverse logit function \(c\mapsto 1/(1+\exp[-c])\) gets us
More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.1380221 | 0.4899565 | 0.2817027 | 0.7781715 |
V1 | 3.6409265 | 0.8893603 | 4.0938711 | 0.0000424 |
V2 | 4.4381955 | 0.9667133 | 4.5910151 | 0.0000044 |
More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.
More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.
Logistic B | Logistic A | Linear B | Linear A | |
---|---|---|---|---|
a | 2 | 148 | 2 | 148 |
b | 147 | 3 | 147 | 3 |
The similarity in results is not very remarkable; the decision boundaries for both cases are similar:
Once we hit more than two groups, the response variable becomes response variables. We create a one-hot or incidence matrix to encode class membership, and then train a multivariate system to predict this.
a | b | c |
---|---|---|
150 | 2 | 3 |
0 | 148 | 0 |
0 | 0 | 147 |
This multinomial logistic regression is an example of one-vs-rest classification: for each label \(g_j\), train a classifier to distinguish \(g_j\) from other labels – combine all classifiers using softmax (or other scheme).
An alternative approach to the logistic regression route is through discriminant functions.
For a Bayesian perspective, we would start out with some set \(\pi_1,\dots,\pi_k\) of prior probabilities – a priori estimates of the likelihoods of each class. Write \(f_j(x) = \text{PDF}_{g_j}(x)\). Then by Bayes rule \[ \mathbb P(g_j|x) = \frac{f_j(x)\pi_j}{\sum f_\ell(x)\pi_\ell} \]
Access to \(f_j\) is a powerful simplification of estimating \(\mathbb P(g_j|x)\) directly.
Suppose each \(f_j\) is a multivariate Gaussian density, and that all share a common covariance matrix \(\Sigma\).
\[ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \log\frac{f_j(x)}{f_\ell(x)} + \log\frac{\pi_j}{\pi_\ell} = \\ \log\frac{\pi_j}{\pi_\ell} - \frac12(\mu_j+\mu_\ell)^T\Sigma^{-1}(\mu_j+\mu_\ell) + x^T\Sigma^{-1}(\mu_k-\mu_\ell) \]
Writing \[ \delta_j(x) = x^T\Sigma^{-1}\mu_j - \frac12\mu_j^T\Sigma^{-1}\mu_j+\log\pi_j \\ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \delta_j(x)-\delta_\ell(x) \]
Estimating from data, we find from \(N\) observations with \(N_j\) instances of class \(j\), \[ \hat\pi_j = N_j/N \qquad \hat\mu_j = \sum_{g(x)=j} x_i/N_j \\ \hat\Sigma = \sum_j\sum_{g(x)=j}(x-\hat\mu_j)(x-\hat\mu_j)^T/(N-K) \\ \hat\delta_j(x) = x^T\hat\Sigma^{-1}\hat\mu_j-\frac12\hat\mu_j^T\hat\Sigma^{-1}\hat\mu_j+\log\hat\pi_j \] and classify each new observation to the discriminator with the largest value.
The two-class case reduces (using the Gaussian assumption) to a linear regression.
If the groups are assumed Gaussian, but the covariances not assumed equal, then the cancellations that generated \(\delta_j\) earlier no longer take place.
Instead, the quadratic discriminant analysis uses discriminants \[ \delta_j(x) = -\frac12\log\left|\Sigma_j\right|-\frac12(x-\mu_j)^T\Sigma_j^{-1}(x-\mu_j)+\log\pi_j \]
scikit-learn
supports the linear methods in sklearn.linear_models
and discriminant analysis in sklearn.discriminant_analysis
Model | Task |
---|---|
sklearn.linear_models.LinearRegression |
Linear Regression |
sklearn.linear_models.LogisticRegression |
Logistic Regression |
sklearn.discriminant_analysis.LinearDiscriminantAnalysis |
LDA |
sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis |
QDA |
Two-class classification with linear regression is supported by the basic lm
linear model function.
Two-class logistic regression is handled by glm
:
model = glm(class ~ predictors, data=data, family=binomial())
The package nnet
contains the function multinom
for multiple logistic regression.
The package MASS
contains lda
and qda
for linear and quadratic discriminant analysis.