25March, 2018

Topological methods and manifold learning

Local distances

Recall that dimensionality reduction is about the following:

Goal Find representatives \(z_1,\dots,z_N\) of the points \(x_1,\dots,x_N\) such that \(d(z_i,z_j)\approx d(x_i,x_j)\).

If what we care about is local distances more than global distances: we are willing to have a large error in the approximation if the points are far apart to start with, the methods we develop focus on connectivity more than on shape.

This makes all locally faithful dimensionality reduction methods essentially topological.

Local MDS

By adjusting the stress function, the influence of large distances can be decreased to emphasize the maintenance of local distances at the expense of very large distances.

Let \(N\) be the symmetric \(k\)-nearest neighbor graph: \((i,j)\in N\) if either \(x_i\) is one of \(x_j\)'s \(k\) nearest neighbors, or \(x_j\) is one of \(x_i\)'s.

\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 + \sum_{(i,j)\not\in N} w(D-\|z_i-z_j\|)^2 \]

This uses some large constant \(D\) and some weight \(w\) to make non-neighbors look very remote, and gives their influence a low weight. For simplification, \(w\approx 1/D\) produces

\[ S(z) = \sum_{(i,j)\in N} (d(x_i,x_j) - \|z_i-z_j\|)^2 - \tau\sum_{(i,j)\not\in N} \|z_i-z_j\| \]

LLE

Idea All reasonable shapes are locally almost Euclidean. So approximate each data point as a linear combination of nearby points. Glue these local approximations together and fit the patches in a lower dimensional space.

LLE

  1. For each \(x_i\) find its \(k\) nearest neighbor set \(N(i)\).
  2. \(x_i\approx \sum_{k\in N(i)}w_{ik}x_k\) – pick \(w_{ik}\) to minimize \(L_2\) error
    Also enforce \(\sum w_{ik}=1\) and make \(w_{ik}=0\) if \(k\) is not a neighbor of \(i\).
  3. \(y_i\in\mathbb R^d\) for \(d<p\) minimizes \(\sum_i \left\|y_i-\sum w_{ik}y_k\right\|^2\) with these fixed \(w_{ik}\).

The third step can be written \[ \text{Tr}[(Y-WY)^T(Y-WY)] = \text{Tr}[Y^T(1-W)^T(1-W)Y] \] We can find \(\hat Y\) as trailing eigenvectors of \((1-W)^T(1-W)\).

ISOMAP

Isometric feature Mapping uses distances between neighbors to estimate global distances, and then uses multi-dimensional scaling where distances are computed by accumulating nearest neighbor distances along the neighbor graph.

Mapper

Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.

Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.

Mapper

Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.

Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.

Mapper

Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.

Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.

Mapper

Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.

Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.

Mapper

Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.

Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.

Mapper

Mapper takes a completely different approach: instead of mapping points into a low-dimensional space, Mapper builds a simplicial complex to represent the data.

Usually the resulting complex is visualized using graph layout techniques, most commonly a force feedback layout algorithm.

In Python

scikit-learn has models in sklearn.manifold:

Model Task
Isomap Isomap
LocallyLinearEmbedding LLE
MDS MDS and local MDS
TSNE t-SNE (not covered)

Ayasdi has a commercial implementation of Mapper. Open source implementations include kepler-mapper and PythonMapper.

In R

The vegan package includes functions isomap, metaMDS (local MDS). The lle package has an implementation of LLE.

Mapper is available in the package TDAmapper.

Classification

Handle new observations

We have data.

We have clustered our data. Picked appropriate parameters.

Now we get new data.

What do we do with it?

Handle new observations

Idea assign new observations to the known clusters.

This leads us to an implicit version of supervised learning.

Definition Supervised Learning: given paired data \((X,y)\) learn a function \(f(X)\) that accurately predicts \(y\) as \(f(X)\).

Definition Regression: supervised learning with continuous responses \(y\).

Definition Classification: supervised learning with categorical responses \(y\).

Nearest Neighbor Classification

One solution is to assign to a new \(X\) whatever class its nearest neighbor belongs to.

More refined: majority vote among \(k\) nearest neighbors.

Pro

  • Can be very sensitive to "true" shapes.

Con

  • Requires a lot of distance calculations for every new observation.

A Probabilistic View

Fundamentally in classification problems, we are looking for estimated probabilities (in a Bayesian sense) of a new observation \(x\) being assigned a class label \(g\).

Hence, we want to estimate \[ \mathbb P(g | x) \]

With these probabilities in hand, we can

  1. Assign \(\hat g=\text{arg}\max_g \hat{\mathbb P}(g|x)\)
  2. Draw a random observation from the distribution \(\hat{\mathbb P}(g|x)\)

Log odds

A plain linear regression would be a bad fit for describing a probability: \(\beta\cdot X\) takes on values in \((-\infty,\infty)\) while probabilities are in \((0,1)\).

One option is to focus on the log odds: \[ \omega = \log\left(\frac{p}{1-p}\right) \qquad\qquad p = \frac{\exp[\omega]}{1+\exp[\omega]} = \frac{1}{\exp[-\omega]+1} \]

The quantity arises naturally when analyzing Bernoulli and Binomial probabilities, and we see it is equivalent to the probability itself.

Log odds takes values within \((-\infty,\infty)\).

Log odds

Logistic Regression

Here is one approach: estimate \(p\) based on predictors \(x\) by fitting \(\omega\) to a linear regression.

Our model is, for a vector \(X\) of predictors, \[ \omega = X\beta \\ p = \frac{1}{1+\exp[-(X\beta)]} \]

With a logistic regression fitted to data, we can classify by assigning the label \(g_j\) with maximal value for \(\omega\).

In the binary case, \(\beta^T\cdot X > 0\) can be used to choose the label \(g_2\) over \(g_1\).

Multiple classes

Let us model the log probability linearly \[ \log\mathbb P(g_j) = (X\beta)_j - \beta_0 \] where we pick \(\beta_0\) to ensure that the resulting distribution across the \(g_j\) is a probability distribution.

Then \[ \mathbb P(g_j|X) = \frac{\exp[(X\beta)_j]}{\beta_0} \\ 1 = \sum_j \mathbb P(g_j|X) = \frac{\sum\exp[(X\beta)_j]}{\beta_0} \] so \(\beta_0=\sum_j\exp[(X\beta)_j]\)

Multiple classes

The softmax function is \[ \text{Softmax}(j,p_1,\dots,p_k) = \frac{\exp[p_j]}{\sum\exp[p_i]} \]

Hence, the log-linear model specifies \[ \mathbb P(g_j|X) = \text{Softmax}(\bullet, X\beta) \]

Reduced multiclass logistic regression

Note that across all \(k\) classes, the model is redundant: the normalization is there to ensure \(\sum\mathbb P(g_j|X)=1\). By picking one probability as reference probability, the redundancy is removed, and \(\beta\) less underdetermined:

\[ \beta'_j = \beta_j-\beta_k \qquad \beta'_k = 0 \\ \mathbb P(g_j|X) = \frac{\exp[(\beta' X)_j]}{\sum\exp[(\beta' X)_i]+1} \\ \mathbb P(g_k|X) = \frac{1}{\sum\exp[(\beta' X)_i]+1} \]

Multiclass logistic regression from binary

We arrive at the same formula if we start from doing \(k-1\) different binary classifications:

\[ \log\frac{\mathbb P(g_j|X)}{\mathbb P(g_k|X)} = \beta_jX\\ \mathbb P(g_j|X) = \mathbb P(g_k|X)\exp[\beta_jX] \\ 1 = \mathbb P(g_k|X)\left(1+\sum_{j=1}^{k-1}\exp[\beta_jX]\right) \\ \mathbb P(g_k|X) = \frac{1}{1+\sum\exp[\beta_jX]} \]

Incidence matrix

If we have \(k\) different class labels \(g_1,\dots,g_k\), we can use a one-hot encoding or incidence matrix to train a multivariate linear regression \[ \omega = X\beta \qquad\omega\in\mathbb R^{1\times k} \qquad\beta\in\mathbb R^{p\times k} \qquad X\in\mathbb R^{1\times p} \]

This would express \[ \omega \]

Among the \(k\) assigned log odds in \(\omega\), we get \(k\) likelihoods as \[ \ell_j = \frac{1}{\exp[\omega_j]} \]

In practice

Let's work with two example data sets

Two-class classification

We can start out with a linear regression against a response that takes on the value \(1\) for the group a, \(0\) for b.

term estimate std.error statistic p.value
(Intercept) 0.5078209 0.0124431 40.81153 0
V1 0.1852310 0.0140017 13.22918 0
V2 0.2199979 0.0130105 16.90929 0

Two-class classification

We can start out with a linear regression against a response that takes on the value \(1\) for the group a, \(0\) for b.

Two-class classification

This linear regression takes values between -0.4152142 and 1.4396741, not allowing for any kind of probabilistic interpretation.

Just putting these values through the inverse logit function \(c\mapsto 1/(1+\exp[-c])\) gets us

Two-class classification

More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.

term estimate std.error statistic p.value
(Intercept) 0.1380221 0.4899565 0.2817027 0.7781715
V1 3.6409265 0.8893603 4.0938711 0.0000424
V2 4.4381955 0.9667133 4.5910151 0.0000044

Two-class classification

More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.

Two-class classification

More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.

Logistic B Logistic A Linear B Linear A
a 2 148 2 148
b 147 3 147 3

Two-class classification

The similarity in results is not very remarkable; the decision boundaries for both cases are similar:

Three-class classification

Once we hit more than two groups, the response variable becomes response variables. We create a one-hot or incidence matrix to encode class membership, and then train a multivariate system to predict this.

Three-class classification

Three-class classification

Three-class classification

a b c
150 2 3
0 148 0
0 0 147

Classification schemes

This multinomial logistic regression is an example of one-vs-rest classification: for each label \(g_j\), train a classifier to distinguish \(g_j\) from other labels – combine all classifiers using softmax (or other scheme).

Discriminant functions

An alternative approach to the logistic regression route is through discriminant functions.

For a Bayesian perspective, we would start out with some set \(\pi_1,\dots,\pi_k\) of prior probabilities – a priori estimates of the likelihoods of each class. Write \(f_j(x) = \text{PDF}_{g_j}(x)\). Then by Bayes rule \[ \mathbb P(g_j|x) = \frac{f_j(x)\pi_j}{\sum f_\ell(x)\pi_\ell} \]

Access to \(f_j\) is a powerful simplification of estimating \(\mathbb P(g_j|x)\) directly.

Linear Discriminant Analysis (LDA)

Suppose each \(f_j\) is a multivariate Gaussian density, and that all share a common covariance matrix \(\Sigma\).

\[ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \log\frac{f_j(x)}{f_\ell(x)} + \log\frac{\pi_j}{\pi_\ell} = \\ \log\frac{\pi_j}{\pi_\ell} - \frac12(\mu_j+\mu_\ell)^T\Sigma^{-1}(\mu_j+\mu_\ell) + x^T\Sigma^{-1}(\mu_k-\mu_\ell) \]

Writing \[ \delta_j(x) = x^T\Sigma^{-1}\mu_j - \frac12\mu_j^T\Sigma^{-1}\mu_j+\log\pi_j \\ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \delta_j(x)-\delta_\ell(x) \]

Linear Discriminant Analysis (LDA)

Estimating from data, we find from \(N\) observations with \(N_j\) instances of class \(j\), \[ \hat\pi_j = N_j/N \qquad \hat\mu_j = \sum_{g(x)=j} x_i/N_j \\ \hat\Sigma = \sum_j\sum_{g(x)=j}(x-\hat\mu_j)(x-\hat\mu_j)^T/(N-K) \\ \hat\delta_j(x) = x^T\hat\Sigma^{-1}\hat\mu_j-\frac12\hat\mu_j^T\hat\Sigma^{-1}\hat\mu_j+\log\hat\pi_j \] and classify each new observation to the discriminator with the largest value.

The two-class case reduces (using the Gaussian assumption) to a linear regression.

Quadratic Discriminant Analysis (QDA)

If the groups are assumed Gaussian, but the covariances not assumed equal, then the cancellations that generated \(\delta_j\) earlier no longer take place.

Instead, the quadratic discriminant analysis uses discriminants \[ \delta_j(x) = -\frac12\log\left|\Sigma_j\right|-\frac12(x-\mu_j)^T\Sigma_j^{-1}(x-\mu_j)+\log\pi_j \]

Quadratic Discriminant Analysis (QDA)

In Python

scikit-learn supports the linear methods in sklearn.linear_models and discriminant analysis in sklearn.discriminant_analysis

Model Task
sklearn.linear_models.LinearRegression Linear Regression
sklearn.linear_models.LogisticRegression Logistic Regression
sklearn.discriminant_analysis.LinearDiscriminantAnalysis LDA
sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis QDA

In R

Two-class classification with linear regression is supported by the basic lm linear model function.

Two-class logistic regression is handled by glm:

model = glm(class ~ predictors, data=data, family=binomial())

The package nnet contains the function multinom for multiple logistic regression.

The package MASS contains lda and qda for linear and quadratic discriminant analysis.