25March, 2018

Classification

Handle new observations

We have data.

We have clustered our data. Picked appropriate parameters.

Now we get new data.

What do we do with it?

Handle new observations

Idea assign new observations to the known clusters.

This leads us to an implicit version of supervised learning.

Definition Supervised Learning: given paired data \((X,y)\) learn a function \(f(X)\) that accurately predicts \(y\) as \(f(X)\).

Definition Regression: supervised learning with continuous responses \(y\).

Definition Classification: supervised learning with categorical responses \(y\).

Nearest Neighbor Classification

One solution is to assign to a new \(X\) whatever class its nearest neighbor belongs to.

More refined: majority vote among \(k\) nearest neighbors.

Pro

  • Can be very sensitive to "true" shapes.

Con

  • Requires a lot of distance calculations for every new observation.

A Probabilistic View

Fundamentally in classification problems, we are looking for estimated probabilities (in a Bayesian sense) of a new observation \(x\) being assigned a class label \(g\).

Hence, we want to estimate \[ \mathbb P(g | x) \]

With these probabilities in hand, we can

  1. Assign \(\hat g=\text{arg}\max_g \hat{\mathbb P}(g|x)\)
  2. Draw a random observation from the distribution \(\hat{\mathbb P}(g|x)\)

Log odds

A plain linear regression would be a bad fit for describing a probability: \(\beta\cdot X\) takes on values in \((-\infty,\infty)\) while probabilities are in \((0,1)\).

One option is to focus on the log odds: \[ \omega = \log\left(\frac{p}{1-p}\right) \qquad\qquad p = \frac{\exp[\omega]}{1+\exp[\omega]} = \frac{1}{\exp[-\omega]+1} \]

The quantity arises naturally when analyzing Bernoulli and Binomial probabilities, and we see it is equivalent to the probability itself.

Log odds takes values within \((-\infty,\infty)\).

Log odds

Logistic Regression

Here is one approach: estimate \(p\) based on predictors \(x\) by fitting \(\omega\) to a linear regression.

Our model is, for a vector \(X\) of predictors, \[ \omega = X\beta \\ p = \frac{1}{1+\exp[-(X\beta)]} \]

With a logistic regression fitted to data, we can classify by assigning the label \(g_j\) with maximal value for \(\omega\).

In the binary case, \(\beta^T\cdot X > 0\) can be used to choose the label \(g_2\) over \(g_1\).

Multiple classes

Let us model the log probability linearly \[ \log\mathbb P(g_j) = (X\beta)_j - \beta_0 \] where we pick \(\beta_0\) to ensure that the resulting distribution across the \(g_j\) is a probability distribution.

Then \[ \mathbb P(g_j|X) = \frac{\exp[(X\beta)_j]}{\beta_0} \\ 1 = \sum_j \mathbb P(g_j|X) = \frac{\sum\exp[(X\beta)_j]}{\beta_0} \] so \(\beta_0=\sum_j\exp[(X\beta)_j]\)

Multiple classes

The softmax function is \[ \text{Softmax}(j,p_1,\dots,p_k) = \frac{\exp[p_j]}{\sum\exp[p_i]} \]

Hence, the log-linear model specifies \[ \mathbb P(g_j|X) = \text{Softmax}(\bullet, X\beta) \]

Reduced multiclass logistic regression

Note that across all \(k\) classes, the model is redundant: the normalization is there to ensure \(\sum\mathbb P(g_j|X)=1\). By picking one probability as reference probability, the redundancy is removed, and \(\beta\) less underdetermined:

\[ \beta'_j = \beta_j-\beta_k \qquad \beta'_k = 0 \\ \mathbb P(g_j|X) = \frac{\exp[(\beta' X)_j]}{\sum\exp[(\beta' X)_i]+1} \\ \mathbb P(g_k|X) = \frac{1}{\sum\exp[(\beta' X)_i]+1} \]

Multiclass logistic regression from binary

We arrive at the same formula if we start from doing \(k-1\) different binary classifications:

\[ \log\frac{\mathbb P(g_j|X)}{\mathbb P(g_k|X)} = \beta_jX\\ \mathbb P(g_j|X) = \mathbb P(g_k|X)\exp[\beta_jX] \\ 1 = \mathbb P(g_k|X)\left(1+\sum_{j=1}^{k-1}\exp[\beta_jX]\right) \\ \mathbb P(g_k|X) = \frac{1}{1+\sum\exp[\beta_jX]} \]

Incidence matrix

If we have \(k\) different class labels \(g_1,\dots,g_k\), we can use a one-hot encoding or incidence matrix to train a multivariate linear regression \[ \omega = X\beta \qquad\omega\in\mathbb R^{1\times k} \qquad\beta\in\mathbb R^{p\times k} \qquad X\in\mathbb R^{1\times p} \]

This would express \[ \omega \]

Among the \(k\) assigned log odds in \(\omega\), we get \(k\) likelihoods as \[ \ell_j = \frac{1}{\exp[\omega_j]} \]

In practice

Let's work with two example data sets

Two-class classification

We can start out with a linear regression against a response that takes on the value \(1\) for the group a, \(0\) for b.

term estimate std.error statistic p.value
(Intercept) 0.5078209 0.0124431 40.81153 0
V1 0.1852310 0.0140017 13.22918 0
V2 0.2199979 0.0130105 16.90929 0

Two-class classification

We can start out with a linear regression against a response that takes on the value \(1\) for the group a, \(0\) for b.

Two-class classification

This linear regression takes values between -0.4152142 and 1.4396741, not allowing for any kind of probabilistic interpretation.

Just putting these values through the inverse logit function \(c\mapsto 1/(1+\exp[-c])\) gets us

Two-class classification

More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.

term estimate std.error statistic p.value
(Intercept) 0.1380221 0.4899565 0.2817027 0.7781715
V1 3.6409265 0.8893603 4.0938711 0.0000424
V2 4.4381955 0.9667133 4.5910151 0.0000044

Two-class classification

More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.

Two-class classification

More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.

Logistic B Logistic A Linear B Linear A
a 2 148 2 148
b 147 3 147 3

Two-class classification

The similarity in results is not very remarkable; the decision boundaries for both cases are similar:

Three-class classification

Once we hit more than two groups, the response variable becomes response variables. We create a one-hot or incidence matrix to encode class membership, and then train a multivariate system to predict this.

Three-class classification

Three-class classification

Three-class classification

a b c
150 2 3
0 148 0
0 0 147

Classification schemes

This multinomial logistic regression is an example of one-vs-rest classification: for each label \(g_j\), train a classifier to distinguish \(g_j\) from other labels – combine all classifiers using softmax (or other scheme).

Discriminant functions

An alternative approach to the logistic regression route is through discriminant functions.

For a Bayesian perspective, we would start out with some set \(\pi_1,\dots,\pi_k\) of prior probabilities – a priori estimates of the likelihoods of each class. Write \(f_j(x) = \text{PDF}_{g_j}(x)\). Then by Bayes rule \[ \mathbb P(g_j|x) = \frac{f_j(x)\pi_j}{\sum f_\ell(x)\pi_\ell} \]

Access to \(f_j\) is a powerful simplification of estimating \(\mathbb P(g_j|x)\) directly.

Linear Discriminant Analysis (LDA)

Suppose each \(f_j\) is a multivariate Gaussian density, and that all share a common covariance matrix \(\Sigma\).

\[ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \log\frac{f_j(x)}{f_\ell(x)} + \log\frac{\pi_j}{\pi_\ell} = \\ \log\frac{\pi_j}{\pi_\ell} - \frac12(\mu_j+\mu_\ell)^T\Sigma^{-1}(\mu_j+\mu_\ell) + x^T\Sigma^{-1}(\mu_k-\mu_\ell) \]

Writing \[ \delta_j(x) = x^T\Sigma^{-1}\mu_j - \frac12\mu_j^T\Sigma^{-1}\mu_j+\log\pi_j \\ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \delta_j(x)-\delta_\ell(x) \]

Linear Discriminant Analysis (LDA)

Estimating from data, we find from \(N\) observations with \(N_j\) instances of class \(j\), \[ \hat\pi_j = N_j/N \qquad \hat\mu_j = \sum_{g(x)=j} x_i/N_j \\ \hat\Sigma = \sum_j\sum_{g(x)=j}(x-\hat\mu_j)(x-\hat\mu_j)^T/(N-K) \\ \hat\delta_j(x) = x^T\hat\Sigma^{-1}\hat\mu_j-\frac12\hat\mu_j^T\hat\Sigma^{-1}\hat\mu_j+\log\hat\pi_j \] and classify each new observation to the discriminator with the largest value.

The two-class case reduces (using the Gaussian assumption) to a linear regression.

Quadratic Discriminant Analysis (QDA)

If the groups are assumed Gaussian, but the covariances not assumed equal, then the cancellations that generated \(\delta_j\) earlier no longer take place.

Instead, the quadratic discriminant analysis uses discriminants \[ \delta_j(x) = -\frac12\log\left|\Sigma_j\right|-\frac12(x-\mu_j)^T\Sigma_j^{-1}(x-\mu_j)+\log\pi_j \]

Quadratic Discriminant Analysis (QDA)

In Python

scikit-learn supports the linear methods in sklearn.linear_models and discriminant analysis in sklearn.discriminant_analysis

Model Task
sklearn.linear_models.LinearRegression Linear Regression
sklearn.linear_models.LogisticRegression Logistic Regression
sklearn.discriminant_analysis.LinearDiscriminantAnalysis LDA
sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis QDA

In R

Two-class classification with linear regression is supported by the basic lm linear model function.

Two-class logistic regression is handled by glm:

model = glm(class ~ predictors, data=data, family=binomial())

The package nnet contains the function multinom for multiple logistic regression.

The package MASS contains lda and qda for linear and quadratic discriminant analysis.

Support Vector Machines

Tricking linear models

Separating hyperplanes

New plan! Let's find a hyperplane forming as large as possible a No point's land.

For any hyperplane defined by \(\beta X=0\), measure the largest \(\mu\) such that \(\beta y_i < -\mu < \mu < \beta x_j\) for all \(i,j\).

Separating hyperplanes

New plan! Let's find a hyperplane forming as large as possible a No point's land.

For any hyperplane defined by \(\beta X=0\), measure the largest \(\mu\) such that \(\beta y_i < -\mu < \mu < \beta x_j\) for all \(i,j\).

Tricking hyperplanes

Tricking hyperplanes

##  Setting default kernel parameters

Support Vector Machines

Instead of requiring a clean separation, penalize misclassified points by how far they are misclassified.

Label two classes by \(y_i=\pm1\). The separating hyperplane earlier is the minimizing \(w, b\) of \[ \min \|w\| \qquad\text{subject to}\qquad y_i(w\cdot x_i-b) \geq 1 \quad\forall i \]

If there is no separating hyperplane, the conditions cannot be fulfilled. Instead weaken the condition, using a loss function that penalizes the misclassification. \[ \min_{w,b}\left[ \frac1n\sum\max(0, 1-y_i(w\cdot x_i-b)) +\lambda\|w\|^2 \right] \]

Support Vector Machines

\[ \min_{w,b}\left[ \frac1n\sum\max(0, 1-y_i(w\cdot x_i-b)) +\lambda\|w\|^2 \right] \]

If \(y_i\) is confidently correctly classified, the penalty \(\max(0,1-y_i(w\cdot x_i-b))\) is 0. When in doubt, the penalty increases, and when misclassified the penalty will exceed 1. The \(\|w\|^2\) term narrows the margin.