25March, 2018
We have data.
We have clustered our data. Picked appropriate parameters.
Now we get new data.
What do we do with it?
Idea assign new observations to the known clusters.
This leads us to an implicit version of supervised learning.
Definition Supervised Learning: given paired data \((X,y)\) learn a function \(f(X)\) that accurately predicts \(y\) as \(f(X)\).
Definition Regression: supervised learning with continuous responses \(y\).
Definition Classification: supervised learning with categorical responses \(y\).
One solution is to assign to a new \(X\) whatever class its nearest neighbor belongs to.
More refined: majority vote among \(k\) nearest neighbors.
Pro
Con
Fundamentally in classification problems, we are looking for estimated probabilities (in a Bayesian sense) of a new observation \(x\) being assigned a class label \(g\).
Hence, we want to estimate \[ \mathbb P(g | x) \]
With these probabilities in hand, we can
A plain linear regression would be a bad fit for describing a probability: \(\beta\cdot X\) takes on values in \((-\infty,\infty)\) while probabilities are in \((0,1)\).
One option is to focus on the log odds: \[ \omega = \log\left(\frac{p}{1-p}\right) \qquad\qquad p = \frac{\exp[\omega]}{1+\exp[\omega]} = \frac{1}{\exp[-\omega]+1} \]
The quantity arises naturally when analyzing Bernoulli and Binomial probabilities, and we see it is equivalent to the probability itself.
Log odds takes values within \((-\infty,\infty)\).
Here is one approach: estimate \(p\) based on predictors \(x\) by fitting \(\omega\) to a linear regression.
Our model is, for a vector \(X\) of predictors, \[ \omega = X\beta \\ p = \frac{1}{1+\exp[-(X\beta)]} \]
With a logistic regression fitted to data, we can classify by assigning the label \(g_j\) with maximal value for \(\omega\).
In the binary case, \(\beta^T\cdot X > 0\) can be used to choose the label \(g_2\) over \(g_1\).
Let us model the log probability linearly \[ \log\mathbb P(g_j) = (X\beta)_j - \beta_0 \] where we pick \(\beta_0\) to ensure that the resulting distribution across the \(g_j\) is a probability distribution.
Then \[ \mathbb P(g_j|X) = \frac{\exp[(X\beta)_j]}{\beta_0} \\ 1 = \sum_j \mathbb P(g_j|X) = \frac{\sum\exp[(X\beta)_j]}{\beta_0} \] so \(\beta_0=\sum_j\exp[(X\beta)_j]\)
The softmax function is \[ \text{Softmax}(j,p_1,\dots,p_k) = \frac{\exp[p_j]}{\sum\exp[p_i]} \]
Hence, the log-linear model specifies \[ \mathbb P(g_j|X) = \text{Softmax}(\bullet, X\beta) \]
Note that across all \(k\) classes, the model is redundant: the normalization is there to ensure \(\sum\mathbb P(g_j|X)=1\). By picking one probability as reference probability, the redundancy is removed, and \(\beta\) less underdetermined:
\[ \beta'_j = \beta_j-\beta_k \qquad \beta'_k = 0 \\ \mathbb P(g_j|X) = \frac{\exp[(\beta' X)_j]}{\sum\exp[(\beta' X)_i]+1} \\ \mathbb P(g_k|X) = \frac{1}{\sum\exp[(\beta' X)_i]+1} \]
We arrive at the same formula if we start from doing \(k-1\) different binary classifications:
\[ \log\frac{\mathbb P(g_j|X)}{\mathbb P(g_k|X)} = \beta_jX\\ \mathbb P(g_j|X) = \mathbb P(g_k|X)\exp[\beta_jX] \\ 1 = \mathbb P(g_k|X)\left(1+\sum_{j=1}^{k-1}\exp[\beta_jX]\right) \\ \mathbb P(g_k|X) = \frac{1}{1+\sum\exp[\beta_jX]} \]
If we have \(k\) different class labels \(g_1,\dots,g_k\), we can use a one-hot encoding or incidence matrix to train a multivariate linear regression \[ \omega = X\beta \qquad\omega\in\mathbb R^{1\times k} \qquad\beta\in\mathbb R^{p\times k} \qquad X\in\mathbb R^{1\times p} \]
This would express \[ \omega \]
Among the \(k\) assigned log odds in \(\omega\), we get \(k\) likelihoods as \[ \ell_j = \frac{1}{\exp[\omega_j]} \]
Let's work with two example data sets
We can start out with a linear regression against a response that takes on the value \(1\) for the group a
, \(0\) for b
.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.5078209 | 0.0124431 | 40.81153 | 0 |
V1 | 0.1852310 | 0.0140017 | 13.22918 | 0 |
V2 | 0.2199979 | 0.0130105 | 16.90929 | 0 |
We can start out with a linear regression against a response that takes on the value \(1\) for the group a
, \(0\) for b
.
This linear regression takes values between -0.4152142 and 1.4396741, not allowing for any kind of probabilistic interpretation.
Just putting these values through the inverse logit function \(c\mapsto 1/(1+\exp[-c])\) gets us
More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.1380221 | 0.4899565 | 0.2817027 | 0.7781715 |
V1 | 3.6409265 | 0.8893603 | 4.0938711 | 0.0000424 |
V2 | 4.4381955 | 0.9667133 | 4.5910151 | 0.0000044 |
More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.
More appropriate than using the logistic function and hoping for the best would be to fit the logistic regression directly.
Logistic B | Logistic A | Linear B | Linear A | |
---|---|---|---|---|
a | 2 | 148 | 2 | 148 |
b | 147 | 3 | 147 | 3 |
The similarity in results is not very remarkable; the decision boundaries for both cases are similar:
Once we hit more than two groups, the response variable becomes response variables. We create a one-hot or incidence matrix to encode class membership, and then train a multivariate system to predict this.
a | b | c |
---|---|---|
150 | 2 | 3 |
0 | 148 | 0 |
0 | 0 | 147 |
This multinomial logistic regression is an example of one-vs-rest classification: for each label \(g_j\), train a classifier to distinguish \(g_j\) from other labels – combine all classifiers using softmax (or other scheme).
An alternative approach to the logistic regression route is through discriminant functions.
For a Bayesian perspective, we would start out with some set \(\pi_1,\dots,\pi_k\) of prior probabilities – a priori estimates of the likelihoods of each class. Write \(f_j(x) = \text{PDF}_{g_j}(x)\). Then by Bayes rule \[ \mathbb P(g_j|x) = \frac{f_j(x)\pi_j}{\sum f_\ell(x)\pi_\ell} \]
Access to \(f_j\) is a powerful simplification of estimating \(\mathbb P(g_j|x)\) directly.
Suppose each \(f_j\) is a multivariate Gaussian density, and that all share a common covariance matrix \(\Sigma\).
\[ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \log\frac{f_j(x)}{f_\ell(x)} + \log\frac{\pi_j}{\pi_\ell} = \\ \log\frac{\pi_j}{\pi_\ell} - \frac12(\mu_j+\mu_\ell)^T\Sigma^{-1}(\mu_j+\mu_\ell) + x^T\Sigma^{-1}(\mu_k-\mu_\ell) \]
Writing \[ \delta_j(x) = x^T\Sigma^{-1}\mu_j - \frac12\mu_j^T\Sigma^{-1}\mu_j+\log\pi_j \\ \log\frac {\mathbb P(g_j|x)} {\mathbb P(g_\ell|x)} = \delta_j(x)-\delta_\ell(x) \]
Estimating from data, we find from \(N\) observations with \(N_j\) instances of class \(j\), \[ \hat\pi_j = N_j/N \qquad \hat\mu_j = \sum_{g(x)=j} x_i/N_j \\ \hat\Sigma = \sum_j\sum_{g(x)=j}(x-\hat\mu_j)(x-\hat\mu_j)^T/(N-K) \\ \hat\delta_j(x) = x^T\hat\Sigma^{-1}\hat\mu_j-\frac12\hat\mu_j^T\hat\Sigma^{-1}\hat\mu_j+\log\hat\pi_j \] and classify each new observation to the discriminator with the largest value.
The two-class case reduces (using the Gaussian assumption) to a linear regression.
If the groups are assumed Gaussian, but the covariances not assumed equal, then the cancellations that generated \(\delta_j\) earlier no longer take place.
Instead, the quadratic discriminant analysis uses discriminants \[ \delta_j(x) = -\frac12\log\left|\Sigma_j\right|-\frac12(x-\mu_j)^T\Sigma_j^{-1}(x-\mu_j)+\log\pi_j \]
scikit-learn
supports the linear methods in sklearn.linear_models
and discriminant analysis in sklearn.discriminant_analysis
Model | Task |
---|---|
sklearn.linear_models.LinearRegression |
Linear Regression |
sklearn.linear_models.LogisticRegression |
Logistic Regression |
sklearn.discriminant_analysis.LinearDiscriminantAnalysis |
LDA |
sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis |
QDA |
Two-class classification with linear regression is supported by the basic lm
linear model function.
Two-class logistic regression is handled by glm
:
model = glm(class ~ predictors, data=data, family=binomial())
The package nnet
contains the function multinom
for multiple logistic regression.
The package MASS
contains lda
and qda
for linear and quadratic discriminant analysis.
New plan! Let's find a hyperplane forming as large as possible a No point's land.
For any hyperplane defined by \(\beta X=0\), measure the largest \(\mu\) such that \(\beta y_i < -\mu < \mu < \beta x_j\) for all \(i,j\).
New plan! Let's find a hyperplane forming as large as possible a No point's land.
For any hyperplane defined by \(\beta X=0\), measure the largest \(\mu\) such that \(\beta y_i < -\mu < \mu < \beta x_j\) for all \(i,j\).
## Setting default kernel parameters
Instead of requiring a clean separation, penalize misclassified points by how far they are misclassified.
Label two classes by \(y_i=\pm1\). The separating hyperplane earlier is the minimizing \(w, b\) of \[ \min \|w\| \qquad\text{subject to}\qquad y_i(w\cdot x_i-b) \geq 1 \quad\forall i \]
If there is no separating hyperplane, the conditions cannot be fulfilled. Instead weaken the condition, using a loss function that penalizes the misclassification. \[ \min_{w,b}\left[ \frac1n\sum\max(0, 1-y_i(w\cdot x_i-b)) +\lambda\|w\|^2 \right] \]
\[ \min_{w,b}\left[ \frac1n\sum\max(0, 1-y_i(w\cdot x_i-b)) +\lambda\|w\|^2 \right] \]
If \(y_i\) is confidently correctly classified, the penalty \(\max(0,1-y_i(w\cdot x_i-b))\) is 0. When in doubt, the penalty increases, and when misclassified the penalty will exceed 1. The \(\|w\|^2\) term narrows the margin.