Lecture 3¶

A survey of standard Machine Learning techniques.
Methods for feature selection and complexity reduction.

Reminder: Choose your semester projects by today.

Fundamental splits in ML¶

Supervised vs Unsupervised Learning¶

In Supervised Learning, data comes as matched pairs $(x_i,y_i)$, and the learning task is to construct a function $f:X\to Y$ that best generalizes observations.

Unsupervised Learning treats unpaired data $x_i$, and tries to generate structures an dinsight for $X$ itself.

Regression vs Classification¶

Regression is supervised learning where the target $Y$ is continuous.

Classification is supervised learning where $Y$ is discrete.

Linear Regression¶

Most will have seen this at some point.

Least squares fitting
$y \sim \sum\beta_jx_j$ or $y \sim X\beta + \epsilon$; closed form estimate $\hat\beta = (X^TX)^{-1}X^Ty$
Expects noise to be
1. Uncorrelated $\mathbb{E}[\epsilon_i\epsilon_j|X] = 0$
2. Normal $\epsilon\sim\mathcal{N}(0,\sigma^2I_n)$
3. Exogenous $\mathbb{E}[\epsilon|X]=0$
Expects variables to be
1. Linearly independent
2. Numeric (categorical can be adjusted using one-hot-encoding)

Linear Regression with Regularization¶

When linear independence is violated, coefficients can grow out of bounds. One way around this is to add penalty term to the optimization task that penalize the coefficients:

$L_2$ - Ridge Regression / Tikhonov Regularization - Bayesian motivated linear regression
$L_1$ - Lasso - Combines coefficient shrinking with enforcing sparsity
$L_0$ - Best subset selection - Enforce sparsity

Once regularization is introduced, closed form estimate of $\hat\beta$ is no longer available - instead we use optimization techniques.

Polynomial Regression¶

Linear in linear regression refers to $\beta$, not to $X$. By adding products and powers of variables, linear regression can be used to fit against curved data.

Splines¶

Various methods exist for fitting piecewise polynomials to data.

Classification with Linear Regression¶

Logistic Regression chains a linear regression with the logistic function:

In [21]:

fig

Out[21]:

The resulting output can be interpreted as a probability measuring confidence in class membership.

Logistic regression produces a linear classifier -- with a linear decision boundary.

Leaving Linearity¶

$k$-Nearest Neighbors¶

Method for both regression and classification. To predict $\hat{f}(x^*)$:

Find $k$ nearest neighbors $x_{i_1},\dots,x_{i_k}$ of $x^*$.
Generate prediction from $y_{i_1},\dots,y_{i_k}$

For regression, the mean or the median of the neighbors is often used.

For classification, majority or plurality vote is commonly used.

$k$-Nearest Neighbors¶

Degree 4 curve. Cubic and degree 4 polynomial regression. k-nearest neighbors regression.

In [51]:

plot(xi, yi, '*b')
plot(xs, y_1, 'r')
plot(xs, y_2, 'g')
plot(xs, y_3, 'orange')

Out[51]:

[<matplotlib.lines.Line2D at 0x11e960128>]

$k$-Nearest Neighbors¶

Two interleaved half-moons. Logistic regression vs. 5-nearest neighbor majority vote.

In [66]:

fig

Out[66]:

Decision Tree¶

Check one feature at a time.
Figure out a good offset to break between groups.
Produce a flowchart to make predictions.

Just like with $k$-NN, can be used both for (piecewise constant) regression and for classification.

Different from k-NN, does not require that we store and search the entire training set.

Produces interpretable prediction processes.

In [99]:

fig

Out[99]:

decision tree

In [102]:

fig

Out[102]:

In [104]:

fig

Out[104]:

In [106]:

fig

Out[106]:

Decision Tree in action¶

As a Classifier:¶

Majority vote within section. Often, train the tree to achieve unity within a section - but restrictions on tree size may make this difficult.

As a Regressor:¶

Mean or median within section.

In [108]:

fig

Out[108]:

In [115]:

fig

Out[115]:

Complexity Reduction¶

Linear Models¶

$n >> p$ - least squares has low bias
$n \geq p$ and no collinearity - least squares has a unique solution
$n < p$ - no unique solution, $\infty$ variance

How to fix it¶

Subset selection - pick $k$ out of the $p$ predictors
Shrinkage / regularization - reduce coefficients
Dimension reduction - project onto a smaller data space

Model Evaluation¶

Residual sum of squares (RSS) emphasizes model precision over model simplicity. To compensate for model complexity, use an adjusted error measure.

With $\mathcal{L}$ the maximum likelihood using all data and $p$ free parameters:

Akaike Information Criterion: $AIC = -\log\mathcal{L}+2p$
Bayesian Information Criterion: $BIC = -\log\mathcal{L}+p\log(n)$

AIC is derived from Maximum Likelihood. BIC is derived from Bayesian theory.

Subset Selection¶

Fit all $2^p$ possible subset models. Pick the one with smallest error measure.
Add best predictor to partial model. Pick best among the sequence of partial models.

Subset Selection¶

Fit all $2^p$ possible subset models. Pick the one with smallest error measure.
For each $k$: fit all $k\choose p$ options against training data; pick among the $k$s against validation data.
Add best predictor to partial model. Pick best among the sequence of partial models.
Use AIC or BIC to balance model complexity against model precision.

Shrinkage¶

Linear regression minimizes $$ RSS = \sum\left(y_i-\beta_0-\sum_j\beta_jx_{ij}\right)^2 $$

Adding a penalty to the size of the coefficients reduces an infinitude of solutions to (hopefully) a unique solution. New minimization target: $$ RSS + \lambda\sum_j|\beta_j|^p $$

$p=0$: reduce # non-zero coefficients
$p=1$: Lasso - in practice approximates $p=0$
$p=2$: Ridge Regression

Dimension Reduction¶

Find a (linear) projection $\pi$ to a lower dimensional space. Try to keep distortion low.

Principal Component Analysis
Random Projection

In [177]:

fig

Out[177]:

Principal Component Analysis¶

Try to decorrelate the data.

Suppose $P$ is a transformation that decorrelates $X$: ie $PX$ has a diagonal covariance matrix.

$$ \text{Cov}(PX) = \mathbb{E}[PX(PX)^T] = \mathbb{E}[PXX^TP^T] = P\mathbb{E}[XX^T]P^T = P\text{Cov}(X)P^T $$

A matrix that decorrelates $X$ is a matrix that diagonalizes $\text{Cov}(X)$.

Principal Component Analysis¶

A diagonalizing matrix is a matrix with eigenvectors as columns. This leads us to define:

Definition The principal components of a dataset $X$ are the eigenvectors of $\text{Cov}(X)$.

Their eigenvalues determine the amount of total variability determined by that component.

Principal components produce an orthonormal basis such that the basis element $p_i$ determines the direction of largest variance in the orthogonal complement of $\langle p_1,\dots,p_{i-1}\rangle$.

In [175]:

fig

Out[175]:

Random Projection¶

Johnson-Lindenstrauss Lemma Given $n$ points $X\subset\mathbb{R}^p$, $0<\epsilon<1$, $m>8\log(n)/\epsilon^2$, there is a linear map $f:\mathbb{R}^p\to\mathbb{R}^m$ such that $$ (1-\epsilon)\|u-v\|^2 \leq \|f(u)-f(v)\|^2 \leq (1+\epsilon)\|u-v\|^2 $$

The maps $f$ are dense enough that randomly sampling an orthonormal matrix can be checked and re-sampled creating a good enough mapping efficiently.