Lecture 3

  • A survey of standard Machine Learning techniques.
  • Methods for feature selection and complexity reduction.

Reminder: Choose your semester projects by today.

Fundamental splits in ML

Supervised vs Unsupervised Learning

In Supervised Learning, data comes as matched pairs $(x_i,y_i)$, and the learning task is to construct a function $f:X\to Y$ that best generalizes observations.

Unsupervised Learning treats unpaired data $x_i$, and tries to generate structures an dinsight for $X$ itself.

Regression vs Classification

Regression is supervised learning where the target $Y$ is continuous.

Classification is supervised learning where $Y$ is discrete.

Linear Regression

Most will have seen this at some point.

  • Least squares fitting
  • $y \sim \sum\beta_jx_j$ or $y \sim X\beta + \epsilon$; closed form estimate $\hat\beta = (X^TX)^{-1}X^Ty$
  • Expects noise to be
    1. Uncorrelated $\mathbb{E}[\epsilon_i\epsilon_j|X] = 0$
    2. Normal $\epsilon\sim\mathcal{N}(0,\sigma^2I_n)$
    3. Exogenous $\mathbb{E}[\epsilon|X]=0$
  • Expects variables to be
    1. Linearly independent
    2. Numeric (categorical can be adjusted using one-hot-encoding)

Linear Regression with Regularization

When linear independence is violated, coefficients can grow out of bounds. One way around this is to add penalty term to the optimization task that penalize the coefficients:

  • $L_2$ - Ridge Regression / Tikhonov Regularization - Bayesian motivated linear regression
  • $L_1$ - Lasso - Combines coefficient shrinking with enforcing sparsity
  • $L_0$ - Best subset selection - Enforce sparsity

Once regularization is introduced, closed form estimate of $\hat\beta$ is no longer available - instead we use optimization techniques.

Polynomial Regression

Linear in linear regression refers to $\beta$, not to $X$. By adding products and powers of variables, linear regression can be used to fit against curved data.

Splines

Various methods exist for fitting piecewise polynomials to data.

Classification with Linear Regression

Logistic Regression chains a linear regression with the logistic function:

In [21]:
fig
Out[21]:

The resulting output can be interpreted as a probability measuring confidence in class membership.

Logistic regression produces a linear classifier -- with a linear decision boundary.

Leaving Linearity

$k$-Nearest Neighbors

Method for both regression and classification. To predict $\hat{f}(x^*)$:

  1. Find $k$ nearest neighbors $x_{i_1},\dots,x_{i_k}$ of $x^*$.
  2. Generate prediction from $y_{i_1},\dots,y_{i_k}$

For regression, the mean or the median of the neighbors is often used.

For classification, majority or plurality vote is commonly used.

$k$-Nearest Neighbors

Degree 4 curve. Cubic and degree 4 polynomial regression. k-nearest neighbors regression.

In [51]:
plot(xi, yi, '*b')
plot(xs, y_1, 'r')
plot(xs, y_2, 'g')
plot(xs, y_3, 'orange')
Out[51]:
[<matplotlib.lines.Line2D at 0x11e960128>]

$k$-Nearest Neighbors

Two interleaved half-moons. Logistic regression vs. 5-nearest neighbor majority vote.

In [66]:
fig
Out[66]:

Decision Tree

  • Check one feature at a time.
  • Figure out a good offset to break between groups.
  • Produce a flowchart to make predictions.

Just like with $k$-NN, can be used both for (piecewise constant) regression and for classification.

Different from k-NN, does not require that we store and search the entire training set.

Produces interpretable prediction processes.

In [99]:
fig
Out[99]:

decision tree

In [102]:
fig
Out[102]:
In [104]:
fig
Out[104]:
In [106]:
fig
Out[106]:

Decision Tree in action

As a Classifier:

Majority vote within section. Often, train the tree to achieve unity within a section - but restrictions on tree size may make this difficult.

As a Regressor:

Mean or median within section.

In [108]:
fig
Out[108]:
In [115]:
fig
Out[115]:

Complexity Reduction

Linear Models

  • $n >> p$ - least squares has low bias
  • $n \geq p$ and no collinearity - least squares has a unique solution
  • $n < p$ - no unique solution, $\infty$ variance

How to fix it

  • Subset selection - pick $k$ out of the $p$ predictors
  • Shrinkage / regularization - reduce coefficients
  • Dimension reduction - project onto a smaller data space

Model Evaluation

Residual sum of squares (RSS) emphasizes model precision over model simplicity. To compensate for model complexity, use an adjusted error measure.

With $\mathcal{L}$ the maximum likelihood using all data and $p$ free parameters:

  • Akaike Information Criterion: $AIC = -\log\mathcal{L}+2p$
  • Bayesian Information Criterion: $BIC = -\log\mathcal{L}+p\log(n)$

AIC is derived from Maximum Likelihood. BIC is derived from Bayesian theory.

Subset Selection

  1. Fit all $2^p$ possible subset models. Pick the one with smallest error measure.
     
  2. Add best predictor to partial model. Pick best among the sequence of partial models.
     

Subset Selection

  1. Fit all $2^p$ possible subset models. Pick the one with smallest error measure.
    For each $k$: fit all $k\choose p$ options against training data; pick among the $k$s against validation data.
  2. Add best predictor to partial model. Pick best among the sequence of partial models.
    Use AIC or BIC to balance model complexity against model precision.

Shrinkage

Linear regression minimizes $$ RSS = \sum\left(y_i-\beta_0-\sum_j\beta_jx_{ij}\right)^2 $$

Adding a penalty to the size of the coefficients reduces an infinitude of solutions to (hopefully) a unique solution. New minimization target: $$ RSS + \lambda\sum_j|\beta_j|^p $$

  • $p=0$: reduce # non-zero coefficients
  • $p=1$: Lasso - in practice approximates $p=0$
  • $p=2$: Ridge Regression

Dimension Reduction

Find a (linear) projection $\pi$ to a lower dimensional space. Try to keep distortion low.

  • Principal Component Analysis
  • Random Projection
In [177]:
fig
Out[177]:

Principal Component Analysis

Try to decorrelate the data.

Suppose $P$ is a transformation that decorrelates $X$: ie $PX$ has a diagonal covariance matrix.

$$ \text{Cov}(PX) = \mathbb{E}[PX(PX)^T] = \mathbb{E}[PXX^TP^T] = P\mathbb{E}[XX^T]P^T = P\text{Cov}(X)P^T $$

A matrix that decorrelates $X$ is a matrix that diagonalizes $\text{Cov}(X)$.

Principal Component Analysis

A diagonalizing matrix is a matrix with eigenvectors as columns. This leads us to define:

Definition The principal components of a dataset $X$ are the eigenvectors of $\text{Cov}(X)$.

Their eigenvalues determine the amount of total variability determined by that component.

Principal components produce an orthonormal basis such that the basis element $p_i$ determines the direction of largest variance in the orthogonal complement of $\langle p_1,\dots,p_{i-1}\rangle$.

In [175]:
fig
Out[175]:

Random Projection

Johnson-Lindenstrauss Lemma Given $n$ points $X\subset\mathbb{R}^p$, $0<\epsilon<1$, $m>8\log(n)/\epsilon^2$, there is a linear map $f:\mathbb{R}^p\to\mathbb{R}^m$ such that $$ (1-\epsilon)\|u-v\|^2 \leq \|f(u)-f(v)\|^2 \leq (1+\epsilon)\|u-v\|^2 $$

The maps $f$ are dense enough that randomly sampling an orthonormal matrix can be checked and re-sampled creating a good enough mapping efficiently.