8March, 2018

Supervision, modeling and error measures

Supervised learning

We are deliberately avoiding supervised learning for most of this course.

Supervised learning Given paired data \(\{(X_i, y_i)\}\) find a function \(f\) that describes a joint distribution \(\mathbb P(X, y)\)

Unsupervised learning Given data \(\{X_i\}\) find a function \(f\) to produce information about the distribution \(\mathbb P(X)\).

Supervised learning

Very often accomplished through the optimization of a loss function: given \(\{X_i, y_i\}\), find a function \(f\) that minimizes \[ \text{Err} = \sum_j L(f(X_j), y_j) \]

Name \(L(f(X), y)\)
Square (Euclidean, \(L_2\)) \(\sum_k (f(X)_k -y_{k})^2\)
Absolute (\(L_1\)) \(\sum_k \mid f(X)_k -y_{k}\mid\)
0-1-loss # disagreements
deviance \(-2\log \mathbb P(y \mid f(X))\)

Models and Generalizability

A model has several important tasks:

  • Introspection – discerning features about data or a process through the parameters learned by a model
  • Estimation – proposing, with bounds on uncertainty, a true value for some quantity
  • Prediction – proposing, with bounds on uncertainty, the next value from a process

A good predictive model will be able to generalize: not only produce low-error predictions on already known cases, but also on new information.

Models and Generalizability

det er vanskeligt at spÃ¥ - især om fremtiden – Storm P

*Predictions are difficult - especially about the future

Models and Generalizability

To evaluate prediction quality and capacity, we want access to the distribution of the prediction error: for \(X_0, y_0\) a new observation, quantify

\[ E_\text{pred} = \mathbb{E} L(f(X_0),y_0) \]

If we can quantify \(E_\text{pred}\) (or even better, the distribution \(\mathbb P[L(f(X_0),y_0) > x]\)), we can produce predictions with bounds on uncertainty with fixed probability of correctness.

Bias - Variance decomposition

In general, for \(\hat X\) estimating \(X\) with finite expectation and variance, the squared error loss decomposes, with \(\text{Bias}(\hat X)=X-\mathbb E\hat X\), \[ \mathbb E[X-\hat X]^2 = \mathbb E[(X-\mathbb E\hat X)-(\hat X-\mathbb E\hat X)]^2 = \\ \color{green}{\mathbb E[X-\mathbb E\hat X]^2} + \color{orange}{\mathbb E[\hat X-\mathbb E\hat X]^2} - 2\mathbb E[(X-\mathbb E\hat X)(\hat X-\mathbb E\hat X)] = \\ \color{green}{\text{Bias}(\hat X)} + \color{orange}{\mathbb V\hat X} \color{lightgrey}{ - 2(X-\mathbb E\hat X)(\mathbb E\hat X-\mathbb E\hat X) } \]

Bias - Variance decomposition

Suppose \(Y=f(X) + \epsilon\) with \(\mathbb E\epsilon = 0\) and \(\mathbb V\epsilon = \sigma^2\). With squared error loss, and a regressor \(\hat f\), \[ E_\text{pred}(X) = \mathbb{E} L(\hat f(X),Y) = \mathbb EL(\hat f(X),f(X)+\epsilon) = \\ \mathbb E[f(X) + \epsilon - \hat f(X)]^2 = \\ \color{blue}{\mathbb E[f(X)-\mathbb E\hat f(X)]^2} + \color{green}{\mathbb E\epsilon^2} + \color{orange}{\mathbb E[\hat f(X)-\mathbb E\hat f(X)]} \\ -2\mathbb E[\epsilon(f(X)-\mathbb E\hat f(X))] -2\mathbb E[\epsilon(\hat f(X)-\mathbb E\hat f(X))] \\ -2\mathbb E[(f(X)-\mathbb E\hat f(X))(\hat f(X)-\mathbb E\hat f(X))] = \\ \color{blue}{\text{Bias}(\hat f(X))} + \color{green}{\mathbb V\epsilon} + \color{orange}{\mathbb V\hat f(X)} = \\ \color{green}{\text{Irreducible Error}} + \color{blue}{\text{Bias}^2} + \color{orange}{\text{Variance}} \]

Overfitting and underfitting

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. – John von Neumann

Overfitting and underfitting

Overfitting and underfitting

Modeling trade-offs

Model complexity tends to decrease bias, increase variance.

Model complexity increases the risk of overfitting, decreases the risk of underfitting.

  Known data New data
Underfitting High error High error
Overfitting Low error High error

Error types

\(E_\text{train}\) (\(\overline{\text{err}}\)) Average error over known data.

\(E_\mathcal{T}\) (\(\text{Err}_{\mathcal T}\)) Expected error of completely unknown data after training over specific batch \(\mathcal T\).

\(E_\text{in}\) (\(\text{Err}_\text{in}\)) Expected error over a batch of new data with known input.

\(E_\text{pred}\) (\(\text{Err}\)) Expected error over new data: \(\mathbb EE_\mathcal{T}\).

We have easy access to \(E_\text{train}\). We want to quantify \(E_\text{pred}\). Pie-in-the-sky dream scenario: complete information about the distribution of \(E_\mathcal{T}\) over all \(\mathcal T\).

Goals for model analysis

Model selection Given a set of candidate models, pick the one that performs the best.

Model assessment Given a chosen model, estimate its prediction error on new data.

Estimating errors

With large enough data, we can use parts to estimate \(E_\text{pred}\) by leaving it out of the entire analysis. This leads to the Train/Test split paradigm:

  Train Validation Test
Estimates \(E_\text{train}\) \(E_\text{pred}\) \(E_\text{pred}\)
Use for Fitting model Selecting model Assessing model
Allowed use Entire process After fitting After fixing all parameters
Common ratios 50% 25% 25%

All estimation of \(E_\text{pred}\) is built to simulate going back to an experimental setup and gathering new data after fixing (registering!) all aspects of the subsequent analysis.

Test/Train splits

In Python sklearn.model_selection.train_test_split

In R base::sample

Test/Train splits in Python

from sklearn import model_selection

X_train, X_test, y_train, y_test = 
  model_select.train_test_split(
    X, y, test_size=0.25
  )

test_size and train_size are percentage if between 0 and 1; count if integer

random_state seed for random generator for reproducibility

shuffle (boolean) turn off shuffling to preserve in-sample orders

stratify (categorical) stratified sampling: force same percentage in each subpopulation

Test/Train splits in R

all.idx = 1:nrow(X)
train.idx = sample.int(nrow(X), size=n.train)
test.idx = sample(all.idx[-train.idx], size=n.test)
validation.idx = all.idx[-c(train.idx, test.idx)]

train = X[train.idx,]
validation = X[validation.idx,]
test = X[test.idx,]

caret::createDataPartition packs this sequence into a single function call.

Cross-validation

Losing data

For small data sets, leaving out separate chunks of data both for validation and for testing may be prohibitively expensive: we may end up with too small data to properly fit a model at all.

We still cannot touch the test data – any interaction with test data pollutes the statistical methods and (vastly) underestimates prediction error.

Cross-validation offers a different route.

K-fold cross-validation

Split data into \(K\) equally sized parts.

Repeat train-validate step \(K\) times, each time leaving one of the \(K\) parts out for validation.

Average validation error across the validation parts.

Select the model with lowest validation error.

Common choices for \(K\) are 5, 10.

Example: Forest Fires

A collection of 500 portuguese forest fires

Task: predict burned area from meterological data.

Candidate models:

  • Linear regression on humidity, temperature, wind and rain
  • Linear regression on these + 4 additional metereological indices
  • Ridge regression on the 8 parameters.

Consider these models black boxes.

Example: Forest Fires

Mean validation MSE

  Linear 4 Linear 8 Ridge 8
Mean Validation MSE 2.166473 2.1391546 1.9874294
Test set MSE 2.9865383 3.056937 2.7227432

Leave One Out

The limit case of \(K\)-folk cross-validation is the case \(N=K\): \(N\) times each model is trained on all but one observation. The average prediction error gives an estimate of \(E_\text{pred}\).

This approach has a long history in statistics: it was introduced in 1949 by Maurice Quenouille, and named jackknife by John Tukey in 1958. The jackknife estimates a parameter by estimating it for all leave-one-out subsamples and taking the average estimate.

Cross-validation in Python

scikit-learn has a large range of cross-validation helper functions and dataset splitting objects in the submodule model_selection. Some of the most relevant are

KFold, LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit.

kfolds = KFold(n_splits=5)
for train_idx, test_idx in kf.split(X):
  X_train, X_test = X[train_idx], X[test_idx]
  y_train, y_test = y[train_idx], y[test_idx]
  # do stuff with the train/validation split

Cross-validation in R

Many libraries include their own cross-validation implementations.

caret is a meta-package with a unified calling interface for a large range of (supervised) machine learning methods. caret will do a lot of hyper-parameter estimation automatically, with methods for cross-validation as well as grid search etc.

Bootstrap

Bootstrap – inflate data

Bootstrap methods form a very generic framework for statistical estimation.

Basic idea If we were able to generate arbitrarily large samples from the true distribution, we could generate quantifiably tight estimates of quantities.

Let \(\hat\theta = \hat\theta(X)\) be a statistic, calculated from data \(X\) that estimates a quantity \(\theta\). If \(\hat\theta\) is consistent, larger and larger samples will lead to \(\mathbb E\hat\theta\) converging (w probability 1) to \(\theta\). If \(\hat\theta\) is unbiased, \(\mathbb E\hat\theta=\theta\).

The bootstrap estimates \(\mathbb E\hat\theta\) by repeatedly calculating \(\hat\theta\) on new samples from the empirical distribution.

Empirical distribution

Given a data sample \(X_1,\dots,X_N\), the data itself is approximately distributed as the data source \(X\).

The empirical distribution is the distribution that takes the value \(X_j\) with probability \(1/N\). We can sample from it by taking samples from \(X_1,\dots,X_N\) randomly with replacement.

Bootstrap distribution statistics

Statistics that describe the estimator \(\hat\theta\) can be derived from its empirical distribution using sample statistics on bootstrap samples. Let \(X^1,\dots,X^B\) be bootstrap samples from the empirical distribution: \(X^b = X^b_1,\dots,X^b_N\) drawn randomly with replacement from \(X_1,\dots,X_N\).

\[ \mathbb E\hat\theta\approx \overline\theta = \frac{1}{B}\sum_{b=1}^B\hat\theta(X^b) \qquad \mathbb V\hat\theta \approx \frac{1}{B-1}\sum_{b=1}^B(\hat\theta(X^b)-\overline{\theta})^2 \]

Example: k-Means on Iris

We cluster the iris dataset using k-means, and estimate the total within-group spread. We will use 100 bootstrap samples from the full dataset.

Estimating \(E_\text{pred}\)

The bootstrap can be used for model selection and model assessment – since these are question of estimating a statistical quantity. It is especially attractive when a train/test split is too expensive.

First approach. Fit \(f^1,\dots,f^B\) to bootstrapped samples \(X^1,\dots,X^B\). \[ E_\text{boot} = \frac{1}{B}\sum_b \frac{1}{N}\sum_{i} L(f^b(X_i), y_i) \]

\(E_\text{boot}\) uses the full training sample as its test sample. Hence the training of each \(f^b\) shares observations with its test set.

Estimating \(E_\text{pred}\)

Fit \(f^1,\dots,f^B\) to bootstrapped samples \(X^1,\dots,X^B\). \(E_\text{boot}\) shares training and test data for each model \(f^b\).

Instead, try to evaluate each \(f^b\) only on observations not in \(X^b\). Let \(C^{-i}=\{b : X_i\not\in X^b\}\) be bootstrap indices that avoid the \(i\)th sample.

\[ E_\text{boot}^{(1)} = \frac{1}{N}\sum_i \frac{1}{|C^{-i}|}\sum_{b\in C^{-i}} L(f^b(X_i),y_i) \]

Estimating \(E_\text{pred}\)

Depending on how the models' error rates depend with sample sizes \(E_\text{boot}\) and \(E_\text{boot}^{(1)}\) may introduce significant bias.

This can be alleviated by interpolating between \(E_\text{boot}^{(1)}\) and \(E_\text{train}\). The derivation in chapter 7.11 uses the probability of an observation occurring in a bootstrap sample \[ \mathbb P(x_i\in X^b) = 1-\left(1-\frac{1}{N}\right)^N \approx 1-e^{-1} \approx 0.632 \\ \mathbb P(x_i\not\in X^b) = 1-\mathbb P(x_i\in X^b) \approx 0.368 \]

Model parsimony

Cross-validation and bootstrap

One rule suggested by our book is:

Estimate errors \(E^j\) for models \(f^j\) and for these error estimates, calculate the mean \(\overline E^j\) and the standard error \(\text{SE}_{E^j}\).

Let \(k\) be the index minimizing \(\overline E^k\).

Pick \[ \text{arg}\min_j \left[ \overline E^j-\text{SE}_{E^j} < \overline E^k+\text{SE}_{E^k} \right] \] the most parsimonous model whose standard error interval intersects that of the best model.

In-sample methods

By focusing on new observations with known inputs we can compare the training error to an expected prediction error controlling for variation in the inputs, making the analysis easier.

We define the optimism \(\text{op}=E_\text{in}-E_\text{train}\). This quantity is still difficult to estimate, but for a large range of loss functions (including squared error & 0-1), the expected optimism is \[ \mathbb E\text{op} = \omega = \frac{2}{N}\sum Cov(\hat y_i,y_i) \] Solving for \(E_\text{in}\) and taking expectations, we get \[ \mathbb EE_\text{in} = \mathbb EE_\text{train} + \frac{2}{N}\sum Cov(\hat y_i,y_i) \]

Optimism

For linear methods with \(d\) inputs and additive noise with variance \(\sigma_\epsilon^2\), \[ \frac{2}{N}\sum Cov(\hat y_i,y_i) = d\sigma_\epsilon^2 \\ \mathbb EE_\text{in} = \mathbb EE_\text{train} + 2\frac{d}{N}\sigma_\epsilon^2 \]

Similar formulas exist for other error models.

From this we can derive several model selection criteria.

\(C_p\) statistic

Estimate \(\sigma_\epsilon^2\) with \(\hat\sigma_\epsilon^2\) by training a low-bias model (to isolate the noise) and using mean squared error for its variance.

\[ C_p = E_\text{train} + 2\frac{d}{N}\sigma_\epsilon^2 \] estimates \(\mathbb EE_\text{in}\).

For models that fit \(d\) parameters under squared error loss, modele selection can be done by picking the model that minimizes \(C_p\).

The Akaike Information Criterion

More generally applicable is the Akaike Information Criterion (AIC). This applies when using a log-likelihood loss function.

\[ \mathbb E[\log\mathbb P(Y|\hat\theta)] \approx -\frac{2}{N}\mathbb E\left[\sum_i\log\mathbb P(y_i|\hat\theta)\right] + 2\frac{d}{N} \\ \text{AIC} = -\frac{2}{N}\sum_i\log\mathbb P(y_i|\hat\theta) + 2\frac{d}{N} \\ \text{AIC}(\alpha) = E_\text{train}(\alpha)+2\frac{d(\alpha)}{N}\sigma_\epsilon^2 \] where the last formula is for a family of models indexed by a tuning parameter \(\alpha\).

The Bayesian Information Criterion

\[ \text{BIC} = -2\sum_i\log\mathbb P(y_i|\hat\theta) + d\log N \]

The Bayesian Information Criterion (BIC) emerges as the loss for a maximum likelihood estimate of the components that go into posterior odds for comparing two models in a Bayesian setting: \[ \text{odds} = \frac{\mathbb P(M_m|X)}{\mathbb P(M_\ell|X)} = \frac{\mathbb P(M_m)}{\mathbb P(M_\ell)}\cdot \frac{\mathbb P(X|M_m)}{\mathbb P(X|M_\ell)} \\ \log\mathbb P(X|M_m) \approx \log\mathbb P(X|\hat\theta_m,M_m)-\frac{d_m}{2}\log N+\text{constant} \] So with log likelihood loss, the BIC approximates the log likelihood well enough to calculate the posterior odds.

The Bayesian Information Criterion

\[ \text{BIC} = -2\sum_i\log\mathbb P(y_i|\hat\theta) + d\log N \]

The Bayesian viewpoint produces posterior probabilities – estimates of the probability of a particular model being the best explanation of the observed data – using BIC: \[ \mathbb P(M_m|X) \approx \frac{\exp\left[-\text{BIC}(M_m)/2\right]} {\sum_k\exp\left[-\text{BIC}(M_k)/2\right]} \]

As \(N\to\infty\), BIC will choose the correct model, whereas AIC will choose too complex models.

For finite \(N\) however, BIC tends to over-simplify.