Bias-Variance Tradeoff¶

Recall from the reading: at the core of Machine Learning is the Bias-Variance tradeoff: Mean Squared Error decomposes as $$ MSE = \mathbb{E}[(y-\hat{f}(x))^2] = \mathbb{V}[\hat{f}(x)] + \textrm{Bias}(\hat{f}(x))^2 + \mathbb{V}[\epsilon] $$

In other words, the MSE is produced by the variance of the predictor, the squared bias of the predictor and the variance of the error terms. Bias here means the deviation $\mathbb{E}[\hat{f}(x)] - y$ of the prediction from the true value.

The trade-off lies in how models with extremely low bias often have high variance, and extremely low variance models often have high bias.

Validation¶

To pick a model, or a model parameter, we have the following process:

Split data into train and validation
Fit model to train
Evaluate model on validation

As several of you have noticed, this validation evaluation can jump around a lot.

In [3]:

hist(scores)

Out[3]:

(array([ 6.,  8.,  9., 13., 22., 19., 16.,  5.,  1.,  1.]),
 array([0.73542601, 0.74932735, 0.7632287 , 0.77713004, 0.79103139,
        0.80493274, 0.81883408, 0.83273543, 0.84663677, 0.86053812,
        0.87443946]),
 <a list of 10 Patch objects>)

A second issue is that the validation set is excluded from use for training: the dataset is artificially shrunk.

Cross validation and bootstrap¶

We will talk about three different ways to get around these issues:

Leave-One-Out Cross-Validation
$k$-fold Cross-Validation
Bootstrap

This follows Chapter 5 in the textbook (pp 175-200)

Leave-one-out¶

The largest set of training data we can use while still validating is $N-1$: leave one single data point out, train on the remaining data and evaluate on that one point.

For each single point, our estimate will be poor - but if we were to do this for each point in the data separately, we get a better estimate of our validation error.

Leave-one-out¶

For $i=1..N$
Split data into $T = \{(x_1,y_1),\dots,(x_{i-1},y_{i-1}),\dots,(x_N,y_N)\}$ and $V = \{(x_i,y_i)\}$
Train preprocessor and model on $T$.
Preprocess, predict and score on $V$, producing a score $s_i$.
Loop
Return $\hat{s}=\frac{\sum s_i}{N}$.

Leave-one-out¶

Benefits¶

Lower bias than using a validation set
Training set is almost identical between tries -- stable models
Deterministic

Drawbacks¶

Computationally expensive; model fit runs $N$ times

In [4]:

loo = model_selection.LeaveOneOut()
scores = []
for ix_train, ix_val in loo.split(titanic):
  X = cleanup.fit_transform(titanic.loc[ix_train,:])
  y = titanic.loc[ix_train, "Survived"]
  X_val = cleanup.transform(titanic.loc[ix_val,:])
  y_val = titanic.loc[ix_val, "Survived"]
  model.fit(X, y)
  scores.append(model.score(X_val, y_val))

In [5]:

mean(scores)

Out[5]:

0.7934904601571269

$k$-fold Cross Validation¶

Rather than to go through the entire pipeline separately for each data point, a much more economical approach is the $k$-fold Cross Validation:

Split the data into $k$ folds $F_1,\dots,F_k$.
For each fold $F_i$: a. Train preprocessor and model on $\bigcup_{j\neq i}F_j$ b. Preprocess, predict and score $F_i$ producing a score $s_i$
Return $\hat{s} = \frac{\sum s_i}{k}$

Leave-one-out is an $N$-fold cross validation.

Much more common choices are $k=5$ or $k=10$.

$k$-fold Cross Validation¶

Each training set has $(k-1)n/k$ observations, but each observation is used in $k-1$ of the training sets.

Benefits¶

Uses the entire training set, unlike Validation
Computationally more efficient than Leave-one-out

Drawbacks¶

The process is randomized. (but averaging stabilizes the result)
Uses less training data each time -- produces a larger variation of models.

In [6]:

total_pipeline = pipeline.make_pipeline(cleanup, model)
scores = model_selection.cross_val_score(total_pipeline,
                                         titanic,             # X
                                         titanic["Survived"], # y
                                         cv=10)
scores

Out[6]:

array([0.78888889, 0.79775281, 0.76404494, 0.80898876, 0.78651685,
       0.76404494, 0.78651685, 0.78651685, 0.80898876, 0.83146067])

Bootstrap¶

Our data can be seen as an approximation of a true distribution from which the data is sampled.

Since the empirical distribution in the data approximates the true distribution, sampling from the data is approximately the same as sampling from the true distribution.

So if we want to estimate some statistical value, with information about the uncertainty in the estimate, we can simulate repeated experiments by sampling from the data.

Bootstrap¶

For $b=1..B$
1. Sample $X^{*b}$ from $X$ with replacement and equal size
2. Calculate the statistical value $\hat\alpha^{*b}$
Compute $$ \hat{a} = \frac{1}{B}\sum\hat\alpha^{*b}\qquad \text{SE}_{B}(\hat\alpha) = \sqrt{\frac{1}{B-1}\sum\left(\hat\alpha^{*b}-\hat\alpha\right)^2} $$ Or use the collection $\{\hat\alpha^{*b}\}$ to analyze the sampling distribution directly.

The bootstrap is not implemented directly in sklearn, but can be built using sklearn.utils.resample or df.sample in a loop.

So how do we pick a model?¶

Let's pick one of several options for the titanic task:

In [7]:

model1 = linear_model.LogisticRegression(solver="lbfgs")
model2 = neighbors.KNeighborsClassifier(5)
model3 = neighbors.KNeighborsClassifier(10)
pipe1 = pipeline.make_pipeline(cleanup, model1)
pipe2 = pipeline.make_pipeline(cleanup, model2)
pipe3 = pipeline.make_pipeline(cleanup, model3)

boxplot(
 c_[model_selection.cross_val_score(pipe1, titanic, titanic["Survived"], cv=10),
    model_selection.cross_val_score(pipe2, titanic, titanic["Survived"], cv=10),
    model_selection.cross_val_score(pipe3, titanic, titanic["Survived"], cv=10),
])
gca().set_xticklabels(["logreg","5-nn","10-nn"])
ylabel("accuracy")

Out[7]:

Text(0, 0.5, 'accuracy')

Ensemble Learning¶

Where decision trees really shine is when combined.

Ensemble Learning methods combine results from several simple models to produce a more competent model.

We distinguish two main types:

Bagging: use bootstrap and averaging over several decision trees to reduce variance
Boosting: repeatedly fit decision trees to the residuals over an intermediate model, and slowly improve the full model

Bagging¶

Notice that the Decision Tree has a much larger variance in cross validation.

In [5]:

fig

Out[5]:

Bagging¶

Notice that the Decision Tree has a much larger variance in cross validation.

We know that if $Z_1,\dots,Z_n$ are iid with variance $\sigma^2$, then $$ \frac{1}{N}\sum Z_j\quad\text{has variance}\quad\sigma^2/N $$

Since the problem with Decision Trees is the high variance, we could try averaging several Decision Trees to produce a lower variance prediction.

Since Decision Trees have a very low bias, that low bias is retained when averaging.

Bagging¶

Notice that the Decision Tree has a much larger variance in cross validation.

We usually won't have access to many training sets to produce many different Decision Trees.

But we know from the Bootstrap process that by treating the data as an empirical distribution, we can simulate having many training sets.

This gives us bagging:

Generate $B$ bootstrap samples $X^1,\dots,X^B$ from $X$
Train models $\hat{f}^{*1},\dots,\hat{f}^{*B}$ on these samples.
For the bagged model, use: $$ \hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum\hat{f}^{*b}(x) $$

Using many trees is not in itself a risk for overfitting. In practice, sufficiently many that the error settles is good enough. Often $B=100$ forms a good balance between computational load and error reduction.

Out of bag error estimation¶

When bagging, there is a trick that allows us to estimate validation error without cross-validation: out-of-bag validation (OOB)

For each component model $f^{*b}$, use $X\setminus X^b$ as a validation set. For a full validation score, average the scores from each components.

Random Forests¶

Bagged decision trees end up correlated with each other. This weakness can be handled by explicitly decorrelating the models. This adjustment produces the random forest models.

Generate $B$ bootstrap samples $X^1,\dots,X^B$ from $X$.
Generate random samples $I_1,\dots,I_B$ of the predictors.
Train models $\hat{f}^{*b}$ on the subset $X^b[I_b]$.
For the bagged model use: $$ \hat{f}_{rf}(x) = \frac{1}{B}\sum\hat{f}^{*b}(x[I_b]) $$

Usually, we use $|I_b|\approx\sqrt{p}$. Decreasing this size can help with data with many highhly correlated predictors.

Boosting¶

With boosting, the plan is to build each model to predict the residuals from the previous models and combine it using a scaling factor to keep the change slow. For regression, it looks like this:

Fit $\hat{f}^1$ to training data $(X,y)$. Set $\hat{f}=\lambda\hat{f}^1$
Calculate residuals $r = y-\hat{f}^1(X)$.
Fit $\hat{f}^b$ to training data $(X,r)$.
Update predictor $\hat{f}=\hat{f}+\lambda\hat{f}^b$.
Update residuals $r = r - \lambda\hat{f}^b(X)$
Loop back to 3.

The final model is $\sum\lambda\hat{f}^b$.

Commonly used boosting algorithms¶

AdaBoost¶

One of the earlier boosting algorithms, AdaBoost (Adaptive Boosting) adjusts $\lambda$ from step to step to adjust the speed of adaptation to the current state of the predictor and data.

AdaBoost is (according to Wikipedia) considered the best out-of-the-box classifier: best performance without tweaking and adjustments.

Gradient Boosting / XGBoost¶

Boosting can be phrased as as optimization problem. Performing gradient descent on this optimization problem produces gradient boosting.

One extremely popular implementation and adaptation is XGBoost - this runs on all major machine learning platforms and regularly appears among winners of Kaggle competitions.

In [9]:

fig

Out[9]:

In `scikit-learn`¶

Classifiers

sklearn.tree.DecisionTreeClassifier
sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.BaggingClassifier - takes a classifier as its first argument
sklearn.ensemble.AdaBoostClassifier
sklearn.ensemble.GradientBoostingClassifier
xgboost.XGBClassifier

Regressors

sklearn.tree.DecisionTreeRegressor
sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.BaggingRegressor - takes a classifier as its first argument
sklearn.ensemble.AdaBoostRegressor
sklearn.ensemble.GradientBoostingRegressor
xgboost.XGBRegressor

Bias-Variance Tradeoff¶

Validation¶

Cross validation and bootstrap¶

Leave-one-out¶

Leave-one-out¶

Leave-one-out¶

Benefits¶

Drawbacks¶

$k$-fold Cross Validation¶

$k$-fold Cross Validation¶

Benefits¶

Drawbacks¶

Bootstrap¶

Bootstrap¶

So how do we pick a model?¶

Ensemble Learning¶

Bagging¶

Bagging¶

Bagging¶

Out of bag error estimation¶

Random Forests¶

Boosting¶

Commonly used boosting algorithms¶

AdaBoost¶

Gradient Boosting / XGBoost¶

In scikit-learn¶

In `scikit-learn`¶