Bias-Variance Tradeoff

Recall from the reading: at the core of Machine Learning is the Bias-Variance tradeoff: Mean Squared Error decomposes as $$ MSE = \mathbb{E}[(y-\hat{f}(x))^2] = \mathbb{V}[\hat{f}(x)] + \textrm{Bias}(\hat{f}(x))^2 + \mathbb{V}[\epsilon] $$

In other words, the MSE is produced by the variance of the predictor, the squared bias of the predictor and the variance of the error terms. Bias here means the deviation $\mathbb{E}[\hat{f}(x)] - y$ of the prediction from the true value.

The trade-off lies in how models with extremely low bias often have high variance, and extremely low variance models often have high bias.

Validation

To pick a model, or a model parameter, we have the following process:

  1. Split data into train and validation
  2. Fit model to train
  3. Evaluate model on validation

As several of you have noticed, this validation evaluation can jump around a lot.

In [3]:
hist(scores)
Out[3]:
(array([ 6.,  8.,  9., 13., 22., 19., 16.,  5.,  1.,  1.]),
 array([0.73542601, 0.74932735, 0.7632287 , 0.77713004, 0.79103139,
        0.80493274, 0.81883408, 0.83273543, 0.84663677, 0.86053812,
        0.87443946]),
 <a list of 10 Patch objects>)

A second issue is that the validation set is excluded from use for training: the dataset is artificially shrunk.

Cross validation and bootstrap

We will talk about three different ways to get around these issues:

  1. Leave-One-Out Cross-Validation
  2. $k$-fold Cross-Validation
  3. Bootstrap

This follows Chapter 5 in the textbook (pp 175-200)

Leave-one-out

The largest set of training data we can use while still validating is $N-1$: leave one single data point out, train on the remaining data and evaluate on that one point.

For each single point, our estimate will be poor - but if we were to do this for each point in the data separately, we get a better estimate of our validation error.

Leave-one-out

  1. For $i=1..N$
  2. Split data into $T = \{(x_1,y_1),\dots,(x_{i-1},y_{i-1}),\dots,(x_N,y_N)\}$ and $V = \{(x_i,y_i)\}$
  3. Train preprocessor and model on $T$.
  4. Preprocess, predict and score on $V$, producing a score $s_i$.
  5. Loop
  6. Return $\hat{s}=\frac{\sum s_i}{N}$.

Leave-one-out

Benefits

  1. Lower bias than using a validation set
  2. Training set is almost identical between tries -- stable models
  3. Deterministic

Drawbacks

  1. Computationally expensive; model fit runs $N$ times
In [4]:
loo = model_selection.LeaveOneOut()
scores = []
for ix_train, ix_val in loo.split(titanic):
  X = cleanup.fit_transform(titanic.loc[ix_train,:])
  y = titanic.loc[ix_train, "Survived"]
  X_val = cleanup.transform(titanic.loc[ix_val,:])
  y_val = titanic.loc[ix_val, "Survived"]
  model.fit(X, y)
  scores.append(model.score(X_val, y_val))
In [5]:
mean(scores)
Out[5]:
0.7934904601571269

$k$-fold Cross Validation

Rather than to go through the entire pipeline separately for each data point, a much more economical approach is the $k$-fold Cross Validation:

  1. Split the data into $k$ folds $F_1,\dots,F_k$.
  2. For each fold $F_i$: a. Train preprocessor and model on $\bigcup_{j\neq i}F_j$ b. Preprocess, predict and score $F_i$ producing a score $s_i$
  3. Return $\hat{s} = \frac{\sum s_i}{k}$

Leave-one-out is an $N$-fold cross validation.

Much more common choices are $k=5$ or $k=10$.

$k$-fold Cross Validation

Each training set has $(k-1)n/k$ observations, but each observation is used in $k-1$ of the training sets.

Benefits

  1. Uses the entire training set, unlike Validation
  2. Computationally more efficient than Leave-one-out

Drawbacks

  1. The process is randomized. (but averaging stabilizes the result)
  2. Uses less training data each time -- produces a larger variation of models.
In [6]:
total_pipeline = pipeline.make_pipeline(cleanup, model)
scores = model_selection.cross_val_score(total_pipeline,
                                         titanic,             # X
                                         titanic["Survived"], # y
                                         cv=10)
scores
Out[6]:
array([0.78888889, 0.79775281, 0.76404494, 0.80898876, 0.78651685,
       0.76404494, 0.78651685, 0.78651685, 0.80898876, 0.83146067])

Bootstrap

Our data can be seen as an approximation of a true distribution from which the data is sampled.

Since the empirical distribution in the data approximates the true distribution, sampling from the data is approximately the same as sampling from the true distribution.

So if we want to estimate some statistical value, with information about the uncertainty in the estimate, we can simulate repeated experiments by sampling from the data.

Bootstrap

  1. For $b=1..B$
    1. Sample $X^{*b}$ from $X$ with replacement and equal size
    2. Calculate the statistical value $\hat\alpha^{*b}$
  2. Compute $$ \hat{a} = \frac{1}{B}\sum\hat\alpha^{*b}\qquad \text{SE}_{B}(\hat\alpha) = \sqrt{\frac{1}{B-1}\sum\left(\hat\alpha^{*b}-\hat\alpha\right)^2} $$ Or use the collection $\{\hat\alpha^{*b}\}$ to analyze the sampling distribution directly.

The bootstrap is not implemented directly in sklearn, but can be built using sklearn.utils.resample or df.sample in a loop.

So how do we pick a model?

Let's pick one of several options for the titanic task:

In [7]:
model1 = linear_model.LogisticRegression(solver="lbfgs")
model2 = neighbors.KNeighborsClassifier(5)
model3 = neighbors.KNeighborsClassifier(10)
pipe1 = pipeline.make_pipeline(cleanup, model1)
pipe2 = pipeline.make_pipeline(cleanup, model2)
pipe3 = pipeline.make_pipeline(cleanup, model3)

boxplot(
 c_[model_selection.cross_val_score(pipe1, titanic, titanic["Survived"], cv=10),
    model_selection.cross_val_score(pipe2, titanic, titanic["Survived"], cv=10),
    model_selection.cross_val_score(pipe3, titanic, titanic["Survived"], cv=10),
])
gca().set_xticklabels(["logreg","5-nn","10-nn"])
ylabel("accuracy")
Out[7]:
Text(0, 0.5, 'accuracy')

Ensemble Learning

Where decision trees really shine is when combined.

Ensemble Learning methods combine results from several simple models to produce a more competent model.

We distinguish two main types:

  • Bagging: use bootstrap and averaging over several decision trees to reduce variance
  • Boosting: repeatedly fit decision trees to the residuals over an intermediate model, and slowly improve the full model

Bagging

Notice that the Decision Tree has a much larger variance in cross validation.

In [5]:
fig
Out[5]:

Bagging

Notice that the Decision Tree has a much larger variance in cross validation.

We know that if $Z_1,\dots,Z_n$ are iid with variance $\sigma^2$, then $$ \frac{1}{N}\sum Z_j\quad\text{has variance}\quad\sigma^2/N $$

Since the problem with Decision Trees is the high variance, we could try averaging several Decision Trees to produce a lower variance prediction.

Since Decision Trees have a very low bias, that low bias is retained when averaging.

Bagging

Notice that the Decision Tree has a much larger variance in cross validation.

We usually won't have access to many training sets to produce many different Decision Trees.

But we know from the Bootstrap process that by treating the data as an empirical distribution, we can simulate having many training sets.

This gives us bagging:

  1. Generate $B$ bootstrap samples $X^1,\dots,X^B$ from $X$
  2. Train models $\hat{f}^{*1},\dots,\hat{f}^{*B}$ on these samples.
  3. For the bagged model, use: $$ \hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum\hat{f}^{*b}(x) $$

Using many trees is not in itself a risk for overfitting. In practice, sufficiently many that the error settles is good enough. Often $B=100$ forms a good balance between computational load and error reduction.

Out of bag error estimation

When bagging, there is a trick that allows us to estimate validation error without cross-validation: out-of-bag validation (OOB)

For each component model $f^{*b}$, use $X\setminus X^b$ as a validation set. For a full validation score, average the scores from each components.

Random Forests

Bagged decision trees end up correlated with each other. This weakness can be handled by explicitly decorrelating the models. This adjustment produces the random forest models.

  1. Generate $B$ bootstrap samples $X^1,\dots,X^B$ from $X$.
  2. Generate random samples $I_1,\dots,I_B$ of the predictors.
  3. Train models $\hat{f}^{*b}$ on the subset $X^b[I_b]$.
  4. For the bagged model use: $$ \hat{f}_{rf}(x) = \frac{1}{B}\sum\hat{f}^{*b}(x[I_b]) $$

Usually, we use $|I_b|\approx\sqrt{p}$. Decreasing this size can help with data with many highhly correlated predictors.

Boosting

With boosting, the plan is to build each model to predict the residuals from the previous models and combine it using a scaling factor to keep the change slow. For regression, it looks like this:

  1. Fit $\hat{f}^1$ to training data $(X,y)$. Set $\hat{f}=\lambda\hat{f}^1$
  2. Calculate residuals $r = y-\hat{f}^1(X)$.
  3. Fit $\hat{f}^b$ to training data $(X,r)$.
  4. Update predictor $\hat{f}=\hat{f}+\lambda\hat{f}^b$.
  5. Update residuals $r = r - \lambda\hat{f}^b(X)$
  6. Loop back to 3.

The final model is $\sum\lambda\hat{f}^b$.

Commonly used boosting algorithms

AdaBoost

One of the earlier boosting algorithms, AdaBoost (Adaptive Boosting) adjusts $\lambda$ from step to step to adjust the speed of adaptation to the current state of the predictor and data.

AdaBoost is (according to Wikipedia) considered the best out-of-the-box classifier: best performance without tweaking and adjustments.

Gradient Boosting / XGBoost

Boosting can be phrased as as optimization problem. Performing gradient descent on this optimization problem produces gradient boosting.

One extremely popular implementation and adaptation is XGBoost - this runs on all major machine learning platforms and regularly appears among winners of Kaggle competitions.

In [9]:
fig
Out[9]:

In scikit-learn

Classifiers

  • sklearn.tree.DecisionTreeClassifier
  • sklearn.ensemble.RandomForestClassifier
  • sklearn.ensemble.BaggingClassifier - takes a classifier as its first argument
  • sklearn.ensemble.AdaBoostClassifier
  • sklearn.ensemble.GradientBoostingClassifier
  • xgboost.XGBClassifier

Regressors

  • sklearn.tree.DecisionTreeRegressor
  • sklearn.ensemble.RandomForestRegressor
  • sklearn.ensemble.BaggingRegressor - takes a classifier as its first argument
  • sklearn.ensemble.AdaBoostRegressor
  • sklearn.ensemble.GradientBoostingRegressor
  • xgboost.XGBRegressor