Recall from the reading: at the core of Machine Learning is the Bias-Variance tradeoff: Mean Squared Error decomposes as $$ MSE = \mathbb{E}[(y-\hat{f}(x))^2] = \mathbb{V}[\hat{f}(x)] + \textrm{Bias}(\hat{f}(x))^2 + \mathbb{V}[\epsilon] $$
In other words, the MSE is produced by the variance of the predictor, the squared bias of the predictor and the variance of the error terms. Bias here means the deviation $\mathbb{E}[\hat{f}(x)] - y$ of the prediction from the true value.
The trade-off lies in how models with extremely low bias often have high variance, and extremely low variance models often have high bias.
To pick a model, or a model parameter, we have the following process:
train
and validation
model
to train
model
on validation
As several of you have noticed, this validation evaluation can jump around a lot.
hist(scores)
(array([ 6., 8., 9., 13., 22., 19., 16., 5., 1., 1.]), array([0.73542601, 0.74932735, 0.7632287 , 0.77713004, 0.79103139, 0.80493274, 0.81883408, 0.83273543, 0.84663677, 0.86053812, 0.87443946]), <a list of 10 Patch objects>)
A second issue is that the validation set is excluded from use for training: the dataset is artificially shrunk.
We will talk about three different ways to get around these issues:
This follows Chapter 5 in the textbook (pp 175-200)
The largest set of training data we can use while still validating is $N-1$: leave one single data point out, train on the remaining data and evaluate on that one point.
For each single point, our estimate will be poor - but if we were to do this for each point in the data separately, we get a better estimate of our validation error.
loo = model_selection.LeaveOneOut()
scores = []
for ix_train, ix_val in loo.split(titanic):
X = cleanup.fit_transform(titanic.loc[ix_train,:])
y = titanic.loc[ix_train, "Survived"]
X_val = cleanup.transform(titanic.loc[ix_val,:])
y_val = titanic.loc[ix_val, "Survived"]
model.fit(X, y)
scores.append(model.score(X_val, y_val))
mean(scores)
0.7934904601571269
Rather than to go through the entire pipeline separately for each data point, a much more economical approach is the $k$-fold Cross Validation:
Leave-one-out is an $N$-fold cross validation.
Much more common choices are $k=5$ or $k=10$.
Each training set has $(k-1)n/k$ observations, but each observation is used in $k-1$ of the training sets.
total_pipeline = pipeline.make_pipeline(cleanup, model)
scores = model_selection.cross_val_score(total_pipeline,
titanic, # X
titanic["Survived"], # y
cv=10)
scores
array([0.78888889, 0.79775281, 0.76404494, 0.80898876, 0.78651685, 0.76404494, 0.78651685, 0.78651685, 0.80898876, 0.83146067])
Our data can be seen as an approximation of a true distribution from which the data is sampled.
Since the empirical distribution in the data approximates the true distribution, sampling from the data is approximately the same as sampling from the true distribution.
So if we want to estimate some statistical value, with information about the uncertainty in the estimate, we can simulate repeated experiments by sampling from the data.
The bootstrap is not implemented directly in sklearn
, but can be built using sklearn.utils.resample
or df.sample
in a loop.
Let's pick one of several options for the titanic task:
model1 = linear_model.LogisticRegression(solver="lbfgs")
model2 = neighbors.KNeighborsClassifier(5)
model3 = neighbors.KNeighborsClassifier(10)
pipe1 = pipeline.make_pipeline(cleanup, model1)
pipe2 = pipeline.make_pipeline(cleanup, model2)
pipe3 = pipeline.make_pipeline(cleanup, model3)
boxplot(
c_[model_selection.cross_val_score(pipe1, titanic, titanic["Survived"], cv=10),
model_selection.cross_val_score(pipe2, titanic, titanic["Survived"], cv=10),
model_selection.cross_val_score(pipe3, titanic, titanic["Survived"], cv=10),
])
gca().set_xticklabels(["logreg","5-nn","10-nn"])
ylabel("accuracy")
Text(0, 0.5, 'accuracy')
Where decision trees really shine is when combined.
Ensemble Learning methods combine results from several simple models to produce a more competent model.
We distinguish two main types:
Notice that the Decision Tree has a much larger variance in cross validation.
fig
Notice that the Decision Tree has a much larger variance in cross validation.
We know that if $Z_1,\dots,Z_n$ are iid with variance $\sigma^2$, then $$ \frac{1}{N}\sum Z_j\quad\text{has variance}\quad\sigma^2/N $$
Since the problem with Decision Trees is the high variance, we could try averaging several Decision Trees to produce a lower variance prediction.
Since Decision Trees have a very low bias, that low bias is retained when averaging.
Notice that the Decision Tree has a much larger variance in cross validation.
We usually won't have access to many training sets to produce many different Decision Trees.
But we know from the Bootstrap process that by treating the data as an empirical distribution, we can simulate having many training sets.
This gives us bagging:
Using many trees is not in itself a risk for overfitting. In practice, sufficiently many that the error settles is good enough. Often $B=100$ forms a good balance between computational load and error reduction.
When bagging, there is a trick that allows us to estimate validation error without cross-validation: out-of-bag validation (OOB)
For each component model $f^{*b}$, use $X\setminus X^b$ as a validation set. For a full validation score, average the scores from each components.
Bagged decision trees end up correlated with each other. This weakness can be handled by explicitly decorrelating the models. This adjustment produces the random forest models.
Usually, we use $|I_b|\approx\sqrt{p}$. Decreasing this size can help with data with many highhly correlated predictors.
With boosting, the plan is to build each model to predict the residuals from the previous models and combine it using a scaling factor to keep the change slow. For regression, it looks like this:
The final model is $\sum\lambda\hat{f}^b$.
One of the earlier boosting algorithms, AdaBoost (Adaptive Boosting) adjusts $\lambda$ from step to step to adjust the speed of adaptation to the current state of the predictor and data.
AdaBoost is (according to Wikipedia) considered the best out-of-the-box classifier: best performance without tweaking and adjustments.
Boosting can be phrased as as optimization problem. Performing gradient descent on this optimization problem produces gradient boosting.
One extremely popular implementation and adaptation is XGBoost
- this runs on all major machine learning platforms and regularly appears among winners of Kaggle competitions.
fig
scikit-learn
¶Classifiers
sklearn.tree.DecisionTreeClassifier
sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.BaggingClassifier
- takes a classifier as its first argumentsklearn.ensemble.AdaBoostClassifier
sklearn.ensemble.GradientBoostingClassifier
xgboost.XGBClassifier
Regressors
sklearn.tree.DecisionTreeRegressor
sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.BaggingRegressor
- takes a classifier as its first argumentsklearn.ensemble.AdaBoostRegressor
sklearn.ensemble.GradientBoostingRegressor
xgboost.XGBRegressor