Stacking

If several ML models perform badly in different parts of the dataset, stacking can improve performance.

Stacking is complex to implement, and may produce only small improvements -- not common in production.

Stacking very often produces improvements -- very common on Kaggle.

In [15]:
display(SVG(data=svg.stdout))
stacking data data model1 model1 data->model1 model2 model2 data->model2 model3 model3 data->model3 metamodel metamodel model1->metamodel model2->metamodel model3->metamodel prediction prediction metamodel->prediction

How to train a stack

The basic idea is to use several model outputs as features for a meta model that combines predictions from several sources.

To avoid overfitting, each feature in the training phase is created as a validation feature:

  1. Split data into $k$ cross-validation folds data.1, ..., data.k
  2. To produce predictions for the input in a fold data.i, fit all models against all other folds, then predict on the validation fold data.i
  3. Train the meta model using the validation fold predictions from all models -- optionally also other features from the data
  4. Retrain all models on the full training data

How to predict with a stack

  1. Predict using each of the models
  2. Feed predictions as features to the meta model

Stacking in sklearn

Starting with scikit-learn 0.22: use ensemble.StackingClassifier. This object requires a list of pairs of a string name and a model as the component models, and uses the __ convention for addressing.

In [2]:
from sklearn import datasets, ensemble, svm, linear_model, preprocessing, pipeline, model_selection
X, y = datasets.load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = model_selection.train_test_split(X, y, stratify=y)
estimators = [
    ("svc", pipeline.make_pipeline(preprocessing.StandardScaler(), svm.SVC())),
    ("rf", ensemble.RandomForestClassifier(n_estimators=10))
]
model = ensemble.StackingClassifier(estimators, final_estimator=linear_model.LogisticRegression())
model.fit(X_train,y_train).score(X_val,y_val)
Out[2]:
0.9736842105263158

Support Vector Machines

I don't want us to leave the classifier challenge without introducing Support Vector Machines first. So we'll do that now.

Support Vector Machines start out as a way to make linear decision boundaries more rigid and well-placed, but go from there by using kernels to a powerful and flexible machinery for machine learning.

WARNING: Mathematics incoming

We will dig into optimization topics and the derivation of logistic regression. This part will be fast, and draw on advanced topics.

Consider the logistic regression

Let the target take values $y=\pm1$. We use the trick to include a variable $x_0$ with constant value 1 so that $AX+B$ can be written as $wX$.

$$ \mathbb{P}(y=\pm1|x,w) = \text{logit}(y(w\cdot x)) $$

The log likelihood of $w$ given $x$ and $y$ is

$$ \ell(w) = \sum_i\log(\text{logit}(y_i(w\cdot x_i))) $$

Consider the logistic regression

Log likelihood is a convex function, which in optimization terms means that there is a unique critical point that forms the optimum.

It turns out we can rewrite the problem

$$ \min_w\ell(w) \quad\text{to}\quad \max_w\sum_{ij}\alpha_i\alpha_jy_iy_j(x_i\cdot x_j) - \sum_iH(\alpha_i) $$

for a particular function $H$.

Consider the logistic regression

Log likelihood is a convex function, which in optimization terms means that there is a unique critical point that forms the optimum.

It turns out we can rewrite the problem

$$ \min_w\ell(w) \quad\text{to}\quad \max_w\sum_{ij}\alpha_i\alpha_jy_iy_j\color{blue}{(x_i\cdot x_j)} - \sum_iH(\alpha_i) $$

for a particular function $H$.

The Kernel trick

Notice how the optimization problem only depends on $x$ through the dot products $x_i\cdot x_j$.

This is fundamental for very many machine learning techniques and approaches.

Why Kernels?

Recall the nested circles from last lecture:

In [28]:
fig1
Out[28]:

Logistic regression behaves catastrophically bad on this dataset.

Why Kernels?

Suppose we instead embed the datapoints in a higher dimensional space? Say we set $z$ to be the distance from the center of the dataset.

In [29]:
fig2
Out[29]:

Now we can separate the two parts of the dataset with a plane.

In [32]:
print(acc)
2D logistic regression accuracy: 0.498
3D logistic regression accuracy: 0.988

Why Kernels?

The trick of embedding a non-linear problem into a higher dimensional space where it becomes linear is immensely powerful.

Many of the interesting embeddings are extremely high-dimensional, sometimes infinite-dimensional. The actual embedding $\Phi(x)$ of the data points can be difficult to impossible to acquire.

This is where the shape of the optimization problem comes in. Recall from the Logistic Regression theory that in the end the data was only present through the factor $\color{blue}{x_i\cdot x_j}$. If we can comfortably calculate $\Phi(x_i)\cdot\Phi(x_j)$ without calculating $\Phi(x_i)$ or $\Phi(x_j)$, we can use the benefits without paying as high a cost.

This observation is called the kernel trick. We often write $K(x,y)$ for $\Phi(x)\cdot\Phi(y)$.

Common Kernels

Name Formula Common use
Linear $K(x,y) = x\cdot y$
Polynomial $K(x,y) = (x\cdot y+c)^d$
Sigmoid $K(x,y) = \tanh(\gamma x\cdot y + c)$ Neural networks
Radial Basis Function $K(x,y) = \exp[-\gamma|x-y|^2]$
Laplacian $K(x,y) = \exp[-\gamma|x-y|_1]$ Useful for noiseless data
Chi-squared $K(x,y) = \exp[-\gamma\sum_i(x_i-y_i)^2/(x_i+y_i)]$ Used in computer vision
Cosine similarity $K(x,y) = (x\cdot y)/(|x|\cdot|y|)$ Useful for tf-idf text embeddings

Which linear classifier do we use?

In [38]:
fig1
Out[38]:

Which linear classifier do we use?

In [39]:
fig2
Out[39]:

Which linear classifier do we use?

One option is to maximize the margin: pick the separating hyperplane to maximize its distance from the two classes. Equivalently, for the hyperplane $\beta X = 0$, maximize $|\beta X_i|$.

Notation is simpler if we pick class labels $y_i\in\{-1,1\}$.

$$ \max_{M, \beta} M \qquad\text{subject to}\qquad y_i(\beta X + \beta_0) \geq M $$
In [56]:
fig3
Out[56]:

Support Vectors

With a maximum margin classifer, not all data points are created equal. Points far away from the margin can move around without influencing the maximum margin; while points on the margin immediately influence the fit.

These points are called support vectors.

In [94]:
fig3
Out[94]:

Non-separable classes

If the classes overlap - no separating hyperplane exists - then the optimization problem

$$ \max_{M, \beta} M \qquad\text{subject to}\qquad y_i(\beta X + \beta_0) \geq M $$

has no solution: the constraints can never be fulfilled.

The solution is to allow for overlap, using a budget parameter for that overlap.

Non-separable classes

Write $\zeta_i$ for the allowed overlap for the point $x_i$: $\zeta_i$ is the extent to which $x_i$ is allowed to sit inside the margin - or on the wrong side of the classifying hyperplane - the extent to which the model is allowed to be uncertain.

We can write this allowance as

$$ y_i(\beta X+\beta_0) \geq 1-\zeta_i \qquad \zeta_i \geq 0 $$

Now, with these restrictions we can optimize for a balance between small $\beta$ and small $\zeta_i$. The optimization becomes

$$ \min_{\beta, \beta_0, \zeta} \|\beta\|_2 + C\|\zeta\|_1 $$

Variations exist where $\|\beta\|_1$ or $\|\zeta\|_2$ are used.

The resulting classifier is called a support vector machine.

Enter Kernels

Just like with the logistic regression example when we introduced kernels, the support vector machine optimization problem can be reformulated as

$$ \min \sum\alpha_i - \frac12\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i\cdot x_j\rangle $$

where $\beta = \sum_i\alpha_i y_i x_i$, $\beta_0 = \sum_i \alpha_iy_i$.

Since fitting this model only depends on the data through $\langle x_i\cdot x_j\rangle$, and using it only uses $\langle\beta\cdot x\rangle$ -- we can apply kernels!

In [128]:
fig4
Out[128]:

Primary tuning parameter

An SVM model is parametrized by:

  • Choice of kernel (including kernel parameters)
  • Choice of balance parameter $C$
In [131]:
fig5
Out[131]:

Primary tuning parameter

An SVM model is parametrized by:

  • Choice of kernel (including kernel parameters)
  • Choice of balance parameter $C$
In [139]:
fig6
Out[139]:

Unbalanced Classes

Used without care, support vector machines behave badly with unbalanced classes: the error budget can be spent almost entirely on the small class, providing quite bad resulting fits.

In [156]:
fig7
Out[156]: