Stacking¶

If several ML models perform badly in different parts of the dataset, stacking can improve performance.

Stacking is complex to implement, and may produce only small improvements -- not common in production.

Stacking very often produces improvements -- very common on Kaggle.

In [15]:

display(SVG(data=svg.stdout))

How to train a stack¶

The basic idea is to use several model outputs as features for a meta model that combines predictions from several sources.

To avoid overfitting, each feature in the training phase is created as a validation feature:

Split data into $k$ cross-validation folds data.1, ..., data.k
To produce predictions for the input in a fold data.i, fit all models against all other folds, then predict on the validation fold data.i
Train the meta model using the validation fold predictions from all models -- optionally also other features from the data
Retrain all models on the full training data

How to predict with a stack¶

Predict using each of the models
Feed predictions as features to the meta model

Stacking in `sklearn`¶

Starting with scikit-learn 0.22: use ensemble.StackingClassifier. This object requires a list of pairs of a string name and a model as the component models, and uses the __ convention for addressing.

In [2]:

from sklearn import datasets, ensemble, svm, linear_model, preprocessing, pipeline, model_selection
X, y = datasets.load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = model_selection.train_test_split(X, y, stratify=y)
estimators = [
    ("svc", pipeline.make_pipeline(preprocessing.StandardScaler(), svm.SVC())),
    ("rf", ensemble.RandomForestClassifier(n_estimators=10))
]
model = ensemble.StackingClassifier(estimators, final_estimator=linear_model.LogisticRegression())
model.fit(X_train,y_train).score(X_val,y_val)

Out[2]:

0.9736842105263158

Support Vector Machines¶

I don't want us to leave the classifier challenge without introducing Support Vector Machines first. So we'll do that now.

Support Vector Machines start out as a way to make linear decision boundaries more rigid and well-placed, but go from there by using kernels to a powerful and flexible machinery for machine learning.

WARNING: Mathematics incoming¶

We will dig into optimization topics and the derivation of logistic regression. This part will be fast, and draw on advanced topics.

Consider the logistic regression¶

Let the target take values $y=\pm1$. We use the trick to include a variable $x_0$ with constant value 1 so that $AX+B$ can be written as $wX$.

$$ \mathbb{P}(y=\pm1|x,w) = \text{logit}(y(w\cdot x)) $$

The log likelihood of $w$ given $x$ and $y$ is

$$ \ell(w) = \sum_i\log(\text{logit}(y_i(w\cdot x_i))) $$

Consider the logistic regression¶

Log likelihood is a convex function, which in optimization terms means that there is a unique critical point that forms the optimum.

It turns out we can rewrite the problem

$$ \min_w\ell(w) \quad\text{to}\quad \max_w\sum_{ij}\alpha_i\alpha_jy_iy_j(x_i\cdot x_j) - \sum_iH(\alpha_i) $$

for a particular function $H$.

Consider the logistic regression¶

Log likelihood is a convex function, which in optimization terms means that there is a unique critical point that forms the optimum.

It turns out we can rewrite the problem

$$ \min_w\ell(w) \quad\text{to}\quad \max_w\sum_{ij}\alpha_i\alpha_jy_iy_j\color{blue}{(x_i\cdot x_j)} - \sum_iH(\alpha_i) $$

for a particular function $H$.

The Kernel trick¶

Notice how the optimization problem only depends on $x$ through the dot products $x_i\cdot x_j$.

This is fundamental for very many machine learning techniques and approaches.

Why Kernels?¶

Recall the nested circles from last lecture:

In [28]:

fig1

Out[28]:

Logistic regression behaves catastrophically bad on this dataset.

Why Kernels?¶

Suppose we instead embed the datapoints in a higher dimensional space? Say we set $z$ to be the distance from the center of the dataset.

In [29]:

fig2

Out[29]:

Now we can separate the two parts of the dataset with a plane.

In [32]:

print(acc)

2D logistic regression accuracy: 0.498
3D logistic regression accuracy: 0.988

Why Kernels?¶

The trick of embedding a non-linear problem into a higher dimensional space where it becomes linear is immensely powerful.

Many of the interesting embeddings are extremely high-dimensional, sometimes infinite-dimensional. The actual embedding $\Phi(x)$ of the data points can be difficult to impossible to acquire.

This is where the shape of the optimization problem comes in. Recall from the Logistic Regression theory that in the end the data was only present through the factor $\color{blue}{x_i\cdot x_j}$. If we can comfortably calculate $\Phi(x_i)\cdot\Phi(x_j)$ without calculating $\Phi(x_i)$ or $\Phi(x_j)$, we can use the benefits without paying as high a cost.

This observation is called the kernel trick. We often write $K(x,y)$ for $\Phi(x)\cdot\Phi(y)$.

Common Kernels¶

Name	Formula	Common use
Linear	$K(x,y) = x\cdot y$
Polynomial	$K(x,y) = (x\cdot y+c)^d$
Sigmoid	$K(x,y) = \tanh(\gamma x\cdot y + c)$	Neural networks
Radial Basis Function	$K(x,y) = \exp[-\gamma\|x-y\|^2]$
Laplacian	$K(x,y) = \exp[-\gamma\|x-y\|_1]$	Useful for noiseless data
Chi-squared	$K(x,y) = \exp[-\gamma\sum_i(x_i-y_i)^2/(x_i+y_i)]$	Used in computer vision
Cosine similarity	$K(x,y) = (x\cdot y)/(\|x\|\cdot\|y\|)$	Useful for tf-idf text embeddings

Which linear classifier do we use?¶

In [38]:

fig1

Out[38]:

Which linear classifier do we use?¶

In [39]:

fig2

Out[39]:

Which linear classifier do we use?¶

One option is to maximize the margin: pick the separating hyperplane to maximize its distance from the two classes. Equivalently, for the hyperplane $\beta X = 0$, maximize $|\beta X_i|$.

Notation is simpler if we pick class labels $y_i\in\{-1,1\}$.

$$ \max_{M, \beta} M \qquad\text{subject to}\qquad y_i(\beta X + \beta_0) \geq M $$

In [56]:

fig3

Out[56]:

Support Vectors¶

With a maximum margin classifer, not all data points are created equal. Points far away from the margin can move around without influencing the maximum margin; while points on the margin immediately influence the fit.

These points are called support vectors.

In [94]:

fig3

Out[94]:

Non-separable classes¶

If the classes overlap - no separating hyperplane exists - then the optimization problem

$$ \max_{M, \beta} M \qquad\text{subject to}\qquad y_i(\beta X + \beta_0) \geq M $$

has no solution: the constraints can never be fulfilled.

The solution is to allow for overlap, using a budget parameter for that overlap.

Non-separable classes¶

Write $\zeta_i$ for the allowed overlap for the point $x_i$: $\zeta_i$ is the extent to which $x_i$ is allowed to sit inside the margin - or on the wrong side of the classifying hyperplane - the extent to which the model is allowed to be uncertain.

We can write this allowance as

$$ y_i(\beta X+\beta_0) \geq 1-\zeta_i \qquad \zeta_i \geq 0 $$

Now, with these restrictions we can optimize for a balance between small $\beta$ and small $\zeta_i$. The optimization becomes

$$ \min_{\beta, \beta_0, \zeta} \|\beta\|_2 + C\|\zeta\|_1 $$

Variations exist where $\|\beta\|_1$ or $\|\zeta\|_2$ are used.

The resulting classifier is called a support vector machine.

Enter Kernels¶

Just like with the logistic regression example when we introduced kernels, the support vector machine optimization problem can be reformulated as

$$ \min \sum\alpha_i - \frac12\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i\cdot x_j\rangle $$

where $\beta = \sum_i\alpha_i y_i x_i$, $\beta_0 = \sum_i \alpha_iy_i$.

Since fitting this model only depends on the data through $\langle x_i\cdot x_j\rangle$, and using it only uses $\langle\beta\cdot x\rangle$ -- we can apply kernels!

In [128]:

fig4

Out[128]:

Primary tuning parameter¶

An SVM model is parametrized by:

Choice of kernel (including kernel parameters)
Choice of balance parameter $C$

In [131]:

fig5

Out[131]:

Primary tuning parameter¶

An SVM model is parametrized by:

Choice of kernel (including kernel parameters)
Choice of balance parameter $C$

In [139]:

fig6

Out[139]:

Unbalanced Classes¶

Used without care, support vector machines behave badly with unbalanced classes: the error budget can be spent almost entirely on the small class, providing quite bad resulting fits.

In [156]:

fig7

Out[156]:

Stacking¶

How to train a stack¶

How to predict with a stack¶

Stacking in sklearn¶

Support Vector Machines¶

WARNING: Mathematics incoming¶

Consider the logistic regression¶

Consider the logistic regression¶

Consider the logistic regression¶

The Kernel trick¶

Why Kernels?¶

Why Kernels?¶

Why Kernels?¶

Common Kernels¶

Which linear classifier do we use?¶

Which linear classifier do we use?¶

Which linear classifier do we use?¶

Support Vectors¶

Non-separable classes¶

Non-separable classes¶

Enter Kernels¶

Primary tuning parameter¶

Primary tuning parameter¶

Unbalanced Classes¶

Stacking in `sklearn`¶