If several ML models perform badly in different parts of the dataset, stacking can improve performance.
Stacking is complex to implement, and may produce only small improvements -- not common in production.
Stacking very often produces improvements -- very common on Kaggle.
display(SVG(data=svg.stdout))
The basic idea is to use several model outputs as features for a meta model that combines predictions from several sources.
To avoid overfitting, each feature in the training phase is created as a validation feature:
data
into $k$ cross-validation folds data.1, ..., data.k
data.i
, fit all models against all other folds, then predict on the validation fold data.i
sklearn
¶Starting with scikit-learn 0.22: use ensemble.StackingClassifier
. This object requires a list of pairs of a string name and a model as the component models, and uses the __
convention for addressing.
from sklearn import datasets, ensemble, svm, linear_model, preprocessing, pipeline, model_selection
X, y = datasets.load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = model_selection.train_test_split(X, y, stratify=y)
estimators = [
("svc", pipeline.make_pipeline(preprocessing.StandardScaler(), svm.SVC())),
("rf", ensemble.RandomForestClassifier(n_estimators=10))
]
model = ensemble.StackingClassifier(estimators, final_estimator=linear_model.LogisticRegression())
model.fit(X_train,y_train).score(X_val,y_val)
0.9736842105263158
I don't want us to leave the classifier challenge without introducing Support Vector Machines first. So we'll do that now.
Support Vector Machines start out as a way to make linear decision boundaries more rigid and well-placed, but go from there by using kernels to a powerful and flexible machinery for machine learning.
We will dig into optimization topics and the derivation of logistic regression. This part will be fast, and draw on advanced topics.
Let the target take values $y=\pm1$. We use the trick to include a variable $x_0$ with constant value 1 so that $AX+B$ can be written as $wX$.
$$ \mathbb{P}(y=\pm1|x,w) = \text{logit}(y(w\cdot x)) $$The log likelihood of $w$ given $x$ and $y$ is
$$ \ell(w) = \sum_i\log(\text{logit}(y_i(w\cdot x_i))) $$Log likelihood is a convex function, which in optimization terms means that there is a unique critical point that forms the optimum.
It turns out we can rewrite the problem
$$ \min_w\ell(w) \quad\text{to}\quad \max_w\sum_{ij}\alpha_i\alpha_jy_iy_j(x_i\cdot x_j) - \sum_iH(\alpha_i) $$for a particular function $H$.
Log likelihood is a convex function, which in optimization terms means that there is a unique critical point that forms the optimum.
It turns out we can rewrite the problem
$$ \min_w\ell(w) \quad\text{to}\quad \max_w\sum_{ij}\alpha_i\alpha_jy_iy_j\color{blue}{(x_i\cdot x_j)} - \sum_iH(\alpha_i) $$for a particular function $H$.
Notice how the optimization problem only depends on $x$ through the dot products $x_i\cdot x_j$.
This is fundamental for very many machine learning techniques and approaches.
Recall the nested circles from last lecture:
fig1
Logistic regression behaves catastrophically bad on this dataset.
Suppose we instead embed the datapoints in a higher dimensional space? Say we set $z$ to be the distance from the center of the dataset.
fig2
Now we can separate the two parts of the dataset with a plane.
print(acc)
2D logistic regression accuracy: 0.498 3D logistic regression accuracy: 0.988
The trick of embedding a non-linear problem into a higher dimensional space where it becomes linear is immensely powerful.
Many of the interesting embeddings are extremely high-dimensional, sometimes infinite-dimensional. The actual embedding $\Phi(x)$ of the data points can be difficult to impossible to acquire.
This is where the shape of the optimization problem comes in. Recall from the Logistic Regression theory that in the end the data was only present through the factor $\color{blue}{x_i\cdot x_j}$. If we can comfortably calculate $\Phi(x_i)\cdot\Phi(x_j)$ without calculating $\Phi(x_i)$ or $\Phi(x_j)$, we can use the benefits without paying as high a cost.
This observation is called the kernel trick. We often write $K(x,y)$ for $\Phi(x)\cdot\Phi(y)$.
Name | Formula | Common use |
---|---|---|
Linear | $K(x,y) = x\cdot y$ | |
Polynomial | $K(x,y) = (x\cdot y+c)^d$ | |
Sigmoid | $K(x,y) = \tanh(\gamma x\cdot y + c)$ | Neural networks |
Radial Basis Function | $K(x,y) = \exp[-\gamma|x-y|^2]$ | |
Laplacian | $K(x,y) = \exp[-\gamma|x-y|_1]$ | Useful for noiseless data |
Chi-squared | $K(x,y) = \exp[-\gamma\sum_i(x_i-y_i)^2/(x_i+y_i)]$ | Used in computer vision |
Cosine similarity | $K(x,y) = (x\cdot y)/(|x|\cdot|y|)$ | Useful for tf-idf text embeddings |
fig1
fig2
One option is to maximize the margin: pick the separating hyperplane to maximize its distance from the two classes. Equivalently, for the hyperplane $\beta X = 0$, maximize $|\beta X_i|$.
Notation is simpler if we pick class labels $y_i\in\{-1,1\}$.
$$ \max_{M, \beta} M \qquad\text{subject to}\qquad y_i(\beta X + \beta_0) \geq M $$fig3
With a maximum margin classifer, not all data points are created equal. Points far away from the margin can move around without influencing the maximum margin; while points on the margin immediately influence the fit.
These points are called support vectors.
fig3
If the classes overlap - no separating hyperplane exists - then the optimization problem
$$ \max_{M, \beta} M \qquad\text{subject to}\qquad y_i(\beta X + \beta_0) \geq M $$has no solution: the constraints can never be fulfilled.
The solution is to allow for overlap, using a budget parameter for that overlap.
Write $\zeta_i$ for the allowed overlap for the point $x_i$: $\zeta_i$ is the extent to which $x_i$ is allowed to sit inside the margin - or on the wrong side of the classifying hyperplane - the extent to which the model is allowed to be uncertain.
We can write this allowance as
$$ y_i(\beta X+\beta_0) \geq 1-\zeta_i \qquad \zeta_i \geq 0 $$Now, with these restrictions we can optimize for a balance between small $\beta$ and small $\zeta_i$. The optimization becomes
$$ \min_{\beta, \beta_0, \zeta} \|\beta\|_2 + C\|\zeta\|_1 $$Variations exist where $\|\beta\|_1$ or $\|\zeta\|_2$ are used.
The resulting classifier is called a support vector machine.
Just like with the logistic regression example when we introduced kernels, the support vector machine optimization problem can be reformulated as
$$ \min \sum\alpha_i - \frac12\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i\cdot x_j\rangle $$where $\beta = \sum_i\alpha_i y_i x_i$, $\beta_0 = \sum_i \alpha_iy_i$.
Since fitting this model only depends on the data through $\langle x_i\cdot x_j\rangle$, and using it only uses $\langle\beta\cdot x\rangle$ -- we can apply kernels!
fig4
An SVM model is parametrized by:
fig5
An SVM model is parametrized by:
fig6
Used without care, support vector machines behave badly with unbalanced classes: the error budget can be spent almost entirely on the small class, providing quite bad resulting fits.
fig7