Expensive Computation¶

$k$-NN is versatile, often good - but stores and searches the entire training dataset every time it predicts. Notably, the Tabular Playground data has 288 columns ($d$) and 200'000 rows ($n$).

Time complexity for brute force $k$-NN on $m$ predictions is $O(nmd)$. This can be sped up by using a geometric lookup tree to store the data. This increases training time to $O(dn\log n)$, training data storage to $O(dn)$ and reduces prediction time to $O(m\log n)$.

Based on the speeds of training and predicting, sklearn.neighbors.KNeighborsClassifier seems to be doing brute force. We can change this by picking an algorithm with model = KNeighborsClassifier(algorithm="kd_tree") or model = KNeighborsClassifier(algorithm="ball_tree").

Strategies for slow computations¶

Many Machine Learning algorithms are quite parallelizable: this invites several possible strategies to deal with slow computations.

Subsample - work only on a subset of the training data.
Parallelize - spread the work onto several CPUs or onto a GPU. This usually requires switching to a more advanced library.
Switch algorithms.

Subsampling to train on 10% of the data gave me a much quicker result: after 10 minutes of computation I scored 0.82119. It placed me on 436/474 and did not beat Kaggle's benchmark solution.

If you want trees, I'll give you trees!¶

Since $k$-NN (often) speeds up using search trees, we could just retain the search trees themselves, discarding the data in the trees!

Decision Trees¶

Create a binary tree to segment $X$ into piecewise constant prediction segments.

For each node in the tree, pick a predictor at random, and put in a threshold for that predictor.

Two strategies:

Pick threshold at random between min and max
Pick threshold to optimize the split

In [4]:

from IPython.display import SVG
display(SVG("lecture7-titanic-dt.svg"))

In `scikit-learn`¶

Important hyperparameters

`sklearn.tree.DecisionTreeClassifier`¶

criterion: method for measuring quality of a split. One of gini (for Gini impurity) or entropy (for information gain).
splitter: strategy for splitting. One of best (for optimizing the criterion) or random (for picking a random splitting point).

`sklearn.tree.DecisionTreeRegressor`¶

criterion: method for measuring quality of a split. One of mse (minimize mean square error or variance), friedman_mse (variation of mse) or mae (minimize mean absolute error or mean deviation from the median)
splitter: as above