Expensive Computation

$k$-NN is versatile, often good - but stores and searches the entire training dataset every time it predicts. Notably, the Tabular Playground data has 288 columns ($d$) and 200'000 rows ($n$).

Time complexity for brute force $k$-NN on $m$ predictions is $O(nmd)$. This can be sped up by using a geometric lookup tree to store the data. This increases training time to $O(dn\log n)$, training data storage to $O(dn)$ and reduces prediction time to $O(m\log n)$.

Based on the speeds of training and predicting, sklearn.neighbors.KNeighborsClassifier seems to be doing brute force. We can change this by picking an algorithm with model = KNeighborsClassifier(algorithm="kd_tree") or model = KNeighborsClassifier(algorithm="ball_tree").

Strategies for slow computations

Many Machine Learning algorithms are quite parallelizable: this invites several possible strategies to deal with slow computations.

  • Subsample - work only on a subset of the training data.
  • Parallelize - spread the work onto several CPUs or onto a GPU. This usually requires switching to a more advanced library.
  • Switch algorithms.

Subsampling to train on 10% of the data gave me a much quicker result: after 10 minutes of computation I scored 0.82119. It placed me on 436/474 and did not beat Kaggle's benchmark solution.

If you want trees, I'll give you trees!

Since $k$-NN (often) speeds up using search trees, we could just retain the search trees themselves, discarding the data in the trees!

Decision Trees

Create a binary tree to segment $X$ into piecewise constant prediction segments.

For each node in the tree, pick a predictor at random, and put in a threshold for that predictor.

Two strategies:

  1. Pick threshold at random between min and max
  2. Pick threshold to optimize the split
In [4]:
from IPython.display import SVG
display(SVG("lecture7-titanic-dt.svg"))
Tree 0 Sex:Male <= 0.5 gini = 0.473 samples = 891 value = [549, 342] 1 Age <= 6.5 gini = 0.306 samples = 577 value = [468, 109] 0->1 True 8 Pclass <= 2.5 gini = 0.383 samples = 314 value = [81, 233] 0->8 False 2 SibSp <= 2.5 gini = 0.444 samples = 24 value = [8, 16] 1->2 5 Pclass <= 1.5 gini = 0.28 samples = 553 value = [460, 93] 1->5 9 Age <= 2.5 gini = 0.1 samples = 170 value = [9, 161] 8->9 12 Fare <= 23.35 gini = 0.5 samples = 144 value = [72, 72] 8->12 3 gini = 0.0 samples = 15 value = [0, 15] 2->3 4 gini = 0.198 samples = 9 value = [8, 1] 2->4 6 gini = 0.46 samples = 120 value = [77, 43] 5->6 7 gini = 0.204 samples = 433 value = [383, 50] 5->7 10 gini = 0.5 samples = 2 value = [1, 1] 9->10 11 gini = 0.091 samples = 168 value = [8, 160] 9->11 13 gini = 0.484 samples = 117 value = [48, 69] 12->13 14 gini = 0.198 samples = 27 value = [24, 3] 12->14

In scikit-learn

Important hyperparameters

sklearn.tree.DecisionTreeClassifier

  • criterion: method for measuring quality of a split. One of gini (for Gini impurity) or entropy (for information gain).
  • splitter: strategy for splitting. One of best (for optimizing the criterion) or random (for picking a random splitting point).

sklearn.tree.DecisionTreeRegressor

  • criterion: method for measuring quality of a split. One of mse (minimize mean square error or variance), friedman_mse (variation of mse) or mae (minimize mean absolute error or mean deviation from the median)
  • splitter: as above