$k$-NN is versatile, often good - but stores and searches the entire training dataset every time it predicts. Notably, the Tabular Playground data has 288 columns ($d$) and 200'000 rows ($n$).
Time complexity for brute force $k$-NN on $m$ predictions is $O(nmd)$. This can be sped up by using a geometric lookup tree to store the data. This increases training time to $O(dn\log n)$, training data storage to $O(dn)$ and reduces prediction time to $O(m\log n)$.
Based on the speeds of training and predicting, sklearn.neighbors.KNeighborsClassifier
seems to be doing brute force.
We can change this by picking an algorithm with model = KNeighborsClassifier(algorithm="kd_tree")
or model = KNeighborsClassifier(algorithm="ball_tree")
.
Many Machine Learning algorithms are quite parallelizable: this invites several possible strategies to deal with slow computations.
Subsampling to train on 10% of the data gave me a much quicker result: after 10 minutes of computation I scored 0.82119. It placed me on 436/474 and did not beat Kaggle's benchmark solution.
Since $k$-NN (often) speeds up using search trees, we could just retain the search trees themselves, discarding the data in the trees!
Create a binary tree to segment $X$ into piecewise constant prediction segments.
For each node in the tree, pick a predictor at random, and put in a threshold for that predictor.
Two strategies:
from IPython.display import SVG
display(SVG("lecture7-titanic-dt.svg"))
scikit-learn
¶Important hyperparameters
sklearn.tree.DecisionTreeClassifier
¶criterion
: method for measuring quality of a split. One of gini
(for Gini impurity) or entropy
(for information gain).splitter
: strategy for splitting. One of best
(for optimizing the criterion) or random
(for picking a random splitting point).sklearn.tree.DecisionTreeRegressor
¶criterion
: method for measuring quality of a split. One of mse
(minimize mean square error or variance), friedman_mse
(variation of mse
) or mae
(minimize mean absolute error or mean deviation from the median)splitter
: as above