$k$-NN is versatile, often good - but stores and searches the entire training dataset every time it predicts. Notably, the Tabular Playground data has 288 columns ($d$) and 200'000 rows ($n$).
Time complexity for brute force $k$-NN on $m$ predictions is $O(nmd)$. This can be sped up by using a geometric lookup tree to store the data. This increases training time to $O(dn\log n)$, training data storage to $O(dn)$ and reduces prediction time to $O(m\log n)$.
Based on the speeds of training and predicting, sklearn.neighbors.KNeighborsClassifier seems to be doing brute force.
We can change this by picking an algorithm with model = KNeighborsClassifier(algorithm="kd_tree") or model = KNeighborsClassifier(algorithm="ball_tree").
Many Machine Learning algorithms are quite parallelizable: this invites several possible strategies to deal with slow computations.
Subsampling to train on 10% of the data gave me a much quicker result: after 10 minutes of computation I scored 0.82119. It placed me on 436/474 and did not beat Kaggle's benchmark solution.
Since $k$-NN (often) speeds up using search trees, we could just retain the search trees themselves, discarding the data in the trees!
Create a binary tree to segment $X$ into piecewise constant prediction segments.
For each node in the tree, pick a predictor at random, and put in a threshold for that predictor.
Two strategies:
from IPython.display import SVG
display(SVG("lecture7-titanic-dt.svg"))
scikit-learn¶Important hyperparameters
sklearn.tree.DecisionTreeClassifier¶criterion: method for measuring quality of a split. One of gini (for Gini impurity) or entropy (for information gain).splitter: strategy for splitting. One of best (for optimizing the criterion) or random (for picking a random splitting point).sklearn.tree.DecisionTreeRegressor¶criterion: method for measuring quality of a split. One of mse (minimize mean square error or variance), friedman_mse (variation of mse) or mae (minimize mean absolute error or mean deviation from the median)splitter: as above