Second Competition: NYC Cab Trip Durations¶

Our second competition will be https://www.kaggle.com/c/nyc-taxi-trip-duration This competition was held about 4 years ago, and remains accessible for learning purposes: you can compare your performance to the leaderboard when the competition ended.

We are using this competition for a few reasons:

It is a regression competition.
It is about feature selection and engineering.
It is about augmenting data sets.

The competition very explicitly encouraged people to create and publish augmenting datasets to improve everybody's predictions. Hence people have published data on weather conditions, or gasoline prices, or GPS routes, or traffic accidents, all of which could be used to improve your predictions.

Regression, not Classification¶

In this competition, we are predicting a continuous value. We are solving a regression problem where previously we were working on a classification problem. Many models we have already seen work here as well: often by piecewise constant predictions. The scikit-learn convention is that the model names end with Regressor instead of Classifier.

ensemble.AdaBoostRegressor
ensemble.ExtraTreesRegressor
ensemble.RandomForestRegressor
linear_model.LinearRegression
neighbors.KNeighborsRegressor
svm.SVR
tree.DecisionTreeRegressor

Parallelization in scikit-learn¶

We discussed earlier how to use xgboost with tree_method="gpu_hist" to draw on the GPU parallelization.

The Kaggle standard allocation is for 4 CPU cores. Every scikit-learn model where parallelization can be used will take the argument n_jobs=-1 to draw on all the processor cores it can find. Using this you should be able to get the CPU usage up to 400% during training and predictions.

GLM and transformed targets¶

A good first guess might be to use model = linear_model.LinearRegression()? Here, this will fail the scoring step, since linear regressions can (and do on this data) end up with negative values.

Taxi cab trips do not have negative trip durations.

There are several ways to approach this discrepancy; most of them put the predictions through some function to adjust the prediction domain. We have seen this already: recall that Logistic Regression is a linear regression, put through the "logistic function": $$ \hat{y} = \frac{1}{1+e^{\beta X+\beta_0}} $$

GLM¶

Generalized Linear Models are an old approach to using linear regressions for a larger span of possible applications.

Linear Regression here means that for a vector $X$ of independent variables and a matrix $\beta$ of model coefficients, we have a probabilistic model for the response variables given as $$ [Y | X] \sim \mathcal{N}(X\beta, \sigma^2) $$

Generalized Linear Models extend this in two ways: by potentially changing the distribution away from normal to some distribution $\mathcal{P}$, and by introducing a link function $g$, so that instead $$ [Y | X] \sim \mathcal{P}(g^{-1}(X\beta) $$

Logistic Regression uses the link function $g(\mu) = \ln(\mu/(1-\mu)$, which yields the usual logit function for $g^{-1}$.

GLM for new domains¶

Domain of response	Distribution	Link $g(\mu)$	Usecase
$\mathbb{R}$	Normal	identity	Linear response data
$(0,\infty)$	Exponential	$-1/\mu$	Exponential response data, scale parameters
$(0,\infty)$	Gamma	$-1/\mu$	Exponential response data, scale parameters
$(0,\infty)$	Inverse Gaussian	$1/\mu^2$
$\mathbb{N}$	Poisson	$\ln(\mu)$	counts of occurrences in fixed amount of time/space
$\{0,1\}$	Bernoulli	logit	outcome of single yes/no occurrence
$[N]$	Binomial	logit	count of # of "yes" in $N$ yes/no occurrences

GLM in scikit-learn¶

scikit-learn implements several GLM models:

linear_model.LinearRegression for the Normal distribution with the identity link function
linear_model.LogisticRegression for the Bernoulli distribution with the logit link function
Normal/Identity, Poisson/$\ln(\mu)$, Gamma/$\ln(\mu)$ and Inverse Gaussian/$\ln(\mu)$ all are special cases of something called a Tweedie distibution GLM regressor:
- linear_model.PoissonRegressor for the Poisson case
- linear_model.GammaRegressor for the Gamma case
- linear_model.TweedieRegressor with the power parameter to choose freely among these options.

Link functions and transformed target regression¶

The approach with the link function can be used even if your regressor is not linear!

scikit-learn provides compose.TransformedTargetRegressor to build a model that is trained on predicting $g(y)$ instead of $y$, and then automatically returns $g^{-1}(f(X))$ as its predictions. This way you can transform the target domain the same way as with GLMs, but use it with arbitrary regressors.

The resulting code could be something like

model = model_selection.RandomizedSearchCV(
    compose.TransformedTargetRegressor(
      regressor=pipeline.make_pipeline(
        preprocessing.StandardScaler(),
        svm.LinearSVR()), 
      func=np.log, inverse_func=np.exp),
    params, scoring=rmsle, n_jobs=-1)

Feature Engineering and merging data¶

If you look through my notebook, you'll notice I generate a LOT of extra features in the datasets: I create a column with the (kilometer) distance between start and end point, and columns for weekday, week of the year, hour of the day, minute of the hour, hour of the week and even estimate traffic speed in the training data.

...and then I don't use a bunch of them.

Generating this kind of derived data is massively important in Machine Learning: figuring out the right transformations that make something suddenly separable, or linear, or otherwise easy to model makes all the difference in the world.

Next steps in feature engineering here would be to interpolate the traffic speed entries and add them as a feature to the test set. Maybe use a $k$-Nearest Neighbors on just hour-of-the-week or hour-and-minute to transfer the information?