Second Competition: NYC Cab Trip Durations

Our second competition will be https://www.kaggle.com/c/nyc-taxi-trip-duration This competition was held about 4 years ago, and remains accessible for learning purposes: you can compare your performance to the leaderboard when the competition ended.

We are using this competition for a few reasons:

  1. It is a regression competition.
  2. It is about feature selection and engineering.
  3. It is about augmenting data sets.

The competition very explicitly encouraged people to create and publish augmenting datasets to improve everybody's predictions. Hence people have published data on weather conditions, or gasoline prices, or GPS routes, or traffic accidents, all of which could be used to improve your predictions.

Regression, not Classification

In this competition, we are predicting a continuous value. We are solving a regression problem where previously we were working on a classification problem. Many models we have already seen work here as well: often by piecewise constant predictions. The scikit-learn convention is that the model names end with Regressor instead of Classifier.

  • ensemble.AdaBoostRegressor
  • ensemble.ExtraTreesRegressor
  • ensemble.RandomForestRegressor
  • linear_model.LinearRegression
  • neighbors.KNeighborsRegressor
  • svm.SVR
  • tree.DecisionTreeRegressor

Parallelization in scikit-learn

We discussed earlier how to use xgboost with tree_method="gpu_hist" to draw on the GPU parallelization.

The Kaggle standard allocation is for 4 CPU cores. Every scikit-learn model where parallelization can be used will take the argument n_jobs=-1 to draw on all the processor cores it can find. Using this you should be able to get the CPU usage up to 400% during training and predictions.

GLM and transformed targets

A good first guess might be to use model = linear_model.LinearRegression()? Here, this will fail the scoring step, since linear regressions can (and do on this data) end up with negative values.

Taxi cab trips do not have negative trip durations.

There are several ways to approach this discrepancy; most of them put the predictions through some function to adjust the prediction domain. We have seen this already: recall that Logistic Regression is a linear regression, put through the "logistic function": $$ \hat{y} = \frac{1}{1+e^{\beta X+\beta_0}} $$

GLM

Generalized Linear Models are an old approach to using linear regressions for a larger span of possible applications.

Linear Regression here means that for a vector $X$ of independent variables and a matrix $\beta$ of model coefficients, we have a probabilistic model for the response variables given as $$ [Y | X] \sim \mathcal{N}(X\beta, \sigma^2) $$

Generalized Linear Models extend this in two ways: by potentially changing the distribution away from normal to some distribution $\mathcal{P}$, and by introducing a link function $g$, so that instead $$ [Y | X] \sim \mathcal{P}(g^{-1}(X\beta) $$

Logistic Regression uses the link function $g(\mu) = \ln(\mu/(1-\mu)$, which yields the usual logit function for $g^{-1}$.

GLM for new domains

Domain of response Distribution Link $g(\mu)$ Usecase
$\mathbb{R}$ Normal identity Linear response data
$(0,\infty)$ Exponential $-1/\mu$ Exponential response data, scale parameters
$(0,\infty)$ Gamma $-1/\mu$ Exponential response data, scale parameters
$(0,\infty)$ Inverse Gaussian $1/\mu^2$
$\mathbb{N}$ Poisson $\ln(\mu)$ counts of occurrences in fixed amount of time/space
$\{0,1\}$ Bernoulli logit outcome of single yes/no occurrence
$[N]$ Binomial logit count of # of "yes" in $N$ yes/no occurrences

GLM in scikit-learn

scikit-learn implements several GLM models:

  • linear_model.LinearRegression for the Normal distribution with the identity link function
  • linear_model.LogisticRegression for the Bernoulli distribution with the logit link function
  • Normal/Identity, Poisson/$\ln(\mu)$, Gamma/$\ln(\mu)$ and Inverse Gaussian/$\ln(\mu)$ all are special cases of something called a Tweedie distibution GLM regressor:
    • linear_model.PoissonRegressor for the Poisson case
    • linear_model.GammaRegressor for the Gamma case
    • linear_model.TweedieRegressor with the power parameter to choose freely among these options.

Link functions and transformed target regression

The approach with the link function can be used even if your regressor is not linear!

scikit-learn provides compose.TransformedTargetRegressor to build a model that is trained on predicting $g(y)$ instead of $y$, and then automatically returns $g^{-1}(f(X))$ as its predictions. This way you can transform the target domain the same way as with GLMs, but use it with arbitrary regressors.

The resulting code could be something like

model = model_selection.RandomizedSearchCV(
    compose.TransformedTargetRegressor(
      regressor=pipeline.make_pipeline(
        preprocessing.StandardScaler(),
        svm.LinearSVR()), 
      func=np.log, inverse_func=np.exp),
    params, scoring=rmsle, n_jobs=-1)

Feature Engineering and merging data

If you look through my notebook, you'll notice I generate a LOT of extra features in the datasets: I create a column with the (kilometer) distance between start and end point, and columns for weekday, week of the year, hour of the day, minute of the hour, hour of the week and even estimate traffic speed in the training data.

...and then I don't use a bunch of them.

Generating this kind of derived data is massively important in Machine Learning: figuring out the right transformations that make something suddenly separable, or linear, or otherwise easy to model makes all the difference in the world.

Next steps in feature engineering here would be to interpolate the traffic speed entries and add them as a feature to the test set. Maybe use a $k$-Nearest Neighbors on just hour-of-the-week or hour-and-minute to transfer the information?