Regression and regularization

Recall Linear Regression picks out coefficients $\beta=(\beta_0,\dots,\beta_p)$ minimizing $\|X\beta-y\|_2^2$. This optimization problem has a closed form solution: we can write down a formula for $\beta$ in terms of $X$ and $y$ directly. It is linear in the sense that no two $\beta_i$ are multiplied with each other: the matrix $X$ can contain products and powers of predictors.

When predictors correlate, coefficients can easily blow up. For an extreme case, assume $x_1=x_2$. Then the linear combinations $x_1-x_2$ and $100x_1-101x_2$ give identical results.

Two main ways to deal with exploding coefficients:

  1. Avoid collinearity in the predictors
  2. Use regularization to restrict coefficients.

Ridge Regression - $L_2$ penalties

Ridge Regression adapts the optimization problem that produces linear regression to also take the sizes of the coefficients into account. Ridge regression coefficients are the $\beta$ that minimize $\|X\beta-y\|_2^2+\alpha\|\beta\|_2^2$ (where $\|x\|_2^2$ is the sum of squares of the entries in $x$)

As $\alpha$ grows to larger and larger values, the coefficients become more robust to collinearity.

In scikit-learn

Ridge Regression is implemented in linear_model.Ridge, and takes a parameter alpha to set the balance. Instead of explicitly setting up a grid-search to pick alpha, you can use linear_model.RidgeCV, which has built-in cross-validation for alpha.

When combining with compose.TransformedTargetRegressor you still may end up needing to set up with GridSearchCV by hand.

Ridge Regression with alpha=10 inside a TransformedTargetRegressor lowered my RMSLE from 0.76346 to 0.68627.

Feature Selection - different strategies

With many many features, and a great variation in how informative they are, it may be very effective to remove confusing or noisy features.

Some methods include:

  1. Remove low-variance features. If it's almost constant, it won't help us differentiate between observations. feature_selection.VarianceThreshold
  2. Use statistical tests to figure out the best scoring features. For regression features, methods are in scikit-learn that use the F-test and that use mutual information for selection. feature_selection.SelectKBest or feature_selection.SelectPercentile with arguments f_regression or mutual_info_regression.
  3. Some ML models can estimate feature importance scores, these can be used to prune low-informative features. feature_selection.SelectFromModel with eg ensemble.ExtraTreesRegressor
  4. Iteratively either add (forward-SFS) or remove (backward-SFS) features based on some cross-validated quality score. feature_selection.SequentialFeatureSelector
  5. Use regularization to encourage a (linear) model to produce sparse solutions. linear_model.Lasso.

Lasso - $L_1$-regularized linear regression

Instead of using $\|X\beta-y\|_2^2+\alpha\|\beta\|_2^2$, we could use [ \frac{1}{2n}|X\beta-y|_2^2+\alpha|\beta|_1 ] where $\|\beta\|_1=\sum_j|\beta_j|$. The factor $1/2n$ is added to make the sum-of-squares and sum-of-absolute-values comparable to each other.

The effect is that a Lasso-model is encouraged to set many coefficients to 0, effectively ignoring features that do not contribute much to the prediction.

In scikit-learn

Lasso is available as linear_model.Lasso. Cross-validation to set $\alpha$ is exposed in linear_model.LassoCV; the same issues with custom scorers and target-transformations as with Ridge regression applies here.

Lasso lowered my score from 0.76346 to 0.68622 (as compared to 0.68627 for Ridge)

Non-linearity in linear models

Maybe the features do not relate linearly to the target? We'd be able to handle that by adding new features that represent the various powers and products of features to the data.

This approach is called Polynomial Regression, and scikit-learn implements it as preprocessing.PolynomialFeatures.

This can grow the feature set by a lot if you let your degree get high.

Putting it all together

params = {
    "regressor__poly__degree": [2],
    "regressor__features__score_func": [feature_selection.f_regression],
    "regressor__features__percentile": [5,10,15],
    "regressor__lm__alpha": np.logspace(-3,3,7)
}
model = model_selection.RandomizedSearchCV(
    compose.TransformedTargetRegressor(
        regressor=pipeline.Pipeline([
          ("features",feature_selection.SelectPercentile()),
          ("scaler", preprocessing.StandardScaler()),
          ("poly", preprocessing.PolynomialFeatures()),
          ("lm", linear_model.Lasso())]),
        func=np.log, inverse_func=np.exp),
    params, scoring=rmsle, cv=5, n_jobs=-1)

Turned out that feature_selection.mutual_info_regression took up a LOT of memory, and had to be excluded to fit the pipeline into 16G.

This dropped my score to 0.67835 from previous 0.76346 and 0.68662.