Recall Linear Regression picks out coefficients $\beta=(\beta_0,\dots,\beta_p)$ minimizing $\|X\beta-y\|_2^2$. This optimization problem has a closed form solution: we can write down a formula for $\beta$ in terms of $X$ and $y$ directly. It is linear in the sense that no two $\beta_i$ are multiplied with each other: the matrix $X$ can contain products and powers of predictors.
When predictors correlate, coefficients can easily blow up. For an extreme case, assume $x_1=x_2$. Then the linear combinations $x_1-x_2$ and $100x_1-101x_2$ give identical results.
Two main ways to deal with exploding coefficients:
Ridge Regression adapts the optimization problem that produces linear regression to also take the sizes of the coefficients into account. Ridge regression coefficients are the $\beta$ that minimize $\|X\beta-y\|_2^2+\alpha\|\beta\|_2^2$ (where $\|x\|_2^2$ is the sum of squares of the entries in $x$)
As $\alpha$ grows to larger and larger values, the coefficients become more robust to collinearity.
Ridge Regression is implemented in linear_model.Ridge, and takes a parameter alpha to set the balance. Instead of explicitly setting up a grid-search to pick alpha, you can use linear_model.RidgeCV, which has built-in cross-validation for alpha.
When combining with compose.TransformedTargetRegressor you still may end up needing to set up with GridSearchCV by hand.
Ridge Regression with alpha=10 inside a TransformedTargetRegressor lowered my RMSLE from 0.76346 to 0.68627.
With many many features, and a great variation in how informative they are, it may be very effective to remove confusing or noisy features.
Some methods include:
feature_selection.VarianceThresholdfeature_selection.SelectKBest or feature_selection.SelectPercentile with arguments f_regression or mutual_info_regression.feature_selection.SelectFromModel with eg ensemble.ExtraTreesRegressorfeature_selection.SequentialFeatureSelectorlinear_model.Lasso.Instead of using $\|X\beta-y\|_2^2+\alpha\|\beta\|_2^2$, we could use [ \frac{1}{2n}|X\beta-y|_2^2+\alpha|\beta|_1 ] where $\|\beta\|_1=\sum_j|\beta_j|$. The factor $1/2n$ is added to make the sum-of-squares and sum-of-absolute-values comparable to each other.
The effect is that a Lasso-model is encouraged to set many coefficients to 0, effectively ignoring features that do not contribute much to the prediction.
Lasso is available as linear_model.Lasso.
Cross-validation to set $\alpha$ is exposed in linear_model.LassoCV; the same issues with custom scorers and target-transformations as with Ridge regression applies here.
Lasso lowered my score from 0.76346 to 0.68622 (as compared to 0.68627 for Ridge)
Maybe the features do not relate linearly to the target? We'd be able to handle that by adding new features that represent the various powers and products of features to the data.
This approach is called Polynomial Regression, and scikit-learn implements it as preprocessing.PolynomialFeatures.
This can grow the feature set by a lot if you let your degree get high.
params = {
"regressor__poly__degree": [2],
"regressor__features__score_func": [feature_selection.f_regression],
"regressor__features__percentile": [5,10,15],
"regressor__lm__alpha": np.logspace(-3,3,7)
}
model = model_selection.RandomizedSearchCV(
compose.TransformedTargetRegressor(
regressor=pipeline.Pipeline([
("features",feature_selection.SelectPercentile()),
("scaler", preprocessing.StandardScaler()),
("poly", preprocessing.PolynomialFeatures()),
("lm", linear_model.Lasso())]),
func=np.log, inverse_func=np.exp),
params, scoring=rmsle, cv=5, n_jobs=-1)
Turned out that feature_selection.mutual_info_regression took up a LOT of memory, and had to be excluded to fit the pipeline into 16G.
This dropped my score to 0.67835 from previous 0.76346 and 0.68662.