Recall Linear Regression picks out coefficients $\beta=(\beta_0,\dots,\beta_p)$ minimizing $\|X\beta-y\|_2^2$. This optimization problem has a closed form solution: we can write down a formula for $\beta$ in terms of $X$ and $y$ directly. It is linear in the sense that no two $\beta_i$ are multiplied with each other: the matrix $X$ can contain products and powers of predictors.
When predictors correlate, coefficients can easily blow up. For an extreme case, assume $x_1=x_2$. Then the linear combinations $x_1-x_2$ and $100x_1-101x_2$ give identical results.
Two main ways to deal with exploding coefficients:
Ridge Regression adapts the optimization problem that produces linear regression to also take the sizes of the coefficients into account. Ridge regression coefficients are the $\beta$ that minimize $\|X\beta-y\|_2^2+\alpha\|\beta\|_2^2$ (where $\|x\|_2^2$ is the sum of squares of the entries in $x$)
As $\alpha$ grows to larger and larger values, the coefficients become more robust to collinearity.
Ridge Regression is implemented in linear_model.Ridge
, and takes a parameter alpha
to set the balance. Instead of explicitly setting up a grid-search to pick alpha
, you can use linear_model.RidgeCV
, which has built-in cross-validation for alpha
.
When combining with compose.TransformedTargetRegressor
you still may end up needing to set up with GridSearchCV
by hand.
Ridge Regression with alpha=10
inside a TransformedTargetRegressor
lowered my RMSLE from 0.76346 to 0.68627.
With many many features, and a great variation in how informative they are, it may be very effective to remove confusing or noisy features.
Some methods include:
feature_selection.VarianceThreshold
feature_selection.SelectKBest
or feature_selection.SelectPercentile
with arguments f_regression
or mutual_info_regression
.feature_selection.SelectFromModel
with eg ensemble.ExtraTreesRegressor
feature_selection.SequentialFeatureSelector
linear_model.Lasso
.Instead of using $\|X\beta-y\|_2^2+\alpha\|\beta\|_2^2$, we could use [ \frac{1}{2n}|X\beta-y|_2^2+\alpha|\beta|_1 ] where $\|\beta\|_1=\sum_j|\beta_j|$. The factor $1/2n$ is added to make the sum-of-squares and sum-of-absolute-values comparable to each other.
The effect is that a Lasso-model is encouraged to set many coefficients to 0, effectively ignoring features that do not contribute much to the prediction.
Lasso is available as linear_model.Lasso
.
Cross-validation to set $\alpha$ is exposed in linear_model.LassoCV
; the same issues with custom scorers and target-transformations as with Ridge regression applies here.
Lasso lowered my score from 0.76346 to 0.68622 (as compared to 0.68627 for Ridge)
Maybe the features do not relate linearly to the target? We'd be able to handle that by adding new features that represent the various powers and products of features to the data.
This approach is called Polynomial Regression, and scikit-learn implements it as preprocessing.PolynomialFeatures
.
This can grow the feature set by a lot if you let your degree get high.
params = {
"regressor__poly__degree": [2],
"regressor__features__score_func": [feature_selection.f_regression],
"regressor__features__percentile": [5,10,15],
"regressor__lm__alpha": np.logspace(-3,3,7)
}
model = model_selection.RandomizedSearchCV(
compose.TransformedTargetRegressor(
regressor=pipeline.Pipeline([
("features",feature_selection.SelectPercentile()),
("scaler", preprocessing.StandardScaler()),
("poly", preprocessing.PolynomialFeatures()),
("lm", linear_model.Lasso())]),
func=np.log, inverse_func=np.exp),
params, scoring=rmsle, cv=5, n_jobs=-1)
Turned out that feature_selection.mutual_info_regression
took up a LOT of memory, and had to be excluded to fit the pipeline into 16G.
This dropped my score to 0.67835 from previous 0.76346 and 0.68662.