Submissions will be evaluated based on their categorization accuracy.
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))
As we are getting more options for building models, we need to choose between them.
Rudimentary first step: split training data into training / validation
from sklearn import model_selection
X_train, X_val, y_train, y_val =
model_selection.train_test_split(X, y)
Train each model on X_train, y_train
, score each model on X_val, y_val
. Pick best scoring model for submission.
Before you do anything else, look at the data.
Do a bit of exploratory data analysis.
Look through histograms on the Kaggle data description page.
Use eg seaborn.pairplot
or scatter
from matplotlib to see how variables interact with the target value.
Build a cleanup pipeline. It is likely that your data is messy - do what you can to clean it up. Code from last week is helpful.
linear_model.LinearRegression
Least squares fitting. Classical statistics.
linear_model.Lasso
Linear regression with penalty for non-zero coefficients. Will estimate a sparse model which makes interpretation easier.
Takes a penalty weight $\alpha$, determining how eager the method is for forcing sparsity.
linear_model.LARS
Least Angle Regression: linear regression that introduces new features one by one.
Allows for curved relationships.
Best used in a pipeline with the preprocessing.PolynomialFeatures
transform.
Drives up total dimensions dramatically.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])