Finish up Titanic

Make sure you all have at least one submission to Titanic before moving on. Come write the team's best score on the whiteboard.

New Competition

We will start a new competition today:

https://kaggle.com/c/tabular-playground-series-feb-2022

Use the same evaluation metric as the competition

Submissions will be evaluated based on their categorization accuracy.

from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))

Validate, validate, validate!

As we are getting more options for building models, we need to choose between them.

  • Different methods (linear regression, polynomial regression, ridge regression, random forests, ...)
  • Different feature sets

Rudimentary first step: split training data into training / validation

from sklearn import model_selection
X_train, X_val, y_train, y_val = 
   model_selection.train_test_split(X, y)

Train each model on X_train, y_train, score each model on X_val, y_val. Pick best scoring model for submission.

First Steps

Before you do anything else, look at the data. Do a bit of exploratory data analysis. Look through histograms on the Kaggle data description page. Use eg seaborn.pairplot or scatter from matplotlib to see how variables interact with the target value.

Build a cleanup pipeline. It is likely that your data is messy - do what you can to clean it up. Code from last week is helpful.

End for now

Regression

  • Linear Regression
  • Lasso
  • LARS
  • Polynomial Regression (ie use products of features as additional features)

Linear Regression

linear_model.LinearRegression

Least squares fitting. Classical statistics.

Lasso

linear_model.Lasso

Linear regression with penalty for non-zero coefficients. Will estimate a sparse model which makes interpretation easier.

Takes a penalty weight $\alpha$, determining how eager the method is for forcing sparsity.

LARS

linear_model.LARS

Least Angle Regression: linear regression that introduces new features one by one.

Polynomial Regression

Allows for curved relationships.

Best used in a pipeline with the preprocessing.PolynomialFeatures transform.

Drives up total dimensions dramatically.

model = Pipeline([('poly', PolynomialFeatures(degree=3)),
                  ('linear', LinearRegression(fit_intercept=False))])
In [ ]: