Finish up Titanic¶

Make sure you all have at least one submission to Titanic before moving on. Come write the team's best score on the whiteboard.

New Competition¶

We will start a new competition today:

https://kaggle.com/c/tabular-playground-series-feb-2022

Use the same evaluation metric as the competition¶

Submissions will be evaluated based on their categorization accuracy.

from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))

Validate, validate, validate!¶

As we are getting more options for building models, we need to choose between them.

Different methods (linear regression, polynomial regression, ridge regression, random forests, ...)
Different feature sets

Rudimentary first step: split training data into training / validation

from sklearn import model_selection
X_train, X_val, y_train, y_val = 
   model_selection.train_test_split(X, y)

Train each model on X_train, y_train, score each model on X_val, y_val. Pick best scoring model for submission.

First Steps¶

Before you do anything else, look at the data. Do a bit of exploratory data analysis. Look through histograms on the Kaggle data description page. Use eg seaborn.pairplot or scatter from matplotlib to see how variables interact with the target value.

Build a cleanup pipeline. It is likely that your data is messy - do what you can to clean it up. Code from last week is helpful.

End for now¶

Regression¶

Linear Regression
Lasso
LARS
Polynomial Regression (ie use products of features as additional features)

Linear Regression¶

linear_model.LinearRegression

Least squares fitting. Classical statistics.

Lasso¶

linear_model.Lasso

Linear regression with penalty for non-zero coefficients. Will estimate a sparse model which makes interpretation easier.

Takes a penalty weight $\alpha$, determining how eager the method is for forcing sparsity.

LARS¶

linear_model.LARS

Least Angle Regression: linear regression that introduces new features one by one.

Polynomial Regression¶

Allows for curved relationships.

Best used in a pipeline with the preprocessing.PolynomialFeatures transform.

Drives up total dimensions dramatically.

model = Pipeline([('poly', PolynomialFeatures(degree=3)),
                  ('linear', LinearRegression(fit_intercept=False))])

In [ ]: