Consider the Titanic dataset, and my naïve classifier based on Age and Fare.

In [2]:

scatter(titanic["Age"], titanic["Fare"], c=titanic["Survived"])
xlabel("Age")
ylabel("Fare")

Out[2]:

Text(0,0.5,'Fare')

We can fit a logistic regression model as our classifier. Again, I am using the less than optimal method of assigning Age=0 and Fare=0 to missing data points.

In [3]:

from sklearn import linear_model
model = linear_model.LogisticRegression(solver="lbfgs")
model.fit(titanic[["Age","Fare"]].fillna(0), titanic["Survived"])

Out[3]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [4]:

scatter(titanic["Age"], titanic["Fare"], c=titanic["Survived"])
xlabel("Age")
ylabel("Fare")

# create a mesh to plot in
x_min, x_max = 0, 80
y_min, y_max = 0, 500
xx, yy = np.meshgrid(np.arange(x_min, x_max, 5),
                     np.arange(y_min, y_max, 50))


# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
contour(xx, yy, Z, colors="teal", levels=[0.5], linewidths=5)

Out[4]:

<matplotlib.contour.QuadContourSet at 0x1176703c8>

Logistic Regression has a simplistic decision boundary: a straight line - everything on one side is classified into one group, everything on the other into another.

For many tasks, Logistic Regression runs a distinct risk of underfitting.

Our first alternative classifier is k-NN.

For each prediction point $x$, find the $k$ nearest neighboring points.
Assign to $x$ the most common class among the neighbors.

In [5]:

from sklearn import neighbors
knn_model = neighbors.KNeighborsClassifier(5)
knn_model.fit(titanic[["Age","Fare"]].fillna(0), titanic["Survived"])

Out[5]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [6]:

scatter(titanic["Age"], titanic["Fare"], c=titanic["Survived"])
xlabel("Age")
ylabel("Fare")

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = knn_model.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
contour(xx, yy, Z, colors="teal", levels=[0.5], linewidths=5)

Out[6]:

<matplotlib.contour.QuadContourSet at 0x1176dfd30>

The k-NN classifier is quite a lot more flexible than the logistic regression classifier is -- but requires more storage and more processing power to use.

In [7]:

{"logreg": model.score(titanic[["Age","Fare"]].fillna(0), titanic["Survived"]), 
 "knn": knn_model.score(titanic[["Age","Fare"]].fillna(0), titanic["Survived"])}

Out[7]:

{'knn': 0.76767676767676762, 'logreg': 0.6689113355780022}

For logistic regression, it is enough to store the linear regression coefficients - one number for each variable.

For a k-NN classifier, the entire data set has to be kept - as well as an additional data structure to make nearest neighbor lookups fast.

Pipelines¶

It is important that both training and test data go through the same preprocessing steps.

scikit-learn provides a mechanism that makes this easier: pipelines

A pipeline gathers up several consecutive transformer steps and allows you to fit and transform the entire sequence at once.

In [8]:

from sklearn import compose, impute, pipeline, preprocessing

numeric_features = ["Age", "Fare"]
categorical_features = ["Pclass", "Sex"]

numeric_transformer = pipeline.make_pipeline(
  impute.SimpleImputer(strategy="median"),
  preprocessing.StandardScaler())
categorical_transformer = pipeline.make_pipeline(
  impute.SimpleImputer(strategy="constant", fill_value="NA"),
  preprocessing.OneHotEncoder(handle_unknown="ignore"))

preprocessor = compose.make_column_transformer(
  (numeric_transformer, numeric_features),
  (categorical_transformer, categorical_features))

In [9]:

preprocessor.fit(titanic)
X = preprocessor.transform(titanic)
y = titanic["Survived"]

model.fit(X, y)
knn_model.fit(X, y)

{"logreg": model.score(X, y), "knn": knn_model.score(X, y)}

Out[9]:

{'knn': 0.86756453423120095, 'logreg': 0.79685746352413023}

Do note: these accuracy scores are not reliable. Why?

sklearn.neighbor.KNeighborClassifier: k-NN classifier
sklearn.pipeline.make_pipeline: Create sequence of transformations
sklearn.compose.make_column_transformer: Transform different columns differently

Logistic Regression is prone to underfitting, very lean in memory and processing power. Linear classifier.

k-NN is more prone to overfitting, more expensive in memory and processing power. Highly non-linear decision boundaries.