Consider the Titanic dataset, and my naïve classifier based on Age
and Fare
.
scatter(titanic["Age"], titanic["Fare"], c=titanic["Survived"])
xlabel("Age")
ylabel("Fare")
Text(0,0.5,'Fare')
We can fit a logistic regression model as our classifier. Again, I am using the less than optimal method of assigning Age=0
and Fare=0
to missing data points.
from sklearn import linear_model
model = linear_model.LogisticRegression(solver="lbfgs")
model.fit(titanic[["Age","Fare"]].fillna(0), titanic["Survived"])
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
scatter(titanic["Age"], titanic["Fare"], c=titanic["Survived"])
xlabel("Age")
ylabel("Fare")
# create a mesh to plot in
x_min, x_max = 0, 80
y_min, y_max = 0, 500
xx, yy = np.meshgrid(np.arange(x_min, x_max, 5),
np.arange(y_min, y_max, 50))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
contour(xx, yy, Z, colors="teal", levels=[0.5], linewidths=5)
<matplotlib.contour.QuadContourSet at 0x1176703c8>
Logistic Regression has a simplistic decision boundary: a straight line - everything on one side is classified into one group, everything on the other into another.
For many tasks, Logistic Regression runs a distinct risk of underfitting.
Our first alternative classifier is k-NN.
from sklearn import neighbors
knn_model = neighbors.KNeighborsClassifier(5)
knn_model.fit(titanic[["Age","Fare"]].fillna(0), titanic["Survived"])
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform')
scatter(titanic["Age"], titanic["Fare"], c=titanic["Survived"])
xlabel("Age")
ylabel("Fare")
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = knn_model.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
contour(xx, yy, Z, colors="teal", levels=[0.5], linewidths=5)
<matplotlib.contour.QuadContourSet at 0x1176dfd30>
The k-NN classifier is quite a lot more flexible than the logistic regression classifier is -- but requires more storage and more processing power to use.
{"logreg": model.score(titanic[["Age","Fare"]].fillna(0), titanic["Survived"]),
"knn": knn_model.score(titanic[["Age","Fare"]].fillna(0), titanic["Survived"])}
{'knn': 0.76767676767676762, 'logreg': 0.6689113355780022}
For logistic regression, it is enough to store the linear regression coefficients - one number for each variable.
For a k-NN classifier, the entire data set has to be kept - as well as an additional data structure to make nearest neighbor lookups fast.
It is important that both training and test data go through the same preprocessing steps.
scikit-learn
provides a mechanism that makes this easier: pipelines
A pipeline gathers up several consecutive transformer steps and allows you to fit
and transform
the entire sequence at once.
from sklearn import compose, impute, pipeline, preprocessing
numeric_features = ["Age", "Fare"]
categorical_features = ["Pclass", "Sex"]
numeric_transformer = pipeline.make_pipeline(
impute.SimpleImputer(strategy="median"),
preprocessing.StandardScaler())
categorical_transformer = pipeline.make_pipeline(
impute.SimpleImputer(strategy="constant", fill_value="NA"),
preprocessing.OneHotEncoder(handle_unknown="ignore"))
preprocessor = compose.make_column_transformer(
(numeric_transformer, numeric_features),
(categorical_transformer, categorical_features))
preprocessor.fit(titanic)
X = preprocessor.transform(titanic)
y = titanic["Survived"]
model.fit(X, y)
knn_model.fit(X, y)
{"logreg": model.score(X, y), "knn": knn_model.score(X, y)}
{'knn': 0.86756453423120095, 'logreg': 0.79685746352413023}
Do note: these accuracy scores are not reliable. Why?
sklearn.neighbor.KNeighborClassifier
: k-NN classifiersklearn.pipeline.make_pipeline
: Create sequence of transformationssklearn.compose.make_column_transformer
: Transform different columns differentlyLogistic Regression is prone to underfitting, very lean in memory and processing power. Linear classifier.
k-NN is more prone to overfitting, more expensive in memory and processing power. Highly non-linear decision boundaries.