Models have parameters
The process of training a model has hyperparameters
model1 = neighbors.KNeighborsClassifier(3)
model2 = neighbors.KNeighborsClassifier(4)
model3 = neighbors.KNeighborsClassifier(5)
model4 = neighbors.KNeighborsClassifier(10)
model5 = neighbors.KNeighborsClassifier(50)
scores1 = model_selection.cross_val_score(model1, X, y)
scores2 = model_selection.cross_val_score(model2, X, y)
scores3 = model_selection.cross_val_score(model3, X, y)
scores4 = model_selection.cross_val_score(model4, X, y)
scores5 = model_selection.cross_val_score(model5, X, y)
ks = [3,4,5,10,50]
models = [neighbors.KNeighborsClassifier(k) for k in ks]
scores = [model_selection.cross_val_score(model,X,y)
for model in models]
ks = [3,4,5,10,50]
weights = ["uniform", "distance"]
models = [neighbors.KNeighborsClassifier(k, weights=w)
for k in ks for w in weights]
scores = [model_selection.cross_val_score(model,X,y)
for model in models]
This is called grid search - to systematically go through all combinations (imagine a square grid...) to find the best option.
parameters = {
'kneighborsclassifier__n_neighbors': [3,4,5,10,50],
'kneighborsclassifier__weights': ["uniform", "distance"]
}
model = pipeline.make_pipeline(preprocessing.StandardScaler(),
neighbors.KNeighborsClassifier())
gscv = model_selection.GridSearchCV(model, parameters, cv=5)
gscv.fit(X, y)
You can also give a list of parameter dictionaries to GridSearchCV, if the availability of some parameters depends on the value of other parameters.
A GridSearchCV object has important members:
| Member | Contains |
|---|---|
.predict() |
calls predict on the best estimator |
.transform() |
calls transform on the best estimator |
.best_estimator_ |
the best estimator |
.best_score_ |
the score of the best estimator |
.best_params_ |
the parameters for the best estimator |
.cv_results_ |
detailed results - timings, parameter combinations, scores - for all CV runs |
fig
parameters = {
'kneighborsclassifier__n_neighbors': [3,4,5,10,50],
'kneighborsclassifier__weights': ["uniform", "distance"]
}
model = pipeline.make_pipeline(preprocessing.StandardScaler(),
neighbors.KNeighborsClassifier())
gscv = model_selection.GridSearchCV(model, parameters, cv=5)
gscv.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
('kneighborsclassifier',
KNeighborsClassifier())]),
param_grid={'kneighborsclassifier__n_neighbors': [3, 4, 5, 10, 50],
'kneighborsclassifier__weights': ['uniform',
'distance']})
print(gscv.score(X_test, y_test))
fig
0.9699666295884316
gscv.best_params_
{'kneighborsclassifier__n_neighbors': 4,
'kneighborsclassifier__weights': 'distance'}
model.get_params() - lists all parameters known to a model
pipeline.make_pipeline:
This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
Explicitly constructing a pipeline: allows you to set names yourself
pipeline = pipeline.Pipeline(
("pca", decomposition.PCA()),
("lasso", linear_model.Lasso()))
Parameters in a pipeline are named [componentname]__[parametername].
model.get_params() - lists all parameters known to a model
compose.make_column_transformer:
This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types.
Explicitly constructing a pipeline: allows you to set names yourself
transformer = compose.ColumnTransformer(
("numeric", numeric_cleanup, numeric_columns),
("categorical", categorical_cleanup, categorical_columns))
Parameters in a column transformer are named [componentname]__[parametername].
Example using the titanic cleanup chain we have used repeatedly:
>>> cleanup.get_params()
[..........]
'pipeline-1__simpleimputer__strategy': 'median',
'pipeline-2__simpleimputer__strategy': 'constant',
'pipeline-2__simpleimputer__fill_value': 'NA',
'pipeline-2__onehotencoder__handle_unknown': 'ignore',
Binary Classifier: pick $A$ or $B$.
Multi-Class Classifier: pick one of $A_1,\dots,A_k$.
For our current competition, we have 10 classes. Binary classifier will not work.
One option: use a multi-class probability model in Logistic Regression.
(set multi_class="multinomial")
Another option: Ensemble Classifiers.
| One-vs-Rest | One-vs-One | |
|---|---|---|
| Dataset Size | Full dataset | Subset |
| Training Rounds | $O(k)$ | $O(k^2)$ |
For methods that scale badly with dataset sizes, One-vs-One may outperform even though many more models are trained.
In sklearn:
linear_model.LogisticRegression(multi_class="multinomial")neighbors.KNeighborsClassifiertree.DecisionTreeClassifierensemble.RandomForestClassifierensemble.AdaBoostClassifierensemble.GradientBoostingClassifierlinear_model.LogisticRegression(multi_class="ovr")multiclass.OneVsRestClassifier - meta classifier. Takes a classifier object.Several Support Vector Machine systems (to be discussed later)
multiclass.OneVsOneClassifier - meta classifier.