Models have parameters
The process of training a model has hyperparameters
model1 = neighbors.KNeighborsClassifier(3)
model2 = neighbors.KNeighborsClassifier(4)
model3 = neighbors.KNeighborsClassifier(5)
model4 = neighbors.KNeighborsClassifier(10)
model5 = neighbors.KNeighborsClassifier(50)
scores1 = model_selection.cross_val_score(model1, X, y)
scores2 = model_selection.cross_val_score(model2, X, y)
scores3 = model_selection.cross_val_score(model3, X, y)
scores4 = model_selection.cross_val_score(model4, X, y)
scores5 = model_selection.cross_val_score(model5, X, y)
ks = [3,4,5,10,50]
models = [neighbors.KNeighborsClassifier(k) for k in ks]
scores = [model_selection.cross_val_score(model,X,y)
for model in models]
ks = [3,4,5,10,50]
weights = ["uniform", "distance"]
models = [neighbors.KNeighborsClassifier(k, weights=w)
for k in ks for w in weights]
scores = [model_selection.cross_val_score(model,X,y)
for model in models]
This is called grid search - to systematically go through all combinations (imagine a square grid...) to find the best option.
parameters = {
'kneighborsclassifier__n_neighbors': [3,4,5,10,50],
'kneighborsclassifier__weights': ["uniform", "distance"]
}
model = pipeline.make_pipeline(preprocessing.StandardScaler(),
neighbors.KNeighborsClassifier())
gscv = model_selection.GridSearchCV(model, parameters, cv=5)
gscv.fit(X, y)
You can also give a list of parameter dictionaries to GridSearchCV
, if the availability of some parameters depends on the value of other parameters.
A GridSearchCV
object has important members:
Member | Contains |
---|---|
.predict() |
calls predict on the best estimator |
.transform() |
calls transform on the best estimator |
.best_estimator_ |
the best estimator |
.best_score_ |
the score of the best estimator |
.best_params_ |
the parameters for the best estimator |
.cv_results_ |
detailed results - timings, parameter combinations, scores - for all CV runs |
fig
parameters = {
'kneighborsclassifier__n_neighbors': [3,4,5,10,50],
'kneighborsclassifier__weights': ["uniform", "distance"]
}
model = pipeline.make_pipeline(preprocessing.StandardScaler(),
neighbors.KNeighborsClassifier())
gscv = model_selection.GridSearchCV(model, parameters, cv=5)
gscv.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('standardscaler', StandardScaler()), ('kneighborsclassifier', KNeighborsClassifier())]), param_grid={'kneighborsclassifier__n_neighbors': [3, 4, 5, 10, 50], 'kneighborsclassifier__weights': ['uniform', 'distance']})
print(gscv.score(X_test, y_test))
fig
0.9699666295884316
gscv.best_params_
{'kneighborsclassifier__n_neighbors': 4, 'kneighborsclassifier__weights': 'distance'}
model.get_params()
- lists all parameters known to a model
pipeline.make_pipeline
:
This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
Explicitly constructing a pipeline: allows you to set names yourself
pipeline = pipeline.Pipeline(
("pca", decomposition.PCA()),
("lasso", linear_model.Lasso()))
Parameters in a pipeline are named [componentname]__[parametername]
.
model.get_params()
- lists all parameters known to a model
compose.make_column_transformer
:
This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types.
Explicitly constructing a pipeline: allows you to set names yourself
transformer = compose.ColumnTransformer(
("numeric", numeric_cleanup, numeric_columns),
("categorical", categorical_cleanup, categorical_columns))
Parameters in a column transformer are named [componentname]__[parametername]
.
Example using the titanic
cleanup chain we have used repeatedly:
>>> cleanup.get_params()
[..........]
'pipeline-1__simpleimputer__strategy': 'median',
'pipeline-2__simpleimputer__strategy': 'constant',
'pipeline-2__simpleimputer__fill_value': 'NA',
'pipeline-2__onehotencoder__handle_unknown': 'ignore',
Binary Classifier: pick $A$ or $B$.
Multi-Class Classifier: pick one of $A_1,\dots,A_k$.
For our current competition, we have 10 classes. Binary classifier will not work.
One option: use a multi-class probability model in Logistic Regression.
(set multi_class="multinomial"
)
Another option: Ensemble Classifiers.
One-vs-Rest | One-vs-One | |
---|---|---|
Dataset Size | Full dataset | Subset |
Training Rounds | $O(k)$ | $O(k^2)$ |
For methods that scale badly with dataset sizes, One-vs-One may outperform even though many more models are trained.
In sklearn
:
linear_model.LogisticRegression(multi_class="multinomial")
neighbors.KNeighborsClassifier
tree.DecisionTreeClassifier
ensemble.RandomForestClassifier
ensemble.AdaBoostClassifier
ensemble.GradientBoostingClassifier
linear_model.LogisticRegression(multi_class="ovr")
multiclass.OneVsRestClassifier
- meta classifier. Takes a classifier object.Several Support Vector Machine systems (to be discussed later)
multiclass.OneVsOneClassifier
- meta classifier.