Scikit-Learn Gridsearchcv with Multiple Repetitions

Pipeline within GridSearch repeats more than expected

I have built an example to check the behavior

import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

values = [{'gender':'Female'} if i%2==0 else {'gender':'Male'} for i in range(100)]

X = pd.DataFrame(values)
y = [0 if i%2==0 else 1 for i in range(100)]

def binary_data(df):
    df.gender = df.gender.map({'Female': 0, 'Male': 1})
    print(df.shape)
    return df

columntransf = ColumnTransformer([('binarydata', FunctionTransformer(binary_data), ['gender'])])
model_pipeline = Pipeline([
    ('preprocessing', columntransf),
    ('classifier', LogisticRegression(solver='lbfgs'))
])
param_grid = {}
search = GridSearchCV(model_pipeline, param_grid, scoring='accuracy')
search.fit(X, y)

And yes, I obtain as you said, 11 print:

(80, 1)
(20, 1)
(80, 1)
(20, 1)
(80, 1)
(20, 1)
(80, 1)
(20, 1)
(80, 1)
(20, 1)
(100, 1)

But, can you see the size of the last set? It's the size of all the dataset.

You forgot what's the main objective of a machine learning model. To learn from a dataset. From all the data from your dataset.

What you are trying to do with CrossValidation is to get an estimate of your model performance while searching for best hyperparameters with grid search

To make it more clear, cv is used to evaluate how well your model with your set of parameters, and after that, your total dataset, with best parameters, is used for the learning.

Another observation: How the method .predict() would be executed otherwise? We need only one model at the end, not five of them to make a prediction

The model used in the end, fitted on all the dataset is the one that you can extract from:

search.best_estimator_

In the general case, that's the reason why we hold out a test set from the dataset. To assess if our model will generalize well

From scikit-learn:

3.1. Cross-validation: evaluating estimator performance
Learning the parameters of a prediction function and testing it on the
same data is a methodological mistake: a model that would just repeat
the labels of the samples that it has just seen would have a perfect
score but would fail to predict anything useful on yet-unseen data.
This situation is called overfitting.

Why GridSearchCV in scikit-learn spawn so many threads

From sklearn.GridSearchCV doc:

n_jobs : int, default=1
Number of jobs to run in parallel.

pre_dispatch : int, or string, optional
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
An int, giving the exact number of total jobs that are spawned
A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

If I understand documentation properly, the GridSearchCV spawns a bunch of threads as number of grid points, and only runs n_jobs simultaneously. Number 31 I believe is some kind of cap limit of your 40 possible values. Try to play with value of pre_dispatch parameter.

Another 11 threads I believe have nothing to do with the GridSearchCV itself, as it is shown on the same level. I think it is leftovers of other commands.

By the way, I don't observe such behavior on Mac (only see 5 processes spawn by the GridSearchCV as one would expect) so it may come from incompatible libraries. Try updating sklearn and numpy manually.

Here is my pstree output (part of the path deleted for privacy):

 └─┬= 00396 *** -fish
   └─┬= 21743 *** python /Users/***/scratch_5.py
     ├─── 21775 *** python /Users/***/scratch_5.py
     ├─── 21776 *** python /Users/***/scratch_5.py
     ├─── 21777 *** python /Users/***/scratch_5.py
     ├─── 21778 *** python /Users/***/scratch_5.py
     └─── 21779 *** python /Users/***/scratch_5.py

answer to the second comment:

That's your code actually. Just generated separable 1d two class problem:

N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )

100k samples was enough to get CPU busy for about a minute.

Grid search preprocess multiple hyperparameters and multiple estimators

Indeed, when param_grid is a list of dictionaries, the search occurs over the union of the grids generated by each dictionary. So your code actually checks six hyperparameter combinations:

PCA dim 2, RandomForest default depth=None
PCA dim 3, RandomForest default depth=None
PCA defaults (dim2), RandomForest depth 5
PCA defaults (dim2), RandomForest depth 15
PCA defaults (dim2), KNN k=2
PCA defaults (dim2), KNN k=3

You'd need something like

search_space = {
    'prep2__pcadtm__n_components': [2, 3],
    'clf': [RandomForestClassifier(max_depth=5),
            RandomForestClassifier(max_depth=15),
            KNeighborsClassifier(n_neighbors=2),
            KNeighborsClassifier(n_neighbors=3)],
}

Depending on your actual needs, that of course might get unwieldy to list all the hyperparameter combinations you want for each different model. In that case it might be simplest to nest searches:

rf_gs = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid={'max_depth': [5, 15]},
)
kn_gs = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid={'n_neighbors': [2, 3]},
)

pipeline = Pipeline([
    ('prep', preprocess),
    ('prep2', preprocess2),
    ('clf', RandomForestClassifier())
])

search = GridSearchCV(
    estimator=pipeline,
    param_grid={
        'prep2__pcadtm__n_components': [2, 3],
        'clf': [rf_gs, kn_gs],
    },
    scoring = 'accuracy',
    cv = 3,
    return_train_score = True    
)

This also has the effect of computing the preprocessors fewer times. But see also the memory parameter of Pipeline.

Also, note that this approach changes the cv-folds fairly dramatically. If you want a "flat" search, maybe write a quick script to generate the longer list in the first approach.

GridSearchCV and prediction errors analysis (scikit-learn)

I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!

The solution that should work is to use cross_val_predict splitting the fold in the same way as done in GridSearchCV. In fact, cross_val_predict re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV.

How is Cross Validation performed and how GridSearchCV() specifically?

So the k-folds method:

you split your training set into n parts (k folds) for example 5. You take de first part as the validation set and the 4 other parts as the training set. You train and this gives you a training/CV performance. You do this 5 (number of folds) times, each folds become the validation set and the others de training set. At the end you do the mean of the performances to obtain the cv performance of your model. This is for the k-fold.

Now, GridSearchCV is an hyperparameter tuner which uses k-folds method. The principel is you give to gridsearch a dictionary with all the hyper parameters you want to test then it will tests all the hyperparameters (dictionary) and select the best set of hyperparameters (those with the best model cv performance). It can take a very loooooooong time.

You pass a model (estimator) in gridsearch, a set of params and if you want the number of k-folds.

Example:

GridSearchCV(SVC(), parameters, cv = 5)

where SVC() is the estimator, parameters is your dictionary of hyperparameters and cv is the number of folds.