Implement K-Fold Cross Validation in Mlpclassification Python

Implement K-fold cross validation in MLPClassification Python

Do not split your data into train and test. This is automatically handled by the KFold cross-validation.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)

for train_indices, test_indices in kf.split(X):
clf.fit(X[train_indices], y[train_indices])
print(clf.score(X[test_indices], y[test_indices]))

KFold validation partitions your dataset into n equal, fair portions. Each portion is then split into test and train. With this, you get a fairly accurate measure of the accuracy of your model since it is tested on small portions of fairly distributed data.

K cross validation with different results everytime

I found the solution to my question.

Setting a random seed with the below solved the problem:

seed = np.random.seed(22)

How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?

You can use the .loc method to accomplish this. This question has a nice answer that shows how to use it: df.loc[index_position, "column_name"] = some_value

So, an edited version of the code you posted (I needed data, and removed auc_roc since we aren't using probabilities per your edit):

from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier

X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()

k = 5
kf = KFold(n_splits=k, random_state=None)

acc_score = []
auroc_score = []

# Create columns
X['Prediction'] = 1

# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']

for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]

model.fit(X_train[model_columns], y_train)
pred_values = model.predict(X_test[model_columns])

acc = accuracy_score(pred_values , y_test)
acc_score.append(acc)

# Add values to the dataframe
X.loc[test_index, 'Prediction'] = pred_values

avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))

# Add label back per question
X['Label'] = y

# Print first 5 rows to show that it works
print(X.head(n=5))

Yields

accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
mean radius mean texture ... Prediction Label
0 17.99 10.38 ... 0 0
1 20.57 17.77 ... 0 0
2 19.69 21.25 ... 0 0
3 11.42 20.38 ... 1 0
4 20.29 14.34 ... 0 0

[5 rows x 32 columns]

(Obviously the model/values etc are all arbitrary)

Different Confusion Matrix with Cross-Validation

As specified within the comment, for what concerns the first question, the first option is the way to go. Namely, splitting the whole dataset via train_test_split and then calling method .split() of the chosen cross-validator object on the training set.

For the second point, the issue is hidden behind some default parameters of StratifiedKFold and StratifiedShuffleSplit and on the sligthly different meaning of parameter n_splits.

  • For what concerns StratifiedKFold, the parameter n_splits identifies the number of folds you're considering as per documentation. Therefore, imposing n_splits=5 means that the model will be trained on 4-folds (80% of the training set) and tested on one fold (20% of the training set), for each possible combination.

  • For what concerns StratifiedShuffleSplit, the parameter n_splits specifies the number of reshuffling and splitting iterations. On the other side, it is the parameter train_size (together with test_size) to define how big the folds will be (relatively to the size of the training set). In particular, according to the docs, the default setting defines that, if none of them is specified, train_size=0.9 (90% of the training set) and test_size=0.1 (10% of the training set).
    Therefore specifying test_size within the StratifiedShuffleSplit constructor - eg - should solve your problem:

    stratshufkfold = StratifiedShuffleSplit(n_splits=n_splits, random_state=0, test_size=0.2)

Low K-fold accuracy for First Fold

It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle parameter off to see if you get the same results.

It is best practice to shuffle your data anyway prior to running cross validation.

Training 8 different classifiers with crossvalidation give same accuracy with the same file?

I would suggest using a pipeline for each model. It looks like you are performing CV on the same model on each iteration. You can check the doc here for more information on how to use them. Then perform CV for each model pipeline.



Related Topics



Leave a reply



Submit