Implement K-fold cross validation in MLPClassification Python
Do not split your data into train and test. This is automatically handled by the KFold cross-validation.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
for train_indices, test_indices in kf.split(X):
clf.fit(X[train_indices], y[train_indices])
print(clf.score(X[test_indices], y[test_indices]))
KFold validation partitions your dataset into n equal, fair portions. Each portion is then split into test and train. With this, you get a fairly accurate measure of the accuracy of your model since it is tested on small portions of fairly distributed data.
K cross validation with different results everytime
I found the solution to my question.
Setting a random seed with the below solved the problem:
seed = np.random.seed(22)
How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?
You can use the .loc
method to accomplish this. This question has a nice answer that shows how to use it: df.loc[index_position, "column_name"] = some_value
So, an edited version of the code you posted (I needed data, and removed auc_roc
since we aren't using probabilities per your edit):
from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier
X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
# Create columns
X['Prediction'] = 1
# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train[model_columns], y_train)
pred_values = model.predict(X_test[model_columns])
acc = accuracy_score(pred_values , y_test)
acc_score.append(acc)
# Add values to the dataframe
X.loc[test_index, 'Prediction'] = pred_values
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
# Add label back per question
X['Label'] = y
# Print first 5 rows to show that it works
print(X.head(n=5))
Yields
accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
mean radius mean texture ... Prediction Label
0 17.99 10.38 ... 0 0
1 20.57 17.77 ... 0 0
2 19.69 21.25 ... 0 0
3 11.42 20.38 ... 1 0
4 20.29 14.34 ... 0 0
[5 rows x 32 columns]
(Obviously the model/values etc are all arbitrary)
Different Confusion Matrix with Cross-Validation
As specified within the comment, for what concerns the first question, the first option is the way to go. Namely, splitting the whole dataset via train_test_split
and then calling method .split()
of the chosen cross-validator object on the training set.
For the second point, the issue is hidden behind some default parameters of StratifiedKFold
and StratifiedShuffleSplit
and on the sligthly different meaning of parameter n_splits
.
For what concerns
StratifiedKFold
, the parametern_splits
identifies the number of folds you're considering as per documentation. Therefore, imposingn_splits=5
means that the model will be trained on 4-folds (80% of the training set) and tested on one fold (20% of the training set), for each possible combination.For what concerns
StratifiedShuffleSplit
, the parametern_splits
specifies the number of reshuffling and splitting iterations. On the other side, it is the parametertrain_size
(together withtest_size
) to define how big the folds will be (relatively to the size of the training set). In particular, according to the docs, the default setting defines that, if none of them is specified,train_size=0.9
(90% of the training set) andtest_size=0.1
(10% of the training set).
Therefore specifyingtest_size
within theStratifiedShuffleSplit
constructor - eg - should solve your problem:stratshufkfold = StratifiedShuffleSplit(n_splits=n_splits, random_state=0, test_size=0.2)
Low K-fold accuracy for First Fold
It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle
parameter off to see if you get the same results.
It is best practice to shuffle your data anyway prior to running cross validation.
Training 8 different classifiers with crossvalidation give same accuracy with the same file?
I would suggest using a pipeline for each model. It looks like you are performing CV on the same model on each iteration. You can check the doc here for more information on how to use them. Then perform CV for each model pipeline.
Related Topics
How to Open Different Urls At the Same Time by Using Python Selenium
How to Check Whether All Elements of Array Are in Between Two Values
Hiding Axis Text in Matplotlib Plots
Regex to Match Digits and At Most One Space Between Them
Python: Requests.Exceptions.Connectionerror. Max Retries Exceeded With Url
How to Copy/Repeat an Array N Times into a New Array
Retrieve Top N in Each Group of a Dataframe in Pyspark
Python Multiprocessing Pool Hangs At Join
Windowserror: [Error 126] the Specified Module Could Not Be Found
A Better Way Than Looping and Calling Functions That Loop and Call Another Functions
How to Move to One Folder Back in Python
_Corrupt_Record Error When Reading a Json File into Spark
How to Locate the Input Within Div
Get First Date and Last Date of Current Quarter in Python
How to Increase the Font Size of the Markdown Table in Jupyter Notebook
Calculate Rgb Value for a Range of Values to Create Heat Map
How to Display Index During List Iteration With Django
Python Replace Empty Strings in a List With Values from a Different List