Stratified Train/Test-Split in Scikit-Learn

Stratified Train/Test-split in scikit-learn

[update for 0.17]

See the docs of sklearn.model_selection.train_test_split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

[/update for 0.17]

There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.

Parameter stratify from method train_test_split (scikit Learn)

Scikit-Learn is just telling you it doesn't recognise the argument "stratify", not that you're using it incorrectly. This is because the parameter was added in version 0.17 as indicated in the documentation you quoted.

So you just need to update Scikit-Learn.

How to stratify the training and testing data in Scikit-Learn?

If you want to shuffle and split your data with 0.3 test ratio, you can use

sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting

In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    shuffle=True, 
                                stratify = X['YOUR_COLUMN_LABEL'])

Stratified train/validation/test split without scikit-learn

To perform stratified data splitting, you need to know which class each data point belongs to. If you have a list of data points and a corresponding list of classes, you can extract all the points that belong to a certain class and split them according to the input proportions.

Here's some code that implements the idea:
Note that you'll have to add some array that tracks the classes the data points belong to after being split in the loop.

import numpy as np
train, valid, test = 0.6, 0.2, 0.2
data_points = np.random.rand(1000, 32, 32)
classes     = np.random.randint(0, 10, size = (1000,))
class_set   = np.unique(classes)
data_train  = []
data_valid  = []
data_test   = []
for class_i in class_set:
    data_inds    = np.where(classes==class_i)
    data_i       = data_points[data_inds, ...]
    N_i          = len(data_inds)
    N_i_train    = int(N_i*train)
    N_i_valid    = int(N_i*valid)
    data_train.append(data_i[:N_i_train])
    data_valid.append(data_i[N_i_train:N_i_train+N_i_valid])
    data_test.append(data_i[N_i_train+N_i_valid:])
    
data_train = np.concatenate(data_train)
data_valid = np.concatenate(data_valid)
data_test = np.concatenate(data_test)

Pandas stratified splitting into train, test, and validation set based on the target variable its cluster

Since you have your data already split by target, you simply need to call train_test_split on each subset and use the cluster column for stratification.

train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])

then do the same for target one and combine all the subsets

how can I train test split in scikit learn

The parameter (stratify = y) inside the train_test_split is giving you the error. Stratify is used when your labels have repeating values. Eg: Let's say your label columns have values of 0 and 1. Then passing stratify = y, would preserve the original proportion of your labels in your training samples. Say, if you had 60% of 1s and 40% of 0s, then your training sample will also have the same proportion.

Stratified Train/Test-Split in Scikit-Learn