Stratified Splitting the Data

Stratified Train/Test-split in scikit-learn

[update for 0.17]

See the docs of sklearn.model_selection.train_test_split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)

[/update for 0.17]

There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.

Pandas stratified splitting into train, test, and validation set based on the target variable its cluster

Since you have your data already split by target, you simply need to call train_test_split on each subset and use the cluster column for stratification.

train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])

then do the same for target one and combine all the subsets

Stratified split of subset of data

Just figured it out in case someone else is wondering the same.

I can simply enter the train_size and test_size parameters to be integers. Then I run the split again on the test set with 50/50 to get a validation and test set.

Parameter stratify from method train_test_split (scikit Learn)

Scikit-Learn is just telling you it doesn't recognise the argument "stratify", not that you're using it incorrectly. This is because the parameter was added in version 0.17 as indicated in the documentation you quoted.

So you just need to update Scikit-Learn.

Stratified train/validation/test split without scikit-learn

To perform stratified data splitting, you need to know which class each data point belongs to. If you have a list of data points and a corresponding list of classes, you can extract all the points that belong to a certain class and split them according to the input proportions.

Here's some code that implements the idea:
Note that you'll have to add some array that tracks the classes the data points belong to after being split in the loop.

import numpy as np
train, valid, test = 0.6, 0.2, 0.2
data_points = np.random.rand(1000, 32, 32)
classes = np.random.randint(0, 10, size = (1000,))
class_set = np.unique(classes)
data_train = []
data_valid = []
data_test = []
for class_i in class_set:
data_inds = np.where(classes==class_i)
data_i = data_points[data_inds, ...]
N_i = len(data_inds)
N_i_train = int(N_i*train)
N_i_valid = int(N_i*valid)
data_train.append(data_i[:N_i_train])
data_valid.append(data_i[N_i_train:N_i_train+N_i_valid])
data_test.append(data_i[N_i_train+N_i_valid:])

data_train = np.concatenate(data_train)
data_valid = np.concatenate(data_valid)
data_test = np.concatenate(data_test)


Related Topics



Leave a reply



Submit