Stratified Train/Test-split in scikit-learn
[update for 0.17]
See the docs of sklearn.model_selection.train_test_split
:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.
Parameter stratify from method train_test_split (scikit Learn)
Scikit-Learn is just telling you it doesn't recognise the argument "stratify", not that you're using it incorrectly. This is because the parameter was added in version 0.17 as indicated in the documentation you quoted.
So you just need to update Scikit-Learn.
How to stratify the training and testing data in Scikit-Learn?
If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])
Stratified train/validation/test split without scikit-learn
To perform stratified data splitting, you need to know which class each data point belongs to. If you have a list of data points and a corresponding list of classes, you can extract all the points that belong to a certain class and split them according to the input proportions.
Here's some code that implements the idea:
Note that you'll have to add some array that tracks the classes the data points belong to after being split in the loop.
import numpy as np
train, valid, test = 0.6, 0.2, 0.2
data_points = np.random.rand(1000, 32, 32)
classes = np.random.randint(0, 10, size = (1000,))
class_set = np.unique(classes)
data_train = []
data_valid = []
data_test = []
for class_i in class_set:
data_inds = np.where(classes==class_i)
data_i = data_points[data_inds, ...]
N_i = len(data_inds)
N_i_train = int(N_i*train)
N_i_valid = int(N_i*valid)
data_train.append(data_i[:N_i_train])
data_valid.append(data_i[N_i_train:N_i_train+N_i_valid])
data_test.append(data_i[N_i_train+N_i_valid:])
data_train = np.concatenate(data_train)
data_valid = np.concatenate(data_valid)
data_test = np.concatenate(data_test)
Pandas stratified splitting into train, test, and validation set based on the target variable its cluster
Since you have your data already split by target, you simply need to call train_test_split
on each subset and use the cluster column for stratification.
train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])
then do the same for target one and combine all the subsets
how can I train test split in scikit learn
The parameter (stratify = y) inside the train_test_split is giving you the error. Stratify is used when your labels have repeating values. Eg: Let's say your label columns have values of 0 and 1. Then passing stratify = y, would preserve the original proportion of your labels in your training samples. Say, if you had 60% of 1s and 40% of 0s, then your training sample will also have the same proportion.
Related Topics
How to Match Any String from a List of Strings in Regular Expressions in Python
Sql-Like Window Functions in Pandas: Row Numbering in Python Pandas Dataframe
Remove Reverse Duplicates from Dataframe
Error Running Basic Tensorflow Example
How to Create Downloading Progress Bar in Ttk
How to Check If a String Only Contains Letters
How to Check That Multiple Keys Are in a Dict in a Single Pass
Python Mixed Integer Linear Programming
How to Get Rid of Beautifulsoup User Warning
Getting the Index of a Row in a Pandas Apply Function
Fill Between Two Vertical Lines in Matplotlib
Plotting Networkx Graph with Node Labels Defaulting to Node Name
Comparing Numpy Arrays Containing Nan