Stratified Train/Test-split in scikit-learn
[update for 0.17]
See the docs of sklearn.model_selection.train_test_split
:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.
Pandas stratified splitting into train, test, and validation set based on the target variable its cluster
Since you have your data already split by target, you simply need to call train_test_split
on each subset and use the cluster column for stratification.
train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])
then do the same for target one and combine all the subsets
Stratified split of subset of data
Just figured it out in case someone else is wondering the same.
I can simply enter the train_size and test_size parameters to be integers. Then I run the split again on the test set with 50/50 to get a validation and test set.
Parameter stratify from method train_test_split (scikit Learn)
Scikit-Learn is just telling you it doesn't recognise the argument "stratify", not that you're using it incorrectly. This is because the parameter was added in version 0.17 as indicated in the documentation you quoted.
So you just need to update Scikit-Learn.
Stratified train/validation/test split without scikit-learn
To perform stratified data splitting, you need to know which class each data point belongs to. If you have a list of data points and a corresponding list of classes, you can extract all the points that belong to a certain class and split them according to the input proportions.
Here's some code that implements the idea:
Note that you'll have to add some array that tracks the classes the data points belong to after being split in the loop.
import numpy as np
train, valid, test = 0.6, 0.2, 0.2
data_points = np.random.rand(1000, 32, 32)
classes = np.random.randint(0, 10, size = (1000,))
class_set = np.unique(classes)
data_train = []
data_valid = []
data_test = []
for class_i in class_set:
data_inds = np.where(classes==class_i)
data_i = data_points[data_inds, ...]
N_i = len(data_inds)
N_i_train = int(N_i*train)
N_i_valid = int(N_i*valid)
data_train.append(data_i[:N_i_train])
data_valid.append(data_i[N_i_train:N_i_train+N_i_valid])
data_test.append(data_i[N_i_train+N_i_valid:])
data_train = np.concatenate(data_train)
data_valid = np.concatenate(data_valid)
data_test = np.concatenate(data_test)
Related Topics
Raster Package Taking All Hard Drive
Using Dplyr for Frequency Counts of Interactions, Must Include Zero Counts
Is There Any Other Package Other Than "Sentiment" to Do Sentiment Analysis in R
How to Remove the Legend Title in Ggplot2
Print Number as Reduced Fraction in R
How to Specify (Non-R) Library Path for Dynamic Library Loading in R
Findinterval() with Right-Closed Intervals
Time Difference in Years with Lubridate
Dplyr - Mean for Multiple Columns
How to Increase Size of the Points in Ggplot2, Similar to Cex in Base Plots
How to Syntax Highlight Inline R Code in R Markdown
Ggplot2 - Shade Area Above Line
Function to Extract Domain Name from Url in R
Convert String Back into Object in R
Error: Zipping Up Workbook Failed When Trying to Write.Xlsx