How to Create Test and Train Samples from One Dataframe with Pandas

How do I create test and train samples from one dataframe with pandas?

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

How to split datatable dataframe into train and test dataset in python

The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes) from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):

source code before split method:

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

dt_df = dt.fread(csv_file_path)

classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])

source code after split method:

ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)

method 1: convert to numpy

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

method 2: convert to numpy and return back to datatable dataframe after the split:

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

X_train = dt.Frame(X_train)

# source code after split method

method 3: convert to pandas dataframe

# source code before split method

dt_df = dt_df.to_pandas()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:


                       T convert    T.train     T.pred
M1 to_numpy             3           85          0.5
M2 to_numpy and back    3.5         29          0.5
M3 to pandas            4           37          4

Sample Image

How to split data into 3 sets (train, validation and test)?

Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1, random_state=42)) and then split our data set into the following parts:

60% - train set,
20% - validation set,
20% - test set

In [305]: train, validate, test = \
              np.split(df.sample(frac=1, random_state=42), 
                       [int(.6*len(df)), int(.8*len(df))])

In [306]: train
Out[306]:
          A         B         C         D         E
0  0.046919  0.792216  0.206294  0.440346  0.038960
2  0.301010  0.625697  0.604724  0.936968  0.870064
1  0.642237  0.690403  0.813658  0.525379  0.396053
9  0.488484  0.389640  0.599637  0.122919  0.106505
8  0.842717  0.793315  0.554084  0.100361  0.367465
7  0.185214  0.603661  0.217677  0.281780  0.938540

In [307]: validate
Out[307]:
          A         B         C         D         E
5  0.806176  0.008896  0.362878  0.058903  0.026328
6  0.145777  0.485765  0.589272  0.806329  0.703479

In [308]: test
Out[308]:
          A         B         C         D         E
4  0.521640  0.332210  0.370177  0.859169  0.401087
3  0.333348  0.964011  0.083498  0.670386  0.169619

[int(.6*len(df)), int(.8*len(df))] - is an indices_or_sections array for numpy.split().

Here is a small demo for np.split() usage - let's split 20-elements array into the following parts: 80%, 10%, 10%:

In [45]: a = np.arange(1, 21)

In [46]: a
Out[46]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
Out[47]:
[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
 array([17, 18]),
 array([19, 20])]

Divide pandas data frame into test and train based on unique ID

Sorted the issue in the following way

samplelist = data["ID_sample"].unique()
    training_samp, test_samp = sklearn.model_selection.train_test_split(samplelist, train_size=0.7, test_size=0.3, random_state=5, shuffle=True)
    
    training_data = data[data['ID_sample'].isin(training_samp)]
    test_data = data[data['ID_sample'].isin(test_samp)]

Pandas stratified splitting into train, test, and validation set based on the target variable its cluster

Since you have your data already split by target, you simply need to call train_test_split on each subset and use the cluster column for stratification.

train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])

then do the same for target one and combine all the subsets

How do you split a pandas multiindex dataframe into train/test sets?

You have 'date' as an index, that's why your query doesn't work. For index, you can use:

df_train.loc['2020-12-31':]

That will select all rows, where df_train >= '2020-12-31'. So, if you would like to choose only rows where df_train > '2020-12-31', you should use df_train.loc['2021-01-01':]