R: How to Split a Data Frame into Training, Validation, and Test Sets

How to split a data frame into training, validation, and test sets dependent on ID's?

The code you posted from the previous train/validate/test question assigns a train, validate, or test label to each row of a data frame and then splits based on the label of each row:

spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq(nrow(df)), 
  nrow(df)*cumsum(c(0,spec)),
  labels = names(spec)
))
res = split(df, g)

Instead, you could assign a label to each unique level of your ID factor variable and split based on the label assigned to the ID of each row:

set.seed(144)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq_along(unique(df$Contact.ID)), 
  length(unique(df$Contact.ID))*cumsum(c(0,spec)),
  labels = names(spec)
))
(res = split(df, g[as.factor(df$Contact.ID)]))
# $train
#   Contact.ID          Date.Time Age Gender Attendance
# 1          A 2012-07-0618:54:48  37   Male         30
# 2          A 2012-07-0620:50:18  37   Male         30
# 3          A 2012-08-1420:18:44  37   Male         30
# 8          C 2013-10-2217:46:07  40   Male          5
# 9          C 2013-10-2711:21:00  40   Male          5
# 
# $test
#   Contact.ID          Date.Time Age Gender Attendance
# 4          B 2012-03-1516:58:15  27 Female         40
# 5          B 2012-04-1810:57:02  27 Female         40
# 6          B 2012-04-1817:31:22  27 Female         40
# 7          B 2012-04-1818:37:00  27 Female         40
# 
# $validate
#    Contact.ID          Date.Time Age Gender Attendance
# 10          D 2012-07-2814:48:33  20 Female         12

Note that this changes the interpretation of the split proportions: the 60% assigned to the training set is now 60% of the unique subject IDs, not 60% of the rows.

How to split data into training and validation in R?

This is how data splitting is done in Max Kuhn's book on the caret package.

library(caret)
set.seed(4650)
trainIndex <- createDataPartition(iris$Species, 
                                  p = .75, 
                                  list = FALSE, 
                                  times = 1)

irisTrain <- iris[ trainIndex,]
irisTest  <- iris[-trainIndex,]

Split data into training and test set: How to make sure all factors are included in training set?

This can be easily done using caret package's createDataPartition() function.

library(caret)
samp = createDataPartition(as.factor(b$x), p = 0.75, list = F)

train = b[samp,]
test = b[-samp,]

How do I split my data set between training and testing sets while keeping the ratio of the target variable in both sets?

What you want to do is stratified splitting of your dataset. You can do this with the createDataPartition from the caret package. Just make sure your Leaver variable is set as a factor.

See a code example below.

library(caret)
data(GermanCredit)

prop.table(table(GermanCredit$Class))
 Bad Good 
 0.3  0.7 
index <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)

# train
prop.table(table(GermanCredit$Class[index]))
 Bad Good 
 0.3  0.7 
#test
prop.table(table(GermanCredit$Class[-index]))
 Bad Good 
 0.3  0.7

R: How to Split a Data Frame into Training, Validation, and Test Sets

How to split a data frame into training, validation, and test sets dependent on ID's?

How to split data into training and validation in R?

Split data into training and test set: How to make sure all factors are included in training set?

How do I split my data set between training and testing sets while keeping the ratio of the target variable in both sets?

Related Topics

Leave a reply