How to split a data frame into training, validation, and test sets dependent on ID's?
The code you posted from the previous train/validate/test question assigns a train, validate, or test label to each row of a data frame and then splits based on the label of each row:
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
seq(nrow(df)),
nrow(df)*cumsum(c(0,spec)),
labels = names(spec)
))
res = split(df, g)
Instead, you could assign a label to each unique level of your ID factor variable and split based on the label assigned to the ID of each row:
set.seed(144)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
seq_along(unique(df$Contact.ID)),
length(unique(df$Contact.ID))*cumsum(c(0,spec)),
labels = names(spec)
))
(res = split(df, g[as.factor(df$Contact.ID)]))
# $train
# Contact.ID Date.Time Age Gender Attendance
# 1 A 2012-07-0618:54:48 37 Male 30
# 2 A 2012-07-0620:50:18 37 Male 30
# 3 A 2012-08-1420:18:44 37 Male 30
# 8 C 2013-10-2217:46:07 40 Male 5
# 9 C 2013-10-2711:21:00 40 Male 5
#
# $test
# Contact.ID Date.Time Age Gender Attendance
# 4 B 2012-03-1516:58:15 27 Female 40
# 5 B 2012-04-1810:57:02 27 Female 40
# 6 B 2012-04-1817:31:22 27 Female 40
# 7 B 2012-04-1818:37:00 27 Female 40
#
# $validate
# Contact.ID Date.Time Age Gender Attendance
# 10 D 2012-07-2814:48:33 20 Female 12
Note that this changes the interpretation of the split proportions: the 60% assigned to the training set is now 60% of the unique subject IDs, not 60% of the rows.
How to split data into training and validation in R?
This is how data splitting is done in Max Kuhn's book on the caret package.
library(caret)
set.seed(4650)
trainIndex <- createDataPartition(iris$Species,
p = .75,
list = FALSE,
times = 1)
irisTrain <- iris[ trainIndex,]
irisTest <- iris[-trainIndex,]
Split data into training and test set: How to make sure all factors are included in training set?
This can be easily done using caret package's createDataPartition() function.
library(caret)
samp = createDataPartition(as.factor(b$x), p = 0.75, list = F)
train = b[samp,]
test = b[-samp,]
How do I split my data set between training and testing sets while keeping the ratio of the target variable in both sets?
What you want to do is stratified splitting of your dataset. You can do this with the createDataPartition
from the caret package. Just make sure your Leaver
variable is set as a factor.
See a code example below.
library(caret)
data(GermanCredit)
prop.table(table(GermanCredit$Class))
Bad Good
0.3 0.7
index <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)
# train
prop.table(table(GermanCredit$Class[index]))
Bad Good
0.3 0.7
#test
prop.table(table(GermanCredit$Class[-index]))
Bad Good
0.3 0.7
Related Topics
Protect/Encrypt R Package Code for Distribution
What Does "Error: Object '<Myvariable>' Not Found" Mean
Use a Variable Within a Plotmath Expression
Make Readline Wait for Input in R
Create a Time Interval of 15 Minutes from Minutely Data in R
Plots Generated by 'Plot' and 'Ggplot' Side-By-Side
Multiple Time Series in One Plot
How to Get Top N Companies from a Data Frame in Decreasing Order
Call by Reference in R (Using Function to Modify an Object)
Finding the Index Inside a Vector Satisfying a Condition
Why Doesn't Outer Work the Way I Think It Should (In R)
How to Get the Name of the Calling Function Inside the Called Routine
Find All Functions (Including Private) in a Package
How Does One Stop Using Rowwise in Dplyr
Add a New Column to a Dataframe Using Matching Values of Another Dataframe
How Can a Data Ellipse Be Superimposed on a Ggplot2 Scatterplot