How can I ensure that a partition has representative observations from each level of a factor?
Try the caret package, particularly the function createDataPartition()
. It should do exactly what you need, available on CRAN, homepage is here:
caret - data splitting
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}
Error in createDataPartition.... : y must have at least 2 data points
I guess what you need is
set.seed(7)
validationIndex <- caret::createDataPartition(TBdta$MEDV, p=0.80, list=FALSE)
validation <- TBdta[-validationIndex,]
dataset <- TBdta[validationIndex,]
So that you have
dim(validation)
#[1] 99 14
dim(dataset)
#[1] 407 14
Partitioning data set in r based on multiple classes of observations
this may be longer but i think it's more intuitive and can be done in base R ;)
# create the data frame you've described
x <-
data.frame(
cl =
c(
rep( 'A' , 100 ) ,
rep( 'B' , 100 ) ,
rep( 'C' , 100 ) ,
rep( 'D' , 100 )
) ,
othernum1 = rnorm( 400 ) ,
othernum2 = rnorm( 400 ) ,
othernum3 = rnorm( 400 ) ,
othernum4 = rnorm( 400 ) ,
othernum5 = rnorm( 400 ) ,
othernum6 = rnorm( 400 ) ,
othernum7 = rnorm( 400 )
)
# sample 67 training rows within classification groups
training.rows <-
tapply(
# numeric vector containing the numbers
# 1 to nrow( x )
1:nrow( x ) ,
# break the sample function out by
# the classification variable
x$cl ,
# use the sample function within
# each classification variable group
sample ,
# send the size = 67 parameter
# through to the sample() function
size = 67
)
# convert your list back to a numeric vector
tr <- unlist( training.rows )
# split your original data frame into two:
# all the records sampled as training rows
training.df <- x[ tr , ]
# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]
How to sample/partition panel data by individuals( preferably with caret library)?
I think there's a little bug in the sampling approach using sample()
: It is using the id
variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:
nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ]
testing <- data[!data$id %in% inTrainID, ]
head(training[, 1:5], 10)
# id FEMALE YEAR AGE HANDDUM
# 1 1 0 1984 54 0.0000000
# 2 1 0 1985 55 0.0000000
# 3 1 0 1986 56 0.0000000
# 8 3 1 1984 58 0.1687193
# 9 3 1 1986 60 1.0000000
# 10 3 1 1987 61 0.0000000
# 11 3 1 1988 62 1.0000000
# 12 4 1 1985 29 0.0000000
# 13 5 0 1987 27 1.0000000
# 14 5 0 1988 28 0.0000000
dim(data)
# [1] 27326 41
dim(training)
# [1] 20566 41
dim(testing)
# [1] 6760 41
20566/27326
### 75.26% were selected for training
Let's check class balances, because createDataPartition
would keep the class balance for WORKING equal in all sets.
table(data$WORKING) / nrow(data)
# 0 1
# 0.3229525 0.6770475
#
table(training$WORKING) / nrow(training)
# 0 1
# 0.3226685 0.6773315
#
table(testing$WORKING) / nrow(testing)
# 0 1
# 0.3238166 0.6761834
### virtually equal
Splitting Dataframe into Confirmatory and Exploratory Samples
You can check out my stratified
function, which you should be able to use like this:
set.seed(1) ## just so you can reproduce this
## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)
## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]
summary(sample1)
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
summary(sample2)
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
Compare the following and see if they are within your expectations.
x1 <- round(prop.table(
xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)
It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.
Update:
stratified
now has a new logical argument "bothSets
" which lets you keep both sets of samples as a list
.
set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
#
# $SET2
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
Related Topics
R Output Without [1], How to Nicely Format
How to Perform a Pairwise T.Test in R Across Multiple Independent Vectors
Accessing Element of a Split String in R
How to Create a Plot with Customized Points in R
Finding the Index of a Max Value in R
How to Add Random 'Na's into a Data Frame
Tidyverse Not Loaded, It Says "Namespace 'Vctrs' 0.2.0 Is Already Loaded, But >= 0.2.1 Is Required"
Generate Rows Between Two Dates into a Data Frame in R
Displaying Image on Point Hover in Plotly
How to Access Browser Session/Cookies from Within Shiny App
How to Measure Area Between 2 Distribution Curves in R/Ggplot2
Converting a "Map" Object to a "Spatialpolygon" Object
How to Output a Stem and Leaf Plot as a Plot
How to Use a Character Vector of Column Names in the Formula Argument of Dcast (Reshape2)