How to Ensure That a Partition Has Representative Observations from Each Level of a Factor

How can I ensure that a partition has representative observations from each level of a factor?

Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:

caret - data splitting

The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).

stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)

#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}

if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))

dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)

}

Error in createDataPartition.... : y must have at least 2 data points

I guess what you need is

set.seed(7)
validationIndex <- caret::createDataPartition(TBdta$MEDV, p=0.80, list=FALSE)
validation <- TBdta[-validationIndex,]
dataset <- TBdta[validationIndex,]

So that you have

dim(validation)
#[1] 99 14
dim(dataset)
#[1] 407 14

Partitioning data set in r based on multiple classes of observations

this may be longer but i think it's more intuitive and can be done in base R ;)

# create the data frame you've described
x <-
data.frame(
cl =
c(
rep( 'A' , 100 ) ,
rep( 'B' , 100 ) ,
rep( 'C' , 100 ) ,
rep( 'D' , 100 )
) ,

othernum1 = rnorm( 400 ) ,
othernum2 = rnorm( 400 ) ,
othernum3 = rnorm( 400 ) ,
othernum4 = rnorm( 400 ) ,
othernum5 = rnorm( 400 ) ,
othernum6 = rnorm( 400 ) ,
othernum7 = rnorm( 400 )
)

# sample 67 training rows within classification groups
training.rows <-
tapply(
# numeric vector containing the numbers
# 1 to nrow( x )
1:nrow( x ) ,

# break the sample function out by
# the classification variable
x$cl ,

# use the sample function within
# each classification variable group
sample ,

# send the size = 67 parameter
# through to the sample() function
size = 67
)

# convert your list back to a numeric vector
tr <- unlist( training.rows )

# split your original data frame into two:

# all the records sampled as training rows
training.df <- x[ tr , ]

# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]

How to sample/partition panel data by individuals( preferably with caret library)?

I think there's a little bug in the sampling approach using sample(): It is using the id variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:

nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ]
testing <- data[!data$id %in% inTrainID, ]

head(training[, 1:5], 10)
# id FEMALE YEAR AGE HANDDUM
# 1 1 0 1984 54 0.0000000
# 2 1 0 1985 55 0.0000000
# 3 1 0 1986 56 0.0000000
# 8 3 1 1984 58 0.1687193
# 9 3 1 1986 60 1.0000000
# 10 3 1 1987 61 0.0000000
# 11 3 1 1988 62 1.0000000
# 12 4 1 1985 29 0.0000000
# 13 5 0 1987 27 1.0000000
# 14 5 0 1988 28 0.0000000

dim(data)
# [1] 27326 41
dim(training)
# [1] 20566 41
dim(testing)
# [1] 6760 41
20566/27326
### 75.26% were selected for training

Let's check class balances, because createDataPartition would keep the class balance for WORKING equal in all sets.

table(data$WORKING) / nrow(data)
# 0 1
# 0.3229525 0.6770475
#
table(training$WORKING) / nrow(training)
# 0 1
# 0.3226685 0.6773315
#
table(testing$WORKING) / nrow(testing)
# 0 1
# 0.3238166 0.6761834
### virtually equal

Splitting Dataframe into Confirmatory and Exploratory Samples

You can check out my stratified function, which you should be able to use like this:

set.seed(1) ## just so you can reproduce this

## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)

## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]

summary(sample1)
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
summary(sample2)
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466

Compare the following and see if they are within your expectations.

x1 <- round(prop.table(
xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)

It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.


Update:

stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.

set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
#
# $SET2
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466


Related Topics



Leave a reply



Submit