How to Ensure That a Partition Has Representative Observations from Each Level of a Factor

How can I ensure that a partition has representative observations from each level of a factor?

Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:

caret - data splitting

The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).

stratified <- function(df, group, size) {
  # USE: * Specify your data frame and grouping variable (as column
  # number) as the first two arguments.
  # * Decide on your sample size. For a sample proportional to the
  # population, enter "size" as a decimal. For an equal number
  # of samples from each group, enter "size" as a whole number.
  #
  # Example 1: Sample 10% of each group from a data frame named "z",
  # where the grouping variable is the fourth variable, use:
  #
  # > stratified(z, 4, .1)
  #
  # Example 2: Sample 5 observations from each group from a data frame
  # named "z"; grouping variable is the third variable:
  #
  # > stratified(z, 3, 5)
  #
  require(sampling)
  temp = df[order(df[group]),]
  colsToReturn <- ncol(df)

  #Don't want to attempt to sample more than possible
  dfCounts <- table(df[group])
  if (size > min(dfCounts)) {
    size <- min(dfCounts)
  }

  if (size < 1) {
    size = ceiling(table(temp[group]) * size)
  } else if (size >= 1) {
    size = rep(size, times=length(table(temp[group])))
  }
  strat = strata(temp, stratanames = names(temp[group]),
                 size = size, method = "srswor")
  (dsample = getdata(temp, strat))

  dsample <- dsample[order(dsample[1]),]
  dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
  return(dsample)

}

Error in createDataPartition.... : y must have at least 2 data points

I guess what you need is

set.seed(7)
validationIndex <- caret::createDataPartition(TBdta$MEDV, p=0.80, list=FALSE)
validation <- TBdta[-validationIndex,]
dataset <- TBdta[validationIndex,]

So that you have

dim(validation)
#[1] 99 14
dim(dataset)
#[1] 407  14

Partitioning data set in r based on multiple classes of observations

this may be longer but i think it's more intuitive and can be done in base R ;)

# create the data frame you've described
x <-
    data.frame(
        cl = 
            c( 
                rep( 'A' , 100 ) ,
                rep( 'B' , 100 ) ,
                rep( 'C' , 100 ) ,
                rep( 'D' , 100 ) 
            ) ,

        othernum1 = rnorm( 400 ) ,
        othernum2 = rnorm( 400 ) ,
        othernum3 = rnorm( 400 ) ,
        othernum4 = rnorm( 400 ) ,
        othernum5 = rnorm( 400 ) ,
        othernum6 = rnorm( 400 ) ,
        othernum7 = rnorm( 400 ) 
    )

# sample 67 training rows within classification groups
training.rows <-
    tapply( 
        # numeric vector containing the numbers
        # 1 to nrow( x )
        1:nrow( x ) , 

        # break the sample function out by
        # the classification variable
        x$cl , 

        # use the sample function within
        # each classification variable group
        sample , 

        # send the size = 67 parameter
        # through to the sample() function
        size = 67 
    )

# convert your list back to a numeric vector
tr <- unlist( training.rows )

# split your original data frame into two:

# all the records sampled as training rows
training.df <- x[ tr , ]

# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]

How to sample/partition panel data by individuals( preferably with caret library)?

I think there's a little bug in the sampling approach using sample(): It is using the id variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:

nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ] 
testing <- data[!data$id %in% inTrainID, ] 

head(training[, 1:5], 10)
#    id FEMALE YEAR AGE   HANDDUM
# 1   1      0 1984  54 0.0000000
# 2   1      0 1985  55 0.0000000
# 3   1      0 1986  56 0.0000000
# 8   3      1 1984  58 0.1687193
# 9   3      1 1986  60 1.0000000
# 10  3      1 1987  61 0.0000000
# 11  3      1 1988  62 1.0000000
# 12  4      1 1985  29 0.0000000
# 13  5      0 1987  27 1.0000000
# 14  5      0 1988  28 0.0000000

dim(data)
# [1] 27326    41
dim(training)
# [1] 20566    41
dim(testing)
# [1] 6760   41
20566/27326
### 75.26% were selected for training

Let's check class balances, because createDataPartition would keep the class balance for WORKING equal in all sets.

table(data$WORKING) / nrow(data)
#         0         1 
# 0.3229525 0.6770475 
#
table(training$WORKING) / nrow(training)
#         0         1 
# 0.3226685 0.6773315 
#
table(testing$WORKING) / nrow(testing)
#         0         1 
# 0.3238166 0.6761834 
### virtually equal

Splitting Dataframe into Confirmatory and Exploratory Samples

You can check out my stratified function, which you should be able to use like this:

set.seed(1) ## just so you can reproduce this

## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)

## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]

summary(sample1)
#  Gender  Region  Age          X1          
#  F:235   1:112   1:84   Min.   :-2.82847  
#  M:259   2: 90   2:78   1st Qu.:-0.69711  
#          3: 94   3:82   Median :-0.03200  
#          4: 97   4:80   Mean   :-0.01401  
#          5:101   5:90   3rd Qu.: 0.63844  
#                  6:80   Max.   : 2.90422
summary(sample2)
#  Gender  Region  Age          X1          
#  F:238   1:114   1:85   Min.   :-2.76808  
#  M:268   2: 92   2:81   1st Qu.:-0.55173  
#          3: 97   3:83   Median : 0.02559  
#          4: 99   4:83   Mean   : 0.05789  
#          5:104   5:91   3rd Qu.: 0.74102  
#                  6:83   Max.   : 3.58466

Compare the following and see if they are within your expectations.

x1 <- round(prop.table(
  xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
  xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
  xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)

It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.

Update:

stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.

set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
#  Gender  Region  Age          X1          
#  F:235   1:112   1:84   Min.   :-2.82847  
#  M:259   2: 90   2:78   1st Qu.:-0.69711  
#          3: 94   3:82   Median :-0.03200  
#          4: 97   4:80   Mean   :-0.01401  
#          5:101   5:90   3rd Qu.: 0.63844  
#                  6:80   Max.   : 2.90422  
#
# $SET2
#  Gender  Region  Age          X1          
#  F:238   1:114   1:85   Min.   :-2.76808  
#  M:268   2: 92   2:81   1st Qu.:-0.55173  
#          3: 97   3:83   Median : 0.02559  
#          4: 99   4:83   Mean   : 0.05789  
#          5:104   5:91   3rd Qu.: 0.74102  
#                  6:83   Max.   : 3.58466