Drawing a Stratified Sample in R

drawing a stratified sample in R

If you have a stratified design, then I believe you can sample randomly within each stratum. Here is a short algorithm to do proportional sampling in each stratum, using ddply:

library(plyr)
set.seed(1)
dat <- data.frame(
id = 1:100,
Category = sample(LETTERS[1:3], 100, replace=TRUE, prob=c(0.2, 0.3, 0.5))
)

sampleOne <- function(id, fraction=0.1){
sort(sample(id, round(length(id)*fraction)))
}

ddply(dat, .(Category), summarize, sampleID=sampleOne(id, fraction=0.2))

Category sampleID
1 A 21
2 A 29
3 A 72
4 B 13
5 B 20
6 B 42
7 B 58
8 B 82
9 B 100
10 C 1
11 C 11
12 C 14
13 C 33
14 C 38
15 C 40
16 C 63
17 C 64
18 C 71
19 C 92

Create a column in the original dataset to indicate whether the row was drawn in a random stratified sample

You can add a column to your data frame that has the required number of 1s per group (and 0 otherwise).

set.seed(1)

samples <- 1

sample1 <- iris %>%
group_by(Species) %>%
mutate(sampled = as.numeric(row_number() %in% sample(n(), samples)))

sample1

sample1
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0
#> 2 4.9 3 1.4 0.2 setosa 0
#> 3 4.7 3.2 1.3 0.2 setosa 0
#> 4 4.6 3.1 1.5 0.2 setosa 1
#> 5 5 3.6 1.4 0.2 setosa 0
#> 6 5.4 3.9 1.7 0.4 setosa 0
#> 7 4.6 3.4 1.4 0.3 setosa 0
#> 8 5 3.4 1.5 0.2 setosa 0
#> 9 4.4 2.9 1.4 0.2 setosa 0
#> 10 4.9 3.1 1.5 0.1 setosa 0
#> # ... with 140 more rows

To get the sampled values, simply filter to find the 1s:

sample1 %>% filter(sampled == 1)
#> # A tibble: 3 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 1
#> 2 5.6 3 4.1 1.3 versicolor 1
#> 3 6.3 3.3 6 2.5 virginica 1

Created on 2022-05-16 by the reprex package (v2.0.1)

Drawing equally-sized samples from differently-sized substrata of a dataframe in R

Consider by to split data frame by position with needed sampling. Then rbind all dfs together outside the loop with do.call().

df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])

final_df <- do.call(rbind, df_list)

Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind inside a for loop which is memory-intensive and not advised.

Specifically,

  • by is the object-oriented wrapper to tapply and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.
  • do.call essentially runs a compact version of an expanded call across multiple elements where rbind(df1, df2, df3) is equivalent to do.call(rbind, list(df1, df2, df3)). The key here to note is rbind is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.


Related Topics



Leave a reply



Submit