drawing a stratified sample in R
If you have a stratified design, then I believe you can sample randomly within each stratum. Here is a short algorithm to do proportional sampling in each stratum, using ddply
:
library(plyr)
set.seed(1)
dat <- data.frame(
id = 1:100,
Category = sample(LETTERS[1:3], 100, replace=TRUE, prob=c(0.2, 0.3, 0.5))
)
sampleOne <- function(id, fraction=0.1){
sort(sample(id, round(length(id)*fraction)))
}
ddply(dat, .(Category), summarize, sampleID=sampleOne(id, fraction=0.2))
Category sampleID
1 A 21
2 A 29
3 A 72
4 B 13
5 B 20
6 B 42
7 B 58
8 B 82
9 B 100
10 C 1
11 C 11
12 C 14
13 C 33
14 C 38
15 C 40
16 C 63
17 C 64
18 C 71
19 C 92
Create a column in the original dataset to indicate whether the row was drawn in a random stratified sample
You can add a column to your data frame that has the required number of 1s per group (and 0 otherwise).
set.seed(1)
samples <- 1
sample1 <- iris %>%
group_by(Species) %>%
mutate(sampled = as.numeric(row_number() %in% sample(n(), samples)))
sample1
sample1
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0
#> 2 4.9 3 1.4 0.2 setosa 0
#> 3 4.7 3.2 1.3 0.2 setosa 0
#> 4 4.6 3.1 1.5 0.2 setosa 1
#> 5 5 3.6 1.4 0.2 setosa 0
#> 6 5.4 3.9 1.7 0.4 setosa 0
#> 7 4.6 3.4 1.4 0.3 setosa 0
#> 8 5 3.4 1.5 0.2 setosa 0
#> 9 4.4 2.9 1.4 0.2 setosa 0
#> 10 4.9 3.1 1.5 0.1 setosa 0
#> # ... with 140 more rows
To get the sampled values, simply filter to find the 1s:
sample1 %>% filter(sampled == 1)
#> # A tibble: 3 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 1
#> 2 5.6 3 4.1 1.3 versicolor 1
#> 3 6.3 3.3 6 2.5 virginica 1
Created on 2022-05-16 by the reprex package (v2.0.1)
Drawing equally-sized samples from differently-sized substrata of a dataframe in R
Consider by
to split data frame by position with needed sampling. Then rbind
all dfs together outside the loop with do.call()
.
df_list <- by(df, df$position, function(sub) sub[sample(1:nrow(sub), 3),])
final_df <- do.call(rbind, df_list)
Currently you index the entire (not subsetted) data frame in each iteration. Also, you are using rbind
inside a for
loop which is memory-intensive and not advised.
Specifically,
by
is the object-oriented wrapper totapply
and essentially splits a data frame into subsets by factor(s) and passes each subset into a defined function. Here sub is just the name of subsetted variable (can be named anything). The result here is a list of data frames.do.call
essentially runs a compact version of an expanded call across multiple elements whererbind(df1, df2, df3)
is equivalent todo.call(rbind, list(df1, df2, df3))
. The key here to note isrbind
is not called inside a loop (avoiding the danger of growing complex objects like a data frame inside an iteration) but once outside the loop.
Related Topics
As(X, 'Double') and As.Double(X) Are Inconsistent
Ggplot2: Fill Color Behaviour of Geom_Ribbon
Group_By() into Fill() Not Working as Expected
How to Plot Pie Charts in Haplonet Haplotype Networks {Pegas}
Reduce Space Between Grid.Arrange Plots
Binning Data, Finding Results by Group, and Plotting Using R
How to Optimize the Following Code with Nested While-Loop? Multicore an Option
Generating Non-Duplicate Combination Pairs in R
Several Substitutions in One Line R
R Issue with Rounding Milliseconds
How to Neatly Align the Regression Equation and R2 and P Value
How to Pass Vector to Integrate Function