Sample Rows of Subgroups from Dataframe with Dplyr

How can I randomly sample a subgroup with multiple rows from within a larger group?

Like so perhaps:


set.seed( 100 )
df %>% group_by( ID, Group ) %>%
    sample_n(1) %>%
    select( -Score ) %>%
    left_join( df, by=c("ID","Group","Color") )

Think I misunderstood you at first, but this sounds like it could be it.

Output:


        ID Group  Color Score
1    Bravo     1 yellow  0.65
2    Bravo     1 yellow  0.70
3    Bravo     1 yellow  0.90
4  Charlie     1    red  0.55
5  Charlie     2    red  0.60
6  Charlie     3    red  0.80
7  Charlie     4    red  0.90
8    Delta     1    red  0.85
9    Delta     2    red  0.63
10   Delta     2    red  0.51
11    Echo     1 yellow  0.85
12    Echo     1 yellow  0.89

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

Randomly sample groups

Just use sample() to choose some number of groups

iris %>% filter(Species %in% sample(levels(Species),2))

How to sample rows without replacement within (multiple) subgroups in R

Here is a base R solution. If you want to sample all elements of a vector exactly once, then just sample(vec) and it will return a permutation of vec.

set.seed(42)
res <-lapply(participant_id, function(p){
  data.frame(participant_id = rep(p, length(item)),
             colour = sample(colour), item = sample(item))
})
res <- do.call(rbind, res)
res

How to randomly subset of data with dplyr?

Maybe this is what you are after:

# sample from distinct values of No
my_groups <- 
  df %>% 
  select(No) %>% 
  distinct %>% 
  sample_n(5)

# merge the two datasets
my_df <-
  left_join(my_groups, df)

Select subgroups with replacement in a dataframe R

You could generate your group sample:

x <- sample(unique(df$groups), 3, replace = TRUE)

Then select the appropriate parts of df:

do.call(rbind, lapply(x, function(i) df[df$groups == i,]))

sample with dplyr and rowwise

The very first row shows that col_1 and col_2 are different, while I
expect them to be the same.

set.seed(7) makes sure that every time you run your script, it will create the same my_df. It does not mean that every single time you run sample, it will sample the same number, so col_1 and col_2 do not need to be the same. However, if you run your code twice, both will get you the same col_1.

I expect col_1 and col_2 be sampled from set_diff column.

From the documentation of sample: If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Therefore, if set_diff equals 3, a sample is drawn from c(1,2,3).