Randomly Sample Data Frame into 3 Groups in R

Randomly sample data frame into 3 groups in R

Do you need the partitioning to be exact? If not,

set.seed(7)
ss <- sample(1:3,size=nrow(mtcars),replace=TRUE,prob=c(0.6,0.2,0.2))
train <- mtcars[ss==1,]
test <- mtcars[ss==2,]
cvr <- mtcars[ss==3,]

should do it.

Or, as @Frank says in comments, you can split() the original data to keep them as elements of a list:

mycars <- setNames(split(mtcars,ss), c("train","test","cvr"))

How to randomly split data into three equal sizes?

IMO it should be sufficient to assign just random project names.

dat$ProjectName <- sample(factor(rep(1:3, length.out=nrow(dat)), 
                          labels=paste0("Project", 1:3)))

Result

head(dat)
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ProjectName
# 1  1  1  0  1  1  1  1  0  1   0    Project1
# 2  1  1  1  1  1  1  0  0  1   0    Project1
# 3  0  0  1  1  0  0  0  1  1   1    Project1
# 4  1  1  1  0  1  0  1  1  0   1    Project3
# 5  1  0  0  1  1  1  1  0  0   1    Project1
# 6  1  0  0  0  0  1  0  1  1   1    Project3

table(dat$ProjectName)
# Project1 Project2 Project3 
#     3186     3186     3186

Data

set.seed(42)
dat <- data.frame(replicate(10, sample(0:1, 9558, rep=TRUE)))

How can I randomly sample a subgroup with multiple rows from within a larger group?

Like so perhaps:


set.seed( 100 )
df %>% group_by( ID, Group ) %>%
    sample_n(1) %>%
    select( -Score ) %>%
    left_join( df, by=c("ID","Group","Color") )

Think I misunderstood you at first, but this sounds like it could be it.

Output:


        ID Group  Color Score
1    Bravo     1 yellow  0.65
2    Bravo     1 yellow  0.70
3    Bravo     1 yellow  0.90
4  Charlie     1    red  0.55
5  Charlie     2    red  0.60
6  Charlie     3    red  0.80
7  Charlie     4    red  0.90
8    Delta     1    red  0.85
9    Delta     2    red  0.63
10   Delta     2    red  0.51
11    Echo     1 yellow  0.85
12    Echo     1 yellow  0.89

How to randomly sample entire group based on multiple grouping conditions

You can use lubridate::floor_date to create groups and then filter one randomly sampled frame per group. You can manually set the interval you need in floor_date, here it's "1 minute".

df %>% 
  mutate(datetime = ymd_hms(datetime),
           fl = floor_date(datetime, "1 minute")) %>% 
  group_by(uniquename, fl) %>% 
  filter(frame == sample(unique(frame), 1))

output:

# A tibble: 11 × 5
# Groups:   uniquename, floor [4]
   uniquename frame id    datetime            fl              
   <chr>      <dbl> <chr> <dttm>              <dttm>             
 1 unique1        2 b1    2021-05-05 07:05:03 2021-05-05 07:05:00
 2 unique1        2 b2    2021-05-05 07:05:03 2021-05-05 07:05:00
 3 unique1        2 b3    2021-05-05 07:05:03 2021-05-05 07:05:00
 4 unique1        3 b2    2021-05-05 07:07:03 2021-05-05 07:07:00
 5 unique1        3 b4    2021-05-05 07:07:03 2021-05-05 07:07:00
 6 unique2        1 b3    2021-06-06 09:17:25 2021-06-06 09:17:00
 7 unique2        1 b4    2021-06-06 09:17:25 2021-06-06 09:17:00
 8 unique2       16 b1    2021-06-06 09:20:59 2021-06-06 09:20:00
 9 unique2       16 b2    2021-06-06 09:20:59 2021-06-06 09:20:00
10 unique2       16 b3    2021-06-06 09:20:59 2021-06-06 09:20:00
11 unique2       16 b4    2021-06-06 09:20:59 2021-06-06 09:20:00

Sample n rows from a data frame by group using another data frame

We can do an inner_join with 'pick_df', grouped by 'manufacturer', 'year', get the sample_n based on the first value of 'pick'

library(dplyr)   
library(ggplot20 
mpg %>%
    inner_join(pick_df) %>% 
    group_by(manufacturer, year) %>%
    sample_n(first(pick))
# A tibble: 27 x 12
# Groups:   manufacturer, year [5]
#   manufacturer model       displ  year   cyl trans      drv     cty   hwy fl    class       pick
#   <chr>        <chr>       <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>      <int>
# 1 audi         a4 quattro    1.8  1999     4 auto(l5)   4        16    25 p     compact        6
# 2 audi         a6 quattro    2.8  1999     6 auto(l5)   4        15    24 p     midsize        6
# 3 audi         a4            2.8  1999     6 auto(l5)   f        16    26 p     compact        6
# 4 audi         a4 quattro    2.8  1999     6 auto(l5)   4        15    25 p     compact        6
# 5 audi         a4            1.8  1999     4 auto(l5)   f        18    29 p     compact        6
# 6 audi         a4            2.8  1999     6 manual(m5) f        18    26 p     compact        6
# 7 honda        civic         1.8  2008     4 manual(m5) f        26    34 r     subcompact     3
# 8 honda        civic         2    2008     4 manual(m6) f        21    29 p     subcompact     3
# 9 honda        civic         1.8  2008     4 auto(l5)   f        24    36 c     subcompact     3
#10 land rover   range rover   4.2  2008     8 auto(s6)   4        12    18 r     suv            2
# … with 17 more rows

Sample n random rows per group in a dataframe

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

Randomly assign sample into groups in R

Using toString.

df$class <- factor(apply(df[c("City", "Age_Group")], 1, toString))
levels(df$class)
# [1] "City 1, 0-9"    "City 1, 10-19"  "City 1, 20-29"  "City 1, 30-39" 
# [5] "City 1, 40-49"  "City 1, 50-59"  "City 1, 60-69"  "City 1, 70-79" 
# [9] "City 1, 80-89"  "City 1, 90+"    "City 10, 0-9"   "City 10, 10-19"
# [13] "City 10, 20-29" [...]

To get random samples, you could split the data set by "class" into subsets, say s, and calculate how many groups you get, when you divide the nrow(s)/20 (individuals) by 20. Use ceiling of this probably decimal point number, say x, and exploit then recycling properties of R; bind 1:ceiling(x) to s using cbind and let it recycle to nrow(s), where we may safely suppressWarnings. Of course we want now use sample to disturb the order, and just want column [,2]. Finally use do.call(rbind(.)) to unsplit the data set, and delete the rownames if we want.

set.seed(1)  ## for sake of reproducibility
df <- `rownames<-`(do.call(rbind, by(df, df$class, function(s) 
  transform(s, SAMP=suppressWarnings(
    sample(cbind(s$class, SAMP=1:ceiling(nrow(s)/20))[,2])
    )))), NULL)

Result:

Yields "SAMP" column with approximately equal sized groups with ~20 members for each "class".

df[60:70, ]  ##example rows
#      ID    City Age_Group          class SAMP
# 60 8766 City 01       0-9   City 01, 0-9    4
# 61 8775 City 01       0-9   City 01, 0-9    1
# 62 9021 City 01       0-9   City 01, 0-9    3
# 63 9041 City 01       0-9   City 01, 0-9    3
# 64 9482 City 01       0-9   City 01, 0-9    1
# 65 9622 City 01       0-9   City 01, 0-9    1
# 66   47 City 01     10-19 City 01, 10-19    4
# 67  698 City 01     10-19 City 01, 10-19    3
# 68  833 City 01     10-19 City 01, 10-19    1
# 69 1166 City 01     10-19 City 01, 10-19    1
# 70 1221 City 01     10-19 City 01, 10-19    2

Check first ten tables of the classes with its SAMPles:

by(df$SAMP, df$class, table)[1:10]
# $`City 01, 0-9`
# 
# 1  2  3  4 
# 17 16 16 16 
# 
# $`City 01, 10-19`
# 
# 1  2  3  4 
# 18 17 17 17 
# 
# $`City 01, 20-29`
# 
# 1  2  3  4 
# 18 18 17 17 
# 
# $`City 01, 30-39`
# 
# 1  2  3  4 
# 19 19 19 19 
# 
# $`City 01, 40-49`
# 
# 1  2  3  4 
# 19 19 19 18 
# 
# $`City 01, 50-59`
# 
# 1  2  3  4  5 
# 18 17 17 17 17 
# 
# $`City 01, 60-69`
# 
# 1  2  3  4 
# 16 16 16 16 
# 
# $`City 01, 70-79`
# 
# 1  2  3  4 
# 19 19 19 19 
# 
# $`City 01, 80-89`
# 
# 1  2  3  4 
# 20 19 19 19 
# 
# $`City 01, 90+`
# 
# 1  2  3  4 
# 18 17 17 17

Case you want the numbering by class rather than altogether, just paste "class" (as numeric) and "SAMP" together.

df <- transform(df, SAMP2=paste(as.numeric(class), SAMP, sep="."))
head(df)
#    ID    City Age_Group        class SAMP SAMP2
# 1 193 City 01       0-9 City 01, 0-9    3   1.3
# 2 480 City 01       0-9 City 01, 0-9    1   1.1
# 3 742 City 01       0-9 City 01, 0-9    2   1.2
# 4 757 City 01       0-9 City 01, 0-9    1   1.1
# 5 811 City 01       0-9 City 01, 0-9    3   1.3
# 6 870 City 01       0-9 City 01, 0-9    3   1.3

Randomly Sample Data Frame into 3 Groups in R