Sample Random Rows Within Each Group in a Data.Table

Sample random rows within each group in a data.table

Maybe something like this?

> DT[,.SD[sample(.N, min(3,.N))],by = a]
a b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343

(Thanks to Josh for the correction, below.)

R data.table: Random sample of rows from second table by group

A direct translation of your needs is:

DT2[DT1, on=.(group), allow.cartesian=TRUE, .(var1, obs=obs[sample(.N, 2L)]), by=.EACHI]

This might be faster:

gn <- DT1[, .(nsamp=2*.N), keyby=.(group)]
DT2[gn, on=.(group), .(obs=obs[sample(.N, nsamp, replace=TRUE)]), by=.EACHI][,
var1 := rep(DT1$var1, each=2L)]

data:

set.seed(0L)
library(data.table)
DT1 <- data.table(var1=101:120, group=c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4))
DT2 <- data.table(obs=201:213, group=c(1,1,1,2,2,2,3,3,3,4,4,4,5))

sample output:

    group var1 obs
1: 1 101 203
2: 1 101 201
3: 1 102 202
4: 1 102 203
5: 1 103 203
6: 1 103 201
7: 1 104 203
8: 1 104 202
9: 1 105 202
10: 1 105 203
11: 2 106 204
12: 2 106 206
13: 2 107 204
14: 2 107 205
15: 2 108 205
16: 2 108 206
17: 2 109 205
18: 2 109 206
19: 3 110 209
20: 3 110 207
21: 3 111 209
22: 3 111 208
23: 3 112 207
24: 3 112 208
25: 4 113 210
26: 4 113 212
27: 4 114 211
28: 4 114 210
29: 4 115 211
30: 4 115 212
31: 4 116 211
32: 4 116 210
33: 4 117 211
34: 4 117 210
35: 4 118 210
36: 4 118 211
37: 4 119 212
38: 4 119 211
39: 4 120 210
40: 4 120 211
group var1 obs

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

random sample of rows with at least x of each group with several conditions

In dplyr you could do this...

library(dplyr)

df2 <- mydf %>% group_by(age, region, gender) %>% sample_n(1) #select one from each group

sample <- mydf %>% sample_n(24 - nrow(df2)) %>% #select rest randomly
bind_rows(df2) #add first set back in

Your example data does not cover all the possible groups because of the way you have constructed it (6=2*3, so very cyclic), but this approach should work in a more general case.

Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

You can sample minimum of number of rows or x for each group :

library(dplyr)

x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))

# samples groups
# <chr> <dbl>
#1 A 1
#2 A 1
#3 A 2
#4 B 1
#5 B 1

However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.


However, as @tmfmnk mentioned we don't need to call n() here. Try :

df %>% group_by(samples, groups) %>% slice_sample(n = x)

How to efficiently sample from a datatable by column in R?

You can use sample on .N for each group and select 1 random row.

library(data.table)
set.seed(123)
dt[, .SD[sample(.N, 1)], A]

# A B C
#1: A 31 143
#2: D 16 175
#3: B 100 165
#4: E 27 190
#5: C 90 197

dplyr has slice_sample (previously sample_n) function for it :

library(dplyr)
dt %>% group_by(A) %>% slice_sample(n = 1)

Flag randomly selected N rows by group in data.table

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]

Or use head, but I think that should be slower

dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]

If the number of flagged rows is not constant you can do

flagsz <- c(2, 1, 2, 3)
dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]

from data table, randomly select one row per group

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]

Sample n random rows per group in a dataframe

You can assign a random ID to each element that has a particular factor level using ave. Then you can select all random IDs in a certain range.

rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]

This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid vector to create subset of different lengths fairly easily.



Related Topics



Leave a reply



Submit