From Data Table, Randomly Select One Row Per Group

from data table, randomly select one row per group

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]

How to randomly choose only one row in each group

In plain R you can use sample() within tapply():

df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
   Region Combo Chosen
1       A     1      0
2       A     2      1
3       A     3      0
4       B     1      1
5       B     2      0
6       C     1      1
7       D     1      0
8       D     2      0
9       D     3      1
10      D     4      0

Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

Sample random rows within each group in a data.table

Maybe something like this?

> DT[,.SD[sample(.N, min(3,.N))],by = a]
   a   b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343

(Thanks to Josh for the correction, below.)

Random Sample 1 row for each unique column value in R

df %>% group_by(match_no) %>% sample_n(1)

Flag randomly selected N rows by group in data.table

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]

Or use head, but I think that should be slower

dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]

If the number of flagged rows is not constant you can do

flagsz <- c(2, 1, 2, 3)
dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]

Take random sample by group

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

How to efficiently sample from a datatable by column in R?

You can use sample on .N for each group and select 1 random row.

library(data.table)
set.seed(123)
dt[, .SD[sample(.N, 1)], A]

#   A   B   C
#1: A  31 143
#2: D  16 175
#3: B 100 165
#4: E  27 190
#5: C  90 197

dplyr has slice_sample (previously sample_n) function for it :

library(dplyr)
dt %>% group_by(A) %>% slice_sample(n = 1)

R data.table: Random sample of rows from second table by group

A direct translation of your needs is:

DT2[DT1, on=.(group), allow.cartesian=TRUE, .(var1, obs=obs[sample(.N, 2L)]), by=.EACHI]

This might be faster:

gn <- DT1[, .(nsamp=2*.N), keyby=.(group)]
DT2[gn, on=.(group), .(obs=obs[sample(.N, nsamp, replace=TRUE)]), by=.EACHI][,
    var1 := rep(DT1$var1, each=2L)]

data:

set.seed(0L)
library(data.table)
DT1 <- data.table(var1=101:120, group=c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4))
DT2 <- data.table(obs=201:213, group=c(1,1,1,2,2,2,3,3,3,4,4,4,5))

sample output:

    group var1 obs
 1:     1  101 203
 2:     1  101 201
 3:     1  102 202
 4:     1  102 203
 5:     1  103 203
 6:     1  103 201
 7:     1  104 203
 8:     1  104 202
 9:     1  105 202
10:     1  105 203
11:     2  106 204
12:     2  106 206
13:     2  107 204
14:     2  107 205
15:     2  108 205
16:     2  108 206
17:     2  109 205
18:     2  109 206
19:     3  110 209
20:     3  110 207
21:     3  111 209
22:     3  111 208
23:     3  112 207
24:     3  112 208
25:     4  113 210
26:     4  113 212
27:     4  114 211
28:     4  114 210
29:     4  115 211
30:     4  115 212
31:     4  116 211
32:     4  116 210
33:     4  117 211
34:     4  117 210
35:     4  118 210
36:     4  118 211
37:     4  119 212
38:     4  119 211
39:     4  120 210
40:     4  120 211
    group var1 obs

How to randomly sample entire group based on multiple grouping conditions

You can use lubridate::floor_date to create groups and then filter one randomly sampled frame per group. You can manually set the interval you need in floor_date, here it's "1 minute".

df %>% 
  mutate(datetime = ymd_hms(datetime),
           fl = floor_date(datetime, "1 minute")) %>% 
  group_by(uniquename, fl) %>% 
  filter(frame == sample(unique(frame), 1))

output:

# A tibble: 11 × 5
# Groups:   uniquename, floor [4]
   uniquename frame id    datetime            fl              
   <chr>      <dbl> <chr> <dttm>              <dttm>             
 1 unique1        2 b1    2021-05-05 07:05:03 2021-05-05 07:05:00
 2 unique1        2 b2    2021-05-05 07:05:03 2021-05-05 07:05:00
 3 unique1        2 b3    2021-05-05 07:05:03 2021-05-05 07:05:00
 4 unique1        3 b2    2021-05-05 07:07:03 2021-05-05 07:07:00
 5 unique1        3 b4    2021-05-05 07:07:03 2021-05-05 07:07:00
 6 unique2        1 b3    2021-06-06 09:17:25 2021-06-06 09:17:00
 7 unique2        1 b4    2021-06-06 09:17:25 2021-06-06 09:17:00
 8 unique2       16 b1    2021-06-06 09:20:59 2021-06-06 09:20:00
 9 unique2       16 b2    2021-06-06 09:20:59 2021-06-06 09:20:00
10 unique2       16 b3    2021-06-06 09:20:59 2021-06-06 09:20:00
11 unique2       16 b4    2021-06-06 09:20:59 2021-06-06 09:20:00

From Data Table, Randomly Select One Row Per Group