How to Extract a Few Random Rows from a Data.Table on the Fly

How do you extract a few random rows from a data.table on the fly

Have just made .N work in i. New README item :

.N is now available in i, FR#724. Thanks to newbie indirectly here and Farrel directly here.

This now works :

DT[...][...][sample(.N,3)]

e.g.

> random.length  <-  sample(x = 15:30, size = 1)
> data.table(city = sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"),size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(.N, 3)] 
         city score
1:   New York     4
2: Pittsburgh     3
3:  Cape Town     9
>

sample from data.table

Part 1

If you want to count the number of unique ids and some ids repeat within groups

dat[, .(n_ids = uniqueN(id)), group]

If ids don't repeat within groups or you don't want to count them on a unique basis

dat[, .(n_ids = .N), group]

Part 2

If ids repeat within groups and you want to return all rows for the randomly selected id in each group

dat[dat[, .(id = sample(id, 1)), group], on = .(id, group)]

If ids do not repeat, or you only want one row per group anyway

dat[dat[, sample(.I, 1), group]$V1]

Thanks to Frank's comment, you can also do the second option for parts 1 & 2 above in one line. This returns the row like dat[dat[, sample(.I, 1), group]$V1] but also adds a column N showing the number of ids (assumed to equal the number of rows in the group)

dat[sample(.N), c(.SD[1], .N), keyby=group]

How to take a random subsample from rows of a data.table per factor?

We could join with a key/value dataset and use .I to sample

DT[DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), 
      on = .(factor)][, sample(.I, val[1], replace = TRUE), factor]$V1]

If we split this into parts-

data.table(factor = letters[1:3], val = c(12, 20, 5))

is a key/value data.table to get the 'val' as a column on the original dataset by joining on the 'factor`.

In the second step, we do the joining

DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), 
      on = .(factor)]

Now, we sample the row index, grouped by 'factor', specifying the size as the first element of 'val', extract the rowindex column $V1 and use this to subset the original dataset. i.e.

DT[....$V1]

Subset by column criteria AND randomly sample rows of a data.table

To achieve that, I would utilize the .I special symbol as follows:

DT <- as.data.table(mtcars)

DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))]

Now you can do some computations:

set.seed(2019)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
   , lapply(.SD, mean)
   , by = am
   , .SDcols = 3:5]

which gives:

   am   disp       hp     drat
1:  0 325.64 179.0667 3.224667
2:  1 243.00 204.7500 3.890000

If you want to reuse that index vector at a later moment, you can store it beforehand:

idx <- c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))

DT[idx, lapply(.SD, mean), .SDcols = 3:5]

Question: Randomly pick rows based on the number of other elements with data.table?

We can join DT and DTmin so that we get the MIN value in the dataframe and for each CATA and ITEM we can select MIN rows.

library(data.table)
DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]

Similarly, with dplyr :

library(dplyr)
DT %>%
  left_join(DTmin, by = 'CATA') %>%
  group_by(CATA, ITEM) %>%
  sample_n(first(MIN))

All the MIN value are the same throughout the group we can use any of it, I use the first one.

Change values for a random selection of a data.table subset

You can also sample the index ( which can be calculated using which function ) besides all the suggestions above:

dt[sample(which(city == "New York"), 1), score:=555L]
dt
#           city score
#  1:   Tel Aviv     8
#  2:  Amsterdam     3
#  3:  Cape Town    10
#  4:   New York     1
#  5:  Cape Town    10
#  6: Pittsburgh     2
#  7: Pittsburgh     8
#  8:  Amsterdam    10
#  9:  Amsterdam     8
# 10:  Amsterdam     4
# 11:   Tel Aviv     7
# 12:  Amsterdam     2
# 13: Pittsburgh     1
# 14:  Amsterdam     3
# 15: Pittsburgh     2
# 16:   New York     7
# 17:   Tel Aviv    10
# 18:   New York    10
# 19:  Cape Town     1
# 20:  Amsterdam     7
# 21:  Amsterdam     3
# 22:   New York   555
# 23:  Cape Town     6
# 24:   New York     1
# 25:   Tel Aviv    10
#           city score

R - Extract random sample with Conditional using 'Which' in Loop

The group_by and sample_n functions in the dplyr package let you do this easily:

library(dplyr)
subset <- H0_LONG %>%
    group_by(Patch) %>%
    sample_n(25)

This approach will typically also run faster than a for loop. Note that this code is just another way of writing:

subset <- sample_n(group_by(H0_LONG, Patch), 25)

Pick one random element from a vector for each row of a data.table

We can use group by option and then do sample

dt[, NextItem := sample(x, 1), by = Name]

Or you can also do this with .N instead of by

dt[, NextItem := sample(x, .N, replace = TRUE)]

Extract rows using multiple conditions related to the order of occurrence of zero and one in R rule()

You could create a small function that reflects your four conditions, and then apply that function by group

f <- function(z,p) {
  p1 = which(p==1)
  z0 = which(z==0)
  c1 = c(p1[1],p1[length(p1)])
  c2 = ifelse(z[p1[1]-1]==0, as.integer(p1[1]-1),as.integer(NA))
  c3 = min(z0[which(z0>p1[1])], na.rm=T)
  c4 = max(p1,z0, na.rm=T)
  unique(c(c1,c2,c3,c4))
}

Now, apply that function by group

libary(dplyr)
df %>%
  group_by(ID) %>%
  filter(row_number() %in% f(zero,pos))

Output:

     ID var    zero   pos
  <dbl> <chr> <dbl> <dbl>
1    60 X2        0    NA
2    60 X3       NA     1
3    60 X6        0    NA
4    60 X9       NA     1
5    61 X1       NA     1
6    61 X4        0    NA
7    61 X9       NA     1
8    61 X10       0    NA

Or, using data.table

library(data.table)
setDT(df)[, .SD[f(zero,pos)], by=ID]