How do you extract a few random rows from a data.table on the fly
Have just made .N
work in i
. New README item :
.N
is now available ini
, FR#724. Thanks to newbie indirectly here and Farrel directly here.
This now works :
DT[...][...][sample(.N,3)]
e.g.
> random.length <- sample(x = 15:30, size = 1)
> data.table(city = sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"),size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(.N, 3)]
city score
1: New York 4
2: Pittsburgh 3
3: Cape Town 9
>
sample from data.table
Part 1
If you want to count the number of unique ids and some ids repeat within groups
dat[, .(n_ids = uniqueN(id)), group]
If ids don't repeat within groups or you don't want to count them on a unique basis
dat[, .(n_ids = .N), group]
Part 2
If ids repeat within groups and you want to return all rows for the randomly selected id in each group
dat[dat[, .(id = sample(id, 1)), group], on = .(id, group)]
If ids do not repeat, or you only want one row per group anyway
dat[dat[, sample(.I, 1), group]$V1]
Thanks to Frank's comment, you can also do the second option for parts 1 & 2 above in one line. This returns the row like dat[dat[, sample(.I, 1), group]$V1]
but also adds a column N
showing the number of ids (assumed to equal the number of rows in the group)
dat[sample(.N), c(.SD[1], .N), keyby=group]
How to take a random subsample from rows of a data.table per factor?
We could join with a key/value dataset and use .I
to sample
DT[DT[data.table(factor = letters[1:3], val = c(12, 20, 5)),
on = .(factor)][, sample(.I, val[1], replace = TRUE), factor]$V1]
If we split this into parts-
data.table(factor = letters[1:3], val = c(12, 20, 5))
is a key/value data.table
to get the 'val' as a column on the original dataset by joining on
the 'factor`.
In the second step, we do the joining
DT[data.table(factor = letters[1:3], val = c(12, 20, 5)),
on = .(factor)]
Now, we sample
the row index, grouped by 'factor', specifying the size
as the first element of 'val', extract the rowindex column $V1
and use this to subset the original dataset. i.e.
DT[....$V1]
Subset by column criteria AND randomly sample rows of a data.table
To achieve that, I would utilize the .I
special symbol as follows:
DT <- as.data.table(mtcars)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))]
Now you can do some computations:
set.seed(2019)
DT[c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
, lapply(.SD, mean)
, by = am
, .SDcols = 3:5]
which gives:
am disp hp drat
1: 0 325.64 179.0667 3.224667
2: 1 243.00 204.7500 3.890000
If you want to reuse that index vector at a later moment, you can store it beforehand:
idx <- c(DT[, .I[cyl == 8]], sample(DT[, .I[cyl == 6]], 5))
DT[idx, lapply(.SD, mean), .SDcols = 3:5]
Question: Randomly pick rows based on the number of other elements with data.table?
We can join DT
and DTmin
so that we get the MIN
value in the dataframe and for each CATA
and ITEM
we can select MIN
rows.
library(data.table)
DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]
Similarly, with dplyr
:
library(dplyr)
DT %>%
left_join(DTmin, by = 'CATA') %>%
group_by(CATA, ITEM) %>%
sample_n(first(MIN))
All the MIN
value are the same throughout the group we can use any of it, I use the first
one.
Change values for a random selection of a data.table subset
You can also sample the index ( which can be calculated using which
function ) besides all the suggestions above:
dt[sample(which(city == "New York"), 1), score:=555L]
dt
# city score
# 1: Tel Aviv 8
# 2: Amsterdam 3
# 3: Cape Town 10
# 4: New York 1
# 5: Cape Town 10
# 6: Pittsburgh 2
# 7: Pittsburgh 8
# 8: Amsterdam 10
# 9: Amsterdam 8
# 10: Amsterdam 4
# 11: Tel Aviv 7
# 12: Amsterdam 2
# 13: Pittsburgh 1
# 14: Amsterdam 3
# 15: Pittsburgh 2
# 16: New York 7
# 17: Tel Aviv 10
# 18: New York 10
# 19: Cape Town 1
# 20: Amsterdam 7
# 21: Amsterdam 3
# 22: New York 555
# 23: Cape Town 6
# 24: New York 1
# 25: Tel Aviv 10
# city score
R - Extract random sample with Conditional using 'Which' in Loop
The group_by
and sample_n
functions in the dplyr package let you do this easily:
library(dplyr)
subset <- H0_LONG %>%
group_by(Patch) %>%
sample_n(25)
This approach will typically also run faster than a for loop. Note that this code is just another way of writing:
subset <- sample_n(group_by(H0_LONG, Patch), 25)
Pick one random element from a vector for each row of a data.table
We can use group by option and then do sample
dt[, NextItem := sample(x, 1), by = Name]
Or you can also do this with .N
instead of by
dt[, NextItem := sample(x, .N, replace = TRUE)]
Extract rows using multiple conditions related to the order of occurrence of zero and one in R rule()
You could create a small function that reflects your four conditions, and then apply that function by group
f <- function(z,p) {
p1 = which(p==1)
z0 = which(z==0)
c1 = c(p1[1],p1[length(p1)])
c2 = ifelse(z[p1[1]-1]==0, as.integer(p1[1]-1),as.integer(NA))
c3 = min(z0[which(z0>p1[1])], na.rm=T)
c4 = max(p1,z0, na.rm=T)
unique(c(c1,c2,c3,c4))
}
Now, apply that function by group
libary(dplyr)
df %>%
group_by(ID) %>%
filter(row_number() %in% f(zero,pos))
Output:
ID var zero pos
<dbl> <chr> <dbl> <dbl>
1 60 X2 0 NA
2 60 X3 NA 1
3 60 X6 0 NA
4 60 X9 NA 1
5 61 X1 NA 1
6 61 X4 0 NA
7 61 X9 NA 1
8 61 X10 0 NA
Or, using data.table
library(data.table)
setDT(df)[, .SD[f(zero,pos)], by=ID]
Output:
ID var zero pos
<num> <char> <num> <num>
1: 60 X3 NA 1
2: 60 X9 NA 1
3: 60 X2 0 NA
4: 60 X6 0 NA
5: 61 X1 NA 1
6: 61 X9 NA 1
7: 61 X4 0 NA
8: 61 X10 0 NA
Related Topics
Change Internal Function of a Package
How Do {{}} Double Curly Brackets Work in Dplyr
Removing Rows in R Based on Values in a Single Column
Formatter Argument in Scale_Continuous Throwing Errors in R 2.15
How to Order a Data Frame by One Descending and One Ascending Column
Override Column Types When Importing Data Using Readr::Read_Csv() When There Are Many Columns
Change the Color and Font of Text in Shiny App
Show That Shiny Is Busy (Or Loading) When Changing Tab Panels
Choosing Eps and Minpts for Dbscan (R)
Mutating Multiple Columns in a Data Frame Using Dplyr
Defer Code to End of Document in Knitr
Apply() Is Slow - How to Make It Faster or What Are My Alternatives
Plot Logistic Regression Curve in R
R Shiny Mouseover Text for Table Columns