How to Speed Up Subset by Groups

How to speed up subset by groups

Great question!

I'll assume df and dt to be the names of objects for easy/quick typing.

df = datas.tbl
dt = datas.dt

Comparison at -O3 level optimisation:

First, here's the timing on my system on the current CRAN version of dplyr and devel version of data.table. The devel version of dplyr seems to suffer from performance regressions (and is being fixed by Romain).

system.time(df %>% group_by(id1, id2) %>% filter(datetime == max(datetime)))
#  25.291   0.128  25.610 

system.time(dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
#  17.191   0.075  17.349

I ran this quite a few times, and dint seem to change. However, I compile all packages with -O3 optimisation flag (by setting ~/.R/Makevars appropriately). And I've observed that data.table performance gets much better than other packages I've compared it with at -O3.

Grouping speed comparison

Second, it is important to understand the reason for such slowness. First let's compare the time to just group.

system.time(group_by(df, id1, id2))
#   0.303   0.007   0.311 
system.time(data.table:::forderv(dt, by = c("id1", "id2"), retGrp = TRUE))
#   0.002   0.000   0.002

Even though there are a total of 250,000 rows your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

data.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

Why is it slow?

So what's the reason? Let's turn on datatable.verbose option and check again:

options(datatable.verbose = TRUE)
dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]
# Detected that j uses these columns: datetime 
# Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as '.I[datetime == max(datetime)]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Starting dogroups ... 
#   memcpy contiguous groups took 0.097s for 230000 groups
#   eval(j) took 17.129s for 230000 calls
# done dogroups in 17.597 secs

So eval(j) alone took ~97% of the time! The expression we've provided in j is evaluated for each group. Since you've 230,000 groups, and there's a penalty to the eval() call, that adds up.

Avoiding the eval() penalty

Since we're aware of this penalty, we've gone ahead and started implementing internal versions of some commonly used functions: sum, mean, min, max. This will/should be expanded to as many other functions as possible (when we find time).

So, let's try computing the time for just obtaining max(datetime) first:

dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
# Detected that j uses these columns: datetime 
# Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'list(max(datetime))'
# GForce optimized j to 'list(gmax(datetime))'

And it's instant. Why? Because max() gets internally optimised to gmax() and there's no eval() call for each of the 230K groups.

So why isn't datetime == max(datetime) instant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

Workaround

So now that we know the issue, and a way to get around it, let's use it.

dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
dt[dt.agg, on = c("id1", "id2", "datetime")] # v1.9.5+

This takes ~0.14 seconds on my Mac.

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)]

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

Data.table - subsetting within groups during group by is slow

GForce makes grouped operations run faster and will work on expressions like list(x = funx(X), y = funy(Y)), ...) where X and Y are column names and funx and funy belong to the set of optimized functions.

For a full description of what works, see ?GForce.
To test if an expression works, read the messages from DT[, expr, by=, verbose=TRUE].

In the OP's case, we have sum_x1_F1 = sum(x1[filter_var < 5]) which is not covered by GForce even though sum(v) is. In this special case, we can make a var v = x1*condition and sum that:

DT[, v := x1*(filter_var < 5)]

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(v)
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    0.63    0.19    0.81

For comparison, timing the OP's code on my computer:

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    9.00    0.02    9.06

Faster equivalent to group_by % % expand in R

You could do

out <- DT[, .(col = seq.int(Start_year, 2015L)), by = ID]
out
#    ID  col
# 1:  1 1999
# 2:  1 2000
# 3:  1 2001
# 4:  1 2002
# 5:  1 2003
# 6:  1 2004
# 7:  1 2005
# 8:  1 2006
# 9:  1 2007
# ...

In your case you would probably need to do

setDT(df)[, .(col = seq.int(Start_year, 2015L)), by = ID]

A tidyverse way of the same idea

library(readr); library(dplyr); library(tidyr)
tbl <- read_table(text)

tbl %>% 
  group_by(ID) %>% 
  mutate(Start_year = list(seq.int(Start_year, 2015L))) %>%
  # rename(new_col = Start_year)
  unnest()

data

text <- "ID    Start_year
01          1999
02          2004
03          2015
04          2007"

library(data.table)
DT <- fread(text)

Subset a data.table to get the most recent 3 or more rows within a duration by group

EDIT: Corrected interpretation of question

It seems I had misinterpreted OP's requirements.

Now, I understand that the OP wants to find

for each group
the most recent sequence of dates
which lie all within a period of two years and
which consist of three or more entries.

This can be solved by grouping in a non-equi join to cover requirements (1) and (3) and subsequent filtering for requirement (4) and subsetting for requirement (2). Finally, the indices are retrieved of the affected rows of test.dt.:

setorder(test.dt, group, -date)
idx <- test.dt[.(group = group, upper = date, lower = date - years(2)), 
               on = .(group, date <= upper, date >= lower), .N, by = .EACHI][
                 N >= 3, seq(.I[1L], length.out = N[1L]), by = group]$V1
test.dt[idx]

    group       date idx     age_yr
 1:     1 2017-03-08   1 0.00000000
 2:     1 2016-10-27   2 0.36164384
 3:     1 2016-09-19   3 0.46575342
 4:     1 2015-05-27   4 1.78356164
 5:     2 2016-04-17   1 0.00000000
 6:     2 2016-03-24   2 0.06575342
 7:     2 2015-09-16   3 0.58630137
 8:     2 2015-02-09   4 1.18630137
 9:     2 2014-09-19   5 1.57808219
10:     2 2014-08-24   6 1.64931507
11:     2 2014-06-01   7 1.87945205
12:     2 2014-05-09   8 1.94246575
13:     2 2014-04-21   9 1.99178082
14:     3 2013-07-02   1 0.00000000
15:     3 2013-04-13   2 0.21917808
16:     3 2013-03-18   3 0.29041096
17:     3 2012-10-31   4 0.66849315
18:     3 2012-10-30   5 0.67123288
19:     3 2012-10-03   6 0.74520548
20:     3 2012-06-01   7 1.08493151
21:     4 2010-08-06   1 0.00000000
22:     4 2009-11-17   2 0.71780822
23:     4 2009-06-19   3 1.13150685
24:     4 2009-04-15   4 1.30958904
25:     4 2009-02-20   5 1.45753425
26:     4 2008-11-18   6 1.71506849
27:     4 2008-10-24   7 1.78356164
28:     5 2011-07-13   1 0.00000000
29:     5 2011-01-19   2 0.47945205
30:     5 2010-07-18   3 0.98630137
31:     5 2009-10-10   4 1.75616438
    group       date idx     age_yr

Please, note that I have used the same set.seed(1L) as in IceCreamToucan's answer when creating test.dt to compare both results.

Wrong interpretation of question

If I understand correctly, the OP wants to keep for each group either the most recent 3 dates (regardless how old) or all dates which occurred within the last 2 years counted from the most recent date (even if more than 3).

The approach below uses the data.table special symbol .I which holds the row number (or index) in the original data.table x while grouping.

So, the indices of the three most recent dates for each group can be determined by

setorder(test.dt, group, -date)
test.dt[, .I[1:3], keyby = group]

    group V1
 1:     1  1
 2:     1  2
 3:     1  3
 4:     2 18
 5:     2 19
 6:     2 20
 7:     3 48
 8:     3 49
 9:     3 50
10:     4 55
11:     4 56
12:     4 57
13:     5 64
14:     5 65
15:     5 66
16:     6 72
17:     6 73
18:     6 74

The indices of the dates which occurred within the last two years counted from the most recent date can be determined by

test.dt[, .I[max(date) <= date %m+% years(2)], keyby = group]

Here, lubridate's date arithmetic is used to avoid problems with leap years.

Both set of indices can be combined using a set union() operation which removes duplicate indices. This set of indices is then used to subset the original data.table:

setorder(test.dt, group, -date)
test.dt[test.dt[, union(.I[1:3], .I[max(date) <= date %m+% years(2)]), keyby = group]$V1]

    group       date idx     age_yr
 1:     1 2017-04-18   1 0.00000000
 2:     1 2017-02-22   2 0.15068493
 3:     1 2016-09-15   3 0.58904110
 4:     1 2016-08-26   4 0.64383562
 5:     1 2016-07-26   5 0.72876712
 6:     1 2015-08-14   6 1.67945205
 7:     2 2016-03-26   1 0.00000000
 8:     2 2015-12-08   2 0.29863014
 9:     2 2015-11-21   3 0.34520548
10:     2 2015-05-23   4 0.84383562
11:     2 2015-04-22   5 0.92876712
12:     2 2014-06-08   6 1.80000000
13:     3 2013-07-02   1 0.00000000
14:     3 2013-05-23   2 0.10958904
15:     3 2012-10-24   3 0.68767123
16:     3 2012-10-06   4 0.73698630
17:     3 2012-06-16   5 1.04383562
18:     3 2012-03-15   6 1.29863014
19:     3 2012-01-26   7 1.43287671
20:     4 2010-07-20   1 0.00000000
21:     4 2010-02-21   2 0.40821918
22:     4 2009-11-19   3 0.66575342
23:     4 2009-08-04   4 0.95890411
24:     4 2009-01-26   5 1.47945205
25:     4 2009-01-17   6 1.50410959
26:     4 2008-07-26   7 1.98356164
27:     5 2011-04-10   1 0.00000000
28:     5 2011-04-04   2 0.01643836
29:     5 2011-04-01   3 0.02465753
30:     5 2011-03-05   4 0.09863014
31:     5 2010-12-28   5 0.28219178
32:     5 2009-08-23   6 1.63013699
33:     5 2009-08-07   7 1.67397260
34:     6 2021-02-21   1 0.00000000
35:     6 2018-12-03   2 2.22191781
36:     6 2014-09-11   3 6.45205479
    group       date idx     age_yr

Please, note that idx and age_yr have been added to verify the result.

Data

I have added a 6th group of dates which represents the use case where 3 dates are picked regardless of age.

set.seed(123L)   # required for reproducible data
test.dt <- data.table(
  group = c(
    rep(1, times = 17),
    rep(2, times = 30),
    rep(3, times = 7),
    rep(4, times = 9),
    rep(5, times = 8),
    rep(6, times = 5)
  ),
  date = c(
    sample(seq(dmy('28/8/2007'), dmy('3/10/2017'), by = 'day'), 17),
    sample(seq(dmy('7/5/2007'), dmy('19/4/2016'), by = 'day'), 30),
    sample(seq(dmy('28/12/2011'), dmy('3/10/2013'), by = 'day'), 7),
    sample(seq(dmy('21/12/2007'), dmy('11/11/2010'),by = 'day'), 9),
    sample(seq(dmy('27/8/2007'), dmy('5/2/2012'), by = 'day'), 8),
    sample(seq(dmy('27/8/2001'), dmy('5/2/2029'), by = 'day'), 5)
  )
)
# add data to verify  result
test.dt[order(-date), idx := rowid(group)]
test.dt[, age_yr := as.integer(max(date) - date)/365, by = group]
test.dt

How to Speed Up Subset by Groups