Subset Rows Corresponding to Max Value by Group Using Data.Table

Subset rows corresponding to max value by group using data.table

Here's the fast data.table way:

bdt[bdt[, .I[g == max(g)], by = id]$V1]

This avoids constructing .SD, which is the bottleneck in your expressions.

edit: Actually, the main reason the OP is slow is not just that it has .SD in it, but the fact that it uses it in a particular way - by calling [.data.table, which at the moment has a huge overhead, so running it in a loop (when one does a by) accumulates a very large penalty.

data.table: Select row with maximum value by group with several grouping variables

You can compare value with max value in A and B, extract the logical vector and use it to subset data.table.

library(data.table)

setDT(mydf)
mydf[mydf[, value == max(value), .(A, B)]$V1, ]

Select the row with the maximum value in each group

Here's a data.table solution:

require(data.table) ## 1.9.2
group <- as.data.table(group)

If you want to keep all the entries corresponding to max values of pt within each group:

group[group[, .I[pt == max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

If you'd like just the first max value of pt:

group[group[, .I[which.max(pt)], by=Subject]$V1]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.

Extract the n highest value by group with data.table in R

We can separate the calls and filter top 3 rows by group.

library(data.table)
DT[order(-b),head(.SD, 3),a]

#   a   b          d
#1: 1 100  1.4647474
#2: 1  61 -1.1250266
#3: 1  51  0.9435628
#4: 2  82  0.3302404
#5: 2  72 -0.0219803
#6: 2  55  1.6865777

How to take all records with max value for each group

You can do this with data.table:

library(data.table)
setDT(df)[, .SD[follow_group == max(follow_group)], by = user]

or this with dplyr:

library(dplyr)
df %>%
  group_by(user) %>%
  filter(follow_group == max(follow_group))

Result:

   user                time follow_group
1:    1 2017-09-01 00:10:01            2
2:    1 2017-09-01 00:11:01            2
3:    2 2017-09-01 00:01:03            1
4:    2 2017-09-01 00:01:08            1
5:    2 2017-09-01 00:03:01            1

# A tibble: 5 x 3
# Groups:   user [2]
   user                time follow_group
  <int>               <chr>        <int>
1     1 2017-09-01 00:10:01            2
2     1 2017-09-01 00:11:01            2
3     2 2017-09-01 00:01:03            1
4     2 2017-09-01 00:01:08            1
5     2 2017-09-01 00:03:01            1

Data:

df = structure(list(user = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), time = c("2017-09-01 00:01:01", 
"2017-09-01 00:01:20", "2017-09-01 00:03:01", "2017-09-01 00:10:01", 
"2017-09-01 00:11:01", "2017-09-01 00:01:03", "2017-09-01 00:01:08", 
"2017-09-01 00:03:01"), follow_group = c(1L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L)), class = "data.frame", .Names = c("user", "time", 
"follow_group"), row.names = c(NA, -8L))

data.table sum by group and return row with max value

Here's one way to do it:

library(data.table)
dd <- data.table(
  f = c("a", "a", "a", "b", "b"), 
  g = c(1,2,3,4,5))
##
> dd[,list(g = sum(g)),by=f][which.max(g),]
   f g
1: b 9

Subset a data.table to get the most recent 3 or more rows within a duration by group

EDIT: Corrected interpretation of question

It seems I had misinterpreted OP's requirements.

Now, I understand that the OP wants to find

for each group
the most recent sequence of dates
which lie all within a period of two years and
which consist of three or more entries.

This can be solved by grouping in a non-equi join to cover requirements (1) and (3) and subsequent filtering for requirement (4) and subsetting for requirement (2). Finally, the indices are retrieved of the affected rows of test.dt.:

setorder(test.dt, group, -date)
idx <- test.dt[.(group = group, upper = date, lower = date - years(2)), 
               on = .(group, date <= upper, date >= lower), .N, by = .EACHI][
                 N >= 3, seq(.I[1L], length.out = N[1L]), by = group]$V1
test.dt[idx]

    group       date idx     age_yr
 1:     1 2017-03-08   1 0.00000000
 2:     1 2016-10-27   2 0.36164384
 3:     1 2016-09-19   3 0.46575342
 4:     1 2015-05-27   4 1.78356164
 5:     2 2016-04-17   1 0.00000000
 6:     2 2016-03-24   2 0.06575342
 7:     2 2015-09-16   3 0.58630137
 8:     2 2015-02-09   4 1.18630137
 9:     2 2014-09-19   5 1.57808219
10:     2 2014-08-24   6 1.64931507
11:     2 2014-06-01   7 1.87945205
12:     2 2014-05-09   8 1.94246575
13:     2 2014-04-21   9 1.99178082
14:     3 2013-07-02   1 0.00000000
15:     3 2013-04-13   2 0.21917808
16:     3 2013-03-18   3 0.29041096
17:     3 2012-10-31   4 0.66849315
18:     3 2012-10-30   5 0.67123288
19:     3 2012-10-03   6 0.74520548
20:     3 2012-06-01   7 1.08493151
21:     4 2010-08-06   1 0.00000000
22:     4 2009-11-17   2 0.71780822
23:     4 2009-06-19   3 1.13150685
24:     4 2009-04-15   4 1.30958904
25:     4 2009-02-20   5 1.45753425
26:     4 2008-11-18   6 1.71506849
27:     4 2008-10-24   7 1.78356164
28:     5 2011-07-13   1 0.00000000
29:     5 2011-01-19   2 0.47945205
30:     5 2010-07-18   3 0.98630137
31:     5 2009-10-10   4 1.75616438
    group       date idx     age_yr

Please, note that I have used the same set.seed(1L) as in IceCreamToucan's answer when creating test.dt to compare both results.

Wrong interpretation of question

If I understand correctly, the OP wants to keep for each group either the most recent 3 dates (regardless how old) or all dates which occurred within the last 2 years counted from the most recent date (even if more than 3).

The approach below uses the data.table special symbol .I which holds the row number (or index) in the original data.table x while grouping.

So, the indices of the three most recent dates for each group can be determined by

setorder(test.dt, group, -date)
test.dt[, .I[1:3], keyby = group]

    group V1
 1:     1  1
 2:     1  2
 3:     1  3
 4:     2 18
 5:     2 19
 6:     2 20
 7:     3 48
 8:     3 49
 9:     3 50
10:     4 55
11:     4 56
12:     4 57
13:     5 64
14:     5 65
15:     5 66
16:     6 72
17:     6 73
18:     6 74

The indices of the dates which occurred within the last two years counted from the most recent date can be determined by

test.dt[, .I[max(date) <= date %m+% years(2)], keyby = group]

Here, lubridate's date arithmetic is used to avoid problems with leap years.

Both set of indices can be combined using a set union() operation which removes duplicate indices. This set of indices is then used to subset the original data.table:

setorder(test.dt, group, -date)
test.dt[test.dt[, union(.I[1:3], .I[max(date) <= date %m+% years(2)]), keyby = group]$V1]

    group       date idx     age_yr
 1:     1 2017-04-18   1 0.00000000
 2:     1 2017-02-22   2 0.15068493
 3:     1 2016-09-15   3 0.58904110
 4:     1 2016-08-26   4 0.64383562
 5:     1 2016-07-26   5 0.72876712
 6:     1 2015-08-14   6 1.67945205
 7:     2 2016-03-26   1 0.00000000
 8:     2 2015-12-08   2 0.29863014
 9:     2 2015-11-21   3 0.34520548
10:     2 2015-05-23   4 0.84383562
11:     2 2015-04-22   5 0.92876712
12:     2 2014-06-08   6 1.80000000
13:     3 2013-07-02   1 0.00000000
14:     3 2013-05-23   2 0.10958904
15:     3 2012-10-24   3 0.68767123
16:     3 2012-10-06   4 0.73698630
17:     3 2012-06-16   5 1.04383562
18:     3 2012-03-15   6 1.29863014
19:     3 2012-01-26   7 1.43287671
20:     4 2010-07-20   1 0.00000000
21:     4 2010-02-21   2 0.40821918
22:     4 2009-11-19   3 0.66575342
23:     4 2009-08-04   4 0.95890411
24:     4 2009-01-26   5 1.47945205
25:     4 2009-01-17   6 1.50410959
26:     4 2008-07-26   7 1.98356164
27:     5 2011-04-10   1 0.00000000
28:     5 2011-04-04   2 0.01643836
29:     5 2011-04-01   3 0.02465753
30:     5 2011-03-05   4 0.09863014
31:     5 2010-12-28   5 0.28219178
32:     5 2009-08-23   6 1.63013699
33:     5 2009-08-07   7 1.67397260
34:     6 2021-02-21   1 0.00000000
35:     6 2018-12-03   2 2.22191781
36:     6 2014-09-11   3 6.45205479
    group       date idx     age_yr

Please, note that idx and age_yr have been added to verify the result.

Data

I have added a 6th group of dates which represents the use case where 3 dates are picked regardless of age.

set.seed(123L)   # required for reproducible data
test.dt <- data.table(
  group = c(
    rep(1, times = 17),
    rep(2, times = 30),
    rep(3, times = 7),
    rep(4, times = 9),
    rep(5, times = 8),
    rep(6, times = 5)
  ),
  date = c(
    sample(seq(dmy('28/8/2007'), dmy('3/10/2017'), by = 'day'), 17),
    sample(seq(dmy('7/5/2007'), dmy('19/4/2016'), by = 'day'), 30),
    sample(seq(dmy('28/12/2011'), dmy('3/10/2013'), by = 'day'), 7),
    sample(seq(dmy('21/12/2007'), dmy('11/11/2010'),by = 'day'), 9),
    sample(seq(dmy('27/8/2007'), dmy('5/2/2012'), by = 'day'), 8),
    sample(seq(dmy('27/8/2001'), dmy('5/2/2029'), by = 'day'), 5)
  )
)
# add data to verify  result
test.dt[order(-date), idx := rowid(group)]
test.dt[, age_yr := as.integer(max(date) - date)/365, by = group]
test.dt

Ratio of row value to sum of rows in a group using r data.table

You can use prop.table to get ratio for value in each year and quarter.

library(data.table)

dt[, pct_byQtrYr := prop.table(value), .(year, quarter)]
dt

#    ID year quarter value pct_byQtrYr
# 1:  A 2020       4   4.0   0.1951220
# 2:  B 2020       4  10.5   0.5121951
# 3:  C 2020       4   6.0   0.2926829
# 4:  A 2021       1   6.6   0.2933333
# 5:  B 2021       1  15.0   0.6666667
# 6:  C 2021       1   0.9   0.0400000
# 7:  A 2021       2   6.2   0.1980831
# 8:  B 2021       2   9.8   0.3130990
# 9:  C 2021       2  15.3   0.4888179
#10:  A 2021       3   5.0   0.5263158
#11:  B 2021       3   3.4   0.3578947
#12:  C 2021       3   1.1   0.1157895

This is similar to dividing value by sum of the group.

dt[, pct_byQtrYr := value/sum(value), .(year, quarter)]

Subset Rows Corresponding to Max Value by Group Using Data.Table