Subset rows corresponding to max value by group using data.table
Here's the fast data.table
way:
bdt[bdt[, .I[g == max(g)], by = id]$V1]
This avoids constructing .SD
, which is the bottleneck in your expressions.
edit: Actually, the main reason the OP is slow is not just that it has .SD
in it, but the fact that it uses it in a particular way - by calling [.data.table
, which at the moment has a huge overhead, so running it in a loop (when one does a by
) accumulates a very large penalty.
data.table: Select row with maximum value by group with several grouping variables
You can compare value
with max
value
in A
and B
, extract the logical vector and use it to subset data.table.
library(data.table)
setDT(mydf)
mydf[mydf[, value == max(value), .(A, B)]$V1, ]
Select the row with the maximum value in each group
Here's a data.table
solution:
require(data.table) ## 1.9.2
group <- as.data.table(group)
If you want to keep all the entries corresponding to max values of pt
within each group:
group[group[, .I[pt == max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
If you'd like just the first max value of pt
:
group[group[, .I[which.max(pt)], by=Subject]$V1]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.
Extract the n highest value by group with data.table in R
We can separate the calls and filter top 3 rows by group.
library(data.table)
DT[order(-b),head(.SD, 3),a]
# a b d
#1: 1 100 1.4647474
#2: 1 61 -1.1250266
#3: 1 51 0.9435628
#4: 2 82 0.3302404
#5: 2 72 -0.0219803
#6: 2 55 1.6865777
How to take all records with max value for each group
You can do this with data.table
:
library(data.table)
setDT(df)[, .SD[follow_group == max(follow_group)], by = user]
or this with dplyr
:
library(dplyr)
df %>%
group_by(user) %>%
filter(follow_group == max(follow_group))
Result:
user time follow_group
1: 1 2017-09-01 00:10:01 2
2: 1 2017-09-01 00:11:01 2
3: 2 2017-09-01 00:01:03 1
4: 2 2017-09-01 00:01:08 1
5: 2 2017-09-01 00:03:01 1
# A tibble: 5 x 3
# Groups: user [2]
user time follow_group
<int> <chr> <int>
1 1 2017-09-01 00:10:01 2
2 1 2017-09-01 00:11:01 2
3 2 2017-09-01 00:01:03 1
4 2 2017-09-01 00:01:08 1
5 2 2017-09-01 00:03:01 1
Data:
df = structure(list(user = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), time = c("2017-09-01 00:01:01",
"2017-09-01 00:01:20", "2017-09-01 00:03:01", "2017-09-01 00:10:01",
"2017-09-01 00:11:01", "2017-09-01 00:01:03", "2017-09-01 00:01:08",
"2017-09-01 00:03:01"), follow_group = c(1L, 1L, 1L, 2L, 2L,
1L, 1L, 1L)), class = "data.frame", .Names = c("user", "time",
"follow_group"), row.names = c(NA, -8L))
data.table sum by group and return row with max value
Here's one way to do it:
library(data.table)
dd <- data.table(
f = c("a", "a", "a", "b", "b"),
g = c(1,2,3,4,5))
##
> dd[,list(g = sum(g)),by=f][which.max(g),]
f g
1: b 9
Subset a data.table to get the most recent 3 or more rows within a duration by group
EDIT: Corrected interpretation of question
It seems I had misinterpreted OP's requirements.
Now, I understand that the OP wants to find
- for each
group
- the most recent sequence of dates
- which lie all within a period of two years and
- which consist of three or more entries.
This can be solved by grouping in a non-equi join to cover requirements (1) and (3) and subsequent filtering for requirement (4) and subsetting for requirement (2). Finally, the indices are retrieved of the affected rows of test.dt
.:
setorder(test.dt, group, -date)
idx <- test.dt[.(group = group, upper = date, lower = date - years(2)),
on = .(group, date <= upper, date >= lower), .N, by = .EACHI][
N >= 3, seq(.I[1L], length.out = N[1L]), by = group]$V1
test.dt[idx]
group date idx age_yr
1: 1 2017-03-08 1 0.00000000
2: 1 2016-10-27 2 0.36164384
3: 1 2016-09-19 3 0.46575342
4: 1 2015-05-27 4 1.78356164
5: 2 2016-04-17 1 0.00000000
6: 2 2016-03-24 2 0.06575342
7: 2 2015-09-16 3 0.58630137
8: 2 2015-02-09 4 1.18630137
9: 2 2014-09-19 5 1.57808219
10: 2 2014-08-24 6 1.64931507
11: 2 2014-06-01 7 1.87945205
12: 2 2014-05-09 8 1.94246575
13: 2 2014-04-21 9 1.99178082
14: 3 2013-07-02 1 0.00000000
15: 3 2013-04-13 2 0.21917808
16: 3 2013-03-18 3 0.29041096
17: 3 2012-10-31 4 0.66849315
18: 3 2012-10-30 5 0.67123288
19: 3 2012-10-03 6 0.74520548
20: 3 2012-06-01 7 1.08493151
21: 4 2010-08-06 1 0.00000000
22: 4 2009-11-17 2 0.71780822
23: 4 2009-06-19 3 1.13150685
24: 4 2009-04-15 4 1.30958904
25: 4 2009-02-20 5 1.45753425
26: 4 2008-11-18 6 1.71506849
27: 4 2008-10-24 7 1.78356164
28: 5 2011-07-13 1 0.00000000
29: 5 2011-01-19 2 0.47945205
30: 5 2010-07-18 3 0.98630137
31: 5 2009-10-10 4 1.75616438
group date idx age_yr
Please, note that I have used the same set.seed(1L)
as in IceCreamToucan's answer when creating test.dt
to compare both results.
Wrong interpretation of question
If I understand correctly, the OP wants to keep for each group either the most recent 3 dates (regardless how old) or all dates which occurred within the last 2 years counted from the most recent date (even if more than 3).
The approach below uses the data.table
special symbol .I
which holds the row number (or index) in the original data.table x
while grouping.
So, the indices of the three most recent dates for each group can be determined by
setorder(test.dt, group, -date)
test.dt[, .I[1:3], keyby = group]
group V1
1: 1 1
2: 1 2
3: 1 3
4: 2 18
5: 2 19
6: 2 20
7: 3 48
8: 3 49
9: 3 50
10: 4 55
11: 4 56
12: 4 57
13: 5 64
14: 5 65
15: 5 66
16: 6 72
17: 6 73
18: 6 74
The indices of the dates which occurred within the last two years counted from the most recent date can be determined by
test.dt[, .I[max(date) <= date %m+% years(2)], keyby = group]
Here, lubridate
's date arithmetic is used to avoid problems with leap years.
Both set of indices can be combined using a set union()
operation which removes duplicate indices. This set of indices is then used to subset the original data.table:
setorder(test.dt, group, -date)
test.dt[test.dt[, union(.I[1:3], .I[max(date) <= date %m+% years(2)]), keyby = group]$V1]
group date idx age_yr
1: 1 2017-04-18 1 0.00000000
2: 1 2017-02-22 2 0.15068493
3: 1 2016-09-15 3 0.58904110
4: 1 2016-08-26 4 0.64383562
5: 1 2016-07-26 5 0.72876712
6: 1 2015-08-14 6 1.67945205
7: 2 2016-03-26 1 0.00000000
8: 2 2015-12-08 2 0.29863014
9: 2 2015-11-21 3 0.34520548
10: 2 2015-05-23 4 0.84383562
11: 2 2015-04-22 5 0.92876712
12: 2 2014-06-08 6 1.80000000
13: 3 2013-07-02 1 0.00000000
14: 3 2013-05-23 2 0.10958904
15: 3 2012-10-24 3 0.68767123
16: 3 2012-10-06 4 0.73698630
17: 3 2012-06-16 5 1.04383562
18: 3 2012-03-15 6 1.29863014
19: 3 2012-01-26 7 1.43287671
20: 4 2010-07-20 1 0.00000000
21: 4 2010-02-21 2 0.40821918
22: 4 2009-11-19 3 0.66575342
23: 4 2009-08-04 4 0.95890411
24: 4 2009-01-26 5 1.47945205
25: 4 2009-01-17 6 1.50410959
26: 4 2008-07-26 7 1.98356164
27: 5 2011-04-10 1 0.00000000
28: 5 2011-04-04 2 0.01643836
29: 5 2011-04-01 3 0.02465753
30: 5 2011-03-05 4 0.09863014
31: 5 2010-12-28 5 0.28219178
32: 5 2009-08-23 6 1.63013699
33: 5 2009-08-07 7 1.67397260
34: 6 2021-02-21 1 0.00000000
35: 6 2018-12-03 2 2.22191781
36: 6 2014-09-11 3 6.45205479
group date idx age_yr
Please, note that idx
and age_yr
have been added to verify the result.
Data
I have added a 6th group of dates which represents the use case where 3 dates are picked regardless of age.
set.seed(123L) # required for reproducible data
test.dt <- data.table(
group = c(
rep(1, times = 17),
rep(2, times = 30),
rep(3, times = 7),
rep(4, times = 9),
rep(5, times = 8),
rep(6, times = 5)
),
date = c(
sample(seq(dmy('28/8/2007'), dmy('3/10/2017'), by = 'day'), 17),
sample(seq(dmy('7/5/2007'), dmy('19/4/2016'), by = 'day'), 30),
sample(seq(dmy('28/12/2011'), dmy('3/10/2013'), by = 'day'), 7),
sample(seq(dmy('21/12/2007'), dmy('11/11/2010'),by = 'day'), 9),
sample(seq(dmy('27/8/2007'), dmy('5/2/2012'), by = 'day'), 8),
sample(seq(dmy('27/8/2001'), dmy('5/2/2029'), by = 'day'), 5)
)
)
# add data to verify result
test.dt[order(-date), idx := rowid(group)]
test.dt[, age_yr := as.integer(max(date) - date)/365, by = group]
test.dt
Ratio of row value to sum of rows in a group using r data.table
You can use prop.table
to get ratio for value
in each year
and quarter
.
library(data.table)
dt[, pct_byQtrYr := prop.table(value), .(year, quarter)]
dt
# ID year quarter value pct_byQtrYr
# 1: A 2020 4 4.0 0.1951220
# 2: B 2020 4 10.5 0.5121951
# 3: C 2020 4 6.0 0.2926829
# 4: A 2021 1 6.6 0.2933333
# 5: B 2021 1 15.0 0.6666667
# 6: C 2021 1 0.9 0.0400000
# 7: A 2021 2 6.2 0.1980831
# 8: B 2021 2 9.8 0.3130990
# 9: C 2021 2 15.3 0.4888179
#10: A 2021 3 5.0 0.5263158
#11: B 2021 3 3.4 0.3578947
#12: C 2021 3 1.1 0.1157895
This is similar to dividing value
by sum
of the group.
dt[, pct_byQtrYr := value/sum(value), .(year, quarter)]
Related Topics
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)
Sort (Order) Data Frame Rows by Multiple Columns
Understanding Exactly When a Data.Table Is a Reference to (Vs a Copy Of) Another Data.Table
What Specifically Are the Dangers of Eval(Parse(...))
Ggplot2 - Annotate Outside of Plot
Determine Path of the Executing Script
Fitting a Density Curve to a Histogram in R
Force R to Stop Plotting Abbreviated Axis Labels (Scientific Notation) - E.G. 1E+00
Error: Could Not Find Function ... in R
How to Implement Coalesce Efficiently in R
How to Get a Contingency Table
How to Use R'S Ellipsis Feature When Writing Your Own Function
Replace Values in a Dataframe Based on Lookup Table
Multirow Axis Labels With Nested Grouping Variables
How to Convert a List Consisting of Vector of Different Lengths to a Usable Data Frame in R