R Data.Table Conditional Aggregation

R data.table conditional aggregation

Given that d is your input structure:

library(data.table)

d[,c.sum:=sum(d$val[d$id1 %in% id1]),by=id2][]

How it works: by=id2 groups input data table d by id2; d$id1 %in% id1 selects rows in d whose id1 matches id1 of the group under consideration; sum(d$val[...]) takes sum of values from such rows; finally, c.sum:=sum(...) adds a column c.sum to d. The ending [] are needed only for the printing purpose.

The output is:

#    id1 id2 val c.sum
# 1: a x 2 9
# 2: a y 1 12
# 3: a z 2 12
# 4: b x 1 9
# 5: b u 3 4
# 6: c y 4 12
# 7: c z 3 12

Conditional aggregation in dplyr and data.table

You can use ifelse :

library(dplyr)

iris %>%
group_by(Species) %>%
mutate(new_var = ifelse(Sepal.Length < 5,
mean(Sepal.Length[Sepal.Length < 5]), NA)) %>%
ungroup()

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
# <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa NA
# 2 4.9 3 1.4 0.2 setosa 4.67
# 3 4.7 3.2 1.3 0.2 setosa 4.67
# 4 4.6 3.1 1.5 0.2 setosa 4.67
# 5 5 3.6 1.4 0.2 setosa NA
# 6 5.4 3.9 1.7 0.4 setosa NA
# 7 4.6 3.4 1.4 0.3 setosa 4.67
# 8 5 3.4 1.5 0.2 setosa NA
# 9 4.4 2.9 1.4 0.2 setosa 4.67
#10 4.9 3.1 1.5 0.1 setosa 4.67
# … with 140 more rows

Conditional row wise aggregation using or in data.table

I think you are not consistently including or excluding the end dates in your calculations of percentage_open_10. If we include end dates, you can use the following:

ndays <- 10L
setnafill(dt, fill=as.IDate("9999-12-31"), cols="payment_date")

dt[, cd10 := creation_date - ndays + 1L]

dt[, c("open_docs_10", "percentage_open_10") :=
.SD[.SD, on=.(customer_id, creation_date<=creation_date, payment_date>=cd10),
allow.cartesian=TRUE, by=.EACHI, {
ix <- x.document_id != i.document_id
p <- 0
if (any(ix)) {
lastd <- min(c(i.creation_date, max(x.payment_date[ix])))
firstd <- if (any(ix)) max(c(i.cd10, min(x.creation_date[ix])))
p <- (lastd - firstd + 1) / 10
}
.(.N - 1L, p)
}][, (1L:3L) := NULL]
]

output:

   customer_id document_id creation_date payment_date       cd10 open_docs_10 percentage_open_10
1: 1 1 2005-03-01 2005-03-05 2005-02-20 0 0.0
2: 1 2 2005-03-03 2005-03-07 2005-02-22 1 0.3
3: 1 3 2005-03-10 9999-12-31 2005-03-01 2 0.7
4: 1 4 2005-03-25 2005-03-28 2005-03-16 1 1.0
5: 2 5 2006-03-01 2006-03-05 2006-02-20 0 0.0
6: 2 6 2006-03-04 9999-12-31 2006-02-23 1 0.4
7: 2 7 2006-03-10 2006-03-15 2006-03-01 2 1.0
8: 2 8 2006-03-12 2006-03-16 2006-03-03 3 1.0

However, with 3 million rows, I am not hopeful that this can be completed in a few seconds.

Conditional aggregation based on groups in a data frame R

We could use map to loop over the values used in comparison, then grouped by 'Col1', create the two columns within each loop, by taking the sum of 'Col7' which is less than or equal to the value looped, and the sum of corresponding values of 'Col4', where 'Col7' is less than or equal to the value

library(purrr)
library(dplyr)
map_dfc(c(1, 2, 5, 10), ~ Data_Frame %>%
group_by(Col1) %>%
transmute(!! sprintf("Last%dCol7", .x) := sum(Col7 <= .x),
!! sprintf("Last%dSumCol4Col7", .x) := sum(Col4[Col7<= .x])) %>%
ungroup %>%
select(-Col1)) %>%
bind_cols(Data_Frame, .)

-output

#Col1       Col2       Col3 Col4 Col5      Col6 Col7 Last1Col7 Last1SumCol4Col7 Last2Col7 Last2SumCol4Col7 Last5Col7 Last5SumCol4Col7 Last10Col7
#1 A1 2011-03-11 2018-10-22 4 7 9.7917808 10 0 0 0 0 1 2 3
#2 A1 2014-08-21 2019-05-24 2 6 6.3452055 10 0 0 0 0 1 2 3
#3 A1 2016-01-17 2020-12-25 2 3 4.9371585 5 0 0 0 0 1 2 3
#4 A2 2017-06-30 2018-10-12 1 1 3.4712329 5 0 0 0 0 3 9 3
#5 A2 2018-07-11 2019-09-24 4 3 2.4410959 5 0 0 0 0 3 9 3
#6 A2 2018-11-28 2020-12-19 4 2 2.0575342 5 0 0 0 0 3 9 3
#7 A3 2019-09-04 2018-10-22 4 5 1.2931507 2 2 8 3 12 3 12 3
#8 A3 2020-02-29 2019-06-14 4 1 0.8060109 1 2 8 3 12 3 12 3
#9 A3 2020-07-12 2020-12-20 4 2 0.4410959 1 2 8 3 12 3 12 3
# Last10SumCol4Col7
#1 8
#2 8
#3 8
#4 9
#5 9
#6 9
#7 12
#8 12
#9 12

The issue in OP's code giving wrong sum is because Data_Frame[Data_Frame$Col7 <=2, ] is breaking the group and is getting the whole column subset instead of those within the group. Within tidyverse, we don't need to Data_Frame$, if we need to specify the data, use . or cur_data(). Also, here we just need Col7 <=2

R data.table Conditional Sum: Cleaner way

Try this

wages[, list(foo= sum(tentgelt > te_med)/.N), by = year]
# year foo
# 1: 2010 0.45


Related Topics



Leave a reply



Submit