R data.table conditional aggregation
Given that d
is your input structure:
library(data.table)
d[,c.sum:=sum(d$val[d$id1 %in% id1]),by=id2][]
How it works: by=id2
groups input data table d
by id2
; d$id1 %in% id1
selects rows in d
whose id1
matches id1
of the group under consideration; sum(d$val[...])
takes sum of values from such rows; finally, c.sum:=sum(...)
adds a column c.sum
to d
. The ending []
are needed only for the printing purpose.
The output is:
# id1 id2 val c.sum
# 1: a x 2 9
# 2: a y 1 12
# 3: a z 2 12
# 4: b x 1 9
# 5: b u 3 4
# 6: c y 4 12
# 7: c z 3 12
Conditional aggregation in dplyr and data.table
You can use ifelse
:
library(dplyr)
iris %>%
group_by(Species) %>%
mutate(new_var = ifelse(Sepal.Length < 5,
mean(Sepal.Length[Sepal.Length < 5]), NA)) %>%
ungroup()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
# <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa NA
# 2 4.9 3 1.4 0.2 setosa 4.67
# 3 4.7 3.2 1.3 0.2 setosa 4.67
# 4 4.6 3.1 1.5 0.2 setosa 4.67
# 5 5 3.6 1.4 0.2 setosa NA
# 6 5.4 3.9 1.7 0.4 setosa NA
# 7 4.6 3.4 1.4 0.3 setosa 4.67
# 8 5 3.4 1.5 0.2 setosa NA
# 9 4.4 2.9 1.4 0.2 setosa 4.67
#10 4.9 3.1 1.5 0.1 setosa 4.67
# … with 140 more rows
Conditional row wise aggregation using or in data.table
I think you are not consistently including or excluding the end dates in your calculations of percentage_open_10
. If we include end dates, you can use the following:
ndays <- 10L
setnafill(dt, fill=as.IDate("9999-12-31"), cols="payment_date")
dt[, cd10 := creation_date - ndays + 1L]
dt[, c("open_docs_10", "percentage_open_10") :=
.SD[.SD, on=.(customer_id, creation_date<=creation_date, payment_date>=cd10),
allow.cartesian=TRUE, by=.EACHI, {
ix <- x.document_id != i.document_id
p <- 0
if (any(ix)) {
lastd <- min(c(i.creation_date, max(x.payment_date[ix])))
firstd <- if (any(ix)) max(c(i.cd10, min(x.creation_date[ix])))
p <- (lastd - firstd + 1) / 10
}
.(.N - 1L, p)
}][, (1L:3L) := NULL]
]
output:
customer_id document_id creation_date payment_date cd10 open_docs_10 percentage_open_10
1: 1 1 2005-03-01 2005-03-05 2005-02-20 0 0.0
2: 1 2 2005-03-03 2005-03-07 2005-02-22 1 0.3
3: 1 3 2005-03-10 9999-12-31 2005-03-01 2 0.7
4: 1 4 2005-03-25 2005-03-28 2005-03-16 1 1.0
5: 2 5 2006-03-01 2006-03-05 2006-02-20 0 0.0
6: 2 6 2006-03-04 9999-12-31 2006-02-23 1 0.4
7: 2 7 2006-03-10 2006-03-15 2006-03-01 2 1.0
8: 2 8 2006-03-12 2006-03-16 2006-03-03 3 1.0
However, with 3 million rows, I am not hopeful that this can be completed in a few seconds.
Conditional aggregation based on groups in a data frame R
We could use map
to loop over the values used in comparison, then grouped by 'Col1', create the two columns within each loop, by taking the sum
of 'Col7' which is less than or equal to the value looped, and the sum
of corresponding values of 'Col4', where 'Col7' is less than or equal to the value
library(purrr)
library(dplyr)
map_dfc(c(1, 2, 5, 10), ~ Data_Frame %>%
group_by(Col1) %>%
transmute(!! sprintf("Last%dCol7", .x) := sum(Col7 <= .x),
!! sprintf("Last%dSumCol4Col7", .x) := sum(Col4[Col7<= .x])) %>%
ungroup %>%
select(-Col1)) %>%
bind_cols(Data_Frame, .)
-output
#Col1 Col2 Col3 Col4 Col5 Col6 Col7 Last1Col7 Last1SumCol4Col7 Last2Col7 Last2SumCol4Col7 Last5Col7 Last5SumCol4Col7 Last10Col7
#1 A1 2011-03-11 2018-10-22 4 7 9.7917808 10 0 0 0 0 1 2 3
#2 A1 2014-08-21 2019-05-24 2 6 6.3452055 10 0 0 0 0 1 2 3
#3 A1 2016-01-17 2020-12-25 2 3 4.9371585 5 0 0 0 0 1 2 3
#4 A2 2017-06-30 2018-10-12 1 1 3.4712329 5 0 0 0 0 3 9 3
#5 A2 2018-07-11 2019-09-24 4 3 2.4410959 5 0 0 0 0 3 9 3
#6 A2 2018-11-28 2020-12-19 4 2 2.0575342 5 0 0 0 0 3 9 3
#7 A3 2019-09-04 2018-10-22 4 5 1.2931507 2 2 8 3 12 3 12 3
#8 A3 2020-02-29 2019-06-14 4 1 0.8060109 1 2 8 3 12 3 12 3
#9 A3 2020-07-12 2020-12-20 4 2 0.4410959 1 2 8 3 12 3 12 3
# Last10SumCol4Col7
#1 8
#2 8
#3 8
#4 9
#5 9
#6 9
#7 12
#8 12
#9 12
The issue in OP's code giving wrong sum
is because Data_Frame[Data_Frame$Col7 <=2, ]
is breaking the group and is getting the whole column subset instead of those within the group. Within tidyverse
, we don't need to Data_Frame$
, if we need to specify the data, use .
or cur_data()
. Also, here we just need Col7 <=2
R data.table Conditional Sum: Cleaner way
Try this
wages[, list(foo= sum(tentgelt > te_med)/.N), by = year]
# year foo
# 1: 2010 0.45
Related Topics
Fastest Way to Find *The Index* of the Second (Third...) Highest/Lowest Value in Vector or Column
Time Series and Stl in R: Error Only Univariate Series Are Allowed
User Defined Colour Palette in R and Ggpairs
Group Data in R for Consecutive Rows
R: How to Make a Confusion Matrix for a Predictive Model
Findassocs for Multiple Terms in R
Match Dataframes Excluding Last Non-Na Value and Disregarding Order
Ggplot2 One Line Per Each Row Dataframe
Page Refresh Button in R Shiny
Twitter Emoji Encoding Problems with Twitter and R
Compute All Pairwise Differences Within a Vector in R
How to Rename Element's List Indexed by a Loop in R
Drawing a Stratified Sample in R