Cumsum by Group

r cumsum per group in dplyr

Ah. After fiddling around I seem to have found it.

pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))

Output with forloop in question:

> pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
> for(grp in unique(df$group)){
+ subs = filter(df, group == grp) %>% arrange(dates)
+ pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=subs$sales, cs=cumsum(subs$sales)))
+ }
> pdf
dates group sales cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087

Output with this line of code:

> pdf = df %>% group_by(group) %>% mutate(cs = cumsum(sales))
> pdf
Source: local data frame [15 x 4]
Groups: group

dates group sales cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087

How can I use cumsum within a group in Pandas?

You can call transform and pass the cumsum function to add that column to your df:

In [156]:
df['cumsum'] = df.groupby('id')['val'].transform(pd.Series.cumsum)
df

Out[156]:
id stuff val cumsum
0 A 12 1 1
1 B 23232 2 2
2 A 13 -3 -2
3 C 1234 1 1
4 D 3235 5 5
5 B 3236 6 8
6 C 732323 -2 -1

With respect to your error, you can't call cumsum on a Series groupby object, secondly you're passing the name of the column as a list which is meaningless.

So this works:

In [159]:
df.groupby('id')['val'].cumsum()

Out[159]:
0 1
1 2
2 -2
3 1
4 5
5 8
6 -1
dtype: int64

Calculate cumulative sum (cumsum) by group

df$csum <- ave(df$value, df$id, FUN=cumsum)

ave is the "go-to" function if you want a by-group vector of equal length to an existing vector and it can be computed from those sub vectors alone. If you need by-group processing based on multiple "parallel" values, the base strategy is do.call(rbind, by(dfrm, grp, FUN)).

Cumulative sum with `all` or `any` by group

You're not actually doing a cumsum--nothing needs to be summed. You are looking for the row number within the group.

Here are a couple ways with dplyr:

df %>%
group_by(group) %>%
mutate(
result1 = row_number() * any(y %% 3 == 0),
result2 = case_when(
any(y %% 3 == 0) ~ row_number(),
TRUE ~ 0L
)
)
# # A tibble: 12 × 4
# # Groups: group [6]
# group y result1 result2
# <int> <int> <int> <int>
# 1 1 1 0 0
# 2 1 2 0 0
# 3 2 3 1 1
# 4 2 4 2 2
# 5 3 5 1 1
# 6 3 6 2 2
# 7 4 7 0 0
# 8 4 8 0 0
# 9 5 9 1 1
# 10 5 10 2 2
# 11 6 11 1 1
# 12 6 12 2 2

Pandas: Cumulative sum within group with two conditions

You can use .where() on conditions x < 1 or x >= 1 to temporarily modify the values of value_1 to 0 according to the condition and then groupby cumsum, as follows:

The second condition is catered by the .groupby function while the first condition is catered by the .where() function, detailed below:

.where() keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x < 1, value_1 will keep its values for feeding to the subsequent cumsum step to accumulate the filtered values of value_1. For rows where the condition x < 1 is False, value_1 has its values masked to 0. These 0 passed to cumsum for accumulation is effectively the same effect as taking out the original values of value_1 for the accumulation into
column cumsum_1.

The second line of codes accumulates value_1 values to column cumsum_2 with the opposite condition of x >= 1. These 2 lines of codes, in effect, allocate value_1 to cumsum_1 and cumsum_2 according to x < 1 and x >= 1, respectively.

(Thanks for the suggestion of @tdy to simplify the codes)

df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()

Result:

print(df)

x y value_1 cumsum_1 cumsum_2
0 0.10 1 12 12 0
1 1.20 1 10 12 10
2 0.25 1 7 19 10
3 1.00 2 3 0 3
4 0.72 2 5 5 3
5 1.50 2 10 5 13

cumsum by group

library(data.table)

data <- data.table(group1=c('A','A','A','B','B'),sum=c(1,2,4,3,7))

data[,list(cumsum = cumsum(sum)),by=list(group1)]

Group by cumulative sums with conditions

Use na.locf0 from zoo to fill in the NAs and then apply rleid from data.table:

library(data.table)
library(zoo)

rleid(na.locf0(df$ID))
## [1] 1 2 2 2 2 3 4 4 5 5 5


Related Topics



Leave a reply



Submit