r cumsum per group in dplyr
Ah. After fiddling around I seem to have found it.
pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))
Output with forloop in question:
> pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
> for(grp in unique(df$group)){
+ subs = filter(df, group == grp) %>% arrange(dates)
+ pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=subs$sales, cs=cumsum(subs$sales)))
+ }
> pdf
dates group sales cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087
Output with this line of code:
> pdf = df %>% group_by(group) %>% mutate(cs = cumsum(sales))
> pdf
Source: local data frame [15 x 4]
Groups: group
dates group sales cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087
How can I use cumsum within a group in Pandas?
You can call transform
and pass the cumsum
function to add that column to your df:
In [156]:
df['cumsum'] = df.groupby('id')['val'].transform(pd.Series.cumsum)
df
Out[156]:
id stuff val cumsum
0 A 12 1 1
1 B 23232 2 2
2 A 13 -3 -2
3 C 1234 1 1
4 D 3235 5 5
5 B 3236 6 8
6 C 732323 -2 -1
With respect to your error, you can't call cumsum
on a Series groupby object, secondly you're passing the name of the column as a list which is meaningless.
So this works:
In [159]:
df.groupby('id')['val'].cumsum()
Out[159]:
0 1
1 2
2 -2
3 1
4 5
5 8
6 -1
dtype: int64
Calculate cumulative sum (cumsum) by group
df$csum <- ave(df$value, df$id, FUN=cumsum)
ave
is the "go-to" function if you want a by-group vector of equal length to an existing vector and it can be computed from those sub vectors alone. If you need by-group processing based on multiple "parallel" values, the base strategy is do.call(rbind, by(dfrm, grp, FUN))
.
Cumulative sum with `all` or `any` by group
You're not actually doing a cumsum
--nothing needs to be summed. You are looking for the row number within the group.
Here are a couple ways with dplyr
:
df %>%
group_by(group) %>%
mutate(
result1 = row_number() * any(y %% 3 == 0),
result2 = case_when(
any(y %% 3 == 0) ~ row_number(),
TRUE ~ 0L
)
)
# # A tibble: 12 × 4
# # Groups: group [6]
# group y result1 result2
# <int> <int> <int> <int>
# 1 1 1 0 0
# 2 1 2 0 0
# 3 2 3 1 1
# 4 2 4 2 2
# 5 3 5 1 1
# 6 3 6 2 2
# 7 4 7 0 0
# 8 4 8 0 0
# 9 5 9 1 1
# 10 5 10 2 2
# 11 6 11 1 1
# 12 6 12 2 2
Pandas: Cumulative sum within group with two conditions
You can use .where()
on conditions x
< 1 or x
>= 1 to temporarily modify the values of value_1
to 0 according to the condition and then groupby cumsum, as follows:
The second condition is catered by the .groupby
function while the first condition is catered by the .where()
function, detailed below:
.where()
keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x
< 1, value_1
will keep its values for feeding to the subsequent cumsum
step to accumulate the filtered values of value_1
. For rows where the condition x
< 1 is False, value_1
has its values masked to 0. These 0 passed to cumsum
for accumulation is effectively the same effect as taking out the original values of value_1
for the accumulation into
column cumsum_1
.
The second line of codes accumulates value_1
values to column cumsum_2
with the opposite condition of x
>= 1. These 2 lines of codes, in effect, allocate value_1
to cumsum_1
and cumsum_2
according to x
< 1 and x
>= 1, respectively.
(Thanks for the suggestion of @tdy to simplify the codes)
df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()
Result:
print(df)
x y value_1 cumsum_1 cumsum_2
0 0.10 1 12 12 0
1 1.20 1 10 12 10
2 0.25 1 7 19 10
3 1.00 2 3 0 3
4 0.72 2 5 5 3
5 1.50 2 10 5 13
cumsum by group
library(data.table)
data <- data.table(group1=c('A','A','A','B','B'),sum=c(1,2,4,3,7))
data[,list(cumsum = cumsum(sum)),by=list(group1)]
Group by cumulative sums with conditions
Use na.locf0 from zoo to fill in the NAs and then apply rleid from data.table:
library(data.table)
library(zoo)
rleid(na.locf0(df$ID))
## [1] 1 2 2 2 2 3 4 4 5 5 5
Related Topics
Filter Data Frame Rows Based on Values in Vector
For Loop Over Dygraph Does Not Work in R
How to Get the Maximum Value by Group
What Are the Differences Between R's New Native Pipe '|>' and the Magrittr Pipe '%>%'
Using Rcpp Within Parallel Code via Snow to Make a Cluster
Transposing a Dataframe Maintaining the First Column as Heading
Reduce PDF File Size of Plots by Filtering Hidden Objects
Saving Multiple Outputs of Foreach Dopar Loop
How to Remove an Element from a List
Idiom for Ifelse-Style Recoding for Multiple Categories
Convert Column Classes in Data.Table
Printing Multiple Ggplots into a Single PDF, Multiple Plots Per Page
Split Text String in a Data.Table Columns
What's the Differences Between & and &&, | and || in R
Extract Month and Year from Date in R
Arrange Base Plots and Grid.Tables on the Same Page