Calculate cumulative sum (cumsum) by group
df$csum <- ave(df$value, df$id, FUN=cumsum)
ave
is the "go-to" function if you want a by-group vector of equal length to an existing vector and it can be computed from those sub vectors alone. If you need by-group processing based on multiple "parallel" values, the base strategy is do.call(rbind, by(dfrm, grp, FUN))
.
Pandas: Cumulative sum within group with two conditions
You can use .where()
on conditions x
< 1 or x
>= 1 to temporarily modify the values of value_1
to 0 according to the condition and then groupby cumsum, as follows:
The second condition is catered by the .groupby
function while the first condition is catered by the .where()
function, detailed below:
.where()
keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x
< 1, value_1
will keep its values for feeding to the subsequent cumsum
step to accumulate the filtered values of value_1
. For rows where the condition x
< 1 is False, value_1
has its values masked to 0. These 0 passed to cumsum
for accumulation is effectively the same effect as taking out the original values of value_1
for the accumulation into
column cumsum_1
.
The second line of codes accumulates value_1
values to column cumsum_2
with the opposite condition of x
>= 1. These 2 lines of codes, in effect, allocate value_1
to cumsum_1
and cumsum_2
according to x
< 1 and x
>= 1, respectively.
(Thanks for the suggestion of @tdy to simplify the codes)
df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()
Result:
print(df)
x y value_1 cumsum_1 cumsum_2
0 0.10 1 12 12 0
1 1.20 1 10 12 10
2 0.25 1 7 19 10
3 1.00 2 3 0 3
4 0.72 2 5 5 3
5 1.50 2 10 5 13
How to calculate cumulative sum (reversed) of a Python DataFrame within given groups?
You can try with series
groupby
df['new'] = df.loc[::-1, 'Chi'].groupby(df['Basin']).cumsum()
df
Out[858]:
Basin (n=17 columns) Chi new
0 13.0 ... 4 14
1 13.0 ... 8 10
2 13.0 ... 2 2
3 21.0 ... 4 10
4 21.0 ... 6 6
5 38.0 ... 1 14
6 38.0 ... 7 13
7 38.0 ... 2 6
8 38.0 ... 4 4
Calculating cumulative sum for multiple columns in R
If I understand what you are doing, you're taking the sum for each month, then doing the cumulative sums for the months. This is usuaully pretty easy in dplyr
.
library(dplyr)
df %>%
group_by(Year, Month, Group, SubGroup) %>%
summarize(
V1_sum = sum(V1),
V2_sum = sum(V2)
) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1_cumsum = cumsum(V1_sum),
V2_cumsum = cumsum(V2_sum)
)
# A tibble: 6 x 8
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 Feb A a 50 0 50 0
# 2 2020 Feb B a 10 1 10 1
# 3 2020 Feb B b 60 6 60 6
# 4 2020 Jan A a 20 1 70 1
# 5 2020 Jan A b 20 2 20 2
# 6 2020 Jan B b 20 2 80 8
But you'll notice that the monthly cumulative sums are backwards (i.e. January comes after February), because by default group_by
groups alphabetically. Also, you don't see the empty values because dplyr
doesn't fill them in.
To fix the order of the months, you can either make your months numeric (convert to dates) or turn them into factors. You can add back 'missing' combinations of the grouping variables by using aggregate
in base R instead of dplyr::summarize
. aggregate
includes all combinations of the grouping factors. aggregate
converts the missing values to NA, but you can replace the NA with 0 with tidyr::replace_na
, for example.
library(dplyr)
library(tidyr)
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)
# Get monthly sums
df1 <- with(df, aggregate(
list(V1_sum = V1, V2_sum = V2),
list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
FUN = sum, drop = FALSE
))
df1 <- df1 %>%
# Replace NA with 0
mutate(
V1_sum = replace_na(V1_sum, 0),
V2_sum = replace_na(V2_sum, 0)
) %>%
# Get cumulative sum across months
group_by(Year, Group, SubGroup) %>%
mutate(V1cumsum = cumsum(V1_sum),
V2cumsum = cumsum(V2_sum)) %>%
ungroup() %>%
select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)
This gives the same result as your example:
# # A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <dbl> <ord> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 20 1
# 2 2020 Feb A a 70 1
# 3 2020 Jan B a 0 0
# 4 2020 Feb B a 10 1
# 5 2020 Jan A b 20 2
# 6 2020 Feb A b 20 2
# 7 2020 Jan B b 20 2
# 8 2020 Feb B b 80 8
cumsum by group
library(data.table)
data <- data.table(group1=c('A','A','A','B','B'),sum=c(1,2,4,3,7))
data[,list(cumsum = cumsum(sum)),by=list(group1)]
Group by cumulative sums with conditions
Use na.locf0 from zoo to fill in the NAs and then apply rleid from data.table:
library(data.table)
library(zoo)
rleid(na.locf0(df$ID))
## [1] 1 2 2 2 2 3 4 4 5 5 5
Related Topics
Capitalize the First Letter of Both Words in a Two Word String
Select Equivalent Rows [A-B & B-A]
Multiple Plots in For Loop Ignoring Par
Subset Rows in a Data Frame Based on a Vector of Values
Displaying Text Below the Plot Generated by Ggplot2
Subsetting R Data Frame Results in Mysterious Na Rows
Error in Plot.New(): Figure Margins Too Large in R
How to Add Layers in Ggplot Using a For-Loop
How to Match Fuzzy Match Strings from Two Datasets
Table of Interactions - Case With Pets and Houses
Plotting Lines and the Group Aesthetic in Ggplot2
Dplyr: "Error in N(): Function Should Not Be Called Directly"
Ignore Outliers in Ggplot2 Boxplot
Merge Several Data.Frames into One Data.Frame With a Loop
Create New Variables With Mutate_At While Keeping the Original Ones