Calculating cumulative sum for each row
You want cumsum()
df <- within(df, acc_sum <- cumsum(count))
Cumulative sum for each row of data for the same ID
Here is an approach that I think meets your expectations.
Would group by id
and calculate the sum of value
for each id
via summarise
.
You can then add your cumulative sum column with mutate
. Based on your comments, I included an ifelse
so that if id
was NA
, it would not provide a cumulative sum, but instead be given NA
.
Finally, to combine your cumulative sum data with your original dataset, you would need to join the two tables.
library(tidyverse)
df %>%
arrange(id) %>%
group_by(id) %>%
summarise(sum = sum(value)) %>%
mutate(cs=ifelse(is.na(id), NA, cumsum(sum))) %>%
left_join(df)
Output
# A tibble: 7 x 5
id sum cs var value
<dbl> <dbl> <dbl> <fct> <dbl>
1 1 2 2 a 1
2 1 2 2 a 1
3 2 1 3 b 0
4 2 1 3 b 1
5 2 1 3 b 0
6 5 0 3 e 0
7 NA 1 NA f 1
Cumulative sum on preceding rows in the same column - R
test %>%
mutate(
cumsum_1 = cumsum(lag(loss_to_layer, default = 0)),
new_col = pmin(loss_to_layer, 127000000 - cumsum_1),
new_col = ifelse(new_col < 0, 0, new_col)
) %>%
select(-cumsum_1)
Measure to calculate Running sum of n rows for each group in PowerBI desktop
You need a simple running sum:
RunningSum =
var currRow = SELECTEDVALUE('ValByGroup'[Date])
return
CALCULATE( sum('ValByGroup'[Value]), filter(ALLEXCEPT('ValByGroup', 'ValByGroup'[Group]), 'ValByGroup'[Date] <= currRow && 'ValByGroup'[Date] > currRow-4 ))
EDIT:
we can add additional Index Column to our table.
Here example.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMjIwMtI1MAQiJR2lRCA2UIrVwSJsiCpshF3YBLshpthVg8xOwm42FmET7MKm2IWNEcKxAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Date = _t, Group = _t, Value = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Date", type date}, {"Group", type text}, {"Value", Int64.Type}}),
#"Sorted Rows" = Table.Sort(#"Changed Type",{{"Date", Order.Ascending}}),
#"Grouped Rows" = Table.Group(#"Sorted Rows", {"Group"}, {{"Count", each _, type table [Date=nullable date, Group=nullable text, Value=nullable number]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.AddIndexColumn([Count],"index",1)),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Date", "Group", "Value", "index"}, {"Custom.Date", "Custom.Group", "Custom.Value", "Custom.index"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Custom",{"Count", "Group"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Custom.Date", "Date"}, {"Custom.Group", "Group"}, {"Custom.Value", "Value"}, {"Custom.index", "index"}})
in
#"Renamed Columns"
Then instead of [Date] we can use IndexColumn in our measure
How to get cumulative sum
select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as sum
from @t t1
inner join @t t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
SQL Fiddle example
Output
| ID | SOMENUMT | SUM |
-----------------------
| 1 | 10 | 10 |
| 2 | 12 | 22 |
| 3 | 3 | 25 |
| 4 | 15 | 40 |
| 5 | 23 | 63 |
Edit: this is a generalized solution that will work across most db platforms. When there is a better solution available for your specific platform (e.g., gareth's), use it!
Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)
Iterating over DataFrame
rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas
.
For you case, looking at your DataFrame
column quantity
as a numpy
array, the code below should speed up the process quite a lot compared to your approach:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
Out :
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
If you still need more speed, suggest to have a look at numba as per jezrael answer.
Edit - Performance test
I got curious about performance and did this module with all 3 approaches.
I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
@njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [https://stackoverflow.com/a/69750009/5069105]
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
Running for a somewhat small array, size = 20_000
, leads to:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here numpy
arrays were 700+ times faster than iterrows()
, and numba
was still 22 times faster than numpy
.
And for larger arrays, size = 200_000
, we get:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Making numba
again 25+ times faster than numpy
arrays for this example, and confirming that you should avoid at all costs to use iterrows()
for anything more than a couple of hundreds of rows.
Pyspark Cumulative sum within Partition for moving last 2 (N) rows
What you want is to sum the last 2 rows (current row included), so simply specify the rowsBetween
like this:
from pyspark.sql import functions as F, Window
w = Window.partitionBy('Period').orderBy('Month').rowsBetween(-1, Window.currentRow)
df = df.withColumn('CustLast2', F.sum('Cust').over(w))
You inverted the lower and upper bounds of the window frame in your attempt.
Related Topics
How to Assign a Unique Id Number to Each Group of Identical Values in a Column
Scatterplot With Marginal Histograms in Ggplot2
Remove Duplicated Rows Using Dplyr
Multiply Rows of Matrix by Vector
How to Open CSV File in R When R Says "No Such File or Directory"
Ignore Outliers in Ggplot2 Boxplot
Subsetting R Data Frame Results in Mysterious Na Rows
Omit Rows Containing Specific Column of Na
How to Swap Values Between Two Columns
Latitude Longitude Coordinates to State Code in R
Explain a Lazy Evaluation Quirk
Construct a Manual Legend For a Complicated Plot
Wrap Long Axis Labels Via Labeller=Label_Wrap in Ggplot2
Frequency Count of Two Column in R
Subset a Dataframe Between 2 Dates
Dplyr Mutate/Replace Several Columns on a Subset of Rows