Calculating Cumulative Sum For Each Row

Calculating cumulative sum for each row

You want cumsum()

df <- within(df, acc_sum <- cumsum(count))

Cumulative sum for each row of data for the same ID

Here is an approach that I think meets your expectations.

Would group by id and calculate the sum of value for each id via summarise.

You can then add your cumulative sum column with mutate. Based on your comments, I included an ifelse so that if id was NA, it would not provide a cumulative sum, but instead be given NA.

Finally, to combine your cumulative sum data with your original dataset, you would need to join the two tables.

library(tidyverse)

df %>%
arrange(id) %>%
group_by(id) %>%
summarise(sum = sum(value)) %>%
mutate(cs=ifelse(is.na(id), NA, cumsum(sum))) %>%
left_join(df)

Output

# A tibble: 7 x 5
id sum cs var value
<dbl> <dbl> <dbl> <fct> <dbl>
1 1 2 2 a 1
2 1 2 2 a 1
3 2 1 3 b 0
4 2 1 3 b 1
5 2 1 3 b 0
6 5 0 3 e 0
7 NA 1 NA f 1

Cumulative sum on preceding rows in the same column - R

test %>% 
mutate(
cumsum_1 = cumsum(lag(loss_to_layer, default = 0)),
new_col = pmin(loss_to_layer, 127000000 - cumsum_1),
new_col = ifelse(new_col < 0, 0, new_col)
) %>%
select(-cumsum_1)

Measure to calculate Running sum of n rows for each group in PowerBI desktop

You need a simple running sum:

RunningSum = 

var currRow = SELECTEDVALUE('ValByGroup'[Date])
return
CALCULATE( sum('ValByGroup'[Value]), filter(ALLEXCEPT('ValByGroup', 'ValByGroup'[Group]), 'ValByGroup'[Date] <= currRow && 'ValByGroup'[Date] > currRow-4 ))

Sample Image

EDIT:

we can add additional Index Column to our table.
Here example.

let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMjIwMtI1MAQiJR2lRCA2UIrVwSJsiCpshF3YBLshpthVg8xOwm42FmET7MKm2IWNEcKxAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Date = _t, Group = _t, Value = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Date", type date}, {"Group", type text}, {"Value", Int64.Type}}),
#"Sorted Rows" = Table.Sort(#"Changed Type",{{"Date", Order.Ascending}}),
#"Grouped Rows" = Table.Group(#"Sorted Rows", {"Group"}, {{"Count", each _, type table [Date=nullable date, Group=nullable text, Value=nullable number]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each Table.AddIndexColumn([Count],"index",1)),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Date", "Group", "Value", "index"}, {"Custom.Date", "Custom.Group", "Custom.Value", "Custom.index"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Custom",{"Count", "Group"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Custom.Date", "Date"}, {"Custom.Group", "Group"}, {"Custom.Value", "Value"}, {"Custom.index", "index"}})
in
#"Renamed Columns"

Then instead of [Date] we can use IndexColumn in our measure

How to get cumulative sum

select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as sum
from @t t1
inner join @t t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id

SQL Fiddle example

Output

| ID | SOMENUMT | SUM |
-----------------------
| 1 | 10 | 10 |
| 2 | 12 | 22 |
| 3 | 3 | 25 |
| 4 | 15 | 40 |
| 5 | 23 | 63 |

Edit: this is a generalized solution that will work across most db platforms. When there is a better solution available for your specific platform (e.g., gareth's), use it!

Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)

Iterating over DataFrame rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas.

For you case, looking at your DataFrame column quantity as a numpy array, the code below should speed up the process quite a lot compared to your approach:

import pandas as pd
import numpy as np

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0

df['outcome'] = y

print(df)

Out :

    quantity  outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0

If you still need more speed, suggest to have a look at numba as per jezrael answer.

Edit - Performance test

I got curious about performance and did this module with all 3 approaches.

I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.

"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.

Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np


def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""

for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0

df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']

return df


def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0

df['outcome'] = y

return df


@njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]

for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result

def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [https://stackoverflow.com/a/69750009/5069105]
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df

def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df


# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))

Running for a somewhat small array, size = 20_000, leads to:

In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)

In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here numpy arrays were 700+ times faster than iterrows(), and numba was still 22 times faster than numpy.

And for larger arrays, size = 200_000, we get:

In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)

In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P

In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Making numba again 25+ times faster than numpy arrays for this example, and confirming that you should avoid at all costs to use iterrows() for anything more than a couple of hundreds of rows.

Pyspark Cumulative sum within Partition for moving last 2 (N) rows

What you want is to sum the last 2 rows (current row included), so simply specify the rowsBetween like this:

from pyspark.sql import functions as F, Window

w = Window.partitionBy('Period').orderBy('Month').rowsBetween(-1, Window.currentRow)

df = df.withColumn('CustLast2', F.sum('Cust').over(w))

You inverted the lower and upper bounds of the window frame in your attempt.



Related Topics



Leave a reply



Submit