Cumsum as a New Column in an Existing Pandas Data

Cumsum as a new column in an existing Pandas data

Just apply cumsum on the pandas.Series df['SUM_C'] and assign it to a new column:

df['CUMSUM_C'] = df['SUM_C'].cumsum()

Result:

df
Out[34]:
A B SUM_C CUMSUM_C
0 1 1 10 10
1 1 2 20 30

cumsum pandas create new column

You're close--you just need to call cumsum():

>>> df.sort_values([by, 'timestamp']).groupby('user_id')['pageviews'].cumsum()
0 3
1 7
2 14
3 2
4 4
Name: pageviews, dtype: int64

As a function:

def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
df.sort_values([by, 'timestamp'], inplace=True)
df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
return df

Note that this will not just return the DataFrame but modify it in-place.


Here's how you would use the function:

>>> df
user_id pageviews conversion timestamp
0 1 3 True 08:01:12
1 1 4 False 07:02:14
2 1 7 False 08:02:14
3 2 2 True 10:12:15
4 2 2 False 05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
... df.sort_values([by, 'timestamp'], inplace=True)
... df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
... return df
...
>>> pageviews_per_user(df)
user_id pageviews conversion timestamp sum_pageviews
1 1 4 False 07:02:14 4
0 1 3 True 08:01:12 7
2 1 7 False 08:02:14 14
4 2 2 False 05:12:18 2
3 2 2 True 10:12:15 4
>>> df
user_id pageviews conversion timestamp sum_pageviews
1 1 4 False 07:02:14 4
0 1 3 True 08:01:12 7
2 1 7 False 08:02:14 14
4 2 2 False 05:12:18 2
3 2 2 True 10:12:15 4

Although timestamp is not a column of datetimes (just strings, as far as Pandas is concerned), it is still able to be sorted lexicographically.

The use of by, aggcol, and **kwargs are means of making your function a bit more generalizable if you'd like to group on other column names. If not, you could also hardcode these into the function body as is done in your question. **kwargs lets you pass any additional keyword arguments to groupby()

Cumulative sum by column in pandas dataframe

You can perform the cumsum per group using groupby + cumsum:

df['z'] = df.groupby('x')['y'].cumsum()

output:

   x    y    z
0 0 67 67
1 0 -5 62
2 1 78 78
3 1 47 125
4 1 88 213
5 1 12 225
6 1 -4 221
7 2 14 14
8 2 232 246
9 2 28 274

How can I create a new column with a conditional cumulative sum using pandas?

A much more efficient solution, see also generalized cumulative functions in NumPy/SciPy? :

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(-1, 2, size=(100, 1)), columns=['val'])
def my_sum(acc,x):
if x == 0 and acc < 0:
return acc + 1
if x == 1 and acc < 0:
return acc + 2
if x == -1 and acc <= 0:
return acc - 1
if x == 0 and acc > 0:
return acc - 1
if x == -1 and acc > 0:
return acc - 2
if x == 1 and acc >= 0:
return acc + 1
if x == 0 and acc == 0:
return acc
u_my_sum = np.frompyfunc(my_sum, 2, 1)
df['mysum'] = u_my_sum.accumulate(df.val, dtype=np.object).astype(np.int64)
print(df)

How to combine cumulative sum with data in Pandas

You can just concatenate the strings:

di.astype(str) + ' (' + di.cumsum(axis=1).astype(str) + ')'

output:

     Day 1    Day 2    Day 3
0 15 (15) 10 (25) 20 (45)
1 1 (1) 15 (16) 6 (22)
2 8 (8) 14 (22) 26 (48)

Cumulative Sum of a column based on values in another column?

Use GroupBy.cumsum with helper Series created by compare Name with Series.cumsum:

df['Sum_Cummulative']=df.groupby(df['Name'].eq('AAAA').cumsum())['Number'].cumsum() 
print (df)
Name Number Sum_Cummulative
0 AAAA 7 7
1 B 8 15
2 C 9 24
3 D 10 34
4 E 1 35
5 AAAA 1 1
6 O 2 3
7 C 34 37
8 D 5 42
9 E 6 48
10 AAAA 7 7
11 D 8 15
12 C 9 24
13 D 10 34
14 E 1 35
15 AAAA 1 1
16 B 7 8
17 C 8 16
18 D 2 18
19 E 3 21
20 AAAA 5 5
21 L 6 11
22 M 7 18

cumsum based on taking first value from another column the creating a new calculation

I'm not sure that I completely understand what you are trying to achieve. Nevertheless, here's an attempt to reproduce your expected results. For your example frame this

groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
df['New Units'] = 0
last = 0
for _, group in df['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = df.at[i, 'New Units'] = new_unit

does result in

   Final Account    Date  Units  New Units
0 A Jun-21 0 0
1 A Jul-21 0 0
2 A Aug-21 0 0
3 A Sep-21 0 0
4 A Oct-21 10 2
5 A Nov-21 0 0
6 A Dec-21 20 3
7 A Jan-22 0 0
8 A Feb-22 0 0
9 A Mar-22 7 0
10 A Apr-22 12 0
11 A May-22 35 5
12 A Jun-22 0 0

The first step identifies the blocks in column Units whose last item is relevant for building the new units: Successive zeros, followed by non-zeros, until the first zero. This

groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()

results in

0     1
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 4

Then group column Units along these blocks, grab the last item if each block if it is non-zero (zero can only happen in the last block), build the new unit (according to the given formula) and store it in the new column New Units.

(If you actually need the column Existing Units then just use .cumsum() on the column New Units.)


If there are multiple accounts (indicated in the comments), then one way to apply the procedure to each account separately would be to pack it into a function (here new_units), .groupby() over the Final Account column, and .apply() the function to the groups:

def new_units(sdf):
groups = (sdf['Units'].eq(0) & sdf['Units'].shift().ne(0)).cumsum()
last = 0
for _, group in sdf['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = sdf.at[i, 'New Units'] = new_unit
return sdf

df['New Units'] = 0
df = df.groupby('Final Account').apply(new_units)

How to stop and restart cumsum using a marker in another column

IIUC, you want to cumsum per group until you reach a True. Then, after this row, restart the count.

You can use an extra group based on the "end" value (also using a cumsum):

df['total'] = (df.groupby(['device_name',
df['end'].shift(1, fill_value=0).cumsum()])
['value'].cumsum())

output:

   device_name  value    end  total
0 A5 1 False 1
1 A5 7 False 8
2 A5 2 True 10
3 A5 1 False 1
4 A5 1 False 2
5 A5 1 False 3
6 A5 1 True 4
7 A6 2 False 2
8 A6 4 False 6
9 A6 2 False 8
10 A6 2 True 10
11 A6 2 False 2
12 A6 2 False 4

NB. note that I get a different value for row #2

NB.2. for purists, the extra group could also be computed using a groupby. It doesn't really matter in this case. The internal groups will just not start from zero after the first group, but their name will not be used anywhere in the output.



Related Topics



Leave a reply



Submit