Cumsum as a new column in an existing Pandas data
Just apply cumsum
on the pandas.Series
df['SUM_C']
and assign it to a new column:
df['CUMSUM_C'] = df['SUM_C'].cumsum()
Result:
df
Out[34]:
A B SUM_C CUMSUM_C
0 1 1 10 10
1 1 2 20 30
cumsum pandas create new column
You're close--you just need to call cumsum()
:
>>> df.sort_values([by, 'timestamp']).groupby('user_id')['pageviews'].cumsum()
0 3
1 7
2 14
3 2
4 4
Name: pageviews, dtype: int64
As a function:
def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
df.sort_values([by, 'timestamp'], inplace=True)
df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
return df
Note that this will not just return the DataFrame but modify it in-place.
Here's how you would use the function:
>>> df
user_id pageviews conversion timestamp
0 1 3 True 08:01:12
1 1 4 False 07:02:14
2 1 7 False 08:02:14
3 2 2 True 10:12:15
4 2 2 False 05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
... df.sort_values([by, 'timestamp'], inplace=True)
... df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
... return df
...
>>> pageviews_per_user(df)
user_id pageviews conversion timestamp sum_pageviews
1 1 4 False 07:02:14 4
0 1 3 True 08:01:12 7
2 1 7 False 08:02:14 14
4 2 2 False 05:12:18 2
3 2 2 True 10:12:15 4
>>> df
user_id pageviews conversion timestamp sum_pageviews
1 1 4 False 07:02:14 4
0 1 3 True 08:01:12 7
2 1 7 False 08:02:14 14
4 2 2 False 05:12:18 2
3 2 2 True 10:12:15 4
Although timestamp
is not a column of datetimes (just strings, as far as Pandas is concerned), it is still able to be sorted lexicographically.
The use of by
, aggcol
, and **kwargs
are means of making your function a bit more generalizable if you'd like to group on other column names. If not, you could also hardcode these into the function body as is done in your question. **kwargs
lets you pass any additional keyword arguments to groupby()
Cumulative sum by column in pandas dataframe
You can perform the cumsum
per group using groupby
+ cumsum
:
df['z'] = df.groupby('x')['y'].cumsum()
output:
x y z
0 0 67 67
1 0 -5 62
2 1 78 78
3 1 47 125
4 1 88 213
5 1 12 225
6 1 -4 221
7 2 14 14
8 2 232 246
9 2 28 274
How can I create a new column with a conditional cumulative sum using pandas?
A much more efficient solution, see also generalized cumulative functions in NumPy/SciPy? :
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-1, 2, size=(100, 1)), columns=['val'])
def my_sum(acc,x):
if x == 0 and acc < 0:
return acc + 1
if x == 1 and acc < 0:
return acc + 2
if x == -1 and acc <= 0:
return acc - 1
if x == 0 and acc > 0:
return acc - 1
if x == -1 and acc > 0:
return acc - 2
if x == 1 and acc >= 0:
return acc + 1
if x == 0 and acc == 0:
return acc
u_my_sum = np.frompyfunc(my_sum, 2, 1)
df['mysum'] = u_my_sum.accumulate(df.val, dtype=np.object).astype(np.int64)
print(df)
How to combine cumulative sum with data in Pandas
You can just concatenate the strings:
di.astype(str) + ' (' + di.cumsum(axis=1).astype(str) + ')'
output:
Day 1 Day 2 Day 3
0 15 (15) 10 (25) 20 (45)
1 1 (1) 15 (16) 6 (22)
2 8 (8) 14 (22) 26 (48)
Cumulative Sum of a column based on values in another column?
Use GroupBy.cumsum
with helper Series
created by compare Name
with Series.cumsum
:
df['Sum_Cummulative']=df.groupby(df['Name'].eq('AAAA').cumsum())['Number'].cumsum()
print (df)
Name Number Sum_Cummulative
0 AAAA 7 7
1 B 8 15
2 C 9 24
3 D 10 34
4 E 1 35
5 AAAA 1 1
6 O 2 3
7 C 34 37
8 D 5 42
9 E 6 48
10 AAAA 7 7
11 D 8 15
12 C 9 24
13 D 10 34
14 E 1 35
15 AAAA 1 1
16 B 7 8
17 C 8 16
18 D 2 18
19 E 3 21
20 AAAA 5 5
21 L 6 11
22 M 7 18
cumsum based on taking first value from another column the creating a new calculation
I'm not sure that I completely understand what you are trying to achieve. Nevertheless, here's an attempt to reproduce your expected results. For your example frame this
groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
df['New Units'] = 0
last = 0
for _, group in df['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = df.at[i, 'New Units'] = new_unit
does result in
Final Account Date Units New Units
0 A Jun-21 0 0
1 A Jul-21 0 0
2 A Aug-21 0 0
3 A Sep-21 0 0
4 A Oct-21 10 2
5 A Nov-21 0 0
6 A Dec-21 20 3
7 A Jan-22 0 0
8 A Feb-22 0 0
9 A Mar-22 7 0
10 A Apr-22 12 0
11 A May-22 35 5
12 A Jun-22 0 0
The first step identifies the blocks in column Units
whose last item is relevant for building the new units: Successive zeros, followed by non-zeros, until the first zero. This
groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
results in
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 4
Then group column Units
along these blocks, grab the last item if each block if it is non-zero (zero can only happen in the last block), build the new unit (according to the given formula) and store it in the new column New Units
.
(If you actually need the column Existing Units
then just use .cumsum()
on the column New Units
.)
If there are multiple accounts (indicated in the comments), then one way to apply the procedure to each account separately would be to pack it into a function (here new_units
), .groupby()
over the Final Account
column, and .apply()
the function to the groups:
def new_units(sdf):
groups = (sdf['Units'].eq(0) & sdf['Units'].shift().ne(0)).cumsum()
last = 0
for _, group in sdf['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = sdf.at[i, 'New Units'] = new_unit
return sdf
df['New Units'] = 0
df = df.groupby('Final Account').apply(new_units)
How to stop and restart cumsum using a marker in another column
IIUC, you want to cumsum
per group until you reach a True. Then, after this row, restart the count.
You can use an extra group based on the "end" value (also using a cumsum
):
df['total'] = (df.groupby(['device_name',
df['end'].shift(1, fill_value=0).cumsum()])
['value'].cumsum())
output:
device_name value end total
0 A5 1 False 1
1 A5 7 False 8
2 A5 2 True 10
3 A5 1 False 1
4 A5 1 False 2
5 A5 1 False 3
6 A5 1 True 4
7 A6 2 False 2
8 A6 4 False 6
9 A6 2 False 8
10 A6 2 True 10
11 A6 2 False 2
12 A6 2 False 4
NB. note that I get a different value for row #2
NB.2. for purists, the extra group could also be computed using a groupby
. It doesn't really matter in this case. The internal groups will just not start from zero after the first group, but their name will not be used anywhere in the output.
Related Topics
How to Create Key or Append an Element to Key
How to Send Cookies in a Post Request with the Python Requests Library
Python "Syntaxerror: Non-Ascii Character '\Xe2' in File"
Backporting Python 3 Open(Encoding="Utf-8") to Python 2
How to Get the Duration of a Video in Python
Print a String as Hexadecimal Bytes
Matplotlib Xticks Not Lining Up with Histogram
Why Is Semicolon Allowed in This Python Snippet
Timeit Versus Timing Decorator
Excluding Directories in Os.Walk
How to Add a String in a Certain Position
How to Prevent a C Shared Library to Print on Stdout in Python
How to Do/Workaround a Conditional Join in Python Pandas
How to Get an Event Callback When a Tkinter Entry Widget Is Modified
Run Command and Get Its Stdout, Stderr Separately in Near Real Time Like in a Terminal