Python - rolling functions for GroupBy object
cumulative sum
To answer the question directly, the cumsum method would produced the desired series:
In [17]: df
Out[17]:
id x
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5
In [18]: df.groupby('id').x.cumsum()
Out[18]:
0 0
1 1
2 3
3 3
4 7
5 12
Name: x, dtype: int64
pandas rolling functions per group
More generally, any rolling function can be applied to each group as follows (using the new .rolling method as commented by @kekert). Note that the return type is a multi-indexed series, which is different from previous (deprecated) pd.rolling_* methods.
In [10]: df.groupby('id')['x'].rolling(2, min_periods=1).sum()
Out[10]:
id
a 0 0.00
1 1.00
2 3.00
b 3 3.00
4 7.00
5 9.00
Name: x, dtype: float64
To apply the per-group rolling function and receive result in original dataframe order, transform should be used instead:
In [16]: df.groupby('id')['x'].transform(lambda s: s.rolling(2, min_periods=1).sum())
Out[16]:
0 0
1 1
2 3
3 3
4 7
5 9
Name: x, dtype: int64
deprecated approach
For reference, here's how the now deprecated pandas.rolling_mean behaved:
In [16]: df.groupby('id')['x'].apply(pd.rolling_mean, 2, min_periods=1)
Out[16]:
0 0.0
1 0.5
2 1.5
3 3.0
4 3.5
5 4.5
Efficient way to perform operations (rolling mean/add new columns) in each group from pandas groupby
- Use
pandas.DataFrame.groupby
and.apply
a function with your calculations
import pandas as pd
data = {'SecuCode': ['600455.SH', '600455.SH', '600455.SH', '600455.SH', '600455.SH', '600551.SH', '600551.SH', '600551.SH', '600551.SH', '600551.SH'],
'TradingDay': ['2013-01-04', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10', '2018-12-24', '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28'],
'Volume': [1484606, 1315166, 1675933, 1244098, 751279, 1166098, 3285799, 3534143, 2462501, 2282954],
'AShare': [49717768, 49717768, 49717768, 49717768, 49717768, 505825296, 505825296, 505825296, 505825296, 505825296]}
df = pd.DataFrame(data)
# function with calculations
def calcs(df: pd.DataFrame) -> pd.DataFrame:
df['volumn_percentage'] = df['Volume']/df['AShare']
df['turnover'] = df['volumn_percentage'].rolling(2).mean()
return df
# groupby and apply the function with the calculations
df_new = df.groupby('SecuCode').apply(calcs)
# print(df_new)
SecuCode TradingDay Volume AShare volumn_percentage turnover
0 600455.SH 2013-01-04 1484606 49717768 0.029861 NaN
1 600455.SH 2013-01-07 1315166 49717768 0.026453 0.028157
2 600455.SH 2013-01-08 1675933 49717768 0.033709 0.030081
3 600455.SH 2013-01-09 1244098 49717768 0.025023 0.029366
4 600455.SH 2013-01-10 751279 49717768 0.015111 0.020067
5 600551.SH 2018-12-24 1166098 505825296 0.002305 NaN
6 600551.SH 2018-12-25 3285799 505825296 0.006496 0.004401
7 600551.SH 2018-12-26 3534143 505825296 0.006987 0.006741
8 600551.SH 2018-12-27 2462501 505825296 0.004868 0.005928
9 600551.SH 2018-12-28 2282954 505825296 0.004513 0.004691
Apply rolling function to groupby over several columns
- It's easiest to handle
resample
androlling
with date frequencies when we have a single level datetime index. - However, I can't
pivot
/unstack
appropriately without dealing with duplicateA
/B
s so Igroupby
andsum
- I
unstack
one leveldate
so I canfill_value=0
. Currently, I can'tfill_value=0
when Iunstack
more than one level at a time. I make up for it with a transposeT
- Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
- Finally, I do a rolling 3 day sum and resample that result every 2 days with
resample
- I clean this up with a bit of renaming indices and one more pivot.
s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()
d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)
d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)
A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0
Pandas rolling slope on groupby objects
Try:
df['rolling_slope'] = (df.groupby('tags')['weight']
.rolling(window=10, min_period=2)
.apply(lambda v: linregress(np.arange(len(v)), v).slope )
.reset_index(level=0, drop=True)
)
But this is rolling on number of rows only, not really looking back 10 days
. There's also an option rolling('10D')
but you would need to set date
as index.
Rolling operations on DataFrameGroupby object
I have found a workable solution but it only works if for each id each date is unique. This is the case in my data with some additional processing:
new_df = df.groupby(['id','date']).mean().reset_index()
which returns:
id date target
0 1.0 2017-01-01 0
1 1.0 2017-01-21 1
2 1.0 2017-10-01 0
3 2.0 2017-01-01 1
4 2.0 2017-01-21 0
5 2.0 2017-10-01 0
I can then use the rolling method on a groupby object to get the desired result:
df = new_df.set_index('date')
df.iloc[::-1].groupby('id')['target'].rolling(window='180D',
centre=False).apply(lambda x : x[:-1].sum())
There are two tricks here:
I reverse the order of the dates (
.iloc[::-1]
) to take a forward looking window; this has been suggested in other SO questions.I drop the last entry of the sum to remove the 'current' date from the sum, so it only looks forward.
The second 'hack' means it only works when there are no repeats of dates for a given id.
I would be interested in making a more robust solution (e.g., where dates are repeated for an id).
Pandas returns incorrect groupby rolling sum of zeros for float64 when having many groups
TLDR: this is a side effect of optimization; the workaround is to use a non-pandas sum.
The reason is that pandas tries to optimize. Naive rolling window functions will take O(n*w) time. However, if we're aware the function is a sum, we can subtract one element going out of window and add the one getting into it. This approach no longer depends on window size, and is always O(n).
The caveat is that now we'll get side effects of floating point precision, manifesting itself similar to what you've described.
Sources: Python code calling window aggregation, Cython implementation of the rolling sum
Related Topics
Splitting a Semicolon-Separated String to a Dictionary, in Python
How to Get a Thread Safe Print in Python 2.6
Streaming Data with Python and Flask
Certificate Verify Failed: Unable to Get Local Issuer Certificate
Which Tkinter Modules Were Renamed in Python 3
Installing Python Packages Without Internet and Using Source Code as .Tar.Gz and .Whl
Pandas Datetime to Unix Timestamp Seconds
Pandas Fill Missing Values in Dataframe from Another Dataframe
Strange Behavior of Lists in Python
Downloading File to Specified Location with Selenium and Python
Python - Using Pandas Structures with Large CSV(Iterate and Chunksize)
How to Dump a Dict to a JSON File
How to Get a Raw, Compiled SQL Query from a SQLalchemy Expression
Two Values from One Input in Python
How to Get Around Declaring an Unused Variable in a for Loop