Python - Rolling Functions for Groupby Object

Python - rolling functions for GroupBy object

cumulative sum

To answer the question directly, the cumsum method would produced the desired series:

In [17]: df
Out[17]:
id x
0 a 0
1 a 1
2 a 2
3 b 3
4 b 4
5 b 5

In [18]: df.groupby('id').x.cumsum()
Out[18]:
0 0
1 1
2 3
3 3
4 7
5 12
Name: x, dtype: int64

pandas rolling functions per group

More generally, any rolling function can be applied to each group as follows (using the new .rolling method as commented by @kekert). Note that the return type is a multi-indexed series, which is different from previous (deprecated) pd.rolling_* methods.

In [10]: df.groupby('id')['x'].rolling(2, min_periods=1).sum()
Out[10]:
id
a 0 0.00
1 1.00
2 3.00
b 3 3.00
4 7.00
5 9.00
Name: x, dtype: float64

To apply the per-group rolling function and receive result in original dataframe order, transform should be used instead:

In [16]: df.groupby('id')['x'].transform(lambda s: s.rolling(2, min_periods=1).sum())
Out[16]:
0 0
1 1
2 3
3 3
4 7
5 9
Name: x, dtype: int64


deprecated approach

For reference, here's how the now deprecated pandas.rolling_mean behaved:

In [16]: df.groupby('id')['x'].apply(pd.rolling_mean, 2, min_periods=1)
Out[16]:
0 0.0
1 0.5
2 1.5
3 3.0
4 3.5
5 4.5

Efficient way to perform operations (rolling mean/add new columns) in each group from pandas groupby

  • Use pandas.DataFrame.groupby and .apply a function with your calculations
import pandas as pd

data = {'SecuCode': ['600455.SH', '600455.SH', '600455.SH', '600455.SH', '600455.SH', '600551.SH', '600551.SH', '600551.SH', '600551.SH', '600551.SH'],
'TradingDay': ['2013-01-04', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10', '2018-12-24', '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28'],
'Volume': [1484606, 1315166, 1675933, 1244098, 751279, 1166098, 3285799, 3534143, 2462501, 2282954],
'AShare': [49717768, 49717768, 49717768, 49717768, 49717768, 505825296, 505825296, 505825296, 505825296, 505825296]}

df = pd.DataFrame(data)

# function with calculations
def calcs(df: pd.DataFrame) -> pd.DataFrame:
df['volumn_percentage'] = df['Volume']/df['AShare']
df['turnover'] = df['volumn_percentage'].rolling(2).mean()
return df

# groupby and apply the function with the calculations
df_new = df.groupby('SecuCode').apply(calcs)

# print(df_new)
SecuCode TradingDay Volume AShare volumn_percentage turnover
0 600455.SH 2013-01-04 1484606 49717768 0.029861 NaN
1 600455.SH 2013-01-07 1315166 49717768 0.026453 0.028157
2 600455.SH 2013-01-08 1675933 49717768 0.033709 0.030081
3 600455.SH 2013-01-09 1244098 49717768 0.025023 0.029366
4 600455.SH 2013-01-10 751279 49717768 0.015111 0.020067
5 600551.SH 2018-12-24 1166098 505825296 0.002305 NaN
6 600551.SH 2018-12-25 3285799 505825296 0.006496 0.004401
7 600551.SH 2018-12-26 3534143 505825296 0.006987 0.006741
8 600551.SH 2018-12-27 2462501 505825296 0.004868 0.005928
9 600551.SH 2018-12-28 2282954 505825296 0.004513 0.004691

Apply rolling function to groupby over several columns

  • It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
  • However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
  • I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
  • Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
  • Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
  • I clean this up with a bit of renaming indices and one more pivot.

s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()

d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)

d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)

A B
date id
2017-02-05 33 20 10
2017-02-07 33 20 20
2017-02-09 33 0 0
2017-02-11 33 3 0
2017-02-13 33 7 0

Pandas rolling slope on groupby objects

Try:

df['rolling_slope'] = (df.groupby('tags')['weight']
.rolling(window=10, min_period=2)
.apply(lambda v: linregress(np.arange(len(v)), v).slope )
.reset_index(level=0, drop=True)
)

But this is rolling on number of rows only, not really looking back 10 days. There's also an option rolling('10D') but you would need to set date as index.

Rolling operations on DataFrameGroupby object

I have found a workable solution but it only works if for each id each date is unique. This is the case in my data with some additional processing:

new_df = df.groupby(['id','date']).mean().reset_index()

which returns:

    id      date      target
0 1.0 2017-01-01 0
1 1.0 2017-01-21 1
2 1.0 2017-10-01 0
3 2.0 2017-01-01 1
4 2.0 2017-01-21 0
5 2.0 2017-10-01 0

I can then use the rolling method on a groupby object to get the desired result:

df = new_df.set_index('date')

df.iloc[::-1].groupby('id')['target'].rolling(window='180D',
centre=False).apply(lambda x : x[:-1].sum())

There are two tricks here:

  1. I reverse the order of the dates (.iloc[::-1]) to take a forward looking window; this has been suggested in other SO questions.

  2. I drop the last entry of the sum to remove the 'current' date from the sum, so it only looks forward.

The second 'hack' means it only works when there are no repeats of dates for a given id.

I would be interested in making a more robust solution (e.g., where dates are repeated for an id).

Pandas returns incorrect groupby rolling sum of zeros for float64 when having many groups

TLDR: this is a side effect of optimization; the workaround is to use a non-pandas sum.

The reason is that pandas tries to optimize. Naive rolling window functions will take O(n*w) time. However, if we're aware the function is a sum, we can subtract one element going out of window and add the one getting into it. This approach no longer depends on window size, and is always O(n).

The caveat is that now we'll get side effects of floating point precision, manifesting itself similar to what you've described.

Sources: Python code calling window aggregation, Cython implementation of the rolling sum



Related Topics



Leave a reply



Submit