Pandas: Rolling Mean by Time Interval

Pandas: rolling mean by time interval

In the meantime, a time-window capability was added. See this link.

In [1]: df = DataFrame({'B': range(5)})

In [2]: df.index = [Timestamp('20130101 09:00:00'),
   ...:             Timestamp('20130101 09:00:02'),
   ...:             Timestamp('20130101 09:00:03'),
   ...:             Timestamp('20130101 09:00:05'),
   ...:             Timestamp('20130101 09:00:06')]

In [3]: df
Out[3]: 
                     B
2013-01-01 09:00:00  0
2013-01-01 09:00:02  1
2013-01-01 09:00:03  2
2013-01-01 09:00:05  3
2013-01-01 09:00:06  4

In [4]: df.rolling(2, min_periods=1).sum()
Out[4]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  5.0
2013-01-01 09:00:06  7.0

In [5]: df.rolling('2s', min_periods=1).sum()
Out[5]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  3.0
2013-01-01 09:00:06  7.0

pandas: rolling mean on time interval plus grouping on index

My approach:

df = df.assign(day=df['ds time'].dt.normalize(),
               hour=df['ds time'].dt.hour)

ret_df = df.merge(df.drop('ds time', axis=1)
           .set_index('day')
           .groupby(['id','hour']).rolling('7D').mean()
           .drop(['hour','id'], axis=1),
         on=['id','hour','day'],
         how='left',
         suffixes=['','_roll']
        ).drop(['day','hour'], axis=1)

Sample data:

dates = pd.date_range('2020-02-21', '2020-02-25', freq='H')

np.random.seed(1)
df = pd.DataFrame({
    'id': np.repeat([6,7], len(dates)),
    'ds time': np.tile(dates,2),
    'X': np.arange(len(dates)*2),
    'Y': np.random.randint(0,10, len(dates)*2)
})
df.head()

Output ret_df.head():

   id             ds time  X  Y  X_roll  Y_roll
0   6 2020-02-21 00:00:00  0  5     0.0     5.0
1   6 2020-02-21 01:00:00  1  8     1.0     8.0
2   6 2020-02-21 02:00:00  2  9     2.0     9.0
3   6 2020-02-21 03:00:00  3  5     3.0     5.0
4   6 2020-02-21 04:00:00  4  0     4.0     0.0

Calculating the mean for each time of day with a rolling window with pandas

You can pre-filter the dataset to only include 13 days preceding the dt date, then groupby time, taking 7 days rolling with min_periods=7, take mean and dropna to remove dates that have accumulated values for fewer than 7 of the previous days:

# generate sample dataset
ix = pd.date_range('2021-01-01', '2021-05-01', freq='15min')
df = pd.DataFrame({
        'Phase1': np.random.uniform(0, 1, len(ix)),
        'Phase2': np.random.uniform(0, 1, len(ix)),
        'Phase3': np.random.uniform(0, 1, len(ix)),
    }, index=ix)
df['Sum'] = df.sum(1)

# set max date
dt = pd.to_datetime('2021-02-14')

# filter out values in [dt - 13 days, dt)
z = df.loc[(df.index >= dt - pd.DateOffset(days=13)) & (df.index < dt)]

# calculate 7-day rolling average for the same time of the day
# for 7 days preceding `dt`
(z
     .groupby(z.index.time)
     .rolling('7d', min_periods=7)
     .mean()
     .dropna()
     .droplevel(0)
     .sort_index())

Output:

                       Phase1    Phase2    Phase3       Sum
2021-02-07 00:00:00  0.479466  0.731746  0.503017  1.714229
2021-02-07 00:15:00  0.443550  0.423135  0.543204  1.409889
2021-02-07 00:30:00  0.465272  0.626117  0.454462  1.545851
2021-02-07 00:45:00  0.528733  0.433475  0.386822  1.349029
2021-02-07 01:00:00  0.425309  0.360065  0.488509  1.273884
...                       ...       ...       ...       ...
2021-02-13 22:45:00  0.519717  0.490549  0.524330  1.534596
2021-02-13 23:00:00  0.367935  0.460093  0.373338  1.201366
2021-02-13 23:15:00  0.597424  0.438130  0.478259  1.513813
2021-02-13 23:30:00  0.675142  0.443580  0.330791  1.449514
2021-02-13 23:45:00  0.474604  0.355723  0.596467  1.426794

How to perform a rolling average for irregular time intervals in pandas?

I am using numpy board-cast

df=pd.DataFrame({'startTime':np.arange(13),'endTime':np.arange(13)+3})
s=ori.timeCol[:,None]
s1=(df.startTime.values-s<=0)&(df.endTime.values-s>=0)
df['New']=ori.dataCol.dot(s1)/s1.sum(axis=0)
df
    startTime  endTime  New
0           0        3  5.0
1           1        4  5.0
2           2        5  5.0
3           3        6  NaN
4           4        7  NaN
5           5        8  NaN
6           6        9  NaN
7           7       10  8.0
8           8       11  6.0
9           9       12  6.0
10         10       13  5.0
11         11       14  5.0
12         12       15  6.0

Pandas rolling time window by days instead of individual rows

If your data is always positive, you can transform after rolling:

# if your index is not always on the day, e.g. 2017-01-01 01:00:00
# use `pd.Grouper(freq='D')` instead of `level` 
df.rolling('3D').sum().groupby(level='t').transform('max')

Output:

              a
t              
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  3.0
2017-01-05  5.0
2017-01-05  5.0
2017-01-05  5.0
2017-01-06  6.0
2017-01-06  6.0
2017-01-07  7.0
2017-01-07  7.0
2017-01-08  5.0

Edit: In the general case, aggregate by the day and map back:

s = df.groupby(pd.Grouper(freq='D')).sum().rolling('3D').sum()
df.index.floor('D').to_series().map(s['a'])

output:

t
2017-01-01    1.0
2017-01-02    2.0
2017-01-03    3.0
2017-01-04    3.0
2017-01-05    5.0
2017-01-05    5.0
2017-01-05    5.0
2017-01-06    6.0
2017-01-06    6.0
2017-01-07    7.0
2017-01-07    7.0
2017-01-08    5.0
Name: t, dtype: float64

window (bucketing) by time for rolling_* in Pandas

Not sure if you ended up figuring out a solution, but I recently asked a similar question. It was pointed out that pandas 0.19.0 now has support for Time-aware Rolling.

I think that you should be able to perform your rolling calculation on 5 min windows with the below:

df1['VWAP'] = df1['Volume_Scaled_Price'].rolling('5min').sum() / df1['QTY'].rolling('5min').sum()

Also - here is a list of the offset aliases that are currently supported.

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Pandas groupby rolling mean, but only for the most recent row to save calculation time

You can use .groupby.agg, to compute the rolling mean for only the recent data you can take head(3) and compute mean of it.

Use:

new_df = (df.sort_values(by=['time'], ascending = False)
            .groupby('id', as_index = False)
            .agg(
              time = ('time', 'first'), 
              price = ('price', lambda x: x.head(3).values.mean())
             ))

Prints:

>>> new_df
    id   time  price
0  ABC  01:04    100
1  QRS  01:04     25
2  XYZ  01:04     50

Pandas: Rolling Mean by Time Interval