Pandas: rolling mean by time interval
In the meantime, a time-window capability was added. See this link.
In [1]: df = DataFrame({'B': range(5)})
In [2]: df.index = [Timestamp('20130101 09:00:00'),
...: Timestamp('20130101 09:00:02'),
...: Timestamp('20130101 09:00:03'),
...: Timestamp('20130101 09:00:05'),
...: Timestamp('20130101 09:00:06')]
In [3]: df
Out[3]:
B
2013-01-01 09:00:00 0
2013-01-01 09:00:02 1
2013-01-01 09:00:03 2
2013-01-01 09:00:05 3
2013-01-01 09:00:06 4
In [4]: df.rolling(2, min_periods=1).sum()
Out[4]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 5.0
2013-01-01 09:00:06 7.0
In [5]: df.rolling('2s', min_periods=1).sum()
Out[5]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 7.0
pandas: rolling mean on time interval plus grouping on index
My approach:
df = df.assign(day=df['ds time'].dt.normalize(),
hour=df['ds time'].dt.hour)
ret_df = df.merge(df.drop('ds time', axis=1)
.set_index('day')
.groupby(['id','hour']).rolling('7D').mean()
.drop(['hour','id'], axis=1),
on=['id','hour','day'],
how='left',
suffixes=['','_roll']
).drop(['day','hour'], axis=1)
Sample data:
dates = pd.date_range('2020-02-21', '2020-02-25', freq='H')
np.random.seed(1)
df = pd.DataFrame({
'id': np.repeat([6,7], len(dates)),
'ds time': np.tile(dates,2),
'X': np.arange(len(dates)*2),
'Y': np.random.randint(0,10, len(dates)*2)
})
df.head()
Output ret_df.head()
:
id ds time X Y X_roll Y_roll
0 6 2020-02-21 00:00:00 0 5 0.0 5.0
1 6 2020-02-21 01:00:00 1 8 1.0 8.0
2 6 2020-02-21 02:00:00 2 9 2.0 9.0
3 6 2020-02-21 03:00:00 3 5 3.0 5.0
4 6 2020-02-21 04:00:00 4 0 4.0 0.0
Calculating the mean for each time of day with a rolling window with pandas
You can pre-filter the dataset to only include 13 days preceding the dt
date, then groupby
time, taking 7 days rolling
with min_periods=7
, take mean
and dropna
to remove dates that have accumulated values for fewer than 7 of the previous days:
# generate sample dataset
ix = pd.date_range('2021-01-01', '2021-05-01', freq='15min')
df = pd.DataFrame({
'Phase1': np.random.uniform(0, 1, len(ix)),
'Phase2': np.random.uniform(0, 1, len(ix)),
'Phase3': np.random.uniform(0, 1, len(ix)),
}, index=ix)
df['Sum'] = df.sum(1)
# set max date
dt = pd.to_datetime('2021-02-14')
# filter out values in [dt - 13 days, dt)
z = df.loc[(df.index >= dt - pd.DateOffset(days=13)) & (df.index < dt)]
# calculate 7-day rolling average for the same time of the day
# for 7 days preceding `dt`
(z
.groupby(z.index.time)
.rolling('7d', min_periods=7)
.mean()
.dropna()
.droplevel(0)
.sort_index())
Output:
Phase1 Phase2 Phase3 Sum
2021-02-07 00:00:00 0.479466 0.731746 0.503017 1.714229
2021-02-07 00:15:00 0.443550 0.423135 0.543204 1.409889
2021-02-07 00:30:00 0.465272 0.626117 0.454462 1.545851
2021-02-07 00:45:00 0.528733 0.433475 0.386822 1.349029
2021-02-07 01:00:00 0.425309 0.360065 0.488509 1.273884
... ... ... ... ...
2021-02-13 22:45:00 0.519717 0.490549 0.524330 1.534596
2021-02-13 23:00:00 0.367935 0.460093 0.373338 1.201366
2021-02-13 23:15:00 0.597424 0.438130 0.478259 1.513813
2021-02-13 23:30:00 0.675142 0.443580 0.330791 1.449514
2021-02-13 23:45:00 0.474604 0.355723 0.596467 1.426794
How to perform a rolling average for irregular time intervals in pandas?
I am using numpy
board-cast
df=pd.DataFrame({'startTime':np.arange(13),'endTime':np.arange(13)+3})
s=ori.timeCol[:,None]
s1=(df.startTime.values-s<=0)&(df.endTime.values-s>=0)
df['New']=ori.dataCol.dot(s1)/s1.sum(axis=0)
df
startTime endTime New
0 0 3 5.0
1 1 4 5.0
2 2 5 5.0
3 3 6 NaN
4 4 7 NaN
5 5 8 NaN
6 6 9 NaN
7 7 10 8.0
8 8 11 6.0
9 9 12 6.0
10 10 13 5.0
11 11 14 5.0
12 12 15 6.0
Pandas rolling time window by days instead of individual rows
If your data is always positive, you can transform after rolling:
# if your index is not always on the day, e.g. 2017-01-01 01:00:00
# use `pd.Grouper(freq='D')` instead of `level`
df.rolling('3D').sum().groupby(level='t').transform('max')
Output:
a
t
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 3.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-06 6.0
2017-01-06 6.0
2017-01-07 7.0
2017-01-07 7.0
2017-01-08 5.0
Edit: In the general case, aggregate by the day and map back:
s = df.groupby(pd.Grouper(freq='D')).sum().rolling('3D').sum()
df.index.floor('D').to_series().map(s['a'])
output:
t
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 3.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-05 5.0
2017-01-06 6.0
2017-01-06 6.0
2017-01-07 7.0
2017-01-07 7.0
2017-01-08 5.0
Name: t, dtype: float64
window (bucketing) by time for rolling_* in Pandas
Not sure if you ended up figuring out a solution, but I recently asked a similar question. It was pointed out that pandas 0.19.0 now has support for Time-aware Rolling.
I think that you should be able to perform your rolling calculation on 5 min windows with the below:
df1['VWAP'] = df1['Volume_Scaled_Price'].rolling('5min').sum() / df1['QTY'].rolling('5min').sum()
Also - here is a list of the offset aliases that are currently supported.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Pandas groupby rolling mean, but only for the most recent row to save calculation time
You can use .groupby.agg
, to compute the rolling mean for only the recent data you can take head(3)
and compute mean of it.
Use:
new_df = (df.sort_values(by=['time'], ascending = False)
.groupby('id', as_index = False)
.agg(
time = ('time', 'first'),
price = ('price', lambda x: x.head(3).values.mean())
))
Prints:
>>> new_df
id time price
0 ABC 01:04 100
1 QRS 01:04 25
2 XYZ 01:04 50
Related Topics
List of Lists into Numpy Array
Making Python/Tkinter Label Widget Update
Why Can Tuples Contain Mutable Items
Efficient Way to Apply Multiple Filters to Pandas Dataframe or Series
Rendering Text with Multiple Lines in Pygame
Windows Scipy Install: No Lapack/Blas Resources Found
How to Make a Single Legend for Many Subplots with Matplotlib
Skip Rows During CSV Import Pandas
How to Convert the Background Color of Image to Match the Color of Pygame Window
Running Selenium Webdriver with a Proxy in Python
What's a Correct and Good Way to Implement _Hash_()
":=" Syntax and Assignment Expressions: What and Why
What's a Good Rate Limiting Algorithm
How to Check Type of Files Without Extensions