Pandas: resample timeseries with groupby
In my original post, I suggested using pd.TimeGrouper
.
Nowadays, use pd.Grouper
instead of pd.TimeGrouper
. The syntax is largely the same, but TimeGrouper
is now deprecated in favor of pd.Grouper
.
Moreover, while pd.TimeGrouper
could only group by DatetimeIndex, pd.Grouper
can group by datetime columns which you can specify through the key
parameter.
You could use a pd.Grouper
to group the DatetimeIndex'ed DataFrame by hour:
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
use count
to count the number of events in each group:
grouper['Event'].count()
# Location
# 2014-08-25 21:00:00 HK 1
# LDN 1
# 2014-08-25 22:00:00 LDN 2
# Name: Event, dtype: int64
use unstack
to move the Location
index level to a column level:
grouper['Event'].count().unstack()
# Out[49]:
# Location HK LDN
# 2014-08-25 21:00:00 1 1
# 2014-08-25 22:00:00 NaN 2
and then use fillna
to change the NaNs into zeros.
Putting it all together,
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)
yields
Location HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
Pandas: combine resampling and group by. How do I average values within same groups before summing them up in a time bin?
Use Grouper
with aggregate mean
:
print (dataframe.groupby(['Group',pd.Grouper(freq='2D', key='Time')]).mean())
Col1 Col2 Col3
Group Time
A 2020-05-18 10 20 30
2020-05-20 10 20 30
B 2020-05-18 10 20 30
2020-05-20 10 20 30
C 2020-05-18 10 20 30
2020-05-20 10 20 30
And then sum
by second level Time
:
df = (
dataframe
.groupby(['Group', pd.Grouper(freq='2D', key='Time')])
.mean()
.sum(level=1)
)
print (df)
Col1 Col2 Col3
Time
2020-05-18 30 60 90
2020-05-20 30 60 90
Retaining time of max with Pandas GroupBy and resample
You can add station
to groupby
, so possible use GroupBy.agg
with max
and also idxmax
:
df = pd_df.groupby(['station', pd.Grouper(freq='D')]).gust_speed.agg(['max','idxmax'])
print (df)
max idxmax
station valid_dt
AMO 2020-05-21 24 2020-05-21 23:00:00
2020-05-22 22 2020-05-22 00:00:00
YSA 2020-05-23 23 2020-05-23 22:00:00
2020-05-24 23 2020-05-24 22:00:00
2020-05-25 19 2020-05-25 00:00:00
How to resample a Time Series for a specific timerange for each group in a dataframe?
One option is to follow this similar answer and create a function and apply
it to each group. You can set a vector of dates to use for reindexing, and put that into a function which will be applied to each group:
START = '01-01-2020'
END = '06-01-2020'
DATE_RANGE = pd.date_range(START, END, freq='MS')
def apply_reindex(df):
reindexed = df.set_index('date').reindex(DATE_RANGE)
filled = reindexed.fillna({'group':df.name, 'value1':0, 'value2':0})
filled.index.name = 'date'
filled = filled.reset_index()
return filled
There's some annoying index setting/renaming stuff to be able to call reindex on the date
column and then return it to be a column.
You can then call the function with:
df.groupby('group').apply(apply_reindex).reset_index(drop=True)
Which produces:
date group value1 value2
0 2020-01-01 A 12.0 2.0
1 2020-02-01 A 11.0 4.0
2 2020-03-01 A 7.0 5.0
3 2020-04-01 A 23.0 3.0
4 2020-05-01 A 0.0 0.0
5 2020-06-01 A 0.0 0.0
6 2020-01-01 B 0.0 0.0
7 2020-02-01 B 0.0 0.0
8 2020-03-01 B 66.0 1.0
9 2020-04-01 B 61.0 8.0
10 2020-05-01 B 0.0 0.0
11 2020-06-01 B 55.0 8.0
12 2020-01-01 C 0.0 0.0
13 2020-02-01 C 0.0 0.0
14 2020-03-01 C 0.0 0.0
15 2020-04-01 C 3.0 0.0
16 2020-05-01 C 0.0 0.0
17 2020-06-01 C 22.0 112.0
Pandas upsampling using groupby and resample
Create DatetimeIndex
and remove parameter on
from resample
:
print (data.set_index('date').groupby('id').resample('D').asfreq())
id
id date
1 2018-01-01 1.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 1.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 1.0
2 2018-01-01 2.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 2.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 2.0
print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
id
id date
1 2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 1
2018-01-09 1
2018-01-10 1
2 2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2
EDIT:
If want use sum
with missing values need min_count=1
parameter - sum
:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
print (data.groupby('id').resample('D', on='date').sum(min_count=1))
Pandas time series resampling with month and with group by column
You can reset_index
to set the cleareddate
field as a column in your dataframe. I'd create a new column with month and then you can perform a straightforward groupby
on that.
df.reset_index(inplace=True)
df['month'] = df.cleareddate.dt.month
df.groupby(['month','bts_name']).agg('sum').duration
Notes:
- I have assumed that the index is of type datetime. If this is not the case then add
df.cleareddate = pd.to_datetime(df.cleareddate)
line after thereset_index
line. - Note that months in two different years will be grouped as the same. Do you need to groupby year as well? If so, add another column for year and add that term to your groupby column
EDIT:
After discussion in comments with @sriman I have included another way of achieving the above with the pandas resample method.
# your data
df = pd.DataFrame({
'bts_name': ['1002_NUc_Marathalli','1002_NUc_Marathalli',
'1002_NUc_Marathalli','1002_NUc_Marathalli',
'1003_IU2_Munnekolalu'],
'duration': [95,188,1332,940,73]
}, index=pd.to_datetime(['2019-01-19','2019-01-21',
'2019-02-11','2019-04-12','2019-01-11']))
# solution
def resample(group):
return group.resample('M').sum()
result = df.groupby('bts_name').apply(resample)
# result
print(result)
duration
bts_name
1002_NUc_Marathalli 2019-01-31 283
2019-02-28 1332
2019-03-31 0
2019-04-30 940
1003_IU2_Munnekolalu 2019-01-31 73
Groupby and resample timeseries so date ranges are consistent
Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.
Set the index then use pandas.MultiIndex.from_product
to produce the Cartesian product of values. I also use fill_value=0
to fill in those missing values.
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
Or
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0
to fill in missing values. We could leave that out to produce the NaN
.
df.set_index(['date', 'group']) \
.unstack(fill_value=0) \
.asfreq('D', fill_value=0) \
.stack().reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
Or
df.set_index(['date', 'group']) \
.unstack() \
.asfreq('D') \
.stack(dropna=False).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
Resample dataframe with specific start/end dates along with a groupby
I am using reindex
here , the key is to setting up the Multiple
index
df.index=pd.to_datetime(df.index).date
df=df.groupby([df.index,df['txn_type'],df['cust_id']]).agg({'txn_amt':'sum'}).reset_index(level=[1,2])
drange=pd.date_range(end=df.index.max(),periods =5)
idx=pd.MultiIndex.from_product([drange,df.cust_id.unique(),df.txn_type.unique()])
Newdf=df.set_index(['cust_id','txn_type'],append=True).reindex(idx,fill_value=0).reset_index(level=[1,2])
Newdf
Out[749]:
level_1 level_2 txn_amt
2019-03-03 100 Credit 0
2019-03-03 100 Debit 0
2019-03-03 101 Credit 0
2019-03-03 101 Debit 0
2019-03-04 100 Credit 0
2019-03-04 100 Debit 0
2019-03-04 101 Credit 0
2019-03-04 101 Debit 0
2019-03-05 100 Credit 25000
2019-03-05 100 Debit 0
2019-03-05 101 Credit 25000
2019-03-05 101 Debit 0
2019-03-06 100 Credit 0
2019-03-06 100 Debit 4000
2019-03-06 101 Credit 0
2019-03-06 101 Debit 7000
2019-03-07 100 Credit 0
2019-03-07 100 Debit 1000
2019-03-07 101 Credit 0
2019-03-07 101 Debit 0
Mean Aggregations using pandas GroupBy and Time Series resampling
Following should also work, based on your question the 'sum' displays 'sum of Volume' based on 'Year' and 'mean' displays 'mean of volums' based on 'Daily mean' both being grouped by 'Session' and 'DateTime'. (Just used some groupy chaining with joins)
import pandas as pd
data = {
'DateTime':['2020-12-16 08:00:00','2020-12-16 08:30:00','2020-12-16 09:00:00','2020-12-16 09:30:00','2020-12-17 08:00:00','2020-12-17 08:30:00','2020-12-17 09:00:00','2020-12-17 09:30:00','2020-12-18 08:00:00','2020-12-18 08:30:00','2020-12-18 09:00:00','2020-12-18 09:30:00','2019-11-18 08:00:00','2019-11-18 08:30:00','2019-11-18 09:00:00','2019-11-18 09:30:00'],
'Volume':[1000,500,1000,3000,2000,2000,2000,2000,1000,1000,1000,1000,1000,1000,1000,1000],
'Session':['PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH']
}
df = pd.DataFrame(data)
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.index = pd.to_datetime(df['DateTime'])
#See below code
x = df.groupby([df.index.strftime('%Y'),'Session',df.index.strftime('%Y-%m-%d')]).agg({'Volume':['sum','mean']}).groupby(['DateTime','Session'],level=2).agg(['sum','mean'])
x['Volume'].drop('mean',axis=1,level=0)
Groupby and resample at 1min frequency using forward fill in Python
Use custom function with define how many need new rows by Timedelta
with date_range
and DataFrame.reindex
:
def f(x):
new = x.index[0] + pd.Timedelta(5, unit='Min')
r = pd.date_range(x.index[0], new, freq='Min')
return x.reindex(r, method='ffill')
df = (df.reset_index()
.set_index('timestamp')
.groupby(['index','id'], sort=False)['data']
.apply(f)
.reset_index(level=0, drop=True)
.rename_axis(['id','timestamp'])
.reset_index()
)
print (df)
id timestamp data
0 1 2017-01-02 13:14:53.040 10.0
1 1 2017-01-02 13:15:53.040 10.0
2 1 2017-01-02 13:16:53.040 10.0
3 1 2017-01-02 13:17:53.040 10.0
4 1 2017-01-02 13:18:53.040 10.0
5 1 2017-01-02 13:19:53.040 10.0
6 1 2017-01-02 12:04:43.240 11.0
7 1 2017-01-02 12:05:43.240 11.0
8 1 2017-01-02 12:06:43.240 11.0
9 1 2017-01-02 12:07:43.240 11.0
10 1 2017-01-02 12:08:43.240 11.0
11 1 2017-01-02 12:09:43.240 11.0
12 2 2017-01-02 15:22:06.540 1.0
13 2 2017-01-02 15:23:06.540 1.0
14 2 2017-01-02 15:24:06.540 1.0
15 2 2017-01-02 15:25:06.540 1.0
16 2 2017-01-02 15:26:06.540 1.0
17 2 2017-01-02 15:27:06.540 1.0
18 2 2017-01-03 13:55:34.240 2.0
19 2 2017-01-03 13:56:34.240 2.0
20 2 2017-01-03 13:57:34.240 2.0
21 2 2017-01-03 13:58:34.240 2.0
22 2 2017-01-03 13:59:34.240 2.0
23 2 2017-01-03 14:00:34.240 2.0
because if use ffill
output is different:
df = df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill()
print (df)
id timestamp
1 2017-01-02 12:04:00 NaN
2017-01-02 12:05:00 11.0
2017-01-02 12:06:00 11.0
2017-01-02 12:07:00 11.0
2017-01-02 12:08:00 11.0
2 2017-01-03 13:51:00 1.0
2017-01-03 13:52:00 1.0
2017-01-03 13:53:00 1.0
2017-01-03 13:54:00 1.0
2017-01-03 13:55:00 1.0
Name: data, Length: 1425, dtype: float64
Related Topics
How to Improve My Paw Detection
How to Programmatically Set a Global (Module) Variable
Sqlalchemy Unique Across Multiple Columns
Requests: How to Disable/Bypass Proxy
Why Is Pip Installing an Old Version of My Package
Putting a 'Cookie' in a 'Cookiejar'
Nested Ssh Session with Paramiko
Installing Numpy and Scipy on 64-Bit Windows (With Pip)
Read a File Line by Line from S3 Using Boto
Tkinter: Mouse Drag a Window Without Borders, Eg. Overridedirect(1)
Importerror: No Module Named Tensorflow
Efficiently Updating Database Using SQLalchemy Orm
How to Get Char from String by Index
Browse Files and Subfolders in Python
@Csrf_Exempt Does Not Work on Generic View Based Class
How to Handle an Asymptote/Discontinuity with Matplotlib
Read Unicode Characters from Command-Line Arguments in Python 2.X on Windows