Pandas: Resample Timeseries with Groupby

Pandas: resample timeseries with groupby

In my original post, I suggested using pd.TimeGrouper.
Nowadays, use pd.Grouper instead of pd.TimeGrouper. The syntax is largely the same, but TimeGrouper is now deprecated in favor of pd.Grouper.

Moreover, while pd.TimeGrouper could only group by DatetimeIndex, pd.Grouper can group by datetime columns which you can specify through the key parameter.


You could use a pd.Grouper to group the DatetimeIndex'ed DataFrame by hour:

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])

use count to count the number of events in each group:

grouper['Event'].count()
# Location
# 2014-08-25 21:00:00 HK 1
# LDN 1
# 2014-08-25 22:00:00 LDN 2
# Name: Event, dtype: int64

use unstack to move the Location index level to a column level:

grouper['Event'].count().unstack()
# Out[49]:
# Location HK LDN
# 2014-08-25 21:00:00 1 1
# 2014-08-25 22:00:00 NaN 2

and then use fillna to change the NaNs into zeros.


Putting it all together,

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)

yields

Location             HK  LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2

Pandas: combine resampling and group by. How do I average values within same groups before summing them up in a time bin?

Use Grouper with aggregate mean:

print (dataframe.groupby(['Group',pd.Grouper(freq='2D', key='Time')]).mean())
Col1 Col2 Col3
Group Time
A 2020-05-18 10 20 30
2020-05-20 10 20 30
B 2020-05-18 10 20 30
2020-05-20 10 20 30
C 2020-05-18 10 20 30
2020-05-20 10 20 30

And then sum by second level Time:

df = (
dataframe
.groupby(['Group', pd.Grouper(freq='2D', key='Time')])
.mean()
.sum(level=1)
)

print (df)
Col1 Col2 Col3
Time
2020-05-18 30 60 90
2020-05-20 30 60 90

Retaining time of max with Pandas GroupBy and resample

You can add station to groupby, so possible use GroupBy.agg with max and also idxmax:

df = pd_df.groupby(['station', pd.Grouper(freq='D')]).gust_speed.agg(['max','idxmax'])
print (df)
max idxmax
station valid_dt
AMO 2020-05-21 24 2020-05-21 23:00:00
2020-05-22 22 2020-05-22 00:00:00
YSA 2020-05-23 23 2020-05-23 22:00:00
2020-05-24 23 2020-05-24 22:00:00
2020-05-25 19 2020-05-25 00:00:00

How to resample a Time Series for a specific timerange for each group in a dataframe?

One option is to follow this similar answer and create a function and apply it to each group. You can set a vector of dates to use for reindexing, and put that into a function which will be applied to each group:

START = '01-01-2020'
END = '06-01-2020'
DATE_RANGE = pd.date_range(START, END, freq='MS')

def apply_reindex(df):
reindexed = df.set_index('date').reindex(DATE_RANGE)
filled = reindexed.fillna({'group':df.name, 'value1':0, 'value2':0})
filled.index.name = 'date'
filled = filled.reset_index()
return filled

There's some annoying index setting/renaming stuff to be able to call reindex on the date column and then return it to be a column.

You can then call the function with:

df.groupby('group').apply(apply_reindex).reset_index(drop=True)

Which produces:

        date group  value1  value2
0 2020-01-01 A 12.0 2.0
1 2020-02-01 A 11.0 4.0
2 2020-03-01 A 7.0 5.0
3 2020-04-01 A 23.0 3.0
4 2020-05-01 A 0.0 0.0
5 2020-06-01 A 0.0 0.0
6 2020-01-01 B 0.0 0.0
7 2020-02-01 B 0.0 0.0
8 2020-03-01 B 66.0 1.0
9 2020-04-01 B 61.0 8.0
10 2020-05-01 B 0.0 0.0
11 2020-06-01 B 55.0 8.0
12 2020-01-01 C 0.0 0.0
13 2020-02-01 C 0.0 0.0
14 2020-03-01 C 0.0 0.0
15 2020-04-01 C 3.0 0.0
16 2020-05-01 C 0.0 0.0
17 2020-06-01 C 22.0 112.0

Pandas upsampling using groupby and resample

Create DatetimeIndex and remove parameter on from resample:

print (data.set_index('date').groupby('id').resample('D').asfreq())
id
id date
1 2018-01-01 1.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 1.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 1.0
2 2018-01-01 2.0
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 2.0
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 2.0

print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
id
id date
1 2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 1
2018-01-09 1
2018-01-10 1
2 2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2018-01-05 2
2018-01-06 2
2018-01-07 2
2018-01-08 2
2018-01-09 2
2018-01-10 2

EDIT:

If want use sum with missing values need min_count=1 parameter - sum:

min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

print (data.groupby('id').resample('D', on='date').sum(min_count=1))

Pandas time series resampling with month and with group by column

You can reset_index to set the cleareddate field as a column in your dataframe. I'd create a new column with month and then you can perform a straightforward groupby on that.

df.reset_index(inplace=True)
df['month'] = df.cleareddate.dt.month
df.groupby(['month','bts_name']).agg('sum').duration

Notes:

  1. I have assumed that the index is of type datetime. If this is not the case then add df.cleareddate = pd.to_datetime(df.cleareddate) line after the reset_index line.
  2. Note that months in two different years will be grouped as the same. Do you need to groupby year as well? If so, add another column for year and add that term to your groupby column

EDIT:
After discussion in comments with @sriman I have included another way of achieving the above with the pandas resample method.

# your data
df = pd.DataFrame({
'bts_name': ['1002_NUc_Marathalli','1002_NUc_Marathalli',
'1002_NUc_Marathalli','1002_NUc_Marathalli',
'1003_IU2_Munnekolalu'],
'duration': [95,188,1332,940,73]
}, index=pd.to_datetime(['2019-01-19','2019-01-21',
'2019-02-11','2019-04-12','2019-01-11']))

# solution
def resample(group):
return group.resample('M').sum()

result = df.groupby('bts_name').apply(resample)

# result
print(result)

duration
bts_name
1002_NUc_Marathalli 2019-01-31 283
2019-02-28 1332
2019-03-31 0
2019-04-30 940
1003_IU2_Munnekolalu 2019-01-31 73

Groupby and resample timeseries so date ranges are consistent

Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.


Set the index then use pandas.MultiIndex.from_product to produce the Cartesian product of values. I also use fill_value=0 to fill in those missing values.

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()

date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0

Or

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx).reset_index()

date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN

Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0 to fill in missing values. We could leave that out to produce the NaN.

df.set_index(['date', 'group']) \
.unstack(fill_value=0) \
.asfreq('D', fill_value=0) \
.stack().reset_index()

date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0

Or

df.set_index(['date', 'group']) \
.unstack() \
.asfreq('D') \
.stack(dropna=False).reset_index()

date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN

Resample dataframe with specific start/end dates along with a groupby

I am using reindex here , the key is to setting up the Multiple index

df.index=pd.to_datetime(df.index).date
df=df.groupby([df.index,df['txn_type'],df['cust_id']]).agg({'txn_amt':'sum'}).reset_index(level=[1,2])
drange=pd.date_range(end=df.index.max(),periods =5)
idx=pd.MultiIndex.from_product([drange,df.cust_id.unique(),df.txn_type.unique()])
Newdf=df.set_index(['cust_id','txn_type'],append=True).reindex(idx,fill_value=0).reset_index(level=[1,2])
Newdf
Out[749]:
level_1 level_2 txn_amt
2019-03-03 100 Credit 0
2019-03-03 100 Debit 0
2019-03-03 101 Credit 0
2019-03-03 101 Debit 0
2019-03-04 100 Credit 0
2019-03-04 100 Debit 0
2019-03-04 101 Credit 0
2019-03-04 101 Debit 0
2019-03-05 100 Credit 25000
2019-03-05 100 Debit 0
2019-03-05 101 Credit 25000
2019-03-05 101 Debit 0
2019-03-06 100 Credit 0
2019-03-06 100 Debit 4000
2019-03-06 101 Credit 0
2019-03-06 101 Debit 7000
2019-03-07 100 Credit 0
2019-03-07 100 Debit 1000
2019-03-07 101 Credit 0
2019-03-07 101 Debit 0

Mean Aggregations using pandas GroupBy and Time Series resampling

Following should also work, based on your question the 'sum' displays 'sum of Volume' based on 'Year' and 'mean' displays 'mean of volums' based on 'Daily mean' both being grouped by 'Session' and 'DateTime'. (Just used some groupy chaining with joins)

import pandas as pd

data = {
'DateTime':['2020-12-16 08:00:00','2020-12-16 08:30:00','2020-12-16 09:00:00','2020-12-16 09:30:00','2020-12-17 08:00:00','2020-12-17 08:30:00','2020-12-17 09:00:00','2020-12-17 09:30:00','2020-12-18 08:00:00','2020-12-18 08:30:00','2020-12-18 09:00:00','2020-12-18 09:30:00','2019-11-18 08:00:00','2019-11-18 08:30:00','2019-11-18 09:00:00','2019-11-18 09:30:00'],
'Volume':[1000,500,1000,3000,2000,2000,2000,2000,1000,1000,1000,1000,1000,1000,1000,1000],
'Session':['PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH']
}

df = pd.DataFrame(data)
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.index = pd.to_datetime(df['DateTime'])

#See below code
x = df.groupby([df.index.strftime('%Y'),'Session',df.index.strftime('%Y-%m-%d')]).agg({'Volume':['sum','mean']}).groupby(['DateTime','Session'],level=2).agg(['sum','mean'])
x['Volume'].drop('mean',axis=1,level=0)

Groupby and resample at 1min frequency using forward fill in Python

Use custom function with define how many need new rows by Timedelta with date_range and DataFrame.reindex:

def f(x):
new = x.index[0] + pd.Timedelta(5, unit='Min')
r = pd.date_range(x.index[0], new, freq='Min')

return x.reindex(r, method='ffill')

df = (df.reset_index()
.set_index('timestamp')
.groupby(['index','id'], sort=False)['data']
.apply(f)
.reset_index(level=0, drop=True)
.rename_axis(['id','timestamp'])
.reset_index()
)


print (df)
id timestamp data
0 1 2017-01-02 13:14:53.040 10.0
1 1 2017-01-02 13:15:53.040 10.0
2 1 2017-01-02 13:16:53.040 10.0
3 1 2017-01-02 13:17:53.040 10.0
4 1 2017-01-02 13:18:53.040 10.0
5 1 2017-01-02 13:19:53.040 10.0
6 1 2017-01-02 12:04:43.240 11.0
7 1 2017-01-02 12:05:43.240 11.0
8 1 2017-01-02 12:06:43.240 11.0
9 1 2017-01-02 12:07:43.240 11.0
10 1 2017-01-02 12:08:43.240 11.0
11 1 2017-01-02 12:09:43.240 11.0
12 2 2017-01-02 15:22:06.540 1.0
13 2 2017-01-02 15:23:06.540 1.0
14 2 2017-01-02 15:24:06.540 1.0
15 2 2017-01-02 15:25:06.540 1.0
16 2 2017-01-02 15:26:06.540 1.0
17 2 2017-01-02 15:27:06.540 1.0
18 2 2017-01-03 13:55:34.240 2.0
19 2 2017-01-03 13:56:34.240 2.0
20 2 2017-01-03 13:57:34.240 2.0
21 2 2017-01-03 13:58:34.240 2.0
22 2 2017-01-03 13:59:34.240 2.0
23 2 2017-01-03 14:00:34.240 2.0

because if use ffill output is different:

df = df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill()
print (df)
id timestamp
1 2017-01-02 12:04:00 NaN
2017-01-02 12:05:00 11.0
2017-01-02 12:06:00 11.0
2017-01-02 12:07:00 11.0
2017-01-02 12:08:00 11.0

2 2017-01-03 13:51:00 1.0
2017-01-03 13:52:00 1.0
2017-01-03 13:53:00 1.0
2017-01-03 13:54:00 1.0
2017-01-03 13:55:00 1.0
Name: data, Length: 1425, dtype: float64


Related Topics



Leave a reply



Submit