Pandas: Resample Timeseries with Groupby

Pandas: resample timeseries with groupby

In my original post, I suggested using pd.TimeGrouper.
Nowadays, use pd.Grouper instead of pd.TimeGrouper. The syntax is largely the same, but TimeGrouper is now deprecated in favor of pd.Grouper.

Moreover, while pd.TimeGrouper could only group by DatetimeIndex, pd.Grouper can group by datetime columns which you can specify through the key parameter.

You could use a pd.Grouper to group the DatetimeIndex'ed DataFrame by hour:

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])

use count to count the number of events in each group:

grouper['Event'].count()
#                      Location
# 2014-08-25 21:00:00  HK          1
#                      LDN         1
# 2014-08-25 22:00:00  LDN         2
# Name: Event, dtype: int64

use unstack to move the Location index level to a column level:

grouper['Event'].count().unstack()
# Out[49]: 
# Location             HK  LDN
# 2014-08-25 21:00:00   1    1
# 2014-08-25 22:00:00 NaN    2

and then use fillna to change the NaNs into zeros.

Putting it all together,

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)

yields

Location             HK  LDN
2014-08-25 21:00:00   1    1
2014-08-25 22:00:00   0    2

Pandas: combine resampling and group by. How do I average values within same groups before summing them up in a time bin?

Use Grouper with aggregate mean:

print (dataframe.groupby(['Group',pd.Grouper(freq='2D', key='Time')]).mean())
                  Col1  Col2  Col3
Group Time                        
A     2020-05-18    10    20    30
      2020-05-20    10    20    30
B     2020-05-18    10    20    30
      2020-05-20    10    20    30
C     2020-05-18    10    20    30
      2020-05-20    10    20    30

And then sum by second level Time:

df = (
    dataframe
    .groupby(['Group', pd.Grouper(freq='2D', key='Time')])
    .mean()
    .sum(level=1)
)

print (df)
            Col1  Col2  Col3
Time                        
2020-05-18    30    60    90
2020-05-20    30    60    90

Retaining time of max with Pandas GroupBy and resample

You can add station to groupby, so possible use GroupBy.agg with max and also idxmax:

df = pd_df.groupby(['station', pd.Grouper(freq='D')]).gust_speed.agg(['max','idxmax'])
print (df)
                    max              idxmax
station valid_dt                           
AMO     2020-05-21   24 2020-05-21 23:00:00
        2020-05-22   22 2020-05-22 00:00:00
YSA     2020-05-23   23 2020-05-23 22:00:00
        2020-05-24   23 2020-05-24 22:00:00
        2020-05-25   19 2020-05-25 00:00:00

How to resample a Time Series for a specific timerange for each group in a dataframe?

One option is to follow this similar answer and create a function and apply it to each group. You can set a vector of dates to use for reindexing, and put that into a function which will be applied to each group:

START = '01-01-2020'
END = '06-01-2020'
DATE_RANGE = pd.date_range(START, END, freq='MS')

def apply_reindex(df):
    reindexed = df.set_index('date').reindex(DATE_RANGE)
    filled = reindexed.fillna({'group':df.name, 'value1':0, 'value2':0})
    filled.index.name = 'date'
    filled = filled.reset_index()
    return filled

There's some annoying index setting/renaming stuff to be able to call reindex on the date column and then return it to be a column.

You can then call the function with:

df.groupby('group').apply(apply_reindex).reset_index(drop=True)

Which produces:

        date group  value1  value2
0  2020-01-01     A    12.0     2.0
1  2020-02-01     A    11.0     4.0
2  2020-03-01     A     7.0     5.0
3  2020-04-01     A    23.0     3.0
4  2020-05-01     A     0.0     0.0
5  2020-06-01     A     0.0     0.0
6  2020-01-01     B     0.0     0.0
7  2020-02-01     B     0.0     0.0
8  2020-03-01     B    66.0     1.0
9  2020-04-01     B    61.0     8.0
10 2020-05-01     B     0.0     0.0
11 2020-06-01     B    55.0     8.0
12 2020-01-01     C     0.0     0.0
13 2020-02-01     C     0.0     0.0
14 2020-03-01     C     0.0     0.0
15 2020-04-01     C     3.0     0.0
16 2020-05-01     C     0.0     0.0
17 2020-06-01     C    22.0   112.0

Pandas upsampling using groupby and resample

Create DatetimeIndex and remove parameter on from resample:

print (data.set_index('date').groupby('id').resample('D').asfreq())
                id
id date           
1  2018-01-01  1.0
   2018-01-02  NaN
   2018-01-03  NaN
   2018-01-04  NaN
   2018-01-05  1.0
   2018-01-06  NaN
   2018-01-07  NaN
   2018-01-08  NaN
   2018-01-09  NaN
   2018-01-10  1.0
2  2018-01-01  2.0
   2018-01-02  NaN
   2018-01-03  NaN
   2018-01-04  NaN
   2018-01-05  2.0
   2018-01-06  NaN
   2018-01-07  NaN
   2018-01-08  NaN
   2018-01-09  NaN
   2018-01-10  2.0

print (data.set_index('date').groupby('id').resample('D').fillna('pad'))
#alternatives
#print (data.set_index('date').groupby('id').resample('D').ffill())
#print (data.set_index('date').groupby('id').resample('D').pad())
               id
id date          
1  2018-01-01   1
   2018-01-02   1
   2018-01-03   1
   2018-01-04   1
   2018-01-05   1
   2018-01-06   1
   2018-01-07   1
   2018-01-08   1
   2018-01-09   1
   2018-01-10   1
2  2018-01-01   2
   2018-01-02   2
   2018-01-03   2
   2018-01-04   2
   2018-01-05   2
   2018-01-06   2
   2018-01-07   2
   2018-01-08   2
   2018-01-09   2
   2018-01-10   2

EDIT:

If want use sum with missing values need min_count=1 parameter - sum:

min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

print (data.groupby('id').resample('D', on='date').sum(min_count=1))

Pandas time series resampling with month and with group by column

You can reset_index to set the cleareddate field as a column in your dataframe. I'd create a new column with month and then you can perform a straightforward groupby on that.

df.reset_index(inplace=True)
df['month'] = df.cleareddate.dt.month
df.groupby(['month','bts_name']).agg('sum').duration

Notes:

I have assumed that the index is of type datetime. If this is not the case then add df.cleareddate = pd.to_datetime(df.cleareddate) line after the reset_index line.
Note that months in two different years will be grouped as the same. Do you need to groupby year as well? If so, add another column for year and add that term to your groupby column

EDIT:
After discussion in comments with @sriman I have included another way of achieving the above with the pandas resample method.

# your data
df = pd.DataFrame({
    'bts_name': ['1002_NUc_Marathalli','1002_NUc_Marathalli',
                 '1002_NUc_Marathalli','1002_NUc_Marathalli',
                 '1003_IU2_Munnekolalu'],
    'duration': [95,188,1332,940,73]
}, index=pd.to_datetime(['2019-01-19','2019-01-21',
                         '2019-02-11','2019-04-12','2019-01-11']))

# solution
def resample(group):
    return group.resample('M').sum()

result = df.groupby('bts_name').apply(resample)

# result
print(result)

                                 duration
bts_name                                 
1002_NUc_Marathalli  2019-01-31       283
                     2019-02-28      1332
                     2019-03-31         0
                     2019-04-30       940
1003_IU2_Munnekolalu 2019-01-31        73

Groupby and resample timeseries so date ranges are consistent

Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.

Set the index then use pandas.MultiIndex.from_product to produce the Cartesian product of values. I also use fill_value=0 to fill in those missing values.

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
    names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()

         date  group  value
0  2010-01-01      1      1
1  2010-01-01      2      5
2  2010-01-02      1      2
3  2010-01-02      2      0
4  2010-01-03      1      3
5  2010-01-03      2      6
6  2010-01-04      1      0
7  2010-01-04      2      0
8  2010-01-05      1      0
9  2010-01-05      2      0
10 2010-01-06      1      4
11 2010-01-06      2      0

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
    names=d.index.names
)
d.reindex(midx).reset_index()

         date  group  value
0  2010-01-01      1    1.0
1  2010-01-01      2    5.0
2  2010-01-02      1    2.0
3  2010-01-02      2    NaN
4  2010-01-03      1    3.0
5  2010-01-03      2    6.0
6  2010-01-04      1    NaN
7  2010-01-04      2    NaN
8  2010-01-05      1    NaN
9  2010-01-05      2    NaN
10 2010-01-06      1    4.0
11 2010-01-06      2    NaN

Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0 to fill in missing values. We could leave that out to produce the NaN.

df.set_index(['date', 'group']) \
  .unstack(fill_value=0) \
  .asfreq('D', fill_value=0) \
  .stack().reset_index()

         date  group  value
0  2010-01-01      1      1
1  2010-01-01      2      5
2  2010-01-02      1      2
3  2010-01-02      2      0
4  2010-01-03      1      3
5  2010-01-03      2      6
6  2010-01-04      1      0
7  2010-01-04      2      0
8  2010-01-05      1      0
9  2010-01-05      2      0
10 2010-01-06      1      4
11 2010-01-06      2      0

df.set_index(['date', 'group']) \
  .unstack() \
  .asfreq('D') \
  .stack(dropna=False).reset_index()

         date  group  value
0  2010-01-01      1    1.0
1  2010-01-01      2    5.0
2  2010-01-02      1    2.0
3  2010-01-02      2    NaN
4  2010-01-03      1    3.0
5  2010-01-03      2    6.0
6  2010-01-04      1    NaN
7  2010-01-04      2    NaN
8  2010-01-05      1    NaN
9  2010-01-05      2    NaN
10 2010-01-06      1    4.0
11 2010-01-06      2    NaN

Resample dataframe with specific start/end dates along with a groupby

I am using reindex here , the key is to setting up the Multiple index

df.index=pd.to_datetime(df.index).date
df=df.groupby([df.index,df['txn_type'],df['cust_id']]).agg({'txn_amt':'sum'}).reset_index(level=[1,2])
drange=pd.date_range(end=df.index.max(),periods =5)
idx=pd.MultiIndex.from_product([drange,df.cust_id.unique(),df.txn_type.unique()])
Newdf=df.set_index(['cust_id','txn_type'],append=True).reindex(idx,fill_value=0).reset_index(level=[1,2])
Newdf
Out[749]: 
            level_1 level_2  txn_amt
2019-03-03      100  Credit        0
2019-03-03      100   Debit        0
2019-03-03      101  Credit        0
2019-03-03      101   Debit        0
2019-03-04      100  Credit        0
2019-03-04      100   Debit        0
2019-03-04      101  Credit        0
2019-03-04      101   Debit        0
2019-03-05      100  Credit    25000
2019-03-05      100   Debit        0
2019-03-05      101  Credit    25000
2019-03-05      101   Debit        0
2019-03-06      100  Credit        0
2019-03-06      100   Debit     4000
2019-03-06      101  Credit        0
2019-03-06      101   Debit     7000
2019-03-07      100  Credit        0
2019-03-07      100   Debit     1000
2019-03-07      101  Credit        0
2019-03-07      101   Debit        0

Mean Aggregations using pandas GroupBy and Time Series resampling

Following should also work, based on your question the 'sum' displays 'sum of Volume' based on 'Year' and 'mean' displays 'mean of volums' based on 'Daily mean' both being grouped by 'Session' and 'DateTime'. (Just used some groupy chaining with joins)

import pandas as pd

data = { 
'DateTime':['2020-12-16 08:00:00','2020-12-16 08:30:00','2020-12-16 09:00:00','2020-12-16 09:30:00','2020-12-17 08:00:00','2020-12-17 08:30:00','2020-12-17 09:00:00','2020-12-17 09:30:00','2020-12-18 08:00:00','2020-12-18 08:30:00','2020-12-18 09:00:00','2020-12-18 09:30:00','2019-11-18 08:00:00','2019-11-18 08:30:00','2019-11-18 09:00:00','2019-11-18 09:30:00'],
'Volume':[1000,500,1000,3000,2000,2000,2000,2000,1000,1000,1000,1000,1000,1000,1000,1000],
'Session':['PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH']
}

df = pd.DataFrame(data)
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.index = pd.to_datetime(df['DateTime'])

#See below code 
x = df.groupby([df.index.strftime('%Y'),'Session',df.index.strftime('%Y-%m-%d')]).agg({'Volume':['sum','mean']}).groupby(['DateTime','Session'],level=2).agg(['sum','mean'])
x['Volume'].drop('mean',axis=1,level=0)

Groupby and resample at 1min frequency using forward fill in Python

Use custom function with define how many need new rows by Timedelta with date_range and DataFrame.reindex:

def f(x):
    new = x.index[0] + pd.Timedelta(5, unit='Min')
    r = pd.date_range(x.index[0], new, freq='Min')
    
    return x.reindex(r, method='ffill')

df = (df.reset_index()
        .set_index('timestamp')
        .groupby(['index','id'], sort=False)['data']
        .apply(f)
        .reset_index(level=0, drop=True)
        .rename_axis(['id','timestamp'])
        .reset_index()
        )

print (df)
    id               timestamp  data
0    1 2017-01-02 13:14:53.040  10.0
1    1 2017-01-02 13:15:53.040  10.0
2    1 2017-01-02 13:16:53.040  10.0
3    1 2017-01-02 13:17:53.040  10.0
4    1 2017-01-02 13:18:53.040  10.0
5    1 2017-01-02 13:19:53.040  10.0
6    1 2017-01-02 12:04:43.240  11.0
7    1 2017-01-02 12:05:43.240  11.0
8    1 2017-01-02 12:06:43.240  11.0
9    1 2017-01-02 12:07:43.240  11.0
10   1 2017-01-02 12:08:43.240  11.0
11   1 2017-01-02 12:09:43.240  11.0
12   2 2017-01-02 15:22:06.540   1.0
13   2 2017-01-02 15:23:06.540   1.0
14   2 2017-01-02 15:24:06.540   1.0
15   2 2017-01-02 15:25:06.540   1.0
16   2 2017-01-02 15:26:06.540   1.0
17   2 2017-01-02 15:27:06.540   1.0
18   2 2017-01-03 13:55:34.240   2.0
19   2 2017-01-03 13:56:34.240   2.0
20   2 2017-01-03 13:57:34.240   2.0
21   2 2017-01-03 13:58:34.240   2.0
22   2 2017-01-03 13:59:34.240   2.0
23   2 2017-01-03 14:00:34.240   2.0

because if use ffill output is different:

df = df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill()
print (df)
id  timestamp          
1   2017-01-02 12:04:00     NaN
    2017-01-02 12:05:00    11.0
    2017-01-02 12:06:00    11.0
    2017-01-02 12:07:00    11.0
    2017-01-02 12:08:00    11.0

2   2017-01-03 13:51:00     1.0
    2017-01-03 13:52:00     1.0
    2017-01-03 13:53:00     1.0
    2017-01-03 13:54:00     1.0
    2017-01-03 13:55:00     1.0
Name: data, Length: 1425, dtype: float64

Pandas: Resample Timeseries with Groupby