Max and Min Date in Pandas Groupby

Max and Min date in pandas groupby

You need to combine the functions that apply to the same column, like this:

In [116]: gb.agg({'sum_col' : np.sum,
...: 'date' : [np.min, np.max]})
Out[116]:
date sum_col
amin amax sum
type weekofyear
A 25 2014-06-22 2014-06-22 1
26 2014-06-25 2014-06-25 1
27 2014-07-05 2014-07-05 2
B 26 2014-06-24 2014-06-24 2
27 2014-07-02 2014-07-02 1
C 26 2014-06-25 2014-06-25 3
27 2014-07-06 2014-07-06 3
30 2014-07-27 2014-07-27 1

Pandas groupby value and get value of max date and min date

Try sort_values by year, then you can groupby and select first for min and last for max:

g = df.sort_values('year').groupby('item')
out = g['value'].last() - g['value'].first()

Output:

item
A 12
B 20
Name: value, dtype: int64

Finding the min and max date from a timeseries range in pandas

I would advise using a groupby on the "site" column and aggregating each group into a min and max date.

df.groupby("Site").agg({'date': ['min', 'max']})

This will return the min and max date for each site.

I haven't tried out the code, but it should do what you want.

Pandas group by two fields, pick min date and next max date from other group

shifting max_date per group

Here max_date is defined as the min_date of the previous id per brand

(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'))
.assign(max_date=lambda d: d.groupby('brand')['min_date'].shift(-1))
#.astype(str).to_markdown() # uncomment for markdown
)

output:

|             | min_date   | max_date   |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-03 |
| (2, 'nike') | 2021-01-03 | NaT |
previous answer

You need to mask the data afterwards:

(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'), max_date=('release_date', 'max'))
.assign(max_date=lambda d: d['max_date'].mask(d['max_date'].eq(d['min_date'])))
#.astype(str).to_markdown() # uncomment for markdown
)

output (as markdown):

|             | min_date   | max_date   |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-02 |
| (2, 'nike') | 2021-01-03 | NaT |

How to calculate difference between max and min date for each user

create new dataframe grouped per id with named cols for min and max values of dates,
later merge with original.

data input:

    import numpy as np
import pandas as pd

df = pd.DataFrame({
"user_id": (np.random.randint(10000,10004,15, dtype="int32")),
"purchase_date": (pd.date_range(start='2022-01-01', periods=15, freq='8H')),
"C": pd.Series(1, index=list(range(15)), dtype="float32"),
"D": np.array([5] * 15, dtype="int32"),
"E": "foo",
})
df['purchase_date'] = pd.to_datetime(df['purchase_date']).dt.normalize()



# Solution

df_grouped = df.groupby(['user_id']).agg(
date_min=('purchase_date', 'min'),
date_max=('purchase_date', 'max'))\
.reset_index()
df_grouped['diff']=(df_grouped['date_max']-df_grouped['date_min']).dt.days
df1 = pd.merge(df, df_grouped)
df1

Out:

   user_id purchase_date    C  D    E   date_min   date_max  diff
0 10001 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
1 10001 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
2 10001 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
3 10001 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
4 10000 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
5 10000 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
6 10000 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
7 10000 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
8 10002 2022-01-01 1.0 5 foo 2022-01-01 2022-01-05 4
9 10002 2022-01-02 1.0 5 foo 2022-01-01 2022-01-05 4
10 10002 2022-01-03 1.0 5 foo 2022-01-01 2022-01-05 4
11 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
12 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
13 10003 2022-01-04 1.0 5 foo 2022-01-04 2022-01-05 1
14 10003 2022-01-05 1.0 5 foo 2022-01-04 2022-01-05 1

Pandas group by on one column with max date on another column python

You can use boolean indexing using groupby and transform

df_new = df[df.groupby('dealer').date.transform('max') == df['date']]

invoice_no dealer billing_change_previous_month date
1 100 1 -41981 2017-01-30
2 5505 2 0 2017-01-30

The solution works as expected even if there are more than two dealers (to address question posted by Ben Smith),

df = pd.DataFrame({'invoice_no':[110,100,5505,5635,10000,10001], 'dealer':[1,1,2,2,3,3],'billing_change_previous_month':[0,-41981,0,58730,9000,100], 'date':['2016-12-31','2017-01-30','2017-01-30','2016-12-31', '2019-12-31', '2020-01-31']})

df['date'] = pd.to_datetime(df['date'])
df[df.groupby('dealer').date.transform('max') == df['date']]

invoice_no dealer billing_change_previous_month date
1 100 1 -41981 2017-01-30
2 5505 2 0 2017-01-30
5 10001 3 100 2020-01-31

Python Min/Max Dates with Groupby

Starting from your original dataframe , you can use a helper column with series.shift to compare the next row and use it for grouping, then groupby and agg with min and max, rename and reset the index:

s = df['Price'].ne(df['Price'].shift()).cumsum()

d = {"min":"start_dt", "max":"end_dt"}
out = (df.groupby([s,'Price'])['ds'].agg(['min','max']).rename(columns=d)
.droplevel(0).reset_index())


print(out)

Price start_dt end_dt
0 3 2017-01-01 2017-01-05
1 4 2017-01-06 2017-01-09
2 3 2017-01-10 2017-01-14

Pandas Groupby with Agg Min/Max date

In pandas, NaN is used as the missing value, and is ignored for most operations, so it's the right one to use. If you're still getting an error, it's probably because you've got a datetime.date there (well, you've definitely got that there, I mean that it's probably causing the problems).

For example, if your missing values are "" and your column dtypes are object with internal types of datetime.date, I get:

In [496]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
[...]
TypeError: '<=' not supported between instances of 'datetime.date' and 'str'

but if I switch to pandas-native time objects and NaNs, it works:

In [500]: df["p_date"] = pd.to_datetime(df["p_date"])

In [501]: df["s_date"] = pd.to_datetime(df["s_date"])

In [502]: df
Out[502]:
issue p_date s_date
0 issue 2012-11-01 NaT
1 issue 2013-12-09 NaT
2 issue 2014-12-08 NaT
3 issue NaT 2016-01-13
4 issue 2012-11-01 NaT
5 issue NaT 2014-03-26
6 issue NaT 2015-05-29
7 issue 2013-12-18 NaT
8 issue NaT 2016-01-13

In [503]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
Out[503]:
p_date s_date
issue
issue 2012-11-01 2016-01-13

get the difference between max and min for a groupby in pandas and calculate the average

For pandas 0.25+ is possible use named aggregations, then subtract and divide columns:

df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg(min1=('f_date','min'),
max1=('f_date','max'),
rn=('rn', 'max'))

df['new'] = df['max1'].sub(df['min1']).div(df['rn'].add(1))
print (df)
min1 max1 rn new
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0 days 00:00:00
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1 0 days 00:00:00
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0 days 00:00:00
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1 790 days 12:00:00

Or if necessary convert difference of datetimes (timedeltas) to seconds by Series.dt.total_seconds:

df['new1'] = df['max1'].sub(df['min1']).dt.total_seconds().div(df['rn'].add(1))
print (df)
min1 max1 rn new
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0.0
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1 0.0
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0.0
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1 68299200.0

Solution for oldier pandas versions:

df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg({'f_date':['min','max'],
'rn':'max'})
df.columns = df.columns.map('_'.join)
df['new'] = df['f_date_max'].sub(df['f_date_min']).div(df['rn_max'].add(1))
print (df)
f_date_min f_date_max rn_max \
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1

new
ticker fy fp
AAPL 2010 0 0 days 00:00:00
2011 0 0 days 00:00:00
GOOG 2010 0 0 days 00:00:00
MSFT 2009 0 790 days 12:00:00

Last if necessary convert MultiIndex to columns:

df = df.reset_index()
print (df)
ticker fy fp f_date_min f_date_max rn_max \
0 AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
1 AAPL 2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1
2 GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
3 MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1

new
0 0 days 00:00:00
1 0 days 00:00:00
2 0 days 00:00:00
3 790 days 12:00:00


Related Topics



Leave a reply



Submit