Max and Min date in pandas groupby
You need to combine the functions that apply to the same column, like this:
In [116]: gb.agg({'sum_col' : np.sum,
...: 'date' : [np.min, np.max]})
Out[116]:
date sum_col
amin amax sum
type weekofyear
A 25 2014-06-22 2014-06-22 1
26 2014-06-25 2014-06-25 1
27 2014-07-05 2014-07-05 2
B 26 2014-06-24 2014-06-24 2
27 2014-07-02 2014-07-02 1
C 26 2014-06-25 2014-06-25 3
27 2014-07-06 2014-07-06 3
30 2014-07-27 2014-07-27 1
Pandas groupby value and get value of max date and min date
Try sort_values
by year
, then you can groupby
and select first
for min
and last
for max
:
g = df.sort_values('year').groupby('item')
out = g['value'].last() - g['value'].first()
Output:
item
A 12
B 20
Name: value, dtype: int64
Finding the min and max date from a timeseries range in pandas
I would advise using a groupby
on the "site" column and aggregating each group into a min
and max
date.
df.groupby("Site").agg({'date': ['min', 'max']})
This will return the min
and max
date for each site.
I haven't tried out the code, but it should do what you want.
Pandas group by two fields, pick min date and next max date from other group
shifting max_date per group
Here max_date is defined as the min_date of the previous id per brand
(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'))
.assign(max_date=lambda d: d.groupby('brand')['min_date'].shift(-1))
#.astype(str).to_markdown() # uncomment for markdown
)
output:
| | min_date | max_date |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-03 |
| (2, 'nike') | 2021-01-03 | NaT |
previous answer
You need to mask the data afterwards:
(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'), max_date=('release_date', 'max'))
.assign(max_date=lambda d: d['max_date'].mask(d['max_date'].eq(d['min_date'])))
#.astype(str).to_markdown() # uncomment for markdown
)
output (as markdown):
| | min_date | max_date |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-02 |
| (2, 'nike') | 2021-01-03 | NaT |
How to calculate difference between max and min date for each user
create new dataframe grouped per id with named cols for min and max values of dates,
later merge with original.
data input:
import numpy as np
import pandas as pd
df = pd.DataFrame({
"user_id": (np.random.randint(10000,10004,15, dtype="int32")),
"purchase_date": (pd.date_range(start='2022-01-01', periods=15, freq='8H')),
"C": pd.Series(1, index=list(range(15)), dtype="float32"),
"D": np.array([5] * 15, dtype="int32"),
"E": "foo",
})
df['purchase_date'] = pd.to_datetime(df['purchase_date']).dt.normalize()
# Solution
df_grouped = df.groupby(['user_id']).agg(
date_min=('purchase_date', 'min'),
date_max=('purchase_date', 'max'))\
.reset_index()
df_grouped['diff']=(df_grouped['date_max']-df_grouped['date_min']).dt.days
df1 = pd.merge(df, df_grouped)
df1
Out:
user_id purchase_date C D E date_min date_max diff
0 10001 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
1 10001 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
2 10001 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
3 10001 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
4 10000 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
5 10000 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
6 10000 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
7 10000 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
8 10002 2022-01-01 1.0 5 foo 2022-01-01 2022-01-05 4
9 10002 2022-01-02 1.0 5 foo 2022-01-01 2022-01-05 4
10 10002 2022-01-03 1.0 5 foo 2022-01-01 2022-01-05 4
11 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
12 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
13 10003 2022-01-04 1.0 5 foo 2022-01-04 2022-01-05 1
14 10003 2022-01-05 1.0 5 foo 2022-01-04 2022-01-05 1
Pandas group by on one column with max date on another column python
You can use boolean indexing using groupby and transform
df_new = df[df.groupby('dealer').date.transform('max') == df['date']]
invoice_no dealer billing_change_previous_month date
1 100 1 -41981 2017-01-30
2 5505 2 0 2017-01-30
The solution works as expected even if there are more than two dealers (to address question posted by Ben Smith),
df = pd.DataFrame({'invoice_no':[110,100,5505,5635,10000,10001], 'dealer':[1,1,2,2,3,3],'billing_change_previous_month':[0,-41981,0,58730,9000,100], 'date':['2016-12-31','2017-01-30','2017-01-30','2016-12-31', '2019-12-31', '2020-01-31']})
df['date'] = pd.to_datetime(df['date'])
df[df.groupby('dealer').date.transform('max') == df['date']]
invoice_no dealer billing_change_previous_month date
1 100 1 -41981 2017-01-30
2 5505 2 0 2017-01-30
5 10001 3 100 2020-01-31
Python Min/Max Dates with Groupby
Starting from your original dataframe , you can use a helper column with series.shift
to compare the next row and use it for grouping, then groupby and agg with min and max, rename and reset the index:
s = df['Price'].ne(df['Price'].shift()).cumsum()
d = {"min":"start_dt", "max":"end_dt"}
out = (df.groupby([s,'Price'])['ds'].agg(['min','max']).rename(columns=d)
.droplevel(0).reset_index())
print(out)
Price start_dt end_dt
0 3 2017-01-01 2017-01-05
1 4 2017-01-06 2017-01-09
2 3 2017-01-10 2017-01-14
Pandas Groupby with Agg Min/Max date
In pandas, NaN
is used as the missing value, and is ignored for most operations, so it's the right one to use. If you're still getting an error, it's probably because you've got a datetime.date there (well, you've definitely got that there, I mean that it's probably causing the problems).
For example, if your missing values are ""
and your column dtypes are object
with internal types of datetime.date
, I get:
In [496]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
[...]
TypeError: '<=' not supported between instances of 'datetime.date' and 'str'
but if I switch to pandas-native time objects and NaNs, it works:
In [500]: df["p_date"] = pd.to_datetime(df["p_date"])
In [501]: df["s_date"] = pd.to_datetime(df["s_date"])
In [502]: df
Out[502]:
issue p_date s_date
0 issue 2012-11-01 NaT
1 issue 2013-12-09 NaT
2 issue 2014-12-08 NaT
3 issue NaT 2016-01-13
4 issue 2012-11-01 NaT
5 issue NaT 2014-03-26
6 issue NaT 2015-05-29
7 issue 2013-12-18 NaT
8 issue NaT 2016-01-13
In [503]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
Out[503]:
p_date s_date
issue
issue 2012-11-01 2016-01-13
get the difference between max and min for a groupby in pandas and calculate the average
For pandas 0.25+ is possible use named aggregations, then subtract and divide columns:
df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg(min1=('f_date','min'),
max1=('f_date','max'),
rn=('rn', 'max'))
df['new'] = df['max1'].sub(df['min1']).div(df['rn'].add(1))
print (df)
min1 max1 rn new
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0 days 00:00:00
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1 0 days 00:00:00
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0 days 00:00:00
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1 790 days 12:00:00
Or if necessary convert difference of datetimes (timedeltas) to seconds by Series.dt.total_seconds
:
df['new1'] = df['max1'].sub(df['min1']).dt.total_seconds().div(df['rn'].add(1))
print (df)
min1 max1 rn new
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0.0
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1 0.0
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0 0.0
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1 68299200.0
Solution for oldier pandas versions:
df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg({'f_date':['min','max'],
'rn':'max'})
df.columns = df.columns.map('_'.join)
df['new'] = df['f_date_max'].sub(df['f_date_min']).div(df['rn_max'].add(1))
print (df)
f_date_min f_date_max rn_max \
ticker fy fp
AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1
GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1
new
ticker fy fp
AAPL 2010 0 0 days 00:00:00
2011 0 0 days 00:00:00
GOOG 2010 0 0 days 00:00:00
MSFT 2009 0 790 days 12:00:00
Last if necessary convert MultiIndex
to columns:
df = df.reset_index()
print (df)
ticker fy fp f_date_min f_date_max rn_max \
0 AAPL 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
1 AAPL 2011 0 2012-01-01 12:12:34 2012-01-01 12:12:34 1
2 GOOG 2010 0 2010-01-01 12:12:34 2010-01-01 12:12:34 0
3 MSFT 2009 0 2010-01-01 12:12:34 2014-05-01 12:12:34 1
new
0 0 days 00:00:00
1 0 days 00:00:00
2 0 days 00:00:00
3 790 days 12:00:00
Related Topics
How to Wrap Every Method of a Class
How to Check Whether a Variable Is a Class or Not
How to Set "Camera Position" for 3D Plots Using Python/Matplotlib
How to Generate Random Numbers That Are Different
How to Read the Contents of an Url with Python
Plot a Bar Using Matplotlib Using a Dictionary
Reading Two Text Files Line by Line Simultaneously
Why Doesn't Django's Model.Save() Call Full_Clean()
How to Do N-D Distance and Nearest Neighbor Calculations on Numpy Arrays
Logging, Streamhandler and Standard Streams
Why Does Python's Multiprocessing Module Import _Main_ When Starting a New Process on Windows
How to Use Multiple Requests and Pass Items in Between Them in Scrapy Python
How to Plot Multi-Color Line If X-Axis Is Date Time Index of Pandas
How to Set Selenium Webdriver from Headless Mode to Normal Mode Within the Same Session