Pandas: Group by Name and Take Row With Most Recent Date

Keeping only rows with most recent date in dataframe

This can be done by sort_values & drop_duplicates:

df = df.sort_values(by=['Modified Date'], ascending=False)
df = drop_duplicates(subset='School ID', keep='first)

Where the sort ensures that for each school the newest date will appear first, and the drop duplicates takes the first appearance of each school, which is the newest.

group by pandas dataframe and select latest in each group

use idxmax in groupby and slice df with loc

df.loc[df.groupby('id').date.idxmax()]

id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01

Pandas groupby a column and sort by date and get only the latest row

If date has higher precendence than content_id, use that fact in sort_values:

out = df.sort_values(['user_id','date','content_id']).groupby(['user_id'])[['content_id','date']].last()

Another possibility is to convert date to datetime and the find the latest date's index using groupby + idxmax; then use loc to filter the desired output:

df['date'] = pd.to_datetime(df['date'])
out = df.loc[df.groupby('user_id')['date'].idxmax()]

Output:

         content_id        date
user_id
123 20 2020-10-14
234 19 2021-05-26

Filter for most recent event by group with pandas

It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby + idxmax:

df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]

Output:

   account_number product  sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30

Select all rows with 2 most recent dates by ID

You can try groupby().nth:

df[df['date']>=df.groupby("id")["date"].transform('nth', n=2)]

Output:

   id        date  value1  value2
0 a 2020-12-07 10 1000
1 a 2020-12-07 10 1000
2 a 2020-12-05 10 1000
3 a 2020-12-05 10 1000
6 b 2021-12-07 20 2000
7 b 2021-12-07 20 2000
8 b 2021-09-05 20 2000
9 b 2021-09-05 20 2000
12 c 2021-09-05 30 3000
13 c 2021-09-05 30 3000
14 c 2021-02-05 30 3000
15 c 2021-02-05 30 3000

Group By Customer Id and Also Take Date Column With Most Recent Value In Pandas

I think it would be easier to just sort them by date and then drop the duplicates.

df = df.sort_values('date_cancelled', ascending=False)
df = df.drop_duplicates(subset='owner_id', keep='first')
print(df)


Related Topics



Leave a reply



Submit