Group by Pandas Dataframe and Select Latest in Each Group

group by pandas dataframe and select latest in each group

use idxmax in groupby and slice df with loc

df.loc[df.groupby('id').date.idxmax()]

id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01

group by pandas dataframe and select next upcoming date in each group

Filter the dates first, then drop duplicates:

df[df['date']>'2020-12-01'].sort_values(['id','date']).drop_duplicates('id')

Output:

    id  product        date
2 220 6647 2020-12-16
4 826 3380 2020-12-09
8 901 4555 2021-11-01

Get only the first and last rows of each group with pandas

Use groupby, find the head and tail for each group, and concat the two.

g = df.groupby('ID')

(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))

Time ID X Y
0 8:00 A 23 100
1 20:00 A 35 220
2 9:00 B 24 110
3 23:00 B 38 250
4 11:00 C 26 130
5 22:00 C 37 240
6 15:00 D 30 170

If you can guarantee each ID group has at least two rows, the drop_duplicates call is not needed.


Details

g.head(1)

Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170

g.tail(1)

Time ID X Y
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250

pd.concat([g.head(1), g.tail(1)])

Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250

Pandas groupby select last row or second to last row based on value (0 or 1) in another column

You are checking if x['churned']==1 for all rows in the group. To check if it presents in the group you have to use any():

df = df.groupby(['CustomerID'],as_index=False).apply \
(lambda x: x.iloc[-2] if (x['churned']==1).any() \
else x.iloc[-1]).reset_index()

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

Keep X% last rows by group in Pandas

groupby-apply-tail

Pass the desired size to tail() in a GroupBy.apply(). This is simpler than the iloc method below since it cleanly handles the "last 0 rows" case.

ratio = 0.6
(df.groupby('ID')
.apply(lambda x: x.tail(int(ratio * len(x))))
.reset_index(drop=True))

# ID value
# 0 A 2
# 1 B 13
# 2 B 14
# 3 B 15
ratio = 0.4
(df.groupby('ID')
.apply(lambda x: x.tail(int(ratio * len(x))))
.reset_index(drop=True))

# ID value
# 0 B 14
# 1 B 15


groupby-apply-iloc

Alternatively, index the desired size via iloc/slicing, but this is clunkier since [-0:] does not actually get the last 0 rows, so we have to check against that:

ratio = 0.6
(df.groupby('ID')
.apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
.reset_index(drop=True))

# ID value
# 0 A 2
# 1 B 13
# 2 B 14
# 3 B 15
ratio = 0.4
(df.groupby('ID')
.apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
.reset_index(drop=True))

# ID value
# 0 B 14
# 1 B 15

Pandas groupby a column and sort by date and get only the latest row

If date has higher precendence than content_id, use that fact in sort_values:

out = df.sort_values(['user_id','date','content_id']).groupby(['user_id'])[['content_id','date']].last()

Another possibility is to convert date to datetime and the find the latest date's index using groupby + idxmax; then use loc to filter the desired output:

df['date'] = pd.to_datetime(df['date'])
out = df.loc[df.groupby('user_id')['date'].idxmax()]

Output:

         content_id        date
user_id
123 20 2020-10-14
234 19 2021-05-26

Filter for most recent event by group with pandas

It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby + idxmax:

df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]

Output:

   account_number product  sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30


Related Topics



Leave a reply



Submit