Pandas Get Topmost N Records Within Each Group

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

Sorting columns and selecting top n rows in each group pandas dataframe

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5

Timings:

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

How to get top n records from each category in a Python dataframe?

you can use groupby.transform on the column version, and factorize the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc with this mask to select the wanted rows.

n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
id Name version copies price
0 6 MSFT 10.0 5 100
1 6 TSLA 10.0 10 200
2 6 ORCL 10.0 15 300
3 5 MSFT 10.0 20 400
4 5 TSLA 10.0 25 500
5 5 ORCL 10.0 30 600
9 3 MSFT 5.0 50 1000
10 3 TSLA 5.0 55 1100
11 3 ORCL 5.0 60 1200
12 2 MSFT 5.0 65 1300
13 2 TSLA 5.0 70 1400
14 2 ORCL 5.0 75 1500
15 1 MSFT 15.0 80 1600
16 1 TSLA 15.0 85 1700
17 1 ORCL 15.0 90 1800

Another option is to use groupby.head once you drop_duplicated to keep unique version-id couples. then use select version-id in a merge.

df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))

Pandas top n values in each group

You can sort_values by both ['item', 'value'] and then groupby.head:

df.sort_values(['item', 'value'], ascending=False).groupby('item').head(10)

Or with nlargest:

df.groupby('item').value.nlargest(10).reset_index()

How to keep only the top n% rows of each group of a pandas dataframe?

You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))

print(df[0].value_counts())

0 6
1 4
Name: 0, dtype: int64

Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:

n = 0.5

g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n

Then apply the condition, set the index as the first series and (if required) sort the index:

df = df.loc[flags].set_index(0).sort_index()

print(df)

1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0

As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.

get top n rows per group pandas

This will do it:

(df.groupby(['Products', 'Month'], as_index=False)
.sum()
.sort_values(['Products', 'Sales'],
ascending=(True,False))
.groupby('Products')
.head(2))

Products Month Sales
1 A 201902 31
2 A 201903 28
4 B 201902 27
3 B 201901 4
7 C 201904 33
6 C 201903 5

Pandas dataframe get first row of each group

>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth

How to select top n row from each group after group by in pandas?

I'd recommend sorting your counts in descending order first, and you can call GroupBy.head after—

(freq_df.sort_values('count', ascending=False)
.groupby(['open_year','open_month'], sort=False).head(5)
)


Related Topics



Leave a reply



Submit