Pandas Get Topmost N Records Within Each Group

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

Sorting columns and selecting top n rows in each group pandas dataframe

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

    mainid pidx pidy  score
8        2    x    w     12
4        1    a    e      8
2        1    c    a      7
10       2    y    x      6
1        1    a    c      5
7        2    z    y      5
6        2    y    z      3
3        1    c    b      2
5        2    x    y      1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
  pidx  mainid pidy  score
0    a       1    e      8
1    a       1    c      5
2    c       1    a      7
3    c       1    b      2
4    x       2    w     12
5    x       2    y      1
6    y       2    x      6
7    y       2    z      3
8    z       2    y      5

Timings:

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
    grouped = df.groupby('pidx')
    new_df = pd.DataFrame([], columns = df.columns)
    for key, values in grouped:
        new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
    return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

How to get top n records from each category in a Python dataframe?

you can use groupby.transform on the column version, and factorize the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc with this mask to select the wanted rows.

n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
    id  Name  version  copies  price
0    6  MSFT     10.0       5    100
1    6  TSLA     10.0      10    200
2    6  ORCL     10.0      15    300
3    5  MSFT     10.0      20    400
4    5  TSLA     10.0      25    500
5    5  ORCL     10.0      30    600
9    3  MSFT      5.0      50   1000
10   3  TSLA      5.0      55   1100
11   3  ORCL      5.0      60   1200
12   2  MSFT      5.0      65   1300
13   2  TSLA      5.0      70   1400
14   2  ORCL      5.0      75   1500
15   1  MSFT     15.0      80   1600
16   1  TSLA     15.0      85   1700
17   1  ORCL     15.0      90   1800

Another option is to use groupby.head once you drop_duplicated to keep unique version-id couples. then use select version-id in a merge.

df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))

Pandas top n values in each group

You can sort_values by both ['item', 'value'] and then groupby.head:

df.sort_values(['item', 'value'], ascending=False).groupby('item').head(10)

Or with nlargest:

df.groupby('item').value.nlargest(10).reset_index()

How to keep only the top n% rows of each group of a pandas dataframe?

You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))

print(df[0].value_counts())

0    6
1    4
Name: 0, dtype: int64

Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:

n = 0.5

g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n

Then apply the condition, set the index as the first series and (if required) sort the index:

df = df.loc[flags].set_index(0).sort_index()

print(df)

   1  2
0      
0  1  1
0  1  1
0  1  0
1  1  1
1  1  0

As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.

get top n rows per group pandas

This will do it:

(df.groupby(['Products', 'Month'], as_index=False)
   .sum()
   .sort_values(['Products', 'Sales'],
                ascending=(True,False))
   .groupby('Products')
   .head(2))

  Products   Month  Sales
1        A  201902     31
2        A  201903     28
4        B  201902     27
3        B  201901      4
7        C  201904     33
6        C  201903      5

Pandas dataframe get first row of each group

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

How to select top n row from each group after group by in pandas?

I'd recommend sorting your counts in descending order first, and you can call GroupBy.head after—

(freq_df.sort_values('count', ascending=False)
        .groupby(['open_year','open_month'], sort=False).head(5)
)

Pandas Get Topmost N Records Within Each Group