Pandas get topmost n records within each group
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Sorting columns and selecting top n rows in each group pandas dataframe
There are 2 solutions:
1.sort_values
and aggregate head
:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index
and aggregate nlargest
:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
Timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
How to get top n records from each category in a Python dataframe?
you can use groupby.transform
on the column version, and factorize
the column id to have an incremental value (from 0 to ...) for each id per group, then compare to your n and use loc
with this mask to select the wanted rows.
n = 2
print(df.loc[df.groupby('version')['id'].transform(lambda x: pd.factorize(x)[0])<n])
id Name version copies price
0 6 MSFT 10.0 5 100
1 6 TSLA 10.0 10 200
2 6 ORCL 10.0 15 300
3 5 MSFT 10.0 20 400
4 5 TSLA 10.0 25 500
5 5 ORCL 10.0 30 600
9 3 MSFT 5.0 50 1000
10 3 TSLA 5.0 55 1100
11 3 ORCL 5.0 60 1200
12 2 MSFT 5.0 65 1300
13 2 TSLA 5.0 70 1400
14 2 ORCL 5.0 75 1500
15 1 MSFT 15.0 80 1600
16 1 TSLA 15.0 85 1700
17 1 ORCL 15.0 90 1800
Another option is to use groupby.head
once you drop_duplicated
to keep unique version-id couples. then use select version-id in a merge
.
df.merge(df[['version','id']].drop_duplicates().groupby('version').head(n))
Pandas top n values in each group
You can sort_values
by both ['item', 'value']
and then groupby.head
:
df.sort_values(['item', 'value'], ascending=False).groupby('item').head(10)
Or with nlargest
:
df.groupby('item').value.nlargest(10).reset_index()
How to keep only the top n% rows of each group of a pandas dataframe?
You can construct a Boolean series of flags and filter before you groupby
. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0
indices and 2 1
indices, in each case half the number in the original dataframe.
get top n rows per group pandas
This will do it:
(df.groupby(['Products', 'Month'], as_index=False)
.sum()
.sort_values(['Products', 'Sales'],
ascending=(True,False))
.groupby('Products')
.head(2))
Products Month Sales
1 A 201902 31
2 A 201903 28
4 B 201902 27
3 B 201901 4
7 C 201904 33
6 C 201903 5
Pandas dataframe get first row of each group
>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth
If you need id
as column:
>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth
To get n first records, you can use head():
>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth
How to select top n row from each group after group by in pandas?
I'd recommend sorting your counts in descending order first, and you can call GroupBy.head
after—
(freq_df.sort_values('count', ascending=False)
.groupby(['open_year','open_month'], sort=False).head(5)
)
Related Topics
How to Get the Pythonpath in Shell
How to Make a Discontinuous Axis in Matplotlib
Scatter Plots in Pandas/Pyplot: How to Plot by Category
Pg_Config Executable Not Found
Convert Pyspark String to Date Format
Beautifulsoup Grab Visible Webpage Text
How Does Asyncio Actually Work
How Exactly Does a Generator Comprehension Work
Fastest Way to Download 3 Million Objects from a S3 Bucket
Valueerror: Setting an Array Element with a Sequence
What Are the Risks of Running 'Sudo Pip'
How to Return Dictionary Keys as a List in Python