Groupby Pandas Dataframe and Select Most Common Value

GroupBy pandas DataFrame and select most common value

You can use value_counts() to get a count series, and get the first row:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

In case you are wondering about performing other agg functions in the .agg()
try this.

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod = ('Short name', \
lambda x: x.value_counts().index[0]),
avg = ('account', 'mean') \
)

GroupBy pandas DataFrame and select most common value which is alphabetically first

Try with groupby and mode:

mapper = df.groupby("province")["city"].agg(lambda x: x.mode().sort_values()[0]).to_dict()
df["city"] = df["city"].where(df["city"].notnull(),
df["province"].map(mapper))

>>> df
province city
0 A newyork
1 A london
2 A newyork
3 A london
4 A london
5 A london
6 A houston
7 B hyderabad
8 B karachi
9 B hyderabad
10 B hyderabad
11 B hyderabad
12 B beijing
13 B karachi

Group by a column to find the most frequent value in another column?

Use SeriesGroupBy.value_counts and select first value of index:

df = df.groupby('col1')['col2'].apply(lambda x: x.value_counts().index[0]).reset_index()
print (df)
col1 col2
0 blue nb
1 green gx

Or add DataFrame.drop_duplicates:

df = df.groupby('col1')['col2'].value_counts().reset_index(name='v')

df = df.drop_duplicates('col1')[['col1','col2']]
print (df)
col1 col2
0 blue nb
2 green gx

Or use Series.mode and select first value by positions by Series.iat:

df = df.groupby('col1')['col2'].apply(lambda x: x.mode().iat[0]).reset_index()
print (df)
col1 col2
0 blue nb
1 green gx

EDIT:

Problem is with only NaNs groups:

d = {'col1': ['green','green','green','blue','blue','blue'],
'col2': [np.nan,np.nan,np.nan,'nb','nb','mj']}
df = pd.DataFrame(data=d)

f = lambda x: np.nan if x.isnull().all() else x.value_counts().index[0]
#or
#f = lambda x: next(iter(x.value_counts().index), np.nan)
#another solution
#f = lambda x: next(iter(x.mode()), np.nan)
df = df.groupby('col1')['col2'].apply(f).reset_index()
print (df)
col1 col2
0 blue nb
1 green NaN

pandas groupby and find most frequent value (mode)

You can calculate both count and max on dates, then sort on these values and drop duplicates (or use groupby().head()):

s = df.groupby(['user_id','product_id'])['created_at'].agg(['count','max'])
s.sort_values(['count','max'], ascending=False).groupby('user_id').head(1)

Output:

                    count                  max
user_id product_id
3 400 2 2021-04-21 10:20:00
1 200 2 2020-06-24 10:10:24
2 300 1 2021-01-21 10:20:00

Python: select most frequent using group by

In the comments you note you're using pandas. You can do something like the following:

>>> df

tag category
0 automotive 8
1 ba 8
2 bamboo 8
3 bamboo 8
4 bamboo 8
5 bamboo 8
6 bamboo 8
7 bamboo 10
8 bamboo 8
9 bamboo 9
10 bamboo 8
11 bamboo 10
12 bamboo 8
13 bamboo 9
14 bamboo 8
15 banana tree 8
16 banana tree 8
17 banana tree 8
18 banana tree 8
19 bath 9

Do a groupby on 'tag' for the 'category' column and then within each group use the mode method. However, we have to make it a conditional because pandas doesn't return a number for the mode if the number of observations is less than 3 (we can just return the group itself in the special cases of 1 or 2 observations in a group). We can use the aggregate/agg method with a lambda function to do this:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive 8
ba 8
bamboo 8
banana tree 8
bath 9

Note, when the mode is multi-modal you will get a array (numpy). For example, suppose there were two entries for bath (all the other data is the same):

tag|category
bath|9
bath|10

In that case the output would be:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive 8
ba 8
bamboo 8
banana tree 8
bath [9, 10]

You can also use the value_counts method instead of mode. Once again, do a groupby on 'tag' for the 'category' column and then within each group use the value_counts method. value_counts arranges in descending order so you want to grab the index of the first row:

>>> df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])

tag
automotive 8
ba 8
bamboo 8
banana tree 8
bath 9

However, this won't return an array in multi-modal situations. It will just return the first mode.

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:

df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Finding most common values with Pandas GroupBy and value_counts

Use head in each group from the results of value_counts:

df.groupby('Area Name')['Code Description'].apply(lambda x: x.value_counts().head(3))

Output:

Area Name                                                                
77th Street RAPE, FORCIBLE 1
Foothill CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER)0060 1
N Hollywood CRIMINAL THREATS - NO WEAPON DISPLAYED 2
VIOLATION OF RESTRAINING ORDER 1
ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 1
Southeast ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 1
West Valley CRIMINAL THREATS - NO WEAPON DISPLAYED 2
Name: Code Description, dtype: int64

How to find common values in groupby groups?

Since df is already sorted by tour, we could use groupby + first:

df['val'] = df.groupby('user')['val'].transform('first')

Output:

    user game tour  val
0 jim 1 1 10
1 john 1 1 12
2 jack 2 1 14
3 jim 2 1 10
4 mel 3 2 20
5 jim 3 2 10
6 mat 4 2 14
7 nick 4 2 20
8 tim 5 3 16
9 john 5 3 12
10 lin 6 3 16
11 mick 6 3 20

Retrieve most frequent value for each couple of values

Use custom lambda function for first mode:

df = (df.groupby(['Latitude', 'Longitude'])['street_type']
.agg(lambda x: x.mode().iat[0])
.reset_index())


Related Topics



Leave a reply



Submit