Get Statistics For Each Group (Such as Count, Mean, etc) Using Pandas Groupby

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

Pandas, groupby and count

You seem to want to group by several columns at once:

df.groupby(['revenue','session','user_id'])['user_id'].count()

should give you what you want

Pandas Groupby: Count and mean combined

You can use groupby with aggregate:

df = df.groupby('source') \
.agg({'text':'size', 'sent':'mean'}) \
.rename(columns={'text':'count','sent':'mean_sent'}) \
.reset_index()
print (df)
source count mean_sent
0 bar 2 0.415
1 foo 3 -0.500

Pandas groupby and count numbers of item by conditions

You can use a named groupby:

df_test.groupby(
['ID1','ID2']).agg(
Count_ID2=('ID2', 'count'),
Count_ID3=('ID3', 'count'),
Count_condition=("condition", lambda x: str(x).count('!')))

prints:

         Count_ID2  Count_ID3  Count_condition
ID1 ID2
A a 3 3 1
aa 1 1 1
aaa 2 2 0
B b 2 2 1
bb 2 2 1

In the above we are counting the occurences with aggfunc="count" for columns "ID2" and "ID3", and creating a small custom function which count's the occurences of ! for the "condition" column. We do the aforementioned for each group and we returned named columns for our aggregation results

Python Pandas : group by in groups by and average, count, median

Try using pd.NamedAgg:

df.groupby('User').agg(avg_time=('time','mean'),
mean_time=('time','median'),
state=('state','first'),
user_count=('time','count')).reset_index()

Output:

  User  avg_time  mean_time state  user_count
0 A 1.5 1.5 CA 2
1 B 3.0 3.0 ID 1
2 C 4.0 4.0 OR 3

Pandas Groupby Syntax explanation

It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after?

This is exactly what's happening. df.groupby() returns a dataframe. The .mean() method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series (which can be indexed) if run on the full dataframe.

Reversing the order produces a single column as a Series and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).

How do I use the Pandas groupby function to calculate the mean for the previous year?

Create a sample data set

import pandas
import numpy as np
df = pandas.DataFrame(
{'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
'2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
'score': [40, 50, 100, 200, 160, 140, 160, 200],
}
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year

Use transform to add the current season average to the data frame

df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df

player datetime score year season_avg
0 B 2020-01-01 40 2020 40.000000
1 A 2020-01-01 50 2020 50.000000
2 A 2021-01-01 100 2021 153.333333
3 B 2021-01-01 200 2021 166.666667
4 A 2021-01-01 160 2021 153.333333
5 B 2021-01-01 140 2021 166.666667
6 B 2021-01-01 160 2021 166.666667
7 A 2021-01-01 200 2021 153.333333

Shift cannot be applied here because years are repeated

df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")

0 NaN
1 NaN
2 50.000000
3 40.000000
4 153.333333
5 166.666667
6 166.666667
7 153.333333
Name: season_avg, dtype: float64

Compute the average from the previous year and join them to the original dataframe

savg = (df.groupby(["year", "player"])
.agg(last_season_avg = ("score", "mean"))
.reset_index())
savg["year"] = savg["year"] + 1
savg

year player last_season_avg
0 2021 A 50.000000
1 2021 B 40.000000
2 2022 A 153.333333
3 2022 B 166.666667

df.merge(savg, on=["player", "year"], how="left" )

player datetime score year season_avg last_season_avg
0 B 2020-01-01 40 2020 40.000000 NaN
1 A 2020-01-01 50 2020 50.000000 NaN
2 A 2021-01-01 100 2021 153.333333 50.0
3 B 2021-01-01 200 2021 166.666667 40.0
4 A 2021-01-01 160 2021 153.333333 50.0
5 B 2021-01-01 140 2021 166.666667 40.0
6 B 2021-01-01 160 2021 166.666667 40.0
7 A 2021-01-01 200 2021 153.333333 50.0

Another way to compute the average from the previous year, using shift is maybe more elegant than doing year + 1.

savg = (df.groupby(["year", "player"])
.agg(season_avg = ("score", "mean"))
.reset_index()
.sort_values(["year"])
)
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")


Related Topics



Leave a reply



Submit