Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Pandas, groupby and count
You seem to want to group by several columns at once:
df.groupby(['revenue','session','user_id'])['user_id'].count()
should give you what you want
Pandas Groupby: Count and mean combined
You can use groupby
with aggregate
:
df = df.groupby('source') \
.agg({'text':'size', 'sent':'mean'}) \
.rename(columns={'text':'count','sent':'mean_sent'}) \
.reset_index()
print (df)
source count mean_sent
0 bar 2 0.415
1 foo 3 -0.500
Pandas groupby and count numbers of item by conditions
You can use a named groupby
:
df_test.groupby(
['ID1','ID2']).agg(
Count_ID2=('ID2', 'count'),
Count_ID3=('ID3', 'count'),
Count_condition=("condition", lambda x: str(x).count('!')))
prints:
Count_ID2 Count_ID3 Count_condition
ID1 ID2
A a 3 3 1
aa 1 1 1
aaa 2 2 0
B b 2 2 1
bb 2 2 1
In the above we are counting the occurences with aggfunc="count"
for columns "ID2" and "ID3", and creating a small custom function which count's the occurences of !
for the "condition" column. We do the aforementioned for each group and we returned named columns for our aggregation results
Python Pandas : group by in groups by and average, count, median
Try using pd.NamedAgg:
df.groupby('User').agg(avg_time=('time','mean'),
mean_time=('time','median'),
state=('state','first'),
user_count=('time','count')).reset_index()
Output:
User avg_time mean_time state user_count
0 A 1.5 1.5 CA 2
1 B 3.0 3.0 ID 1
2 C 4.0 4.0 OR 3
Pandas Groupby Syntax explanation
It reads as if you are calling the
.mean()
function on the age column specifically. The second appears like you are calling.mean()
on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby()
returns a dataframe. The .mean()
method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series
(which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series
and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
How do I use the Pandas groupby function to calculate the mean for the previous year?
Create a sample data set
import pandas
import numpy as np
df = pandas.DataFrame(
{'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
'2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
'score': [40, 50, 100, 200, 160, 140, 160, 200],
}
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year
Use transform to add the current season average to the data frame
df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df
player datetime score year season_avg
0 B 2020-01-01 40 2020 40.000000
1 A 2020-01-01 50 2020 50.000000
2 A 2021-01-01 100 2021 153.333333
3 B 2021-01-01 200 2021 166.666667
4 A 2021-01-01 160 2021 153.333333
5 B 2021-01-01 140 2021 166.666667
6 B 2021-01-01 160 2021 166.666667
7 A 2021-01-01 200 2021 153.333333
Shift cannot be applied here because years are repeated
df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")
0 NaN
1 NaN
2 50.000000
3 40.000000
4 153.333333
5 166.666667
6 166.666667
7 153.333333
Name: season_avg, dtype: float64
Compute the average from the previous year and join them to the original dataframe
savg = (df.groupby(["year", "player"])
.agg(last_season_avg = ("score", "mean"))
.reset_index())
savg["year"] = savg["year"] + 1
savg
year player last_season_avg
0 2021 A 50.000000
1 2021 B 40.000000
2 2022 A 153.333333
3 2022 B 166.666667
df.merge(savg, on=["player", "year"], how="left" )
player datetime score year season_avg last_season_avg
0 B 2020-01-01 40 2020 40.000000 NaN
1 A 2020-01-01 50 2020 50.000000 NaN
2 A 2021-01-01 100 2021 153.333333 50.0
3 B 2021-01-01 200 2021 166.666667 40.0
4 A 2021-01-01 160 2021 153.333333 50.0
5 B 2021-01-01 140 2021 166.666667 40.0
6 B 2021-01-01 160 2021 166.666667 40.0
7 A 2021-01-01 200 2021 153.333333 50.0
Another way to compute the average from the previous year, using shift
is maybe more elegant than doing year + 1
.
savg = (df.groupby(["year", "player"])
.agg(season_avg = ("score", "mean"))
.reset_index()
.sort_values(["year"])
)
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")
Related Topics
Convert Columns into Rows With Pandas
Input() Error - Nameerror: Name '...' Is Not Defined
How to Generate All Permutations of a List
Why Does Append() Always Return None in Python
Difference Between Re.Search and Re.Match
Accessing the Index in 'For' Loops
Running Shell Command and Capturing the Output
Why Do I Get Attributeerror: 'Nonetype' Object Has No Attribute 'Something'
How to Parse a String to a Float or Int
How to Provide a Reproducible Copy of Your Dataframe With To_Clipboard()
Static Class Variables and Methods in Python
Understanding Python Super() With _Init_() Methods
What Do Lambda Function Closures Capture
Iterating Over Dictionaries Using 'For' Loops
Strip HTML from Strings in Python
How to Select a HTML Element No Matter What Frame It Is in in Selenium