Multiple aggregations of the same column using pandas GroupBy.agg()
As of 2022-06-20, the below is the accepted practice for aggregations:
df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))
Below the fold included for historical versions of pandas
.
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
Apply multiple functions to multiple groupby columns
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg
groupby method. Second, never use .ix
.
If you desire to work with two separate columns at the same time I would suggest using the apply
method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__
attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using apply
and returning a Series
Now, if you had multiple columns that needed to interact together then you cannot use agg
, which implicitly passes a Series to the aggregating function. When using apply
the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
How to implement multiple aggregations using pandas groupby, referencing a specific column
Two step process...
aggregate for sum of stores and idxmin
for rank
...
then use idxmin
to slice original dataframe and join it with the aggregate
agged = df.groupby('title').agg(dict(rank='idxmin', stores='sum'))
df.loc[agged['rank'], ['title', 't_no', 't_descr', 'rank']].join(agged.stores, on='title')
title t_no t_descr rank stores
0 A 1 a 1 1000
1 B 1 a 1 1800
3 C 2 b 2 800
4 D 1 a 1 1800
7 E 3 c 3 700
6 F 4 d 4 500
Pandas groupby, how to do multiple aggregations on multiple columns?
Use GroupBy.agg
:
df2=df.groupby('Product',as_index = False).agg({'occasion':'|'.join,'count':'sum'})
print(df2)
# Product occasion count
#0 cake wedding 2
#1 chairs funeral|wedding 5
Pandas Groupby: Aggregations on the same column but totals based on two different critera / dataframes
I believe need Total_RFQ
column with size
- total counts and Done_RFQ
by count by boolean mask - compare with Done
and sum
of True
s:
d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.eq('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
display_name security_type1 currency_str Total_RFQ Done_RFQ Done_Pct
0 A GOVT USD 1 1 100.0
1 B CORP NZD 1 0 0.0
2 B CORP USD 1 1 100.0
3 C CORP EUR 2 1 50.0
4 C CORP GBP 2 2 100.0
5 C CORP USD 1 1 100.0
If need check substrings:
d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.str.contains('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
display_name security_type1 currency_str Total_RFQ Done_RFQ Done_Pct
0 A GOVT USD 1 1 100.0
1 B CORP NZD 1 0 0.0
2 B CORP USD 1 1 100.0
3 C CORP EUR 2 1 50.0
4 C CORP GBP 2 2 100.0
5 C CORP USD 1 1 100.0
different aggregated sums of the same column based on categorical values in other columns
One option is to do a groupby and unstack:
(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)
set_id normal_pieces spare_pieces num_pieces
0 A 29.0 8.0 37.0
1 B 24.0 8.0 32.0
2 C 10.0 NaN 10.0
For the updated solution, you can use a groupby and unstack - I'll just jump straight to the pivot_table
, which is a wrapper around groupby and pivot:
temp = df.pivot_table(index='set_id',
columns=['is_spare', 'color'],
values='quantity',
aggfunc='sum')
# get the sum of `red`, `blue`, ...
colors = temp.groupby(level='color', axis=1).sum(1)
#pandas MultiIndex works nicely here
# where we can select the top columns and sum
# in this case, `False`, and `True`
(temp.assign(num_pieces = temp.sum(1),
normal_pieces = temp[False].sum(1),
spare_pieces = temp[True].sum(1),
# assign is basically an expansion of a dictionary
# and here we take advantage of that
**colors)
.drop(columns=[False, True])
.reset_index()
.rename_axis(columns=[None, None], index=None)
)
set_id num_pieces normal_pieces spare_pieces black grey red white
0 A 37.0 29.0 8.0 0.0 8.0 16.0 13.0
1 B 32.0 24.0 8.0 13.0 0.0 9.0 10.0
2 C 10.0 10.0 0.0 0.0 10.0 0.0 0.0
Another option, that may be a bit faster (groupby is called only once), is to use get_dummies, before grouping:
temp = df.set_index('set_id').loc[:, ['is_spare', 'color', 'quantity']]
# get_dummies returns 0 and 1, depending on if the value exists
# so if `blue` exists for a row, 1 is assigned, else 0
(pd.get_dummies(temp.drop(columns='quantity'),
columns = ['is_spare', 'color'],
prefix='',
prefix_sep='')
# here we do a conditional replacement
# similar to python's if-else statement
# replacing the 1s with quantity
.where(lambda df: df == 0, temp.quantity, axis = 0)
# from here on it is grouping
# with some renaming
.groupby('set_id')
.sum()
.assign(num_pieces = lambda df: df[['False', 'True']].sum(1))
.rename(columns={'False':'normal_pieces', 'True':'spare_pieces'})
)
normal_pieces spare_pieces black grey red white num_pieces
set_id
A 29 8 0 8 16 13 37
B 24 8 13 0 9 10 32
C 10 0 0 10 0 0 10
How to apply different aggregation functions to same column by using pandas Groupby
The following should work:
data.groupby(['A','B']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])
basically call agg
and passing a list of functions will generate multiple columns with those functions applied.
Example:
In [12]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])
Out[12]:
a
mean std count
b
0 -0.769198 0.158049 2
1 0.247708 0.743606 2
2 -0.312705 NaN 1
You can also pass the string of the method names, the common ones work, some of the more obscure ones don't I can't remember which but in this case they work fine, thanks to @ajcr for the suggestion:
In [16]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg(['mean', 'std', 'count'])
Out[16]:
a
mean std count
b
0 -1.037301 0.790498 2
1 -0.495549 0.748858 2
2 -0.644818 NaN 1
pivot_table(), multiple aggfuncs to the *same* column - possible?
IIUC use:
pivot = df.pivot_table(index=['Year', 'Month'], values=['Claims'], aggfunc={'Claims': ['min','max']})
Related Topics
How to Retrieve a Module'S Path
How to Send an Email With Gmail as Provider Using Python
How to Read CSV Data into a Record Array in Numpy
How to Format a Floating Number to Fixed Width in Python
How to Create a List of Random Numbers Without Duplicates
How to Chain the Movement of a Snake'S Body
How to List All Functions in a Module
How to Use Pickle to Save a Dict (Or Any Other Python Object)
Speed Up Millions of Regex Replacements in Python 3
How to Use the Apply() Function For a Single Column
Groupby Value Counts on the Dataframe Pandas
Updating Gui Elements in Multithreaded Pyqt
How to Read a Text File into a String Variable and Strip Newlines
How to Get the Path and Name of the File That Is Currently Executing
What Does It Mean If a Python Object Is "Subscriptable" or Not