Multiple Aggregations of the Same Column Using Pandas Groupby.Agg()

Multiple aggregations of the same column using pandas GroupBy.agg()

As of 2022-06-20, the below is the accepted practice for aggregations:

df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))

Below the fold included for historical versions of pandas.

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012

Apply multiple functions to multiple groupby columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})

a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})

a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

How to implement multiple aggregations using pandas groupby, referencing a specific column

Two step process...

aggregate for sum of stores and idxmin for rank...

then use idxmin to slice original dataframe and join it with the aggregate

agged = df.groupby('title').agg(dict(rank='idxmin', stores='sum'))
df.loc[agged['rank'], ['title', 't_no', 't_descr', 'rank']].join(agged.stores, on='title')

title t_no t_descr rank stores
0 A 1 a 1 1000
1 B 1 a 1 1800
3 C 2 b 2 800
4 D 1 a 1 1800
7 E 3 c 3 700
6 F 4 d 4 500

Pandas groupby, how to do multiple aggregations on multiple columns?

Use GroupBy.agg:

df2=df.groupby('Product',as_index = False).agg({'occasion':'|'.join,'count':'sum'})
print(df2)
# Product occasion count
#0 cake wedding 2
#1 chairs funeral|wedding 5

Pandas Groupby: Aggregations on the same column but totals based on two different critera / dataframes

I believe need Total_RFQ column with size - total counts and Done_RFQ by count by boolean mask - compare with Done and sum of Trues:

d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.eq('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
display_name security_type1 currency_str Total_RFQ Done_RFQ Done_Pct
0 A GOVT USD 1 1 100.0
1 B CORP NZD 1 0 0.0
2 B CORP USD 1 1 100.0
3 C CORP EUR 2 1 50.0
4 C CORP GBP 2 2 100.0
5 C CORP USD 1 1 100.0

If need check substrings:

d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.str.contains('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
display_name security_type1 currency_str Total_RFQ Done_RFQ Done_Pct
0 A GOVT USD 1 1 100.0
1 B CORP NZD 1 0 0.0
2 B CORP USD 1 1 100.0
3 C CORP EUR 2 1 50.0
4 C CORP GBP 2 2 100.0
5 C CORP USD 1 1 100.0

different aggregated sums of the same column based on categorical values in other columns

One option is to do a groupby and unstack:

(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)

set_id normal_pieces spare_pieces num_pieces
0 A 29.0 8.0 37.0
1 B 24.0 8.0 32.0
2 C 10.0 NaN 10.0

For the updated solution, you can use a groupby and unstack - I'll just jump straight to the pivot_table, which is a wrapper around groupby and pivot:


temp = df.pivot_table(index='set_id',
columns=['is_spare', 'color'],
values='quantity',
aggfunc='sum')
# get the sum of `red`, `blue`, ...
colors = temp.groupby(level='color', axis=1).sum(1)

#pandas MultiIndex works nicely here
# where we can select the top columns and sum
# in this case, `False`, and `True`
(temp.assign(num_pieces = temp.sum(1),
normal_pieces = temp[False].sum(1),
spare_pieces = temp[True].sum(1),
# assign is basically an expansion of a dictionary
# and here we take advantage of that
**colors)
.drop(columns=[False, True])
.reset_index()
.rename_axis(columns=[None, None], index=None)
)
set_id num_pieces normal_pieces spare_pieces black grey red white

0 A 37.0 29.0 8.0 0.0 8.0 16.0 13.0
1 B 32.0 24.0 8.0 13.0 0.0 9.0 10.0
2 C 10.0 10.0 0.0 0.0 10.0 0.0 0.0

Another option, that may be a bit faster (groupby is called only once), is to use get_dummies, before grouping:

temp = df.set_index('set_id').loc[:, ['is_spare', 'color', 'quantity']]

# get_dummies returns 0 and 1, depending on if the value exists
# so if `blue` exists for a row, 1 is assigned, else 0
(pd.get_dummies(temp.drop(columns='quantity'),
columns = ['is_spare', 'color'],
prefix='',
prefix_sep='')
# here we do a conditional replacement
# similar to python's if-else statement
# replacing the 1s with quantity
.where(lambda df: df == 0, temp.quantity, axis = 0)
# from here on it is grouping
# with some renaming
.groupby('set_id')
.sum()
.assign(num_pieces = lambda df: df[['False', 'True']].sum(1))
.rename(columns={'False':'normal_pieces', 'True':'spare_pieces'})
)

normal_pieces spare_pieces black grey red white num_pieces
set_id
A 29 8 0 8 16 13 37
B 24 8 13 0 9 10 32
C 10 0 0 10 0 0 10

How to apply different aggregation functions to same column by using pandas Groupby

The following should work:

data.groupby(['A','B']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])

basically call agg and passing a list of functions will generate multiple columns with those functions applied.

Example:

In [12]:

df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])
Out[12]:
a
mean std count
b
0 -0.769198 0.158049 2
1 0.247708 0.743606 2
2 -0.312705 NaN 1

You can also pass the string of the method names, the common ones work, some of the more obscure ones don't I can't remember which but in this case they work fine, thanks to @ajcr for the suggestion:

In [16]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg(['mean', 'std', 'count'])

Out[16]:
a
mean std count
b
0 -1.037301 0.790498 2
1 -0.495549 0.748858 2
2 -0.644818 NaN 1

pivot_table(), multiple aggfuncs to the *same* column - possible?

IIUC use:

pivot = df.pivot_table(index=['Year', 'Month'], values=['Claims'], aggfunc={'Claims': ['min','max']})


Related Topics



Leave a reply



Submit