Multiple Aggregations of the Same Column Using Pandas Groupby.Agg()

Multiple aggregations of the same column using pandas GroupBy.agg()

As of 2022-06-20, the below is the accepted practice for aggregations:

df.groupby('dummy').agg(
    Mean=('returns', np.mean),
    Sum=('returns', np.sum))

Below the fold included for historical versions of pandas.

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

Apply multiple functions to multiple groupby columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

How to implement multiple aggregations using pandas groupby, referencing a specific column

Two step process...

aggregate for sum of stores and idxmin for rank...

then use idxmin to slice original dataframe and join it with the aggregate

agged = df.groupby('title').agg(dict(rank='idxmin', stores='sum'))
df.loc[agged['rank'], ['title', 't_no', 't_descr', 'rank']].join(agged.stores, on='title')

  title  t_no t_descr  rank  stores
0     A     1       a     1    1000
1     B     1       a     1    1800
3     C     2       b     2     800
4     D     1       a     1    1800
7     E     3       c     3     700
6     F     4       d     4     500

Pandas groupby, how to do multiple aggregations on multiple columns?

Use GroupBy.agg:

df2=df.groupby('Product',as_index = False).agg({'occasion':'|'.join,'count':'sum'})
print(df2)
#  Product         occasion  count
#0    cake          wedding      2
#1  chairs  funeral|wedding      5

Pandas Groupby: Aggregations on the same column but totals based on two different critera / dataframes

I believe need Total_RFQ column with size - total counts and Done_RFQ by count by boolean mask - compare with Done and sum of Trues:

d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.eq('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
  display_name security_type1 currency_str  Total_RFQ  Done_RFQ  Done_Pct
0            A           GOVT          USD          1         1     100.0
1            B           CORP          NZD          1         0       0.0
2            B           CORP          USD          1         1     100.0
3            C           CORP          EUR          2         1      50.0
4            C           CORP          GBP          2         2     100.0
5            C           CORP          USD          1         1     100.0

If need check substrings:

d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.str.contains('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
  display_name security_type1 currency_str  Total_RFQ  Done_RFQ  Done_Pct
0            A           GOVT          USD          1         1     100.0
1            B           CORP          NZD          1         0       0.0
2            B           CORP          USD          1         1     100.0
3            C           CORP          EUR          2         1      50.0
4            C           CORP          GBP          2         2     100.0
5            C           CORP          USD          1         1     100.0

different aggregated sums of the same column based on categorical values in other columns

One option is to do a groupby and unstack:

(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)

  set_id  normal_pieces  spare_pieces  num_pieces
0      A           29.0           8.0        37.0
1      B           24.0           8.0        32.0
2      C           10.0           NaN        10.0

For the updated solution, you can use a groupby and unstack - I'll just jump straight to the pivot_table, which is a wrapper around groupby and pivot:


temp = df.pivot_table(index='set_id', 
                      columns=['is_spare', 'color'], 
                      values='quantity', 
                      aggfunc='sum')
# get the sum of `red`, `blue`, ...
colors = temp.groupby(level='color', axis=1).sum(1)

#pandas MultiIndex works nicely here
# where we can select the top columns and sum
# in this case, `False`, and `True`
(temp.assign(num_pieces = temp.sum(1), 
             normal_pieces = temp[False].sum(1), 
             spare_pieces = temp[True].sum(1), 
             # assign is basically an expansion of a dictionary
             # and here we take advantage of that
             **colors)
     .drop(columns=[False, True])
     .reset_index()
     .rename_axis(columns=[None, None], index=None)
)
  set_id num_pieces normal_pieces spare_pieces black  grey   red white
                                                                      
0      A       37.0          29.0          8.0   0.0   8.0  16.0  13.0
1      B       32.0          24.0          8.0  13.0   0.0   9.0  10.0
2      C       10.0          10.0          0.0   0.0  10.0   0.0   0.0

Another option, that may be a bit faster (groupby is called only once), is to use get_dummies, before grouping:

temp = df.set_index('set_id').loc[:, ['is_spare', 'color', 'quantity']]

# get_dummies returns 0 and 1, depending on if the value exists
# so if `blue` exists for a row, 1 is assigned, else 0
(pd.get_dummies(temp.drop(columns='quantity'), 
                columns = ['is_spare', 'color'], 
                prefix='', 
                prefix_sep='')
    # here we do a conditional replacement
    # similar to python's if-else statement
    # replacing the 1s with quantity
   .where(lambda df: df == 0, temp.quantity, axis = 0)
    # from here on it is grouping
    # with some renaming
   .groupby('set_id')
   .sum()
   .assign(num_pieces = lambda df: df[['False', 'True']].sum(1))
   .rename(columns={'False':'normal_pieces', 'True':'spare_pieces'})
)

        normal_pieces  spare_pieces  black  grey  red  white  num_pieces
set_id                                                                  
A                  29             8      0     8   16     13          37
B                  24             8     13     0    9     10          32
C                  10             0      0    10    0      0          10

How to apply different aggregation functions to same column by using pandas Groupby

The following should work:

data.groupby(['A','B']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])

basically call agg and passing a list of functions will generate multiple columns with those functions applied.

Example:

In [12]:

df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg([pd.Series.mean, pd.Series.std, pd.Series.count])
Out[12]:
          a                
       mean       std count
b                          
0 -0.769198  0.158049     2
1  0.247708  0.743606     2
2 -0.312705       NaN     1

You can also pass the string of the method names, the common ones work, some of the more obscure ones don't I can't remember which but in this case they work fine, thanks to @ajcr for the suggestion:

In [16]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':[0,0,1,1,2]})
df.groupby(['b']).agg(['mean', 'std', 'count'])

Out[16]:
          a                
       mean       std count
b                          
0 -1.037301  0.790498     2
1 -0.495549  0.748858     2
2 -0.644818       NaN     1

pivot_table(), multiple aggfuncs to the same column - possible?

IIUC use:

pivot = df.pivot_table(index=['Year', 'Month'], values=['Claims'], aggfunc={'Claims': ['min','max']})

Multiple Aggregations of the Same Column Using Pandas Groupby.Agg()