Aggregation in Pandas

Question 1

How can I perform aggregation with Pandas?

Expanded aggregation documentation.

Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.

Some common aggregating functions are tabulated below:


Function    Description
mean()         Compute mean of groups
sum()         Compute sum of group values
size()         Compute group sizes
count()     Compute count of group
std()         Standard deviation of groups
var()         Compute variance of groups
sem()         Standard error of the mean of groups
describe()     Generates descriptive statistics
first()     Compute first of group values
last()         Compute last of group values
nth()         Take nth value, or a subset if n is a list
min()         Compute min of group values
max()         Compute max of group values

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

Aggregation by filtered columns and Cython implemented functions:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

An aggregate function is used for all columns without being specified in the groupby function, here the A, B columns:

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

You can also specify only some columns used for aggregation in a list after the groupby function:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

Same results by using function DataFrameGroupBy.agg:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

If want to pass multiple functions is possible pass list of tuples:

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E
          average total average total average total
A   B
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

Then get MultiIndex in columns:

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

And for converting to columns, flattening MultiIndex use map with join:

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

If want specified each column with aggregated function separately pass dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D':'mean'})
         .rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

You can pass custom function too:

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D': func})
         .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

Question 2

No DataFrame after aggregation! What happened?

Aggregation by two or more columns:

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

First check the Index and type of a Pandas object:

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

There are two solutions for how to get MultiIndex Series to columns:

add parameter as_index=False

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

use Series.reset_index:

df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

If group by one column:

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

... get Series with Index:

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

And the solution is the same like in the MultiIndex Series:

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

Question 3

How can I aggregate mainly strings columns (to `list`s, `tuple`s, `strings with separator`)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2]})
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

Instead of an aggregation function, it is possible to pass list, tuple, set for converting the column:

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

An alternative is use GroupBy.apply:

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

For converting to strings with a separator, use .join only if it is a string column:

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

If it is a numeric column, use a lambda function with astype for converting to strings:

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

Another solution is converting to strings before groupby:

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

For converting all columns, don't pass a list of column(s) after groupby.
There isn't any column D, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

So it's necessary to convert all columns into strings, and then get all columns:

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

Question 4

How can I aggregate counts?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

Function GroupBy.size for size of each group:

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

Function GroupBy.count excludes missing values:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

This function should be used for multiple columns for counting non-missing values:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

A related function is Series.value_counts. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaNs values by default.

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

If you want same output like using function groupby + size, add Series.sort_index:

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

Question 5

How can I create a new column filled by aggregated values?

Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped.

See the Pandas documentation for more information.

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4

Pandas Dataframe Aggregate Object Type

First idea is use GroupBy.first for object columns:

aggregated = groupped.agg({ 
         'name' : ['first'],
         'id' : ['first'],
         'date' : ['first'],
         'number_one' : ['sum'],
         'type' : ['first'],
         'number_two' : ['sum'],
})

If want avoid MultiIndex remove []:

aggregated = groupped.agg({ 
         'name' : 'first',
         'id' : 'first',
         'date' : 'first',
         'number_one' : 'sum',
         'type' : 'first',
         'number_two' : 'sum',
})

More general solution is for numeric columns aggregate sum and for another columns get first value in lambda function:

f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else x.iat[0]
aggregated = groupped.agg(f)

Multiple aggregations of the same column using pandas GroupBy.agg()

As of 2022-06-20, the below is the accepted practice for aggregations:

df.groupby('dummy').agg(
    Mean=('returns', np.mean),
    Sum=('returns', np.sum))

Below the fold included for historical versions of pandas.

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

Python Pandas aggregate count and max value

df.groupby(by=['host', 'ip'])['ts'].agg(['max', 'count'])

You group by the two properties and call multiple aggregation functions using agg.

Custom aggregation of pandas dataframe

To get the desired output you could do something like this:

#lambda also possible but this looks a bit cleaner
def weights(grp):
    val1,val2,val3 = grp
    return 0.4*val1 + 0.5*val2 + 0.5*val3

# the aggregations on B and D are just examples. You can change that to whatever you like
df.groupby('C').agg({'A':weights, 'B':'first', 'D':lambda x: 'aggregated'}).reset_index()

Output:

    C    A  B           D
0  XX  3.4  X  aggregated
1  YY  4.8  Y  aggregated
2  ZZ  6.2  Z  aggregated

Second-level aggregation in pandas

Use `.groupby()` + `.sum()` + `value_counts()` + `.agg()`:

df2 = DF.groupby('F1')['F2'].sum()
df3 = (DF.groupby(['F1', 'F3'])['F3']
         .value_counts()
         .reset_index([2], name='count')
         .apply(lambda x: x['F3'] + '-' + str(x['count']), axis=1)
      )
df4 = df3.groupby(level=0).agg(' '.join)
df4.name = 'F3'
df_out = pd.concat([df2, df4], axis=1).reset_index()

Result:

print(df_out)

  F1  F2              F3
0  A   4  xx-1 yy-1 zz-1
1  B   7       xx-1 zz-2
2  C   8       yy-1 zz-3

Ungrouping a pandas dataframe after aggregation operation

You can do two different things:

(1) Create an aggregate DataFrame using groupby.agg and calling appropriate methods. The code below lists all names corresponding to a location:

out = dataframe.groupby(by=['location'], as_index=False).agg({'people':'sum', 'name':list})

(2) Use groupby.transform to add a new column to dataframe that has the sum of people by location in each row:

dataframe['sum'] = dataframe.groupby(by=['location'])['people'].transform('sum')

Issues with groupby and aggregate in pandas

Use transform instead:

df['PV_SUM'] = df.groupby('DOCKET').PV.transform(sum)

Output:

  DOCKET  PV  PV_SUM
0     1a   1       3
1     1a   1       3
2     1a   1       3
3     1b   0       2
4     1b   1       2
5     1b   1       2

The issue with your code is that df.groupby('DOCKET').agg({'PV':sum}) returns a dataframe with DOCKET as index and PV as value column. When you try assigning it back to the daframe, pandas looks for matching indexes, and, since there are no matchs, it returns NaN.

For example, take a look at the output from df.groupby('DOCKET').agg({'PV':sum}):

        PV
DOCKET    
1a       3
1b       2

As pandas matches the index, you could first set the index of your dataframe to "DOCKET", then it will work as expected:

result = df.groupby('DOCKET').agg({'PV':sum})
df = df.set_index('DOCKET')
df['PV_SUM'] = result

Aggregation in Pandas