Pandas Dataframe Aggregate Function Using Multiple Columns

Pandas - dataframe groupby - how to get sum of multiple columns

By using apply

df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum())
Out[1257]:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4

If you want to agg

df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})

Apply Same Aggregation on Multiple Columns when Using Groupby (python)

You can generate dict:

d = {**{"payment_amount": 'sum'}, 
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}

print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}

expected_output = example_df.groupby("user_id").agg(d)

More general solution should be:

d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}

expected_output = example_df.groupby("user_id").agg(d)

Apply multiple functions to multiple groupby columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})

a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})

a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

pandas groupby aggregate customised function with multiple columns

IIUC

df.groupby(['pool']).apply(lambda x : pd.Series({'count':sum(x['count']),'newavg':newAvg(x)}))
Out[58]:
count newavg
pool
A 7.0 4.0
B 7.0 11.0

Python pandas groupby aggregate on multiple columns, then pivot

Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.

We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
.agg(rnm_cols.keys()).rename(columns=rnm_cols)

Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678

option 1
use agg ← link to docs

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6

option 2
more for less
use describe ← link to docs

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

count mean std min 25% 50% 75% max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0

Pandas - aggregate multiple columns with pivot_table

You could reset_index of df_res and groupby "ind0" and using agg, use different functions on columns: joining unique values of "ind1" and summing "X" and "Y".

out = df_res.reset_index().groupby('ind0').agg({'ind1': lambda x: ', '.join(x.unique()), 'X':'sum', 'Y':'sum'})

Or if you have multiple columns that you need to do the same function on, you could also use a dict comprehension:

funcs = {'ind1': lambda x: ', '.join(x.unique()), **{k:'sum' for k in ('X','Y')}}
out = df_res.reset_index().groupby('ind0').agg(funcs)

Output:

cols  ind1    X     Y
ind0
Q R 1.0 2.0
W R, S 7.0 11.0

Pandas groupby - aggregate - function of two columns

First create column new before groupby and then aggregate sum, your solution rewritten in named aggregation is:

counts = (df.assign(new = df['col3'] * df['col4'])
.groupby(['col1', 'col2'], as_index=False)
.agg(counts=('col1','size'),
col3_mean=('col3','mean'),
col4_median=('col4','median'),
col4_min=('col4','min'),
both_sum=('new','sum')))


Related Topics



Leave a reply



Submit