Apply Multiple Functions to Multiple Groupby Columns

Apply multiple functions to multiple groupby columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})

a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})

a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

How to apply *multiple* functions to pandas groupby apply?

For this specific issue, how about groupby after difference?

(df['x']-df['y']).groupby(df['id']).agg(['min','max'])

More generically, you could probably do something like

df.groupby('id').apply(lambda x:pd.Series({'min':mindist(x),'max':maxdist(x)}))

How can I apply multiple functions involving multiple columns of a pandas dataframe with grouby?

You first need to assign max-score to mean(max-score), then this is a simple groupby+agg:

(df.assign(**{'mean(max-score)': df['max']-df['score']})
.groupby('id', as_index=False)
.agg({'cat': 'first', 'date': 'first', 'mean(max-score)': 'mean'})
)

output:

   id cat   date  mean(max-score)
0 s1 A 12/06 4.1
1 s2 C 11/04 13.4
2 s3 E 08/02 5.6

Apply multiple custom functions to multiple columns on multiple groupby objects in Pandas in Python

I don't know if there's a simpler way to do it, but one way is to use currying. I wasn't able to find a way to use the groupby structure to add a column (the structures involved are designed around non-mutable data), so I just dealt with the data in the groupby object directly. You can see whether the following code does what you want:

def sum_curry(x, y):
return lambda df: sum(df[x]) + sum(df[y])

def diff_curry(x, y):
return lambda df: sum(df[x]) - sum(df[y])

def append_prod(df):
df['E'] = df['C']*df['D']
return df

g1_sums = g1.apply(sum_curry('B','C'))
g1_diffs = g1.apply(diff_curry('C','D'))
g2_diffs = g2.apply(diff_curry('B','C'))
g2_with_prod = [(group[0], append_prod(group[1])) for group in g2]

Pandas groupby aggregate apply multiple functions to multiple columns

In pandas, the agg operation takes single or multiple individual methods to be applied to relevant columns and returns a summary of the outputs. In python, lists hold and parse multiple entities. In this case, I pass a list of functions into the aggregator. In your case, you were parsing a dictionary, which means you had to handle each column individually making it very manual. Happy to explain further if not clear

ss=tt.groupby('Group').agg(['count','mean','median'])
ss.columns = ['_'.join(col).strip() for col in ss.columns.values]
ss

pandas, apply multiple functions of multiple columns to groupby object

I think you can avoid agg or apply and rather first multiple by mul, then div and last use groupby by index with aggregating sum:

lasts = pd.DataFrame({'user':['a','s','d','d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores':[7,8,9,4]})

print (lasts)
elapsed_time num_cores running_time user
0 40000 7 30000 a
1 50000 8 20000 s
2 60000 9 30000 d
3 90000 4 15000 d
by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
elapsed_days running_days
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
.div(86400)
.groupby(level=0)
.sum())
elapsed_time running_time
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852

Groupby and Apply Functions on multiple Columns with 1-to-many relationship

You can use a regex to add the URL part:

woLink = r'example.org/woNum='
df['Link'] = df['Work Order'].str.replace('(\d+)', rf'{woLink}\1')

output:

                  Date  Ticket ID            Work Order                                                      Link
0 2018-08-30 22:52:25 1444008 119846184 example.org/woNum=119846184
1 2021-09-29 13:33:49 1724734 122445397, 122441551 example.org/woNum=122445397, example.org/woNum=122441551

Apply multiple functions to GroupBy object in a specific order

is that doable (without having to re-group and without using .apply)

I think generally not, if only 2 values per groups or some another patterns of data there should be alternatives.

#if always 2 values per id in order
df1 = df.groupby("id")['date'].agg(['min','max'])
max_diff_for_each_id = df1['max'].sub(df1['min']).dt.days

Or:

#if always 2 values per id 
df2 = df.groupby("id")['date'].agg(['first','last'])

max_diff_for_each_id = df2['last'].sub(df2['first']).dt.days

One idea with convert id to index, but max(level=0) is only hidden .groupby(level=0).max(), so this should be trick solution (in my opinion)

max_diff_for_each_id = df.set_index('id').groupby("id")['date'].diff().max(level=0).dt.days

There is possible multiple groupby like:

max_diff_for_each_id = df.groupby("id")['date'].diff(1).groupby(df["id"]).max().dt.days

Or create custom functions like:

max_diff_for_each_id = df.groupby("id")['date'].apply(lambda x: x.diff().max()).dt.days

max_diff_for_each_id = df.groupby("id")['date'].agg(lambda x: x.diff().max()).dt.days


print (max_diff_for_each_id)
id
1 5
2 1
dtype: int64

Multiple aggregations of the same column using pandas GroupBy.agg()

As of 2022-06-20, the below is the accepted practice for aggregations:

df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))

Below the fold included for historical versions of pandas.

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012

Apply Same Aggregation on Multiple Columns when Using Groupby (python)

You can generate dict:

d = {**{"payment_amount": 'sum'}, 
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}

print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}

expected_output = example_df.groupby("user_id").agg(d)

More general solution should be:

d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}

expected_output = example_df.groupby("user_id").agg(d)


Related Topics



Leave a reply



Submit