Apply multiple functions to multiple groupby columns
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg
groupby method. Second, never use .ix
.
If you desire to work with two separate columns at the same time I would suggest using the apply
method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__
attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using apply
and returning a Series
Now, if you had multiple columns that needed to interact together then you cannot use agg
, which implicitly passes a Series to the aggregating function. When using apply
the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
How to apply *multiple* functions to pandas groupby apply?
For this specific issue, how about groupby
after difference?
(df['x']-df['y']).groupby(df['id']).agg(['min','max'])
More generically, you could probably do something like
df.groupby('id').apply(lambda x:pd.Series({'min':mindist(x),'max':maxdist(x)}))
How can I apply multiple functions involving multiple columns of a pandas dataframe with grouby?
You first need to assign max-score
to mean(max-score)
, then this is a simple groupby
+agg
:
(df.assign(**{'mean(max-score)': df['max']-df['score']})
.groupby('id', as_index=False)
.agg({'cat': 'first', 'date': 'first', 'mean(max-score)': 'mean'})
)
output:
id cat date mean(max-score)
0 s1 A 12/06 4.1
1 s2 C 11/04 13.4
2 s3 E 08/02 5.6
Apply multiple custom functions to multiple columns on multiple groupby objects in Pandas in Python
I don't know if there's a simpler way to do it, but one way is to use currying. I wasn't able to find a way to use the groupby structure to add a column (the structures involved are designed around non-mutable data), so I just dealt with the data in the groupby object directly. You can see whether the following code does what you want:
def sum_curry(x, y):
return lambda df: sum(df[x]) + sum(df[y])
def diff_curry(x, y):
return lambda df: sum(df[x]) - sum(df[y])
def append_prod(df):
df['E'] = df['C']*df['D']
return df
g1_sums = g1.apply(sum_curry('B','C'))
g1_diffs = g1.apply(diff_curry('C','D'))
g2_diffs = g2.apply(diff_curry('B','C'))
g2_with_prod = [(group[0], append_prod(group[1])) for group in g2]
Pandas groupby aggregate apply multiple functions to multiple columns
In pandas, the agg operation takes single or multiple individual methods to be applied to relevant columns and returns a summary of the outputs. In python, lists hold and parse multiple entities. In this case, I pass a list of functions into the aggregator. In your case, you were parsing a dictionary, which means you had to handle each column individually making it very manual. Happy to explain further if not clear
ss=tt.groupby('Group').agg(['count','mean','median'])
ss.columns = ['_'.join(col).strip() for col in ss.columns.values]
ss
pandas, apply multiple functions of multiple columns to groupby object
I think you can avoid agg
or apply
and rather first multiple by mul
, then div
and last use groupby
by index
with aggregating
sum
:
lasts = pd.DataFrame({'user':['a','s','d','d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores':[7,8,9,4]})
print (lasts)
elapsed_time num_cores running_time user
0 40000 7 30000 a
1 50000 8 20000 s
2 60000 9 30000 d
3 90000 4 15000 d
by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
elapsed_days running_days
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
.div(86400)
.groupby(level=0)
.sum())
elapsed_time running_time
user
a 3.240741 2.430556
d 10.416667 3.819444
s 4.629630 1.851852
Groupby and Apply Functions on multiple Columns with 1-to-many relationship
You can use a regex to add the URL part:
woLink = r'example.org/woNum='
df['Link'] = df['Work Order'].str.replace('(\d+)', rf'{woLink}\1')
output:
Date Ticket ID Work Order Link
0 2018-08-30 22:52:25 1444008 119846184 example.org/woNum=119846184
1 2021-09-29 13:33:49 1724734 122445397, 122441551 example.org/woNum=122445397, example.org/woNum=122441551
Apply multiple functions to GroupBy object in a specific order
is that doable (without having to re-group and without using .apply)
I think generally not, if only 2 values per groups or some another patterns of data there should be alternatives.
#if always 2 values per id in order
df1 = df.groupby("id")['date'].agg(['min','max'])
max_diff_for_each_id = df1['max'].sub(df1['min']).dt.days
Or:
#if always 2 values per id
df2 = df.groupby("id")['date'].agg(['first','last'])
max_diff_for_each_id = df2['last'].sub(df2['first']).dt.days
One idea with convert id
to index, but max(level=0)
is only hidden .groupby(level=0).max()
, so this should be trick solution (in my opinion)
max_diff_for_each_id = df.set_index('id').groupby("id")['date'].diff().max(level=0).dt.days
There is possible multiple groupby
like:
max_diff_for_each_id = df.groupby("id")['date'].diff(1).groupby(df["id"]).max().dt.days
Or create custom functions like:
max_diff_for_each_id = df.groupby("id")['date'].apply(lambda x: x.diff().max()).dt.days
max_diff_for_each_id = df.groupby("id")['date'].agg(lambda x: x.diff().max()).dt.days
print (max_diff_for_each_id)
id
1 5
2 1
dtype: int64
Multiple aggregations of the same column using pandas GroupBy.agg()
As of 2022-06-20, the below is the accepted practice for aggregations:
df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))
Below the fold included for historical versions of pandas
.
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
Apply Same Aggregation on Multiple Columns when Using Groupby (python)
You can generate dict
:
d = {**{"payment_amount": 'sum'},
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}
print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}
expected_output = example_df.groupby("user_id").agg(d)
More general solution should be:
d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}
expected_output = example_df.groupby("user_id").agg(d)
Related Topics
How to Use Subprocess.Popen to Connect Multiple Processes by Pipes
How to Use a Decimal Step Value For Range()
Tkinter - Executing Functions Over Time
How to Search and Replace Text in a File
Flask View Raises Typeerror: 'Bool' Object Is Not Callable
Pip Install Failing With: Oserror: [Errno 13] Permission Denied on Directory
Convert List of Dictionaries to a Pandas Dataframe
Pygame Window Not Responding After a Few Seconds
Remove Specific Characters from a String in Python
Installing Pip Is Not Working in Python ≪ 3.6