groupby weighted average and sum in pandas dataframe
EDIT: update aggregation so it works with recent version of pandas
To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:
# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])
# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)
# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))
adjusted_lots price_weighted_mean
contract month year buys
C Z 5 Sell -19 424.828947
CC U 5 Buy 5 3328.000000
SB V 5 Buy 12 11.637500
W Z 5 Sell -5 554.850000
You can see more here:
- http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
and in a similar question here:
- Apply multiple functions to multiple groupby columns
Hope this helps
[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling
Calculate the weighted average using groupby in Python
So this should do the trick I think
import pandas as pd
def calculator(df, columns):
weighted_sum = (df[columns[0]]*df[columns[1]]).sum()/df[columns[0]].sum()
return weighted_sum
cols = ['tot_SKU', 'avg_lag']
Sums = df.groupby('SF_type').apply(lambda x: calculator(x, cols))
df.join(Sums.rename(('sums')), on='SF_type')
Edit: Added the requested merge with the old dataframe
Calculating weighted average by GroupBy.agg and a named aggregation
A weighted average requires 2 separate Series (i.e. a DataFrame). Because of this GroupBy.apply
is the correct aggregation method to use. Use pd.concat
to join the results.
pd.concat([t.groupby('bucket').agg(NR = ('bucket', 'count'),
AVG_QTY = ('qty', np.mean)),
(t.groupby('bucket').apply(lambda gp: np.average(gp.qty, weights=gp.weight))
.rename('W_AVG_QTY'))],
axis=1)
# NR AVG_QTY W_AVG_QTY
#bucket
#a 2 300.000000 340.0
#b 3 566.666667 687.5
This can be done with agg
, assuming your DataFrame has a unique Index, though I can't guarantee it will be very performant given all the slicing. We create our own function that accepts the Series of values and the entire DataFrame. The function then subsets the DataFrame
using the Series to obtain the weights for each group.
def my_w_avg(s, df, wcol):
return np.average(s, weights=df.loc[s.index, wcol])
t.groupby('bucket').agg(
NR= ('bucket', 'count'),
AVG_QTY= ('qty', np.mean),
W_AVG_QTY= ('qty', functools.partial(my_w_avg, df=t, wcol='weight'))
)
# NR AVG_QTY W_AVG_QTY
#bucket
#a 2 300.000000 340.0
#b 3 566.666667 687.5
Groupby and weighted average
Your error is because you can not do multiple series/column operations using agg
. Agg takes one series/column as a time. Let's use apply
and pd.concat
.
g = df.groupby(['STAND_ID','Species'])
newdf = pd.concat([g.apply(lambda x: np.average(x['Height'],weights=x['Volume'])),
g.apply(lambda x: np.average(x['Stems'],weights=x['Volume']))],
axis=1, keys=['Height','Stems']).unstack()
Edit a better solution:
g = df.groupby(['STAND_ID','Species'])
newdf = g.apply(lambda x: pd.Series([np.average(x['Height'], weights=x['Volume']),
np.average(x['Stems'],weights=x['Volume'])],
index=['Height','Stems'])).unstack()
Output:
Height Stems
Species Broadleaves Conifer Broadleaves Conifer
STAND_ID
1 19.0 20.000000 2000.0 1500.000000
2 NaN 13.000000 NaN 1000.000000
3 24.0 24.363636 1200.0 1636.363636
python pandas weighted average with the use of groupby agg()
You can use x
you have in lambda (specifically, use it's .index
to get values you want). For example:
import pandas as pd
import numpy as np
def weighted_avg(group_df, whole_df, values, weights):
v = whole_df.loc[group_df.index, values]
w = whole_df.loc[group_df.index, weights]
return (v * w).sum() / w.sum()
dfr = pd.DataFrame(np.random.randint(1, 50, size=(4, 4)), columns=list("ABCD"))
dfr["group"] = [1, 1, 0, 1]
print(dfr)
dfr = (
dfr.groupby("group")
.agg(
{"A": "mean", "B": "sum", "C": lambda x: weighted_avg(x, dfr, "D", "C")}
)
.reset_index()
)
print(dfr)
Prints:
A B C D group
0 32 2 34 29 1
1 33 32 15 49 1
2 4 43 41 10 0
3 39 33 7 31 1
group A B C
0 0 4.000000 43 10.000000
1 1 34.666667 67 34.607143
EDIT: As @enke stated in comments, you can call your weighted_avg
function with already filtered dataframe:
weighted_avg(dfr.loc[x.index], 'D', 'C')
weighted average aggregation on multiple columns of df
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex
:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
pandas and groupby: how to calculate weighted averages within an agg
Is it possible, but really complicated:
np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'
f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'
g = df.groupby('a').agg(
{'b':['sum', f, 'mean', wm],
'c':['sum','mean'],
'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)
d_sum c_sum c_mean b_sum b_% b_mean b_wa
a
5 2104 2062 5.976812 2067 0.344672 5.991304 5.969521
6 1859 1857 5.951923 1875 0.312656 6.009615 5.954667
7 2058 2084 6.075802 2055 0.342671 5.991254 6.085645
Solution with apply:
def func(x):
# print (x)
b1 = x['b'].sum()
b2 = x['b'].sum() / df['b'].sum()
b3 = (x['b'] * x['c']).sum() / x['b'].sum()
b4 = x['b'].mean()
c1 = x['c'].sum()
c2 = x['c'].mean()
d1 = x['d'].sum()
cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)
g = df.groupby('a').apply(func)
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
g.loc['total']=g.sum()
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
total 5997.0 1.000000 18.009832 17.992173 6003.0 18.004536 6021.0
Related Topics
How to Write a 'Try'/'Except' Block That Catches All Exceptions
Cron Job: How to Run a Script That Requires to Open Display
Plotting the Data with Scrollable X (Time/Horizontal) Axis on Linux
Get File Creation Time with Python on Linux
Cant Get Pyperclip to Use Copy and Paste Modules on Python3
How to Make Python3 Command Run Python 3.6 Instead of 3.5
Python Requests, How to Specify Port for Outgoing Traffic
Python Script Is Not Running Under Cron, Despite Working When Run Manually
Fastest Way to Get System Uptime in Python in Linux
Getting a List of All Subdirectories in the Current Directory
Executing Command Using "Su -L" in Ssh Using Python
Decrypt Chrome Linux Blob Encrypted Cookies in Python
Auto.Arima() Equivalent for Python
/Usr/Bin/Env: Python2: No Such File or Directory
Bash: Variable in Single Quote
Capturing Output of Python Script Run Inside a Docker Container
Importerror: Libtk8.6.So: Cannot Open Shared Object File: No Such File or Directory