Groupby Weighted Average and Sum in Pandas Dataframe

groupby weighted average and sum in pandas dataframe

EDIT: update aggregation so it works with recent version of pandas

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])

# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)

# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))

adjusted_lots price_weighted_mean
contract month year buys
C Z 5 Sell -19 424.828947
CC U 5 Buy 5 3328.000000
SB V 5 Buy 12 11.637500
W Z 5 Sell -5 554.850000

You can see more here:

  • http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

and in a similar question here:

  • Apply multiple functions to multiple groupby columns

Hope this helps

[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

Calculate the weighted average using groupby in Python

So this should do the trick I think

import pandas as pd

def calculator(df, columns):
weighted_sum = (df[columns[0]]*df[columns[1]]).sum()/df[columns[0]].sum()
return weighted_sum

cols = ['tot_SKU', 'avg_lag']

Sums = df.groupby('SF_type').apply(lambda x: calculator(x, cols))
df.join(Sums.rename(('sums')), on='SF_type')

Edit: Added the requested merge with the old dataframe

Calculating weighted average by GroupBy.agg and a named aggregation

A weighted average requires 2 separate Series (i.e. a DataFrame). Because of this GroupBy.apply is the correct aggregation method to use. Use pd.concat to join the results.

pd.concat([t.groupby('bucket').agg(NR = ('bucket', 'count'),
AVG_QTY = ('qty', np.mean)),
(t.groupby('bucket').apply(lambda gp: np.average(gp.qty, weights=gp.weight))
.rename('W_AVG_QTY'))],
axis=1)

# NR AVG_QTY W_AVG_QTY
#bucket
#a 2 300.000000 340.0
#b 3 566.666667 687.5

This can be done with agg, assuming your DataFrame has a unique Index, though I can't guarantee it will be very performant given all the slicing. We create our own function that accepts the Series of values and the entire DataFrame. The function then subsets the DataFrame using the Series to obtain the weights for each group.

def my_w_avg(s, df, wcol):
return np.average(s, weights=df.loc[s.index, wcol])

t.groupby('bucket').agg(
NR= ('bucket', 'count'),
AVG_QTY= ('qty', np.mean),
W_AVG_QTY= ('qty', functools.partial(my_w_avg, df=t, wcol='weight'))
)

# NR AVG_QTY W_AVG_QTY
#bucket
#a 2 300.000000 340.0
#b 3 566.666667 687.5

Groupby and weighted average

Your error is because you can not do multiple series/column operations using agg. Agg takes one series/column as a time. Let's use apply and pd.concat.

g = df.groupby(['STAND_ID','Species'])
newdf = pd.concat([g.apply(lambda x: np.average(x['Height'],weights=x['Volume'])),
g.apply(lambda x: np.average(x['Stems'],weights=x['Volume']))],
axis=1, keys=['Height','Stems']).unstack()

Edit a better solution:

g = df.groupby(['STAND_ID','Species'])
newdf = g.apply(lambda x: pd.Series([np.average(x['Height'], weights=x['Volume']),
np.average(x['Stems'],weights=x['Volume'])],
index=['Height','Stems'])).unstack()

Output:

              Height                  Stems             
Species Broadleaves Conifer Broadleaves Conifer
STAND_ID
1 19.0 20.000000 2000.0 1500.000000
2 NaN 13.000000 NaN 1000.000000
3 24.0 24.363636 1200.0 1636.363636

python pandas weighted average with the use of groupby agg()

You can use x you have in lambda (specifically, use it's .index to get values you want). For example:

import pandas as pd
import numpy as np

def weighted_avg(group_df, whole_df, values, weights):
v = whole_df.loc[group_df.index, values]
w = whole_df.loc[group_df.index, weights]
return (v * w).sum() / w.sum()

dfr = pd.DataFrame(np.random.randint(1, 50, size=(4, 4)), columns=list("ABCD"))
dfr["group"] = [1, 1, 0, 1]

print(dfr)
dfr = (
dfr.groupby("group")
.agg(
{"A": "mean", "B": "sum", "C": lambda x: weighted_avg(x, dfr, "D", "C")}
)
.reset_index()
)
print(dfr)

Prints:

    A   B   C   D  group
0 32 2 34 29 1
1 33 32 15 49 1
2 4 43 41 10 0
3 39 33 7 31 1

group A B C
0 0 4.000000 43 10.000000
1 1 34.666667 67 34.607143

EDIT: As @enke stated in comments, you can call your weighted_avg function with already filtered dataframe:

weighted_avg(dfr.loc[x.index], 'D', 'C')

weighted average aggregation on multiple columns of df

Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:

def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()

#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)

#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444

pandas and groupby: how to calculate weighted averages within an agg

Is it possible, but really complicated:

np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])

wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'

f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'

g = df.groupby('a').agg(
{'b':['sum', f, 'mean', wm],
'c':['sum','mean'],
'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)

d_sum c_sum c_mean b_sum b_% b_mean b_wa
a
5 2104 2062 5.976812 2067 0.344672 5.991304 5.969521
6 1859 1857 5.951923 1875 0.312656 6.009615 5.954667
7 2058 2084 6.075802 2055 0.342671 5.991254 6.085645

Solution with apply:

def func(x):
# print (x)
b1 = x['b'].sum()
b2 = x['b'].sum() / df['b'].sum()
b3 = (x['b'] * x['c']).sum() / x['b'].sum()
b4 = x['b'].mean()

c1 = x['c'].sum()
c2 = x['c'].mean()

d1 = x['d'].sum()
cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)

g = df.groupby('a').apply(func)
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0

g.loc['total']=g.sum()
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
total 5997.0 1.000000 18.009832 17.992173 6003.0 18.004536 6021.0


Related Topics



Leave a reply



Submit