Groupby Weighted Average and Sum in Pandas Dataframe

groupby weighted average and sum in pandas dataframe

EDIT: update aggregation so it works with recent version of pandas

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])

# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)

# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),  
                                                      price_weighted_mean=("price", wm))

                          adjusted_lots  price_weighted_mean
contract month year buys                                    
C        Z     5    Sell            -19           424.828947
CC       U     5    Buy               5          3328.000000
SB       V     5    Buy              12            11.637500
W        Z     5    Sell             -5           554.850000

You can see more here:

http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

and in a similar question here:

Apply multiple functions to multiple groupby columns

Hope this helps

[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

Calculate the weighted average using groupby in Python

So this should do the trick I think

import pandas as pd

def calculator(df, columns):
    weighted_sum = (df[columns[0]]*df[columns[1]]).sum()/df[columns[0]].sum()
    return weighted_sum

cols = ['tot_SKU', 'avg_lag']

Sums = df.groupby('SF_type').apply(lambda x: calculator(x, cols))
df.join(Sums.rename(('sums')), on='SF_type')

Edit: Added the requested merge with the old dataframe

Calculating weighted average by GroupBy.agg and a named aggregation

A weighted average requires 2 separate Series (i.e. a DataFrame). Because of this GroupBy.apply is the correct aggregation method to use. Use pd.concat to join the results.

pd.concat([t.groupby('bucket').agg(NR = ('bucket', 'count'),
                                   AVG_QTY = ('qty', np.mean)),
           (t.groupby('bucket').apply(lambda gp: np.average(gp.qty, weights=gp.weight))
             .rename('W_AVG_QTY'))], 
          axis=1)

#        NR     AVG_QTY  W_AVG_QTY
#bucket                           
#a        2  300.000000      340.0
#b        3  566.666667      687.5

This can be done with agg, assuming your DataFrame has a unique Index, though I can't guarantee it will be very performant given all the slicing. We create our own function that accepts the Series of values and the entire DataFrame. The function then subsets the DataFrame using the Series to obtain the weights for each group.

def my_w_avg(s, df, wcol):
    return np.average(s, weights=df.loc[s.index, wcol])

t.groupby('bucket').agg(
        NR= ('bucket', 'count'),
        AVG_QTY= ('qty', np.mean),
        W_AVG_QTY= ('qty', functools.partial(my_w_avg, df=t, wcol='weight'))
   )

#        NR     AVG_QTY  W_AVG_QTY
#bucket                           
#a        2  300.000000      340.0
#b        3  566.666667      687.5

Groupby and weighted average

Your error is because you can not do multiple series/column operations using agg. Agg takes one series/column as a time. Let's use apply and pd.concat.

g = df.groupby(['STAND_ID','Species'])
newdf = pd.concat([g.apply(lambda x: np.average(x['Height'],weights=x['Volume'])), 
                   g.apply(lambda x: np.average(x['Stems'],weights=x['Volume']))], 
                   axis=1, keys=['Height','Stems']).unstack()

Edit a better solution:

g = df.groupby(['STAND_ID','Species'])
newdf = g.apply(lambda x: pd.Series([np.average(x['Height'], weights=x['Volume']), 
                             np.average(x['Stems'],weights=x['Volume'])], 
                                    index=['Height','Stems'])).unstack()

Output:

              Height                  Stems             
Species  Broadleaves    Conifer Broadleaves      Conifer
STAND_ID                                                
1               19.0  20.000000      2000.0  1500.000000
2                NaN  13.000000         NaN  1000.000000
3               24.0  24.363636      1200.0  1636.363636

python pandas weighted average with the use of groupby agg()

You can use x you have in lambda (specifically, use it's .index to get values you want). For example:

import pandas as pd
import numpy as np

def weighted_avg(group_df, whole_df, values, weights):
    v = whole_df.loc[group_df.index, values]
    w = whole_df.loc[group_df.index, weights]
    return (v * w).sum() / w.sum()

dfr = pd.DataFrame(np.random.randint(1, 50, size=(4, 4)), columns=list("ABCD"))
dfr["group"] = [1, 1, 0, 1]

print(dfr)
dfr = (
    dfr.groupby("group")
    .agg(
        {"A": "mean", "B": "sum", "C": lambda x: weighted_avg(x, dfr, "D", "C")}
    )
    .reset_index()
)
print(dfr)

Prints:

    A   B   C   D  group
0  32   2  34  29      1
1  33  32  15  49      1
2   4  43  41  10      0
3  39  33   7  31      1

   group          A   B          C
0      0   4.000000  43  10.000000
1      1  34.666667  67  34.607143

EDIT: As @enke stated in comments, you can call your weighted_avg function with already filtered dataframe:

weighted_avg(dfr.loc[x.index], 'D', 'C')

weighted average aggregation on multiple columns of df

Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:

def wavg(x, value, weight):
    d = x[value]
    w = x[weight]
    try:
        return (d.mul(w, axis=0)).div(w.sum())
    except ZeroDivisionError:
        return d.mean()

#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)

#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
         .groupby(level=groups)
         .apply(wavg, cols, "Weight(kg)")
         .reset_index())
print (df2)
  Group  Year  Month    Calcium   Nitrogen
0     A  2020      1  28.000000   4.000000
1     A  2020      1  46.800000   2.400000
2     A  2021      5  36.000000   2.727273
3     A  2021      5  24.545455   3.636364
4     B  2021      8  90.000000  10.000000
5     C  2021      8  51.111111  11.111111
6     C  2021      8  42.222222   4.444444

pandas and groupby: how to calculate weighted averages within an agg

Is it possible, but really complicated:

np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])

wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'

f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'

g = df.groupby('a').agg(
        {'b':['sum', f, 'mean', wm],
         'c':['sum','mean'], 
         'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)

   d_sum  c_sum    c_mean  b_sum       b_%    b_mean      b_wa
a                                                             
5   2104   2062  5.976812   2067  0.344672  5.991304  5.969521
6   1859   1857  5.951923   1875  0.312656  6.009615  5.954667
7   2058   2084  6.075802   2055  0.342671  5.991254  6.085645

Solution with apply:

def func(x):
#    print (x)
    b1 = x['b'].sum()
    b2 = x['b'].sum() / df['b'].sum()
    b3 = (x['b'] * x['c']).sum() / x['b'].sum()
    b4 = x['b'].mean()

    c1 = x['c'].sum()
    c2 = x['c'].mean()

    d1 = x['d'].sum()
    cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
    return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)

g = df.groupby('a').apply(func)
print (g)
    b sum       b %        wa    b mean   c sum    c mean   d sum
a                                                                
5  2067.0  0.344672  5.969521  5.991304  2062.0  5.976812  2104.0
6  1875.0  0.312656  5.954667  6.009615  1857.0  5.951923  1859.0
7  2055.0  0.342671  6.085645  5.991254  2084.0  6.075802  2058.0

g.loc['total']=g.sum()
print (g)
        b sum       b %         wa     b mean   c sum     c mean   d sum
a                                                                       
5      2067.0  0.344672   5.969521   5.991304  2062.0   5.976812  2104.0
6      1875.0  0.312656   5.954667   6.009615  1857.0   5.951923  1859.0
7      2055.0  0.342671   6.085645   5.991254  2084.0   6.075802  2058.0
total  5997.0  1.000000  18.009832  17.992173  6003.0  18.004536  6021.0

Groupby Weighted Average and Sum in Pandas Dataframe