Weighted Means by Group and Column

Display weighted mean by group in the data.frame

If we use mutate, then we can avoid the left_join

library(dplyr)
df %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight))
# obs income education weight weighted_income
# <int> <int> <fctr> <int> <dbl>
#1 1 1000 A 10 1166.667
#2 2 2000 B 1 1583.333
#3 3 1500 B 5 1583.333
#4 4 2000 A 2 1166.667

The weighted means of group is not equal to the total mean in pandas groupby

You compute a weighted mean within each group, so when you compute the total mean from the weighted means, the correct weight for each group is the sum of the weights within the group (and not the size of the group).

In [47]: wsums = df.groupby("groups").apply(lambda d: d["weights"].sum())

In [48]: total_mean_from_group_means = np.average(group_means, weights=wsums)

In [49]: total_mean_from_group_means
Out[49]: 0.5070955626929458

Weighted means for several columns, by groups (in a data.table)

Normally, you should be able to do:

dt2 <- dt[,lapply(.SD,weighted.mean,w=weights), 
by = list(region,state,county), .SDcols = colsToKeep]

i.e., just by providing just those columns to .SDcols. But at the moment, this won't work due to a bug, in that weights column won't be available because it's not specified in .SDcols.

Until it's fixed, we can accomplish this as follows:

dt2 <- dt[, lapply(mget(colsToKeep), weighted.mean, w = weights), 
by = list(region, state, county)]
# region state county y1980 y1990
# 1: 1 1 1 100.0000 200.0000
# 2: 1 1 2 50.0000 100.0000
# 3: 1 1 3 1000.0000 500.0000
# 4: 1 1 4 113.2353 144.1176
# 5: 2 2 1 10.0000 50.0000
# 6: 2 2 2 34.0000 82.0000
# 7: 2 2 3 100.0000 100.0000

Calculate weighted means for multiple grouping with different weightings in R

dfnew <- setDT(df)[, lapply(.SD, function(x) weighted.mean(x, value, na.rm = TRUE)), by = c("year", "site"), .SDcols = wingL:proW]

I had to include the na.rm statement! I think this gives the correct results. Thanks everyone for helping me think it through as I did have errors by grouping - was over thinking it.

It does replace the original values, but I can live with that.

Weighted means for groups in r - using aggregate and weighted.mean functions together

Here is a way to compute the weighted means with aggregate called by by().

res <- by(df, df$X1, function(DF){
aggregate(cbind(Y1, Y2, Y3) ~ X1, DF, function(y, w)
weighted.mean(y, w = DF[['wgt']], na.rm = TRUE))
})
do.call(rbind, res)
# X1 Y1 Y2 Y3
#A A 2.152503 2.633935 18.93457
#B B 6.677851 3.589251 16.90102
#C C 10.194695 2.638378 16.70958

How to calculate weighted mean using group_by function in Pandas

Another option without using apply (which is generally not recommended for performance reasons, see timing example below):

(df.iloc[:, 2:]
.multiply(df['length'], axis=0)
.divide(df.groupby('id')['length'].transform('sum'), axis=0)
.groupby(df['id'])
.sum()
.add_prefix('weighted_'))

Output:

    weighted_0  weighted_1  weighted_2  weighted_3  weighted_40
id
1 0.133333 0.016667 0.500000 0.2 0.300000
3 0.281818 0.345455 0.381818 0.5 0.390909

For example, timing on a 10K rows dataset with 5K unique ids:

%timeit f_this_answer(df)
2.87 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit f_with_apply(df)
2.04 s ± 5.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This answer runs in 3 ms compared with 2 s execution time for Jérôme Richard's answer with apply (approximately 700x improvement)

Weighted mean of a group, where weight is from another group

First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))

# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120

# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))

# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.

Created on 2020-11-03 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit