Group by in R, Ddply with Weighted.Mean

grouped weighted mean in r (preferably with ddply but whatever works)

An approach with data.table

library(data.table)

setDT(mydf)[,list(normalMean=mean(gain),
weightedMean=weighted.mean(gain, wt/sum(wt))),
by = group]

# group normalMean weightedMean
#1: a 0.03 0.01227273
#2: b 0.06 0.09272727

group by in R, ddply with weighted.mean

Use an anonymous function:

> ddply(iris,"Species",function(X) data.frame(wmn=weighted.mean(X$Sepal.Length,
+ X$Petal.Length),
+ mn=mean(X$Sepal.Length)))
Species wmn mn
1 setosa 5.016963 5.006
2 versicolor 5.978075 5.936
3 virginica 6.641535 6.588
>

This computes a weighted mean of Sepal.Length (weighted by Petal.Length) as well as unweighted mean and returns both.

How to use ddply to get weighted-mean of class in dataframe?

You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:

library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))

To do this for x as well, just add that line to be passed into the summarise function:

ddply(frame, .(class), summarise, 
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))

Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.


That would give:

wmean.vars <- c("x", "x2")

ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))

Finally, if you don't like having to specify wmean.vars, you can also do:

ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))

which will compute a weighted-average for every numerical field, excluding the weights themselves.

Display weighted mean by group in the data.frame

If we use mutate, then we can avoid the left_join

library(dplyr)
df %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight))
# obs income education weight weighted_income
# <int> <int> <fctr> <int> <dbl>
#1 1 1000 A 10 1166.667
#2 2 2000 B 1 1583.333
#3 3 1500 B 5 1583.333
#4 4 2000 A 2 1166.667

calculate a weighted mean by group with dplyr (and replicate other approaches)

This is very common thing that happens when package plyr is loaded because plyr::summarise can override dplyr::summarise function. Just use dplyr::summarise. It's the first thing to check if summarise outputs unexpected results.

Another way is to detach the plyr package before using dplyr:

detach("package:plyr")
library("dplyr")
df %>% group_by(B) %>%
summarise(wm = weighted.mean(A, P))
# B wm
# <dbl> <dbl>
# 1 10 1.6
# 2 20 1.8

Weighted mean of a group, where weight is from another group

First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))

# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120

# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))

# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.

Created on 2020-11-03 by the reprex package (v0.3.0)

R using dplyr to compute weighed statistics by group

I'm not sure I understand exactly the approach you're working on, but here's an example of finding the weighted average and weighted standard deviation by gear, using wt as the weighting:

library(dplyr)
datasets::mtcars %>%
group_by(gear) %>%
summarize(n = n(),
mpg_weighted_by_weight = sum(mpg*wt) / sum(wt),
mpg_weighted_by_weight_check = weighted.mean(mpg, wt),

mpg_sd = sqrt(sum(wt * ((mpg - mpg_weighted_by_weight)^2))/(sum(wt)-1)),
mpg_sd_check = sqrt(Hmisc::wtd.var(mpg, wt)))

# A tibble: 3 x 6
gear n mpg_weighted_by_weight mpg_weighted_by_weight_check mpg_sd mpg_sd_check
* <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 3 15 15.6 15.6 3.32 3.32
2 4 12 23.6 23.6 4.81 4.81
3 5 5 19.7 19.7 5.63 5.63

I wasn't familiar with the formula for weighted standard deviation, but rather cheated and relied on the formula from Hmisc::wtd.var. If you control-click on the formula name in RStudio, it shows the underlying code of the function. Most of it is error handling until the bottom:

#Hmisc::wtd.var
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased",
"ML"))
{
# ... skipping error handling
sw <- sum(weights)
# ...
xbar <- sum(weights * x)/sw
sum(weights * ((x - xbar)^2))/(sw - 1)
}

Using ddply in combo with weighted.mean in a for loop with dynamic variables

In general in R you don't need loops for grouping and summarising (which you would call collapsing in Stata). You can use dplyr for this type of operation:

df1  %>% 
group_by(GEOLEV2, DHSYEAR) %>%
summarise(
across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
)

# A tibble: 6 x 4
# Groups: GEOLEV2 [4]
# GEOLEV2 DHSYEAR age highest_year_edu
# <chr> <dbl> <dbl> <dbl>
# 1 768001001 1988 29.5 4
# 2 768001002 1988 32 NaN
# 3 768001002 1998 30 2.5
# 4 768001002 2013 17 2
# 5 768001006 1998 22 5
# 6 768002016 2013 25.5 4


Related Topics



Leave a reply



Submit