Group by in R, Ddply with Weighted.Mean

grouped weighted mean in r (preferably with ddply but whatever works)

An approach with data.table

library(data.table)

setDT(mydf)[,list(normalMean=mean(gain), 
             weightedMean=weighted.mean(gain, wt/sum(wt))), 
             by = group]

#   group normalMean weightedMean
#1:     a       0.03   0.01227273
#2:     b       0.06   0.09272727

group by in R, ddply with weighted.mean

Use an anonymous function:

> ddply(iris,"Species",function(X) data.frame(wmn=weighted.mean(X$Sepal.Length,
+                                                               X$Petal.Length),
+                                             mn=mean(X$Sepal.Length)))
     Species      wmn    mn
1     setosa 5.016963 5.006
2 versicolor 5.978075 5.936
3  virginica 6.641535 6.588
>

This computes a weighted mean of Sepal.Length (weighted by Petal.Length) as well as unweighted mean and returns both.

How to use ddply to get weighted-mean of class in dataframe?

You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:

library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20), 
                    x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise, 
      x2 = weighted.mean(x2, weights))

To do this for x as well, just add that line to be passed into the summarise function:

ddply(frame, .(class), summarise, 
      x = weighted.mean(x, weights),
      x2 = weighted.mean(x2, weights))

Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.

That would give:

wmean.vars <- c("x", "x2")

ddply(frame, .(class), function(x)
      colwise(weighted.mean, w = x$weights)(x[wmean.vars]))

Finally, if you don't like having to specify wmean.vars, you can also do:

ddply(frame, .(class), function(x)
      numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))

which will compute a weighted-average for every numerical field, excluding the weights themselves.

Display weighted mean by group in the data.frame

If we use mutate, then we can avoid the left_join

library(dplyr)
df %>%
   group_by(education) %>% 
   mutate(weighted_income = weighted.mean(income, weight))
#    obs income education weight weighted_income
#  <int>  <int>    <fctr>  <int>           <dbl>
#1     1   1000         A     10        1166.667
#2     2   2000         B      1        1583.333
#3     3   1500         B      5        1583.333
#4     4   2000         A      2        1166.667

calculate a weighted mean by group with dplyr (and replicate other approaches)

This is very common thing that happens when package plyr is loaded because plyr::summarise can override dplyr::summarise function. Just use dplyr::summarise. It's the first thing to check if summarise outputs unexpected results.

Another way is to detach the plyr package before using dplyr:

detach("package:plyr")
library("dplyr")
df %>% group_by(B) %>%
    summarise(wm = weighted.mean(A, P))
#       B    wm
#    <dbl> <dbl>
# 1    10   1.6
# 2    20   1.8

Weighted mean of a group, where weight is from another group

First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))

# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#>   ID Group Year   w
#> 1  1     1 2016 100
#> 2  2     1 2016 120

# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))

# Add a column for weight
new_dat %>% 
  filter(Year>2013) %>% 
  group_by(Group, Year, Field) %>% 
  summarize(m = weighted.mean(VALUE,w)) %>%
  ungroup() 
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#>   Group  Year Field     m
#>   <dbl> <dbl> <chr> <dbl>
#> 1     1  2016 AA     13.3
#> 2     1  2016 TOTAL 111.

^{Created on 2020-11-03 by the reprex package (v0.3.0)}

R using dplyr to compute weighed statistics by group

I'm not sure I understand exactly the approach you're working on, but here's an example of finding the weighted average and weighted standard deviation by gear, using wt as the weighting:

library(dplyr)
datasets::mtcars %>% 
  group_by(gear) %>%
  summarize(n = n(),
            mpg_weighted_by_weight = sum(mpg*wt) / sum(wt),
            mpg_weighted_by_weight_check = weighted.mean(mpg, wt),
            
            mpg_sd = sqrt(sum(wt * ((mpg - mpg_weighted_by_weight)^2))/(sum(wt)-1)),
            mpg_sd_check = sqrt(Hmisc::wtd.var(mpg, wt)))

# A tibble: 3 x 6
   gear     n mpg_weighted_by_weight mpg_weighted_by_weight_check mpg_sd mpg_sd_check
* <dbl> <int>                  <dbl>                        <dbl>  <dbl>        <dbl>
1     3    15                   15.6                         15.6   3.32         3.32
2     4    12                   23.6                         23.6   4.81         4.81
3     5     5                   19.7                         19.7   5.63         5.63

I wasn't familiar with the formula for weighted standard deviation, but rather cheated and relied on the formula from Hmisc::wtd.var. If you control-click on the formula name in RStudio, it shows the underlying code of the function. Most of it is error handling until the bottom:

#Hmisc::wtd.var
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased", 
  "ML")) 
{
  # ...  skipping error handling
  sw <- sum(weights)
  # ...
  xbar <- sum(weights * x)/sw
  sum(weights * ((x - xbar)^2))/(sw - 1)
}

Using ddply in combo with weighted.mean in a for loop with dynamic variables

In general in R you don't need loops for grouping and summarising (which you would call collapsing in Stata). You can use dplyr for this type of operation:

df1  %>% 
    group_by(GEOLEV2, DHSYEAR)  %>% 
    summarise(
        across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
    )

# A tibble: 6 x 4
# Groups:   GEOLEV2 [4]
#   GEOLEV2   DHSYEAR   age highest_year_edu
#   <chr>       <dbl> <dbl>            <dbl>
# 1 768001001    1988  29.5              4
# 2 768001002    1988  32              NaN
# 3 768001002    1998  30                2.5
# 4 768001002    2013  17                2
# 5 768001006    1998  22                5
# 6 768002016    2013  25.5              4