grouped weighted mean in r (preferably with ddply but whatever works)
An approach with data.table
library(data.table)
setDT(mydf)[,list(normalMean=mean(gain),
weightedMean=weighted.mean(gain, wt/sum(wt))),
by = group]
# group normalMean weightedMean
#1: a 0.03 0.01227273
#2: b 0.06 0.09272727
group by in R, ddply with weighted.mean
Use an anonymous function:
> ddply(iris,"Species",function(X) data.frame(wmn=weighted.mean(X$Sepal.Length,
+ X$Petal.Length),
+ mn=mean(X$Sepal.Length)))
Species wmn mn
1 setosa 5.016963 5.006
2 versicolor 5.978075 5.936
3 virginica 6.641535 6.588
>
This computes a weighted mean of Sepal.Length (weighted by Petal.Length) as well as unweighted mean and returns both.
How to use ddply to get weighted-mean of class in dataframe?
You might find what you want in the ?summarise
function. I can replicate your code with summarise
as follows:
library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))
To do this for x
as well, just add that line to be passed into the summarise
function:
ddply(frame, .(class), summarise,
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))
Edit: If you want to do an operation over many columns, use colwise
or numcolwise
instead of summarise
, or do summarise
on a melt
ed data frame with the reshape2
package, then cast
back to original form. Here's an example.
That would give:
wmean.vars <- c("x", "x2")
ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))
Finally, if you don't like having to specify wmean.vars
, you can also do:
ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))
which will compute a weighted-average for every numerical field, excluding the weights themselves.
Display weighted mean by group in the data.frame
If we use mutate
, then we can avoid the left_join
library(dplyr)
df %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight))
# obs income education weight weighted_income
# <int> <int> <fctr> <int> <dbl>
#1 1 1000 A 10 1166.667
#2 2 2000 B 1 1583.333
#3 3 1500 B 5 1583.333
#4 4 2000 A 2 1166.667
calculate a weighted mean by group with dplyr (and replicate other approaches)
This is very common thing that happens when package plyr
is loaded because plyr::summarise
can override dplyr::summarise
function. Just use dplyr::summarise
. It's the first thing to check if summarise
outputs unexpected results.
Another way is to detach the plyr
package before using dplyr
:
detach("package:plyr")
library("dplyr")
df %>% group_by(B) %>%
summarise(wm = weighted.mean(A, P))
# B wm
# <dbl> <dbl>
# 1 10 1.6
# 2 20 1.8
Weighted mean of a group, where weight is from another group
First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join()
, but I am sure you could do this with fill()
or across()
.
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))
# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120
# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))
# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.
Created on 2020-11-03 by the reprex package (v0.3.0)
R using dplyr to compute weighed statistics by group
I'm not sure I understand exactly the approach you're working on, but here's an example of finding the weighted average and weighted standard deviation by gear
, using wt
as the weighting:
library(dplyr)
datasets::mtcars %>%
group_by(gear) %>%
summarize(n = n(),
mpg_weighted_by_weight = sum(mpg*wt) / sum(wt),
mpg_weighted_by_weight_check = weighted.mean(mpg, wt),
mpg_sd = sqrt(sum(wt * ((mpg - mpg_weighted_by_weight)^2))/(sum(wt)-1)),
mpg_sd_check = sqrt(Hmisc::wtd.var(mpg, wt)))
# A tibble: 3 x 6
gear n mpg_weighted_by_weight mpg_weighted_by_weight_check mpg_sd mpg_sd_check
* <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 3 15 15.6 15.6 3.32 3.32
2 4 12 23.6 23.6 4.81 4.81
3 5 5 19.7 19.7 5.63 5.63
I wasn't familiar with the formula for weighted standard deviation, but rather cheated and relied on the formula from Hmisc::wtd.var
. If you control-click on the formula name in RStudio, it shows the underlying code of the function. Most of it is error handling until the bottom:
#Hmisc::wtd.var
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased",
"ML"))
{
# ... skipping error handling
sw <- sum(weights)
# ...
xbar <- sum(weights * x)/sw
sum(weights * ((x - xbar)^2))/(sw - 1)
}
Using ddply in combo with weighted.mean in a for loop with dynamic variables
In general in R you don't need loops for grouping and summarising (which you would call collapsing in Stata). You can use dplyr
for this type of operation:
df1 %>%
group_by(GEOLEV2, DHSYEAR) %>%
summarise(
across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
)
# A tibble: 6 x 4
# Groups: GEOLEV2 [4]
# GEOLEV2 DHSYEAR age highest_year_edu
# <chr> <dbl> <dbl> <dbl>
# 1 768001001 1988 29.5 4
# 2 768001002 1988 32 NaN
# 3 768001002 1998 30 2.5
# 4 768001002 2013 17 2
# 5 768001006 1998 22 5
# 6 768002016 2013 25.5 4
Related Topics
Use Fortran Subroutine in R? Undefined Symbol
How to Create, Structure, Maintain and Update Data Codebooks in R
How to Drop Unused Levels from a Data Frame
Ggplot Geom_Point() with Colors Based on Specific, Discrete Values
Mgcv: How to Set Number And/Or Locations of Knots for Splines
Conditional Assignment of One Variable to the Value of One of Two Other Variables
Setting the Color for an Individual Data Point
Check If Each Row of a Data Frame Is Contained in Another Data Frame
Run a Bash Script from an R Script
Create Sections Through a Loop with Knitr
Glpk: No Such File or Directory Error When Trying to Install R Package
How to Pass Input Variable to SQL Statement in R Shiny
Multiple Histograms in Ggplot2
Ggplot Year by Year Comparison
How to Create Base R Plot 'Type = B' Equivalent in Ggplot2