Calculate Group Mean While Excluding Current Observation Using Dplyr

Calculate group mean while excluding current observation using dplyr

No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.

df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0

Calculate group variance while excluding current observation

Try the following :

library(dplyr)

DF %>%
group_by(School)%>%
mutate(Var_grade = purrr::map_dbl(row_number(), ~var(grade[-.x])))

# School grade Var_grade
# <int> <dbl> <dbl>
#1 1 90 112.
#2 1 80 12.5
#3 1 95 50
#4 2 100 108.
#5 2 65 225
#6 2 70 308.
#7 2 85 358.

In base you can use ave with sapply :

DF$Var_grade <- with(DF, ave(grade, School, FUN = function(x) 
sapply(seq_along(x), function(i) var(x[-i]))))

data

DF <- data.frame(School = rep(1:2, c(3, 4)), 
grade = c(90, 80, 95, 100, 65, 70, 85))

Exclude current observation from computation in dplyr pipe

For a general case to remove current observation and perform calculation, you could use map_dbl

library(dplyr)
library(purrr)
da %>%
group_by(ice_id) %>%
mutate(mean_price = mean(price),
mean_price_without = map_dbl(day, ~mean(price[-.x])))
#Or
#mean_price_without = map_dbl(day, ~mean(price[day != .x])))
#mean_price_without = map_dbl(row_number(), ~mean(price[-.x])))


# ice_id day price mean_price mean_price_without
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1.6 1.77 1.85
#2 1 2 1.9 1.77 1.7
#3 1 3 1.8 1.77 1.75
#4 2 1 2.1 2.15 2.17
#5 2 2 2.05 2.15 2.2
#6 2 3 2.3 2.15 2.08
#7 3 1 0.5 0.417 0.375
#8 3 2 0.4 0.417 0.425
#9 3 3 0.35 0.417 0.45

Get group mean with multiple grouping variables and excluding own group value


library(dplyr)

df %>%
group_by(state, year) %>%
mutate(q = (sum(value) - value) / (n()-1))

#> # A tibble: 12 x 5
#> # Groups: state, year [4]
#> state county year value q
#> <chr> <chr> <int> <int> <dbl>
#> 1 AL a 2011 68 30.5
#> 2 AL a 2012 63 42
#> 3 AL b 2011 53 38
#> 4 AL b 2012 56 45.5
#> 5 AL c 2011 8 60.5
#> 6 AL c 2012 28 59.5
#> 7 CA d 2011 7 40
#> 8 CA d 2012 69 41
#> 9 CA e 2011 39 24
#> 10 CA e 2012 79 36
#> 11 CA f 2011 41 23
#> 12 CA f 2012 3 74

Data:

#data_frame is deprecate!
df <- tibble(
state = rep(c("AL", "CA"), each = 6),
county = rep(letters[1:6], each = 2),
year = rep(c(2011:2012), 6),
value = sample.int(100, 12)
)

Taking group means, excluding the observation itself (and dealing with NA's)

IIUC this should do what you're looking for:

DF[, mean_value := (sum(value, na.rm=TRUE)-value)/(sum(!is.na(value))-!is.na(value)),  
by=c("iso", "year")]

A B D value iso year mean_value
1: 0 1 1 NA ECU 2009 NA
2: 1 0 2 1 ECU 2009 2.0
3: 1 0 1 2 ECU 2009 1.0
4: 0 0 3 1 BRA 2011 0.5
5: 1 0 4 0 BRA 2011 1.0
6: 0 0 3 1 BRA 2011 0.5
7: 0 1 7 NA ECU 2008 NA
8: 1 0 1 1 ECU 2008 1.0
9: 1 0 1 1 ECU 2008 1.0
10: 0 0 3 2 BRA 2012 2.0
11: 0 0 3 2 BRA 2012 2.0
12: 1 0 4 NA BRA 2012 NA

Note: you may want to additionally consider edge cases like a group of size 1 with NA value which would lead to division by zero

Compute mean excluding current value

I am not sure if your calculation is correct for group 1 but you can do -

library(data.table)

setDT(df)[, avg2 := (sum(b) - b)/(.N -1), a]
df

# a b avg avg2
#1: 1 7 3 1.0
#2: 1 0 3 4.5
#3: 1 2 3 3.5
#4: 2 1 2 3.0
#5: 2 3 2 1.0

Calculate standard deviation by group excluding current observation in R

An option is to use dplyr and mapply. mapply runs for every row (of group) and sd calculation excludes the current row.

library(dplyr)

df %>% group_by(country) %>%
mutate(Sp_SD = mapply(function(x)sd(weight[-x]), 1:n()))


# # A tibble: 6 x 3
# # Groups: country [2]
# country weight Sp_SD
# <fctr> <dbl> <dbl>
# 1 A 10.0 0.707
# 2 A 11.0 1.41
# 3 A 12.0 0.707
# 4 B 20.0 3.54
# 5 B 25.0 7.07
# 6 B 30.0 3.54

dplyr mutate: Excluding observations similar to the current one

You can likely do this more succinctly, but this will get you the result.

You essentially create a column which contains the total observations and sum of records for the whole data.frame. Then you group by the X column and repeat the process, by taking the difference you can calculate your mean.

data

df <- data.frame(X = c("A", "A", "B", "B", "C", "C"),
Y = c(1:6))

solution

library(tidyverse)
df %>%
mutate(total_sum = sum(Y),
total_obs = n()) %>%
group_by(X) %>%
mutate(group_sum = sum(Y),
group_obs = n()) %>%
ungroup() %>%
mutate(other_group_sum = total_sum - group_sum,
other_group_obs = total_obs - group_obs,
other_mean = other_group_sum/other_group_obs) %>%
select(X, Y, other_mean)

result

# A tibble: 6 x 3
X Y other_mean
<fct> <int> <dbl>
1 A 1 4.50
2 A 2 4.50
3 B 3 3.50
4 B 4 3.50
5 C 5 2.50
6 C 6 2.50


Related Topics



Leave a reply



Submit