Replace Na with Grouped Means in R

How to replace NA with mean by group / subset?

Not my own technique I saw it on the boards a while back:

dat <- read.table(text = "id    taxa        length  width
101 collembola 2.1 0.9
102 mite 0.9 0.7
103 mite 1.1 0.8
104 collembola NA NA
105 collembola 1.5 0.5
106 mite NA NA", header=TRUE)

library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {
for (j in which(is.na(dat[, i]))) {
dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i], na.rm = TRUE)
}
}

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
group_by(taxa) %>%
mutate(
length = impute.mean(length),
width = impute.mean(width)
)

Replace NA with grouped means in R?

I slightly changed your example, because the data frame you provided had columns of different lengths, but this should solve your problem:

First, I loaded the packages in tidyverse. Then I grouped data by month. The second pipe runs a mutate_all function so it automatically changes all columns.

library(tidyverse)

df <- tibble(x1 = c(13, NA, 16, 17, 16, 12), x2 = c(1, 4, 3, 5, NA, 4),
month = c(1, 1, 1, 2, 2, 2))

new_df <- df %>% group_by(month) %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE),.)))

Let me know if this is of any help.

Use tidyverse to replace NA with mean of data, by group

Use mutate and dplyr::case_when instead of summarise :

dat1 %>%
group_by(id, operator) %>%
mutate(nummos2 = case_when(is.na(nummos) ~ mean(nummos, na.rm=TRUE),
TRUE ~ as.numeric(nummos)
)
)

How to replace NAs by group mean?

There are many ways to do that in R. For example

# generate dataframe with some interval vaulues
df1 <- data.frame(interval= rep(seq(0, 25, 5), 5))
# add a steps column
df1$steps <- 1:nrow(df)
# copy the dataframe
df2 <- df1
# replace some steps values with missings in df1
df1$steps[c(1,2,5, 14)] <- NA

# sapply goes thru every unique interval...
sapply(df1$interval, function(interval_i){
# replace missing steps of interval_i in df1 with the steps mean of interval_i of df2
df1$steps[is.na(df1$steps) & df1$interval == interval_i] <<- mean(df2$steps[df2$interval == interval_i], na.rm= TRUE)
# you must use <<- not <- to assign it to df outside of sapply
})

Replacing NAs in R dataframe with mean based on group and apply the same to multiple columns

You can use :

library(dplyr)

df %>%
group_by(group) %>%
mutate(across(WC:We, NAs_replace_function)) %>%
ungroup -> df_after_imputation

df_after_imputation

# group WC Sixltr Dic I We
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2.3 3.3 0 3.1 4.6
#2 A 2.3 3.3 0 3 4.6
#3 B 3.5 2.7 2.4 2.7 2.2
#4 B 4 2.7 2.4 1.9 2.2

Replace NA with mean of variable grouped by time and treatment

I think I would just use indexing in base R for this:

within(df, {A1[is.na(A1) & time == 0] <- mean(A1[trt == "2" & time == 0])
B1[is.na(B1) & time == 0] <- mean(B1[trt == "2" & time == 0])})
#> # A tibble: 24 x 4
#> time trt A1 B1
#> <dbl> <fct> <dbl> <dbl>
#> 1 0 2 6.30 5.73
#> 2 0 2 5.43 5.73
#> 3 0 2 5.60 5.45
#> 4 0 1 5.78 5.63
#> 5 0 1 5.78 5.63
#> 6 0 1 5.78 5.63
#> 7 14 2 6.17 6.60
#> 8 14 2 6.43 7.03
#> 9 14 2 6.82 7.12
#> 10 14 1 2.30 3.03
#> # ... with 14 more rows

Created on 2020-05-15 by the reprex package (v0.3.0)

replace NA value with the group value

Try ave. It applies a function to groups. Have a look at ?ave for details, e.g.:

df$med_card_new <- ave(df$med_card, df$hhold_no, FUN=function(x)unique(x[!is.na(x)]))

# person_id hhold_no med_card med_card_new
#1 1 1 1 1
#2 2 1 1 1
#3 3 1 NA 1
#4 4 1 NA 1
#5 5 1 NA 1
#6 6 2 0 0
#7 7 2 0 0
#8 8 2 0 0
#9 9 2 0 0

Please note that this will only work if not all values in a household are NA and the should not differ (e.g. person 1 == 1, person 2 == 0).

replace NA with mean of column groups

Here is another solution using reshape from base R, an often forgotten function with amazing power.

x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")
x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(y){
y[is.na(y)] = mean(y, na.rm = T)
return(y)
})
x3 = reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time',
sep = "")

Here is how it works. First, we reshape the data to long format, where a and b become columns and the years become rows. Second, we replace NAs in columns a and b with their respective means. Finally, we reshape the data back to the wide format. reshape is a confusing function, but working through the examples on the help page will get you up to speed.

EDIT

To reorder columns, you can do

x3[,names(x)]

To replace the rownames, you can do

rownames(x3) = 1:NROW(x3)

Replace NA with sample() by group

We can write a function if there is NA value in LENGTH and number of rows in group is more than 1, select random value from the group excluding NA values. If the number of values in group is just 1 then select random value from the entire data excluding NA values. We apply the function for both the columns LENGTH and NUMPOINTS.

replace_func <- function(x, y) {
inds <- is.na(x)
if (length(x) > 1 & any(inds)) {
x[inds] <- sample(x[!inds], sum(inds))
x
}
else if(any(inds)) {
x[inds] <- sample(y[!is.na(y)], 1)
x
} else x
}

library(dplyr)

exam %>%
group_by(Outlet) %>%
mutate(nuLENGTH = replace_func(LENGTH, .$LENGTH),
nuNUMPOINTS = replace_func(NUMPOINTS, .$NUMPOINTS))

# LENGTH NUMPOINTS CTRY_ Outlet nuLENGTH nuNUMPOINTS
# <dbl> <dbl> <fct> <fct> <dbl> <dbl>
# 1 NA NA WCY_____ES ESO0244476 7046. 19
# 2 45447. 0 WCY_____FR DSO0213034 45447. 0
# 3 161038. 8 WCY_____FR DSO0213034 161038. 8
# 4 78148. 5 WCY_____FR DSO0213034 78148. 5
# 5 7193. 2 WCY_____FR DSO0045543 7193. 2
# 6 1572. 0 WCY_____FR DSO0045543 1572. 0
# 7 18191. 9 WCY_____FR DSO0045543 18191. 9
# 8 20366. 0 WCY_____FR DSO0045543 20366. 0
# 9 10015. 0 WCY_____FR DSO0045543 10015. 0
#10 1404. 0 WCY_____FR DSO0045543 1404. 0
# … with 32 more rows

replace NA with groups mean in a non specified number of columns

If you don't mind using dplyr:



library(dplyr)

dat %>%
group_by(ID) %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
#> # A tibble: 7 x 5
#> # Groups: ID [2]
#> id ID length width extra
#> <int> <fctr> <dbl> <dbl> <dbl>
#> 1 101 collembola 2.1 0.90 1
#> 2 102 mite 1.5 0.70 3
#> 3 103 mite 1.1 0.80 2
#> 4 104 collembola 1.0 0.70 3
#> 5 105 collembola 1.5 0.50 4
#> 6 106 mite 1.5 0.75 3
#> 7 106 mite 1.9 0.75 4


Related Topics



Leave a reply



Submit