Replace Na with Grouped Means in R

How to replace NA with mean by group / subset?

Not my own technique I saw it on the boards a while back:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header=TRUE)

library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
     width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {
    for (j in which(is.na(dat[, i]))) {
        dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i],  na.rm = TRUE)
    }
}

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
    group_by(taxa) %>%
    mutate(
        length = impute.mean(length),
        width = impute.mean(width)  
    )

Replace NA with grouped means in R?

I slightly changed your example, because the data frame you provided had columns of different lengths, but this should solve your problem:

First, I loaded the packages in tidyverse. Then I grouped data by month. The second pipe runs a mutate_all function so it automatically changes all columns.

library(tidyverse)

df <- tibble(x1 = c(13, NA, 16, 17, 16, 12), x2 = c(1, 4, 3, 5, NA, 4),
             month = c(1, 1, 1, 2, 2, 2))

new_df <- df %>%  group_by(month) %>%
  mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE),.)))

Let me know if this is of any help.

Use tidyverse to replace NA with mean of data, by group

Use mutate and dplyr::case_when instead of summarise :

dat1 %>%
    group_by(id, operator) %>%
    mutate(nummos2 = case_when(is.na(nummos) ~ mean(nummos, na.rm=TRUE),
                               TRUE ~ as.numeric(nummos) 
                              )
           )

How to replace NAs by group mean?

There are many ways to do that in R. For example

# generate dataframe with some interval vaulues
df1 <- data.frame(interval= rep(seq(0, 25, 5), 5))
# add a steps column
df1$steps <- 1:nrow(df)
# copy the dataframe
df2 <- df1
# replace some steps values with missings in df1
df1$steps[c(1,2,5, 14)] <- NA

# sapply goes thru every unique interval...
sapply(df1$interval, function(interval_i){
  # replace missing steps of interval_i in df1 with the steps mean of interval_i of df2
  df1$steps[is.na(df1$steps) & df1$interval == interval_i] <<- mean(df2$steps[df2$interval == interval_i], na.rm= TRUE)
# you must use <<- not <- to assign it to df outside of sapply
})

Replacing NAs in R dataframe with mean based on group and apply the same to multiple columns

You can use :

library(dplyr)

df %>%
  group_by(group) %>%
  mutate(across(WC:We, NAs_replace_function)) %>%
  ungroup -> df_after_imputation

df_after_imputation

#  group    WC Sixltr   Dic     I    We
#  <chr> <dbl>  <dbl> <dbl> <dbl> <dbl>
#1 A       2.3    3.3   0     3.1   4.6
#2 A       2.3    3.3   0     3     4.6
#3 B       3.5    2.7   2.4   2.7   2.2
#4 B       4      2.7   2.4   1.9   2.2

Replace NA with mean of variable grouped by time and treatment

I think I would just use indexing in base R for this:

within(df, {A1[is.na(A1) & time == 0] <- mean(A1[trt == "2" & time == 0])
            B1[is.na(B1) & time == 0] <- mean(B1[trt == "2" & time == 0])})
#> # A tibble: 24 x 4
#>     time trt      A1    B1
#>    <dbl> <fct> <dbl> <dbl>
#>  1     0 2      6.30  5.73
#>  2     0 2      5.43  5.73
#>  3     0 2      5.60  5.45
#>  4     0 1      5.78  5.63
#>  5     0 1      5.78  5.63
#>  6     0 1      5.78  5.63
#>  7    14 2      6.17  6.60
#>  8    14 2      6.43  7.03
#>  9    14 2      6.82  7.12
#> 10    14 1      2.30  3.03
#> # ... with 14 more rows

^{Created on 2020-05-15 by the reprex package (v0.3.0)}

replace NA value with the group value

Try ave. It applies a function to groups. Have a look at ?ave for details, e.g.:

df$med_card_new <- ave(df$med_card, df$hhold_no, FUN=function(x)unique(x[!is.na(x)]))

#   person_id hhold_no med_card med_card_new
#1          1        1        1            1
#2          2        1        1            1
#3          3        1       NA            1
#4          4        1       NA            1
#5          5        1       NA            1
#6          6        2        0            0
#7          7        2        0            0
#8          8        2        0            0
#9          9        2        0            0

Please note that this will only work if not all values in a household are NA and the should not differ (e.g. person 1 == 1, person 2 == 0).

replace NA with mean of column groups

Here is another solution using reshape from base R, an often forgotten function with amazing power.

x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")
x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(y){
  y[is.na(y)] = mean(y, na.rm = T)
  return(y)
})
x3 = reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time', 
 sep = "")

Here is how it works. First, we reshape the data to long format, where a and b become columns and the years become rows. Second, we replace NAs in columns a and b with their respective means. Finally, we reshape the data back to the wide format. reshape is a confusing function, but working through the examples on the help page will get you up to speed.

EDIT

To reorder columns, you can do

x3[,names(x)]

To replace the rownames, you can do

rownames(x3) = 1:NROW(x3)

Replace NA with sample() by group

We can write a function if there is NA value in LENGTH and number of rows in group is more than 1, select random value from the group excluding NA values. If the number of values in group is just 1 then select random value from the entire data excluding NA values. We apply the function for both the columns LENGTH and NUMPOINTS.

replace_func <- function(x, y) {
    inds <- is.na(x)
    if (length(x) > 1 & any(inds)) {
     x[inds] <- sample(x[!inds], sum(inds))
     x
    }
    else if(any(inds)) {
       x[inds] <- sample(y[!is.na(y)], 1)
       x
    } else x
}

library(dplyr) 

exam %>%
  group_by(Outlet) %>%
  mutate(nuLENGTH = replace_func(LENGTH, .$LENGTH), 
         nuNUMPOINTS = replace_func(NUMPOINTS, .$NUMPOINTS)) 

#  LENGTH NUMPOINTS CTRY_      Outlet     nuLENGTH nuNUMPOINTS
#     <dbl>     <dbl> <fct>      <fct>         <dbl>       <dbl>
# 1     NA         NA WCY_____ES ESO0244476    7046.          19
# 2  45447.         0 WCY_____FR DSO0213034   45447.           0
# 3 161038.         8 WCY_____FR DSO0213034  161038.           8
# 4  78148.         5 WCY_____FR DSO0213034   78148.           5
# 5   7193.         2 WCY_____FR DSO0045543    7193.           2
# 6   1572.         0 WCY_____FR DSO0045543    1572.           0
# 7  18191.         9 WCY_____FR DSO0045543   18191.           9
# 8  20366.         0 WCY_____FR DSO0045543   20366.           0
# 9  10015.         0 WCY_____FR DSO0045543   10015.           0
#10   1404.         0 WCY_____FR DSO0045543    1404.           0
# … with 32 more rows

replace NA with groups mean in a non specified number of columns

If you don't mind using dplyr:

library(dplyr)

dat %>% 
  group_by(ID) %>% 
  mutate_if(is.numeric, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
#> # A tibble: 7 x 5
#> # Groups:   ID [2]
#>      id         ID length width extra
#>   <int>     <fctr>  <dbl> <dbl> <dbl>
#> 1   101 collembola    2.1  0.90     1
#> 2   102       mite    1.5  0.70     3
#> 3   103       mite    1.1  0.80     2
#> 4   104 collembola    1.0  0.70     3
#> 5   105 collembola    1.5  0.50     4
#> 6   106       mite    1.5  0.75     3
#> 7   106       mite    1.9  0.75     4