How to replace NA with mean by group / subset?
Not my own technique I saw it on the boards a while back:
dat <- read.table(text = "id taxa length width
101 collembola 2.1 0.9
102 mite 0.9 0.7
103 mite 1.1 0.8
104 collembola NA NA
105 collembola 1.5 0.5
106 mite NA NA", header=TRUE)
library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
width = impute.mean(width))
dat2[order(dat2$id), ] #plyr orders by group so we have to reorder
Edit A non plyr approach with a for
loop:
for (i in which(sapply(dat, is.numeric))) {
for (j in which(is.na(dat[, i]))) {
dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i], na.rm = TRUE)
}
}
Edit many moons later here is a data.table & dplyr approach:
data.table
library(data.table)
setDT(dat)
dat[, length := impute.mean(length), by = taxa][,
width := impute.mean(width), by = taxa]
dplyr
library(dplyr)
dat %>%
group_by(taxa) %>%
mutate(
length = impute.mean(length),
width = impute.mean(width)
)
Replace NA with grouped means in R?
I slightly changed your example, because the data frame you provided had columns of different lengths, but this should solve your problem:
First, I loaded the packages in tidyverse. Then I grouped data by month. The second pipe runs a mutate_all function so it automatically changes all columns.
library(tidyverse)
df <- tibble(x1 = c(13, NA, 16, 17, 16, 12), x2 = c(1, 4, 3, 5, NA, 4),
month = c(1, 1, 1, 2, 2, 2))
new_df <- df %>% group_by(month) %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE),.)))
Let me know if this is of any help.
Use tidyverse to replace NA with mean of data, by group
Use mutate
and dplyr::case_when
instead of summarise
:
dat1 %>%
group_by(id, operator) %>%
mutate(nummos2 = case_when(is.na(nummos) ~ mean(nummos, na.rm=TRUE),
TRUE ~ as.numeric(nummos)
)
)
How to replace NAs by group mean?
There are many ways to do that in R. For example
# generate dataframe with some interval vaulues
df1 <- data.frame(interval= rep(seq(0, 25, 5), 5))
# add a steps column
df1$steps <- 1:nrow(df)
# copy the dataframe
df2 <- df1
# replace some steps values with missings in df1
df1$steps[c(1,2,5, 14)] <- NA
# sapply goes thru every unique interval...
sapply(df1$interval, function(interval_i){
# replace missing steps of interval_i in df1 with the steps mean of interval_i of df2
df1$steps[is.na(df1$steps) & df1$interval == interval_i] <<- mean(df2$steps[df2$interval == interval_i], na.rm= TRUE)
# you must use <<- not <- to assign it to df outside of sapply
})
Replacing NAs in R dataframe with mean based on group and apply the same to multiple columns
You can use :
library(dplyr)
df %>%
group_by(group) %>%
mutate(across(WC:We, NAs_replace_function)) %>%
ungroup -> df_after_imputation
df_after_imputation
# group WC Sixltr Dic I We
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2.3 3.3 0 3.1 4.6
#2 A 2.3 3.3 0 3 4.6
#3 B 3.5 2.7 2.4 2.7 2.2
#4 B 4 2.7 2.4 1.9 2.2
Replace NA with mean of variable grouped by time and treatment
I think I would just use indexing in base R for this:
within(df, {A1[is.na(A1) & time == 0] <- mean(A1[trt == "2" & time == 0])
B1[is.na(B1) & time == 0] <- mean(B1[trt == "2" & time == 0])})
#> # A tibble: 24 x 4
#> time trt A1 B1
#> <dbl> <fct> <dbl> <dbl>
#> 1 0 2 6.30 5.73
#> 2 0 2 5.43 5.73
#> 3 0 2 5.60 5.45
#> 4 0 1 5.78 5.63
#> 5 0 1 5.78 5.63
#> 6 0 1 5.78 5.63
#> 7 14 2 6.17 6.60
#> 8 14 2 6.43 7.03
#> 9 14 2 6.82 7.12
#> 10 14 1 2.30 3.03
#> # ... with 14 more rows
Created on 2020-05-15 by the reprex package (v0.3.0)
replace NA value with the group value
Try ave
. It applies a function to groups. Have a look at ?ave
for details, e.g.:
df$med_card_new <- ave(df$med_card, df$hhold_no, FUN=function(x)unique(x[!is.na(x)]))
# person_id hhold_no med_card med_card_new
#1 1 1 1 1
#2 2 1 1 1
#3 3 1 NA 1
#4 4 1 NA 1
#5 5 1 NA 1
#6 6 2 0 0
#7 7 2 0 0
#8 8 2 0 0
#9 9 2 0 0
Please note that this will only work if not all values in a household are NA
and the should not differ (e.g. person 1 == 1, person 2 == 0).
replace NA with mean of column groups
Here is another solution using reshape
from base R, an often forgotten function with amazing power.
x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")
x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(y){
y[is.na(y)] = mean(y, na.rm = T)
return(y)
})
x3 = reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time',
sep = "")
Here is how it works. First, we reshape the data to long format, where a
and b
become columns and the years become rows. Second, we replace NAs in columns a
and b
with their respective means. Finally, we reshape the data back to the wide format. reshape
is a confusing function, but working through the examples on the help page will get you up to speed.
EDIT
To reorder columns, you can do
x3[,names(x)]
To replace the rownames, you can do
rownames(x3) = 1:NROW(x3)
Replace NA with sample() by group
We can write a function if there is NA
value in LENGTH
and number of rows in group is more than 1, select random value from the group excluding NA
values. If the number of values in group is just 1 then select random value from the entire data excluding NA
values. We apply the function for both the columns LENGTH
and NUMPOINTS
.
replace_func <- function(x, y) {
inds <- is.na(x)
if (length(x) > 1 & any(inds)) {
x[inds] <- sample(x[!inds], sum(inds))
x
}
else if(any(inds)) {
x[inds] <- sample(y[!is.na(y)], 1)
x
} else x
}
library(dplyr)
exam %>%
group_by(Outlet) %>%
mutate(nuLENGTH = replace_func(LENGTH, .$LENGTH),
nuNUMPOINTS = replace_func(NUMPOINTS, .$NUMPOINTS))
# LENGTH NUMPOINTS CTRY_ Outlet nuLENGTH nuNUMPOINTS
# <dbl> <dbl> <fct> <fct> <dbl> <dbl>
# 1 NA NA WCY_____ES ESO0244476 7046. 19
# 2 45447. 0 WCY_____FR DSO0213034 45447. 0
# 3 161038. 8 WCY_____FR DSO0213034 161038. 8
# 4 78148. 5 WCY_____FR DSO0213034 78148. 5
# 5 7193. 2 WCY_____FR DSO0045543 7193. 2
# 6 1572. 0 WCY_____FR DSO0045543 1572. 0
# 7 18191. 9 WCY_____FR DSO0045543 18191. 9
# 8 20366. 0 WCY_____FR DSO0045543 20366. 0
# 9 10015. 0 WCY_____FR DSO0045543 10015. 0
#10 1404. 0 WCY_____FR DSO0045543 1404. 0
# … with 32 more rows
replace NA with groups mean in a non specified number of columns
If you don't mind using dplyr
:
library(dplyr)
dat %>%
group_by(ID) %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
#> # A tibble: 7 x 5
#> # Groups: ID [2]
#> id ID length width extra
#> <int> <fctr> <dbl> <dbl> <dbl>
#> 1 101 collembola 2.1 0.90 1
#> 2 102 mite 1.5 0.70 3
#> 3 103 mite 1.1 0.80 2
#> 4 104 collembola 1.0 0.70 3
#> 5 105 collembola 1.5 0.50 4
#> 6 106 mite 1.5 0.75 3
#> 7 106 mite 1.9 0.75 4
Related Topics
Arranging Ggally Plots with Gridextra
Split on Factor, Sapply, and Lm
Unexpected Date When Converting Posixct Date-Time to Date - Timezone Issue
For Loop Within Custom Function to Create Ggplot Time Series Plots
Efficient Method to Filter and Add Based on Certain Conditions (3 Conditions in This Case)
Error in Running Factor() on a Column of a Data Frame
Weird Case with Data Tables in R, Column Names Are Mixed
Stacked Bar Chart with Group by and Facet
Breaking Up a Long Regular Expression in R
Test If Element Is in a List and Return 0 or 1
How to Manage Parallel Processing with Animated Ggplot2-Plot
Can Ggplot2 Find the Intersections - or Is There Any Other Neat Way
Duplicate Couples (Id-Time) Error in Plm with Only Two Ids
Pre-Select Rows of a Dynamic Dt in Shiny
Ggplot2 Issue: Graph Text Shown with Weird Unicode Blocks