Replacing Nas in R with Nearest Value

Replacing NAs in R with nearest value

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:

f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos

dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}

And here I test it:

# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE

Replace NA with the nearest value based on another variable, while keeping NA for observation which doesn't have non-missing neighbour

One option would be to make use of case_when from tidyverse. Essentially, if the previous row has a closer year and is not NA, then return x from that row. If not, then choose the row below. Or if the year is closer above but there is an NA, then return the row below. Then, same for if the row below has a closer year, but has an NA, then return the row above. If a row does not have an NA, then just return x.

library(tidyverse)

dat %>%
mutate(x = case_when(is.na(x) & !is.na(lag(x)) & year - lag(year) < lead(year) - year ~ lag(x),
is.na(x) & !is.na(lead(x)) & year - lag(year) > lead(year) - year ~ lead(x),
is.na(x) & is.na(lag(x)) ~ lead(x),
is.na(x) & is.na(lead(x)) ~ lag(x),
TRUE ~ x))

Output

   year  x
1 2000 1
2 2001 2
3 2002 3
4 2003 3
5 2005 5
6 2006 5
7 2007 NA
8 2008 9
9 2009 9
10 2010 10

Replacing NA values by nearest value and factor

An option using nearest rolling from data.table:

cols <- paste0("V", 1L:6L)
for (x in cols) {
DT[is.na(get(x)), (x) :=
DT[!is.na(get(x))][.SD, on=.(factorID, RDate), roll="nearest", get(paste0("x.",x))]]
}

output:

   factorID       Date RDate  V1  V2  V3   V4  V5  V6
1: 1 1989-02-06 6976 318 351 172 570 260 108
2: 1 1989-05-13 7072 77 305 591 835 801 550
3: 1 1989-05-29 7088 77 305 591 835 801 550
4: 1 1989-06-14 7104 252 305 286 835 271 85
5: 2 1989-02-06 6976 236 389 323 2078 908 373
6: 2 1989-05-13 7072 77 62 591 2001 801 550
7: 2 1989-05-29 7088 55 62 410 2001 801 550
8: 2 1989-06-14 7104 351 508 456 1618 780 421

data:

library(data.table)
DT <- fread("factorID Date RDate V1 V2 V3 V4 V5 V6
1 1989-02-06 6976 318 351 172 570 260 108
1 1989-05-13 7072 77 NA 591 NA 801 550
1 1989-05-29 7088 NA NA NA NA NA NA
1 1989-06-14 7104 252 305 286 835 271 85
2 1989-02-06 6976 236 389 323 2078 908 373
2 1989-05-13 7072 77 NA 591 NA 801 550
2 1989-05-29 7088 55 62 410 2001 NA NA
2 1989-06-14 7104 351 508 456 1618 780 421")

Note that for factorID=1, for V2, 1989-06-14 is the nearest date both 1989-05-13 and 1989-05-29 and hence 305 should be used to fill these NA rows.

How to replace NAs with the average of the nearest two values

We can use na.approx

library(zoo)
data[] <- lapply(data, function(x) na.locf0(na.approx(x, na.rm = FALSE)))

Or with tidyverse

library(dplyr)
library(tidyr)
data %>%
mutate_all(na.approx, na.rm = FALSE) %>%
fill(everything(), .direction = 'updown')
# A B C
#1 10 2 4
#2 25 2 4
#3 21 2 6
#4 17 2 4
#5 8 2 3
#6 3 2 3

data

data <- structure(list(A = c(10L, 25L, NA, 17L, 8L, 3L), 
B = c(2L, NA, NA, NA, NA, 2L),
C = c(NA, 4L, 6L, 4L, 3L, NA)), class = "data.frame", row.names = c(NA, -6L))

R: replacing NA with value of closest point

Yup.

First, make your data frame with data.frame or things all get coerced to characters:

data<-data.frame(LAT=LAT,LON=LON,COLOR=COLOR)

Split the data frame up - you could probably do this in one go but this makes things a bit more obvious:

query = data[is.na(data$COLOR),]
colours = data[!is.na(data$COLOR),]
library(FNN)
neighs = get.knnx(colours[,c("LAT","LON")],query[,c("LAT","LON")],k=1)

Now insert the replacement colours directly into the data dataframe:

data[is.na(data$COLOR),"COLOR"]=colours$COLOR[neighs$nn.index]
plot(data$LON,data$LAT,col=data$COLOR,pch=19)

Note however that distance is being computed using pythagoras geometry on lat-long, which isn't true because the earth isn't flat. You might have to transform your coordinates to something else first.

Replacing NAs with latest non-NA value

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6
9 3 2 3 2

Replacing missing value with mean of 2 nearest rows in R

na.approx in the zoo package does that. If there can be leading or trailing NA values and you want to:

  • extend the nearest non-NA values add the rule = 2 argument to na.approx or
  • leave those as NA add the na.rm = FALSE argument to na.approx.

See ?na.approx for further arguments. Other possibilities from the same package include na.spline (fill in with cubic spline fit), na.aggregate (mean of all non-NA values), na.locf (last value carried forward) and na.StructTS (seasonal Kalman filter).

library(zoo)

A[, list(Value = na.approx(Value))]

giving:

    Value
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10

Replacing row values with the closest conditional values in R

Here's one way to do it.

rle will give you run length encodings, from which you can replace the negative values with NA and then using na.locf function from zoo package to carry forward (and carry backward) the nearest non negative values. Finally, inverse.rle function can create your desired vector back which we can add to our original data.frame df as newlocNumb

As for any additional condition can be used to replace back some of the original values in locNumb column into newlocNumb column

require(zoo)
pred_trip <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
locNumb <- c(-1, -1, -1, -1, -1, 2, 2, 2, 2, 3, 3, 0, 0, 0, 4, 4, 4, 4, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5)
df <- data.frame(pred_trip, locNumb)

RLE <- rle(df$locNumb)

RLE
## Run Length Encoding
## lengths: int [1:8] 5 4 2 3 4 5 6 4
## values : num [1:8] -1 2 3 0 4 -1 0 5

RLE$values[RLE$values < 0] <- NA

while (any(is.na(RLE$values))) {
RLE$values <- na.locf(na.locf(RLE$values, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
}

df$newlocNumb <- inverse.rle(RLE)

df
## pred_trip locNumb newlocNumb
## 1 0 -1 2
## 2 0 -1 2
## 3 0 -1 2
## 4 0 -1 2
## 5 0 -1 2
## 6 1 2 2
## 7 1 2 2
## 8 1 2 2
## 9 1 2 2
## 10 0 3 3
## 11 0 3 3
## 12 0 0 0
## 13 1 0 0
## 14 1 0 0
## 15 1 4 4
## 16 0 4 4
## 17 0 4 4
## 18 0 4 4
## 19 0 -1 4
## 20 0 -1 4
## 21 0 -1 4
## 22 0 -1 4
## 23 0 -1 4
## 24 0 0 0
## 25 0 0 0
## 26 0 0 0
## 27 1 0 0
## 28 1 0 0
## 29 1 0 0
## 30 1 5 5
## 31 1 5 5
## 32 1 5 5
## 33 1 5 5


Related Topics



Leave a reply



Submit