Replacing NAs in R with nearest value
Here is a very fast one. It uses findInterval
to find what two positions should be considered for each NA
in your original data:
f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
And here I test it:
# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE
Replace NA with the nearest value based on another variable, while keeping NA for observation which doesn't have non-missing neighbour
One option would be to make use of case_when
from tidyverse
. Essentially, if the previous row has a closer year and is not NA
, then return x
from that row. If not, then choose the row below. Or if the year is closer above but there is an NA
, then return the row below. Then, same for if the row below has a closer year, but has an NA
, then return the row above. If a row does not have an NA
, then just return x
.
library(tidyverse)
dat %>%
mutate(x = case_when(is.na(x) & !is.na(lag(x)) & year - lag(year) < lead(year) - year ~ lag(x),
is.na(x) & !is.na(lead(x)) & year - lag(year) > lead(year) - year ~ lead(x),
is.na(x) & is.na(lag(x)) ~ lead(x),
is.na(x) & is.na(lead(x)) ~ lag(x),
TRUE ~ x))
Output
year x
1 2000 1
2 2001 2
3 2002 3
4 2003 3
5 2005 5
6 2006 5
7 2007 NA
8 2008 9
9 2009 9
10 2010 10
Replacing NA values by nearest value and factor
An option using nearest rolling from data.table
:
cols <- paste0("V", 1L:6L)
for (x in cols) {
DT[is.na(get(x)), (x) :=
DT[!is.na(get(x))][.SD, on=.(factorID, RDate), roll="nearest", get(paste0("x.",x))]]
}
output:
factorID Date RDate V1 V2 V3 V4 V5 V6
1: 1 1989-02-06 6976 318 351 172 570 260 108
2: 1 1989-05-13 7072 77 305 591 835 801 550
3: 1 1989-05-29 7088 77 305 591 835 801 550
4: 1 1989-06-14 7104 252 305 286 835 271 85
5: 2 1989-02-06 6976 236 389 323 2078 908 373
6: 2 1989-05-13 7072 77 62 591 2001 801 550
7: 2 1989-05-29 7088 55 62 410 2001 801 550
8: 2 1989-06-14 7104 351 508 456 1618 780 421
data:
library(data.table)
DT <- fread("factorID Date RDate V1 V2 V3 V4 V5 V6
1 1989-02-06 6976 318 351 172 570 260 108
1 1989-05-13 7072 77 NA 591 NA 801 550
1 1989-05-29 7088 NA NA NA NA NA NA
1 1989-06-14 7104 252 305 286 835 271 85
2 1989-02-06 6976 236 389 323 2078 908 373
2 1989-05-13 7072 77 NA 591 NA 801 550
2 1989-05-29 7088 55 62 410 2001 NA NA
2 1989-06-14 7104 351 508 456 1618 780 421")
Note that for factorID=1
, for V2, 1989-06-14 is the nearest date both 1989-05-13 and 1989-05-29 and hence 305 should be used to fill these NA rows.
How to replace NAs with the average of the nearest two values
We can use na.approx
library(zoo)
data[] <- lapply(data, function(x) na.locf0(na.approx(x, na.rm = FALSE)))
Or with tidyverse
library(dplyr)
library(tidyr)
data %>%
mutate_all(na.approx, na.rm = FALSE) %>%
fill(everything(), .direction = 'updown')
# A B C
#1 10 2 4
#2 25 2 4
#3 21 2 6
#4 17 2 4
#5 8 2 3
#6 3 2 3
data
data <- structure(list(A = c(10L, 25L, NA, 17L, 8L, 3L),
B = c(2L, NA, NA, NA, NA, 2L),
C = c(NA, 4L, 6L, 4L, 3L, NA)), class = "data.frame", row.names = c(NA, -6L))
R: replacing NA with value of closest point
Yup.
First, make your data frame with data.frame
or things all get coerced to characters:
data<-data.frame(LAT=LAT,LON=LON,COLOR=COLOR)
Split the data frame up - you could probably do this in one go but this makes things a bit more obvious:
query = data[is.na(data$COLOR),]
colours = data[!is.na(data$COLOR),]
library(FNN)
neighs = get.knnx(colours[,c("LAT","LON")],query[,c("LAT","LON")],k=1)
Now insert the replacement colours directly into the data
dataframe:
data[is.na(data$COLOR),"COLOR"]=colours$COLOR[neighs$nn.index]
plot(data$LON,data$LAT,col=data$COLOR,pch=19)
Note however that distance is being computed using pythagoras geometry on lat-long, which isn't true because the earth isn't flat. You might have to transform your coordinates to something else first.
Replacing NAs with latest non-NA value
You probably want to use the na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.
Here is the beginning of its usage example from the help page:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2
Replacing missing value with mean of 2 nearest rows in R
na.approx
in the zoo package does that. If there can be leading or trailing NA values and you want to:
- extend the nearest non-NA values add the
rule = 2
argument tona.approx
or - leave those as NA add the
na.rm = FALSE
argument tona.approx
.
See ?na.approx
for further arguments. Other possibilities from the same package include na.spline
(fill in with cubic spline fit), na.aggregate
(mean of all non-NA values), na.locf
(last value carried forward) and na.StructTS
(seasonal Kalman filter).
library(zoo)
A[, list(Value = na.approx(Value))]
giving:
Value
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
Replacing row values with the closest conditional values in R
Here's one way to do it.
rle
will give you run length encodings, from which you can replace the negative values with NA
and then using na.locf
function from zoo
package to carry forward (and carry backward) the nearest non negative values. Finally, inverse.rle
function can create your desired vector back which we can add to our original data.frame df
as newlocNumb
As for any additional condition can be used to replace back some of the original values in locNumb
column into newlocNumb
column
require(zoo)
pred_trip <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
locNumb <- c(-1, -1, -1, -1, -1, 2, 2, 2, 2, 3, 3, 0, 0, 0, 4, 4, 4, 4, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5)
df <- data.frame(pred_trip, locNumb)
RLE <- rle(df$locNumb)
RLE
## Run Length Encoding
## lengths: int [1:8] 5 4 2 3 4 5 6 4
## values : num [1:8] -1 2 3 0 4 -1 0 5
RLE$values[RLE$values < 0] <- NA
while (any(is.na(RLE$values))) {
RLE$values <- na.locf(na.locf(RLE$values, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
}
df$newlocNumb <- inverse.rle(RLE)
df
## pred_trip locNumb newlocNumb
## 1 0 -1 2
## 2 0 -1 2
## 3 0 -1 2
## 4 0 -1 2
## 5 0 -1 2
## 6 1 2 2
## 7 1 2 2
## 8 1 2 2
## 9 1 2 2
## 10 0 3 3
## 11 0 3 3
## 12 0 0 0
## 13 1 0 0
## 14 1 0 0
## 15 1 4 4
## 16 0 4 4
## 17 0 4 4
## 18 0 4 4
## 19 0 -1 4
## 20 0 -1 4
## 21 0 -1 4
## 22 0 -1 4
## 23 0 -1 4
## 24 0 0 0
## 25 0 0 0
## 26 0 0 0
## 27 1 0 0
## 28 1 0 0
## 29 1 0 0
## 30 1 5 5
## 31 1 5 5
## 32 1 5 5
## 33 1 5 5
Related Topics
Sort Matrix According to First Column in R
How to Round a Data.Frame in R That Contains Some Character Variables
Error in Eval(Expr, Envir, Enclos):Object Not Found
Problems Using Foreach Parallelization
Sort a Factor Based on Value in One or More Other Columns
R - Run Source() in Background
Convert Scientific Notation to Numeric, Preserving Decimals
Long and Wide Data - When to Use What
How to Give Color to Each Class in Scatter Plot in R
Emacs Ess Mode - Tabbing for Comment Region
Writing to a Dataframe from a For-Loop in R
How to Use Plyr to Number Rows
Replace Na with Zero in Dplyr Without Using List()
Filter One Selectinput Based on Selection from Another Selectinput