Replacing Nas With Latest Non-Na Value

Replacing NAs with latest non-NA value

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6 
9 3 2 3 2

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Replace NAs for a group of values with a non-NA character in group in R

Here is an alternative way using na.locf from zoo package:

library(zoo)
library(dplyr)
df %>% 
  group_by(participant_id) %>% 
  arrange(participant_id, test) %>% 
  mutate(test = zoo::na.locf(test, na.rm=FALSE))

   participant_id test 
   <chr>          <chr>
 1 ps1            test1
 2 ps1            test1
 3 ps1            test1
 4 ps1            test1
 5 ps2            test2
 6 ps2            test2
 7 ps3            test3
 8 ps3            test3
 9 ps3            test3
10 ps3            test3

Tidyverse: Replacing NAs with latest non-NA values using tidyverse tools

We can replace the NAs before 2017 with value available in 2017 year for each country.

library(dplyr)

df %>% 
  group_by(country) %>% 
  mutate(value = replace(value, is.na(value) & year < 2017, value[year == 2017]))
  #Similarly with ifelse
  #mutate(value = ifelse(is.na(value) & year < 2017, value[year == 2017], value))

#  country  year value
#  <chr>   <int> <int>
#1 usa      2015   100
#2 usa      2016   100
#3 usa      2017   100
#4 usa      2018    NA
#5 aus      2015    50
#6 aus      2016    50
#7 aus      2017    50
#8 aus      2018    60

Fill NAs with either last or next non NA value in R

Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.

Here is the code, using a spiced up version of your example:

library(tidyverse)
df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"), 
                    year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L), 
                    pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L), 
                    pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)), 
               class = "data.frame", row.names = c(NA, -18L))

fill_nearest = function(x){
  keys=which(!is.na(x))
  if(length(keys)==0) return(NA)
  b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])
  x[b]
}

df %>% 
  group_by(id) %>% 
  arrange(id, year) %>%
  mutate(pop_imputated = fill_nearest(pop)) %>% 
  ungroup()
#> # A tibble: 18 x 5
#>    id     year   pop pop_exp pop_imputated
#>    <chr> <int> <int>   <int>         <int>
#>  1 E1     2000    NA      NA            NA
#>  2 E2     2000    NA     120           120
#>  3 E2     2001    NA     120           120
#>  4 E2     2003   120     120           120
#>  5 E2     2005   125     125           125
#>  6 E3     1999   115     115           115
#>  7 E3     2001   300     300           300
#>  8 E3     2003    NA     300           300
#>  9 E4     2004    10      10            10
#> 10 E4     2005    NA      10            10
#> 11 E4     2006    NA      10            10
#> 12 E4     2007    NA       9             9
#> 13 E4     2008    NA       9             9
#> 14 E4     2009     9       9             9
#> 15 E4     2018    NA       9             9
#> 16 E4     2019     8       8             8
#> 17 E5     2002    12      12            12
#> 18 E5     2003    80      80            80

^{Created on 2021-05-13 by the reprex package (v2.0.0)}

As I had to use a purrr loop, it might get a bit slow in a huge dataset though.

EDIT: I suggested to add this option in tidyr::fill(): https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.

Replacing NA from a specific column with latest non-NA value from that row in R

If it is a large data.frame, it may be more efficient to use vectorized solution instead of looping over rows. Get the logical index of elements in 'col1' that are NA ('i1'), use max.col to return the column index of first non-NA element from columns 3 to 5 ('j1'), create a row/column index matrix (m1) with cbind, assign the 'col1' where there are missing values with the elements extracted from 3 to 5 columns using 'm1' and assign those elements to NA

df1 <- as.data.frame(df)
i1 <- is.na(df1$col1)
j1 <- max.col(!is.na(df1[3:5]), "first")
m1 <- cbind(which(i1), j1[i1])
df1$col1[i1] <- df1[3:5][m1]
df1[3:5][m1] <- NA

-output

> df1
  fruits col1 col2 col3 col4
1  apple    4    5   10   20
2 banana  100   NA   NA    4
3 ananas   10   NA    5    1

Replace NA values if last and next non-NA value are the same

You can fill forwards and backwards, then set the rows where they don't match to NA.

library(zoo)
library(dplyr)

df %>% 
  mutate_if(is.factor, as.character) %>% 
  group_by(ID) %>%
  mutate(result = na.locf(with_missing, fromLast = T),
         result = ifelse(result == na.locf(with_missing), result, NA))

#    ID with_missing desired_result result
# 1   1            a              a      a
# 2   1            a              a      a
# 3   1         <NA>              a      a
# 4   1         <NA>              a      a
# 5   1            a              a      a
# 6   1            a              a      a
# 7   2            a              a      a
# 8   2            a              a      a
# 9   2         <NA>           <NA>   <NA>
# 10  2            b              b      b
# 11  2            b              b      b
# 12  2            b              b      b
# 13  3            a              a      a
# 14  3         <NA>           <NA>   <NA>
# 15  3         <NA>           <NA>   <NA>
# 16  3         <NA>           <NA>   <NA>
# 17  3            c              c      c
# 18  3            c              c      c
# 19  4            b              b      b
# 20  4         <NA>           <NA>   <NA>
# 21  4            a              a      a
# 22  4            a              a      a
# 23  4            a              a      a
# 24  4            a              a      a
# 25  5            a              a      a
# 26  5         <NA>              a      a
# 27  5         <NA>              a      a
# 28  5         <NA>              a      a
# 29  5         <NA>              a      a
# 30  5            a              a      a
# 31  6            a              a      a
# 32  6            a              b      a
# 33  6         <NA>              b   <NA>
# 34  6            b              b      b
# 35  6            a              a      a
# 36  6            a              a      a
# 37  7            a              a      a
# 38  7            a              a      a
# 39  7         <NA>              a      a
# 40  7         <NA>              a      a
# 41  7            a              a      a
# 42  7            a              a      a
# 43  8            a              a      a
# 44  8            a              a      a
# 45  8         <NA>           <NA>   <NA>
# 46  8            b              b      b
# 47  8            b              b      b
# 48  8            b              b      b
# 49  9            a              a      a
# 50  9         <NA>           <NA>   <NA>
# 51  9         <NA>           <NA>   <NA>
# 52  9         <NA>           <NA>   <NA>
# 53  9            c              c      c
# 54  9            c              c      c
# 55 10            b              b      b
# 56 10         <NA>           <NA>   <NA>
# 57 10            a              a      a
# 58 10            a              a      a
# 59 10            a              a      a
# 60 10            a              a      a

Replace non NA values of a column in a dataframe with a constant

Two things need to be done, to get the conjuntion of ind with the non NA and to index by numeric values in order to not have NA's as indices.

A one-liner will do it.

data$yi[ which(ind & !is.na(data$yi)) ] <- Tukey_max

Replacing Nas With Latest Non-Na Value