Populate Nas in a Vector Using Prior Non-Na Values

Replacing NAs with latest non-NA value

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6 
9 3 2 3 2

Replace NAs for a group of values with a non-NA character in group in R

Here is an alternative way using na.locf from zoo package:

library(zoo)
library(dplyr)
df %>% 
  group_by(participant_id) %>% 
  arrange(participant_id, test) %>% 
  mutate(test = zoo::na.locf(test, na.rm=FALSE))

   participant_id test 
   <chr>          <chr>
 1 ps1            test1
 2 ps1            test1
 3 ps1            test1
 4 ps1            test1
 5 ps2            test2
 6 ps2            test2
 7 ps3            test3
 8 ps3            test3
 9 ps3            test3
10 ps3            test3

Populate NAs in a vector using prior non-NA values?

library(zoo)
na.locf(test)
[1] 1 2 2 2 5 5 9 9 9

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Filling NA values with last non-NA's if between repeated identical non-NA values

Here is a base R for loop solution.

Write a function that compares two consecutive non-NA values and if they are the same fill the middle NA values with the same value.

fill_NA_values <- function(x) {
  #Index of non-NA values
  non_na_values <- which(!is.na(x))
  #loop over each index.
  for(i in seq_along(non_na_values[-1])) {
    #If two consecutive non-NA value are the same
    if(x[non_na_values[i]] == x[non_na_values[i + 1]]) {
      #Fill the NA values in between with the value.
      x[(non_na_values[i] + 1):(non_na_values[i+1] -1)] <- x[non_na_values[i]]
    }
  }
  x
}

Apply this for multiple columns using lapply.

df[-1] <- lapply(df[-1], fill_NA_values)
df

#         date    X1    X3    X4
#1  2004-12-27    NA    NA    NA
#2  2004-12-28 2.299 2.349 2.348
#3  2004-12-29 2.299 2.349 2.348
#4  2005-01-03 2.299 2.349 2.348
#5  2005-01-04 2.299 2.349 2.348
#6  2005-01-05 2.299 2.349 2.348
#7  2005-01-06 2.299 2.349 2.348
#8  2005-01-10 2.299 2.349 2.348
#9  2005-01-11 2.299 2.349 2.348
#10 2005-01-12 2.299    NA    NA
#11 2005-01-17 2.299    NA    NA
#12 2005-01-18 2.299    NA    NA
#13 2005-01-19 2.299    NA    NA
#14 2005-01-24 2.299    NA    NA
#15 2005-01-25 2.299 2.369 2.368
#16 2005-01-26 2.299    NA    NA
#17 2005-01-31 2.299    NA    NA
#18 2005-02-01    NA    NA    NA
#19 2005-02-02    NA    NA    NA
#20 2005-02-08    NA    NA    NA

Tidyverse: Replacing NAs with latest non-NA values using tidyverse tools

We can replace the NAs before 2017 with value available in 2017 year for each country.

library(dplyr)

df %>% 
  group_by(country) %>% 
  mutate(value = replace(value, is.na(value) & year < 2017, value[year == 2017]))
  #Similarly with ifelse
  #mutate(value = ifelse(is.na(value) & year < 2017, value[year == 2017], value))

#  country  year value
#  <chr>   <int> <int>
#1 usa      2015   100
#2 usa      2016   100
#3 usa      2017   100
#4 usa      2018    NA
#5 aus      2015    50
#6 aus      2016    50
#7 aus      2017    50
#8 aus      2018    60

Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner

As the OP has mentioned that memory consumption is crucial, here is a data.table approach which uses the na.locf() function from the zoo package:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows 
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"), 
   by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]

   ID   Level                      Taxonomy Letter
1: A1  domain                  D__Eukaryota      D
2: A1 kingdom             K__Chloroplastida      K
3: A1  phylum          P__K__Chloroplastida      P
4: A1   class            C__Mamiellophyceae      C
5: A1   order         O__C__Mamiellophyceae      O
6: A1  family      F__O__C__Mamiellophyceae      F
7: A1   genus               G__Crustomastix      G
8: A1 species S__Crustomastix sp. MBIC10709      S

I've refrained from chaining the expressions, so the code can be executed step by step.

Note that data.table is updating in place without copying the whole data set which saves memory as well as time.

Prerequisites and additional explanations

In response to this comment, the OP has confirmed that the starting data frame is ordered and non-redundant and that ID+Level should be the unique key of the data frame.

However, as the solution above depends on these assumptions it is worthwhile to add some checks:

# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
  ln = 1:8,
  Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]

In addition, it has to be made sure that at least for the top Level "domain" the Taxonomy is specified. This can be doublechecked with:

# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))

The grouping logic by = .(ID, Taxonomy) is been used together with the selection on na, i.e. DT[(na), ..., in order to prepend the additional letters to Taxonomy where Taxonomywas originally missing. During development of the solution, I had introduced an additional helper column gn := rleid(ID, Taxonomy) which would cover duplicates as mentioned in this comment, Finally, I recognized that I can scrape this column because of the prerequisites.

Replace NA with previous or next value, by group, using dplyr

library(tidyr) #fill is part of tidyr

ps1 %>% 
  group_by(userID) %>% 
  #fill(color, age, gender) %>% #default direction down
  fill(color, age, gender, .direction = "downup")

Which gives you:

Source: local data frame [9 x 4]
Groups: userID [3]

  userID  color    age gender
   <dbl> <fctr> <fctr> <fctr>
1     21   blue   3yrs      F
2     21   blue   2yrs      F
3     21    red   2yrs      M
4     22   blue   3yrs      F
5     22   blue   3yrs      F
6     22   blue   3yrs      F
7     23    red   4yrs      F
8     23    red   4yrs      F
9     23   gold   4yrs      F

Replace NA row with non-NA value from previous row and certain column

Finally I realized my own vectorized version. It returns expected output:

na.replace <- function(x, k) {
    isNA <- is.na(x[, k])
    x[isNA, ] <- na.locf(x[, k], na.rm = F)[isNA]
    x
}

UPDATE

Better solution, without any packages

na.lomf <- function(x) {
    if (length(x) > 0L) {
        non.na.idx <- which(!is.na(x))
        if (is.na(x[1L])) {
            non.na.idx <- c(1L, non.na.idx)
        }
        rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
    }
}

na.lomf(c(NA, 1, 2, NA, NA, 3, NA, NA, 4, NA))
# [1] NA  1  2  2  2  3  3  3  4  4

Populate Nas in a Vector Using Prior Non-Na Values

Replacing NAs with latest non-NA value

Replace NAs for a group of values with a non-NA character in group in R

Populate NAs in a vector using prior non-NA values?

Replace missing values (NA) with most recent non-NA by group

Filling NA values with last non-NA's if between repeated identical non-NA values

Tidyverse: Replacing NAs with latest non-NA values using tidyverse tools

Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner

Prerequisites and additional explanations

Replace NA with previous or next value, by group, using dplyr

Replace NA row with non-NA value from previous row and certain column

Related Topics

Leave a reply

Replacing NAs with latest non-NA value

Replace NAs for a group of values with a non-NA character in group in R

Populate NAs in a vector using prior non-NA values?

Replace missing values (NA) with most recent non-NA by group

Filling NA values with last non-NA's if between repeated identical non-NA values

Tidyverse: Replacing NAs with latest non-NA values *using tidyverse tools*

Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner

Prerequisites and additional explanations

Replace NA with previous or next value, by group, using dplyr

Replace NA row with non-NA value from previous row and certain column

Related Topics

Leave a reply

Tidyverse: Replacing NAs with latest non-NA values using tidyverse tools