Populate Nas in a Vector Using Prior Non-Na Values

Replacing NAs with latest non-NA value

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6
9 3 2 3 2

Replace NAs for a group of values with a non-NA character in group in R

Here is an alternative way using na.locf from zoo package:

library(zoo)
library(dplyr)
df %>%
group_by(participant_id) %>%
arrange(participant_id, test) %>%
mutate(test = zoo::na.locf(test, na.rm=FALSE))
   participant_id test 
<chr> <chr>
1 ps1 test1
2 ps1 test1
3 ps1 test1
4 ps1 test1
5 ps2 test2
6 ps2 test2
7 ps3 test3
8 ps3 test3
9 ps3 test3
10 ps3 test3

Populate NAs in a vector using prior non-NA values?

library(zoo)
na.locf(test)
[1] 1 2 2 2 5 5 9 9 9

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Filling NA values with last non-NA's if between repeated identical non-NA values

Here is a base R for loop solution.

Write a function that compares two consecutive non-NA values and if they are the same fill the middle NA values with the same value.

fill_NA_values <- function(x) {
#Index of non-NA values
non_na_values <- which(!is.na(x))
#loop over each index.
for(i in seq_along(non_na_values[-1])) {
#If two consecutive non-NA value are the same
if(x[non_na_values[i]] == x[non_na_values[i + 1]]) {
#Fill the NA values in between with the value.
x[(non_na_values[i] + 1):(non_na_values[i+1] -1)] <- x[non_na_values[i]]
}
}
x
}

Apply this for multiple columns using lapply.

df[-1] <- lapply(df[-1], fill_NA_values)
df

# date X1 X3 X4
#1 2004-12-27 NA NA NA
#2 2004-12-28 2.299 2.349 2.348
#3 2004-12-29 2.299 2.349 2.348
#4 2005-01-03 2.299 2.349 2.348
#5 2005-01-04 2.299 2.349 2.348
#6 2005-01-05 2.299 2.349 2.348
#7 2005-01-06 2.299 2.349 2.348
#8 2005-01-10 2.299 2.349 2.348
#9 2005-01-11 2.299 2.349 2.348
#10 2005-01-12 2.299 NA NA
#11 2005-01-17 2.299 NA NA
#12 2005-01-18 2.299 NA NA
#13 2005-01-19 2.299 NA NA
#14 2005-01-24 2.299 NA NA
#15 2005-01-25 2.299 2.369 2.368
#16 2005-01-26 2.299 NA NA
#17 2005-01-31 2.299 NA NA
#18 2005-02-01 NA NA NA
#19 2005-02-02 NA NA NA
#20 2005-02-08 NA NA NA

Tidyverse: Replacing NAs with latest non-NA values *using tidyverse tools*

We can replace the NAs before 2017 with value available in 2017 year for each country.

library(dplyr)

df %>%
group_by(country) %>%
mutate(value = replace(value, is.na(value) & year < 2017, value[year == 2017]))
#Similarly with ifelse
#mutate(value = ifelse(is.na(value) & year < 2017, value[year == 2017], value))

# country year value
# <chr> <int> <int>
#1 usa 2015 100
#2 usa 2016 100
#3 usa 2017 100
#4 usa 2018 NA
#5 aus 2015 50
#6 aus 2016 50
#7 aus 2017 50
#8 aus 2018 60

Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner

As the OP has mentioned that memory consumption is crucial, here is a data.table approach which uses the na.locf() function from the zoo package:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"),
by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
   ID   Level                      Taxonomy Letter
1: A1 domain D__Eukaryota D
2: A1 kingdom K__Chloroplastida K
3: A1 phylum P__K__Chloroplastida P
4: A1 class C__Mamiellophyceae C
5: A1 order O__C__Mamiellophyceae O
6: A1 family F__O__C__Mamiellophyceae F
7: A1 genus G__Crustomastix G
8: A1 species S__Crustomastix sp. MBIC10709 S

I've refrained from chaining the expressions, so the code can be executed step by step.

Note that data.table is updating in place without copying the whole data set which saves memory as well as time.

Prerequisites and additional explanations

In response to this comment, the OP has confirmed that the starting data frame is ordered and non-redundant and that ID+Level should be the unique key of the data frame.

However, as the solution above depends on these assumptions it is worthwhile to add some checks:

# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
ln = 1:8,
Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]

In addition, it has to be made sure that at least for the top Level "domain" the Taxonomy is specified. This can be doublechecked with:

# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))

The grouping logic by = .(ID, Taxonomy) is been used together with the selection on na, i.e. DT[(na), ..., in order to prepend the additional letters to Taxonomy where Taxonomywas originally missing. During development of the solution, I had introduced an additional helper column gn := rleid(ID, Taxonomy) which would cover duplicates as mentioned in this comment, Finally, I recognized that I can scrape this column because of the prerequisites.

Replace NA with previous or next value, by group, using dplyr

library(tidyr) #fill is part of tidyr

ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")

Which gives you:

Source: local data frame [9 x 4]
Groups: userID [3]

userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F

Replace NA row with non-NA value from previous row and certain column

Finally I realized my own vectorized version. It returns expected output:

na.replace <- function(x, k) {
isNA <- is.na(x[, k])
x[isNA, ] <- na.locf(x[, k], na.rm = F)[isNA]
x
}

UPDATE

Better solution, without any packages

na.lomf <- function(x) {
if (length(x) > 0L) {
non.na.idx <- which(!is.na(x))
if (is.na(x[1L])) {
non.na.idx <- c(1L, non.na.idx)
}
rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
}
}

na.lomf(c(NA, 1, 2, NA, NA, 3, NA, NA, 4, NA))
# [1] NA 1 2 2 2 3 3 3 4 4


Related Topics



Leave a reply



Submit