Replacing NAs with latest non-NA value
You probably want to use the na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.
Here is the beginning of its usage example from the help page:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2
Replace NAs for a group of values with a non-NA character in group in R
Here is an alternative way using na.locf
from zoo
package:
library(zoo)
library(dplyr)
df %>%
group_by(participant_id) %>%
arrange(participant_id, test) %>%
mutate(test = zoo::na.locf(test, na.rm=FALSE))
participant_id test
<chr> <chr>
1 ps1 test1
2 ps1 test1
3 ps1 test1
4 ps1 test1
5 ps2 test2
6 ps2 test2
7 ps3 test3
8 ps3 test3
9 ps3 test3
10 ps3 test3
Populate NAs in a vector using prior non-NA values?
library(zoo)
na.locf(test)
[1] 1 2 2 2 5 5 9 9 9
Replace missing values (NA) with most recent non-NA by group
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr)
library(zoo)
na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
giving:
Source: local data frame [15 x 3]
Groups: houseID
houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44
A variation of this is:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
by
library(zoo)
do.call(rbind, by(df, df$houseID, na.locf2))
ave
library(zoo)
transform(df, price = ave(price, houseID, FUN = na.locf0))
data.table
library(data.table)
library(zoo)
data.table(df)[, na.locf2(.SD), by = houseID]
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo)
z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
giving:
1 2 3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
This solution could be combined with dplyr like this:
library(dplyr)
library(zoo)
df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
input
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
Filling NA values with last non-NA's if between repeated identical non-NA values
Here is a base R for
loop solution.
Write a function that compares two consecutive non-NA
values and if they are the same fill the middle NA
values with the same value.
fill_NA_values <- function(x) {
#Index of non-NA values
non_na_values <- which(!is.na(x))
#loop over each index.
for(i in seq_along(non_na_values[-1])) {
#If two consecutive non-NA value are the same
if(x[non_na_values[i]] == x[non_na_values[i + 1]]) {
#Fill the NA values in between with the value.
x[(non_na_values[i] + 1):(non_na_values[i+1] -1)] <- x[non_na_values[i]]
}
}
x
}
Apply this for multiple columns using lapply
.
df[-1] <- lapply(df[-1], fill_NA_values)
df
# date X1 X3 X4
#1 2004-12-27 NA NA NA
#2 2004-12-28 2.299 2.349 2.348
#3 2004-12-29 2.299 2.349 2.348
#4 2005-01-03 2.299 2.349 2.348
#5 2005-01-04 2.299 2.349 2.348
#6 2005-01-05 2.299 2.349 2.348
#7 2005-01-06 2.299 2.349 2.348
#8 2005-01-10 2.299 2.349 2.348
#9 2005-01-11 2.299 2.349 2.348
#10 2005-01-12 2.299 NA NA
#11 2005-01-17 2.299 NA NA
#12 2005-01-18 2.299 NA NA
#13 2005-01-19 2.299 NA NA
#14 2005-01-24 2.299 NA NA
#15 2005-01-25 2.299 2.369 2.368
#16 2005-01-26 2.299 NA NA
#17 2005-01-31 2.299 NA NA
#18 2005-02-01 NA NA NA
#19 2005-02-02 NA NA NA
#20 2005-02-08 NA NA NA
Tidyverse: Replacing NAs with latest non-NA values *using tidyverse tools*
We can replace
the NA
s before 2017 with value available in 2017 year for each country
.
library(dplyr)
df %>%
group_by(country) %>%
mutate(value = replace(value, is.na(value) & year < 2017, value[year == 2017]))
#Similarly with ifelse
#mutate(value = ifelse(is.na(value) & year < 2017, value[year == 2017], value))
# country year value
# <chr> <int> <int>
#1 usa 2015 100
#2 usa 2016 100
#3 usa 2017 100
#4 usa 2018 NA
#5 aus 2015 50
#6 aus 2016 50
#7 aus 2017 50
#8 aus 2018 60
Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner
As the OP has mentioned that memory consumption is crucial, here is a data.table
approach which uses the na.locf()
function from the zoo
package:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"),
by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
ID Level Taxonomy Letter
1: A1 domain D__Eukaryota D
2: A1 kingdom K__Chloroplastida K
3: A1 phylum P__K__Chloroplastida P
4: A1 class C__Mamiellophyceae C
5: A1 order O__C__Mamiellophyceae O
6: A1 family F__O__C__Mamiellophyceae F
7: A1 genus G__Crustomastix G
8: A1 species S__Crustomastix sp. MBIC10709 S
I've refrained from chaining the expressions, so the code can be executed step by step.
Note that data.table
is updating in place without copying the whole data set which saves memory as well as time.
Prerequisites and additional explanations
In response to this comment, the OP has confirmed that the starting data frame is ordered and non-redundant and that ID+Level should be the unique key of the data frame.
However, as the solution above depends on these assumptions it is worthwhile to add some checks:
# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
ln = 1:8,
Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]
In addition, it has to be made sure that at least for the top Level
"domain" the Taxonomy
is specified. This can be doublechecked with:
# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))
The grouping logic by = .(ID, Taxonomy)
is been used together with the selection on na
, i.e. DT[(na), ...
, in order to prepend the additional letters to Taxonomy
where Taxonomy
was originally missing. During development of the solution, I had introduced an additional helper column gn := rleid(ID, Taxonomy)
which would cover duplicates as mentioned in this comment, Finally, I recognized that I can scrape this column because of the prerequisites.
Replace NA with previous or next value, by group, using dplyr
library(tidyr) #fill is part of tidyr
ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")
Which gives you:
Source: local data frame [9 x 4]
Groups: userID [3]
userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F
Replace NA row with non-NA value from previous row and certain column
Finally I realized my own vectorized version. It returns expected output:
na.replace <- function(x, k) {
isNA <- is.na(x[, k])
x[isNA, ] <- na.locf(x[, k], na.rm = F)[isNA]
x
}
UPDATE
Better solution, without any packages
na.lomf <- function(x) {
if (length(x) > 0L) {
non.na.idx <- which(!is.na(x))
if (is.na(x[1L])) {
non.na.idx <- c(1L, non.na.idx)
}
rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
}
}
na.lomf(c(NA, 1, 2, NA, NA, 3, NA, NA, 4, NA))
# [1] NA 1 2 2 2 3 3 3 4 4
Related Topics
Adding a Legend to an Rgl 3D Plot
Why Does Withcallinghandlers Still Stops Execution
Passing Variable with Line Types to Ggplot Linetype
Subsetting Data Based on Dynamic Column Names
Combining Grid.Table and Base Package Plots in R Figure
My Group by Doesn't Appear to Be Working in Disk Frames
How to Shift X Axis Positions of Two Geoms Relative to Each Other
Change Date Print Format from Yyyy-Mm-Dd to Dd-Mm-Yyyy
Get Most Frequent String from a Data Frame Column
Converting 1M to 1000000 Elegantly
How to Know a Dimension of Matrix or Vector in R
Reshape R Data with User Entries in Rows, Collapsing for Each User
Separate a Column into Multiple Columns Using Tidyr::Separate with Sep=""
How to Embed Plots into a Tab in Rmarkdown in a Procedural Fashion
Selecting Unique Rows in Matrix Using R
R: How to Judge Date in the Same Week