Replace missing values (NA) with most recent non-NA by group
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr)
library(zoo)
na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
giving:
Source: local data frame [15 x 3]
Groups: houseID
houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44
A variation of this is:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
by
library(zoo)
do.call(rbind, by(df, df$houseID, na.locf2))
ave
library(zoo)
transform(df, price = ave(price, houseID, FUN = na.locf0))
data.table
library(data.table)
library(zoo)
data.table(df)[, na.locf2(.SD), by = houseID]
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo)
z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
giving:
1 2 3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
This solution could be combined with dplyr like this:
library(dplyr)
library(zoo)
df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
input
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
Replace NAs for a group of values with a non-NA character in group in R
Here is an alternative way using na.locf
from zoo
package:
library(zoo)
library(dplyr)
df %>%
group_by(participant_id) %>%
arrange(participant_id, test) %>%
mutate(test = zoo::na.locf(test, na.rm=FALSE))
participant_id test
<chr> <chr>
1 ps1 test1
2 ps1 test1
3 ps1 test1
4 ps1 test1
5 ps2 test2
6 ps2 test2
7 ps3 test3
8 ps3 test3
9 ps3 test3
10 ps3 test3
How to replace NA with most recent non-NA by group?
As another base R solution, here is a poor man's na.locf
fill_down <- function(v) {
if (length(v) > 1) {
keep <- c(TRUE, !is.na(v[-1]))
v[keep][cumsum(keep)]
} else v
}
To fill down by group, the approach is to use tapply()
to split and apply to each group, and split<-
to combine groups to the original geometry, as
fill_down_by_group <- function(v, grp) {
## original 'by hand':
## split(v, grp) <- tapply(v, grp, fill_down)
## v
## done by built-in function `ave()`
ave(v, grp, FUN=fill_down)
}
To process multiple columns, one might
elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)
Notes
I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is
library(dplyr); library(tidyr)
df %>% group_by(name) %>% fill_(elts)A more efficient base solution when the groups are already 'grouped' (e.g.,
identical(grp, sort(grp))
) isfill_down_by_grouped <- function(v, grp) {
if (length(v) > 1) {
keep <- !(duplicated(v) & is.na(v))
v[keep][cumsum(keep)]
} else v
}For me,
fill_down()
on a vector with about 10M elements takes ~225ms;fill_down_by_grouped()
takes ~300ms independent of the number of groups;fill_down_by_group()
scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s
Fill NAs with either last or next non NA value in R
Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.
Here is the code, using a spiced up version of your example:
library(tidyverse)
df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"),
year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L),
pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L),
pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)),
class = "data.frame", row.names = c(NA, -18L))
fill_nearest = function(x){
keys=which(!is.na(x))
if(length(keys)==0) return(NA)
b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])
x[b]
}
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(pop_imputated = fill_nearest(pop)) %>%
ungroup()
#> # A tibble: 18 x 5
#> id year pop pop_exp pop_imputated
#> <chr> <int> <int> <int> <int>
#> 1 E1 2000 NA NA NA
#> 2 E2 2000 NA 120 120
#> 3 E2 2001 NA 120 120
#> 4 E2 2003 120 120 120
#> 5 E2 2005 125 125 125
#> 6 E3 1999 115 115 115
#> 7 E3 2001 300 300 300
#> 8 E3 2003 NA 300 300
#> 9 E4 2004 10 10 10
#> 10 E4 2005 NA 10 10
#> 11 E4 2006 NA 10 10
#> 12 E4 2007 NA 9 9
#> 13 E4 2008 NA 9 9
#> 14 E4 2009 9 9 9
#> 15 E4 2018 NA 9 9
#> 16 E4 2019 8 8 8
#> 17 E5 2002 12 12 12
#> 18 E5 2003 80 80 80
Created on 2021-05-13 by the reprex package (v2.0.0)
As I had to use a purrr
loop, it might get a bit slow in a huge dataset though.
EDIT: I suggested to add this option in tidyr::fill()
: https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year
column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.
Replacing NAs with latest non-NA value
You probably want to use the na.locf()
function from the zoo package to carry the last observation forward to replace your NA values.
Here is the beginning of its usage example from the help page:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2
fill in NA based on the last non-NA value for each group in R
This may have been answered before, but I don't know if it's been answered in a dplyr
context. zoo::na.locf()
is your friend:
m %>% group_by(y1) %>% mutate(y4=zoo::na.locf(y3))
Replace NA values if last and next non-NA value are the same
You can fill forwards and backwards, then set the rows where they don't match to NA
.
library(zoo)
library(dplyr)
df %>%
mutate_if(is.factor, as.character) %>%
group_by(ID) %>%
mutate(result = na.locf(with_missing, fromLast = T),
result = ifelse(result == na.locf(with_missing), result, NA))
# ID with_missing desired_result result
# 1 1 a a a
# 2 1 a a a
# 3 1 <NA> a a
# 4 1 <NA> a a
# 5 1 a a a
# 6 1 a a a
# 7 2 a a a
# 8 2 a a a
# 9 2 <NA> <NA> <NA>
# 10 2 b b b
# 11 2 b b b
# 12 2 b b b
# 13 3 a a a
# 14 3 <NA> <NA> <NA>
# 15 3 <NA> <NA> <NA>
# 16 3 <NA> <NA> <NA>
# 17 3 c c c
# 18 3 c c c
# 19 4 b b b
# 20 4 <NA> <NA> <NA>
# 21 4 a a a
# 22 4 a a a
# 23 4 a a a
# 24 4 a a a
# 25 5 a a a
# 26 5 <NA> a a
# 27 5 <NA> a a
# 28 5 <NA> a a
# 29 5 <NA> a a
# 30 5 a a a
# 31 6 a a a
# 32 6 a b a
# 33 6 <NA> b <NA>
# 34 6 b b b
# 35 6 a a a
# 36 6 a a a
# 37 7 a a a
# 38 7 a a a
# 39 7 <NA> a a
# 40 7 <NA> a a
# 41 7 a a a
# 42 7 a a a
# 43 8 a a a
# 44 8 a a a
# 45 8 <NA> <NA> <NA>
# 46 8 b b b
# 47 8 b b b
# 48 8 b b b
# 49 9 a a a
# 50 9 <NA> <NA> <NA>
# 51 9 <NA> <NA> <NA>
# 52 9 <NA> <NA> <NA>
# 53 9 c c c
# 54 9 c c c
# 55 10 b b b
# 56 10 <NA> <NA> <NA>
# 57 10 a a a
# 58 10 a a a
# 59 10 a a a
# 60 10 a a a
Tidyverse: Replacing NAs with latest non-NA values *using tidyverse tools*
We can replace
the NA
s before 2017 with value available in 2017 year for each country
.
library(dplyr)
df %>%
group_by(country) %>%
mutate(value = replace(value, is.na(value) & year < 2017, value[year == 2017]))
#Similarly with ifelse
#mutate(value = ifelse(is.na(value) & year < 2017, value[year == 2017], value))
# country year value
# <chr> <int> <int>
#1 usa 2015 100
#2 usa 2016 100
#3 usa 2017 100
#4 usa 2018 NA
#5 aus 2015 50
#6 aus 2016 50
#7 aus 2017 50
#8 aus 2018 60
Replace NA when last and next non-NA values are equal
Perform an na.locf0
both fowards and backwards and if they are the same then use the common value; otherwise, use NA. The grouping is done with ave
.
library(zoo)
filler <- function(x) {
forward <- na.locf0(x)
backward <- na.locf0(x, fromLast = TRUE)
ifelse(forward == backward, forward, NA)
}
transform(dat, message = ave(message, id, FUN = filler))
giving:
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
Related Topics
R Group by Date, and Summarize the Values
Adding Regression Line Per Group with Ggplot2
How to Color Sliderbar (Sliderinput)
How to Complete Missing Factor Levels in Data Frame
How to Do Range Grouping on a Column Using Dplyr
How to Jitter/Dodge Geom_Segments So They Remain Parallel
Rolling Sum by Another Variable in R
Insert Elements in a Vector in R
How to Create a Grouped Boxplot in R
Different Legend-Keys Inside Same Legend in Ggplot2
Too Few Periods for Decompose()
Rle-Like Function That Catches "Run" of Adjacent Integers
Returning Above and Below Rows of Specific Rows in R Dataframe
List for Multiple Plots from Loop (Ggplot2) - List Elements Being Overwritten
Error in Loading Rgl Package with MAC Os X
How to Add a Index by Set of Data When Using Rbindlist
Take Sum of a Variable If Combination of Values in Two Other Columns Are Unique