Fill in Na Based on the Last Non-Na Value for Each Group in R

fill in NA based on the last non-NA value for each group in R

This may have been answered before, but I don't know if it's been answered in a dplyr context. zoo::na.locf() is your friend:

m %>% group_by(y1) %>% mutate(y4=zoo::na.locf(y3))

Fill NAs with either last or next non NA value in R

Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.

Here is the code, using a spiced up version of your example:

library(tidyverse)
df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"),
year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L),
pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L),
pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)),
class = "data.frame", row.names = c(NA, -18L))

fill_nearest = function(x){
keys=which(!is.na(x))
if(length(keys)==0) return(NA)
b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])
x[b]
}

df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(pop_imputated = fill_nearest(pop)) %>%
ungroup()
#> # A tibble: 18 x 5
#> id year pop pop_exp pop_imputated
#> <chr> <int> <int> <int> <int>
#> 1 E1 2000 NA NA NA
#> 2 E2 2000 NA 120 120
#> 3 E2 2001 NA 120 120
#> 4 E2 2003 120 120 120
#> 5 E2 2005 125 125 125
#> 6 E3 1999 115 115 115
#> 7 E3 2001 300 300 300
#> 8 E3 2003 NA 300 300
#> 9 E4 2004 10 10 10
#> 10 E4 2005 NA 10 10
#> 11 E4 2006 NA 10 10
#> 12 E4 2007 NA 9 9
#> 13 E4 2008 NA 9 9
#> 14 E4 2009 9 9 9
#> 15 E4 2018 NA 9 9
#> 16 E4 2019 8 8 8
#> 17 E5 2002 12 12 12
#> 18 E5 2003 80 80 80

Created on 2021-05-13 by the reprex package (v2.0.0)

As I had to use a purrr loop, it might get a bit slow in a huge dataset though.

EDIT: I suggested to add this option in tidyr::fill(): https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.

Replace NAs for a group of values with a non-NA character in group in R

Here is an alternative way using na.locf from zoo package:

library(zoo)
library(dplyr)
df %>%
group_by(participant_id) %>%
arrange(participant_id, test) %>%
mutate(test = zoo::na.locf(test, na.rm=FALSE))
   participant_id test 
<chr> <chr>
1 ps1 test1
2 ps1 test1
3 ps1 test1
4 ps1 test1
5 ps2 test2
6 ps2 test2
7 ps3 test3
8 ps3 test3
9 ps3 test3
10 ps3 test3

Fill missing values (NA) before the first non-NA value by group

You could use cumall(is.na(age)) to find the positions before the first non-NA value.

library(dplyr)

data %>%
group_by(id) %>%
mutate(age2 = replace(age, cumall(is.na(age)), age[!is.na(age)][1])) %>%
ungroup()

# A tibble: 16 × 3
id age age2
<dbl> <dbl> <dbl>
1 1 NA 6
2 1 6 6
3 1 NA NA
4 1 8 8
5 1 NA NA
6 1 NA NA
7 2 NA 3
8 2 NA 3
9 2 3 3
10 2 8 8
11 2 NA NA
12 3 NA 7
13 3 NA 7
14 3 7 7
15 3 NA NA
16 3 9 9

How to select only first non NA value of each group in R?

A dplyr alternative. Assuming that by "first" you simply mean the first row, in the order given, by group.

Note that (Id, VISIT) in your example data gives 2 groups for Baseline.

library(dplyr)

mydata %>%
group_by(Id, VISIT) %>%
filter(!is.na(Score)) %>%
slice(1) %>%
ungroup()

Result:

# A tibble: 5 x 3
Id VISIT Score
<dbl> <chr> <dbl>
1 1 Baseline 2
2 1 Screeing 1
3 1 Week 9 78
4 2 Baseline 5
5 2 Week 2 3

Get the latest non-NA value based on date column by group

library(dplyr)
library(tidyr)

df1 %>%
mutate(date = as.Date(date)) %>%
group_by(country_name) %>%
arrange(date) %>%
select(-date) %>%
fill(everything()) %>%
slice(n())

#> # A tibble: 2 x 4
#> # Groups: country_name [2]
#> country_name column_1 column_2 column_3
#> <chr> <dbl> <int> <dbl>
#> 1 UK 0.5 3 NA
#> 2 US 10 3 7.3

Data:

read.table(text = "country_name  date         column_1 column_2 column_3
US 2016-11-02 7.5 NA NA
US 2017-09-12 NA NA 9
US 2017-09-19 NA 8 10
US 2020-02-10 10 NA NA
US 2021-03-10 NA NA 7.3
US 2021-05-02 NA 3 NA
UK 2016-11-02 NA 2 NA
UK 2017-09-12 0.5 3 NA",
header = T, stringsAsFactors = F) -> df1

Replacing NAs with latest non-NA value

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6
9 3 2 3 2

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

How to get value of last non-NA column

You can use max.col with ties.method set as "last" to get last non-NA value in each row.

test$val <- test[cbind(1:nrow(test), max.col(!is.na(test), ties.method = 'last'))]
test

# date a b c val
#1 2020-01-01 4 NA NA 4
#2 2020-01-02 3 2 NA 2
#3 2020-01-03 4 1 5 5

creating a table of last non-NA values that only stores last non-NA values

maybe such a solution will be useful

tidyverse

df <- data.frame(
id = c(rep('a', 5), rep('b', 5), rep('c', 5)),
year = c(seq(2011, 2015), seq(2011, 2015), seq(2011, 2015)),
x = c(c(1:3, NA, NA), c(1:5), rep(NA, 5)),
y = c(c(NA, NA, 3, 4, NA), c(NA, 2, 3, 4, NA), rep(NA, 5)),
z = c(c(1:5), c(1:4, NA), rep(NA, 5))
)

library(tidyverse)
df %>%
pivot_longer(-c(id, year), values_drop_na = TRUE) %>%
group_by(id, name) %>%
slice_tail(n = 1) %>%
ungroup() %>%
pivot_wider(id_cols = c(id, year), names_from = name, values_from = value)
#> # A tibble: 5 x 5
#> id year x y z
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 a 2013 3 NA NA
#> 2 a 2014 NA 4 NA
#> 3 a 2015 NA NA 5
#> 4 b 2015 5 NA NA
#> 5 b 2014 NA 4 4

Created on 2021-11-24 by the reprex package (v2.0.1)

data.table


library(data.table)
library(magrittr)

melt(data = setDT(df), id.vars = c("id", "year"), na.rm = TRUE) %>%
.[order(id, year), last(.SD), by = list(id, variable)] %>%
dcast(formula = id + year ~ variable)

#> id year x y z
#> 1: a 2013 3 NA NA
#> 2: a 2014 NA 4 NA
#> 3: a 2015 NA NA 5
#> 4: b 2014 NA 4 4
#> 5: b 2015 5 NA NA

Created on 2021-11-24 by the reprex package (v2.0.1)



Related Topics



Leave a reply



Submit