Extract Last Non-Missing Value in Row with Data.Table

How to get value of last non-NA column

You can use max.col with ties.method set as "last" to get last non-NA value in each row.

test$val <- test[cbind(1:nrow(test), max.col(!is.na(test), ties.method = 'last'))]
test

# date a b c val
#1 2020-01-01 4 NA NA 4
#2 2020-01-02 3 2 NA 2
#3 2020-01-03 4 1 5 5

Extract last non-missing value in row with data.table

Here's another way:

dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]

X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t

Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.

Functions:

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}

alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}

frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}

frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}

Example data and benchmark:

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)

library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)

Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30

keep last non missing observation for all variables by group

Using data.table :

library(data.table)

d[, lapply(.SD, function(x) last(na.omit(x))), g]

# g a b c
#1: 1 1 2 <NA>
#2: 2 4 4 c

Getting the position of the the last non-NA value in a row in an R data.table

We can use max.col :

max.col(!is.na(dt[, -1]), ties.method = 'last') * +(rowSums(!is.na(dt[,-1])) > 0)
#[1] 4 2 3 0

Extract and collapse non-missing elements by row in the data.table

Using melt() / dcast():

data[, row := .I
][, melt(.SD, id.vars = "row")
][order(row, value), paste0(unique(value[!is.na(value)]), collapse = "&&&"), by = row]

row V1
1: 1 1
2: 2
3: 3 1
4: 4 1
5: 5 1&&&2
6: 6 1&&&2
7: 7 2
8: 8 1&&&2
9: 9 1&&&2
10: 10 2

Alterntively using your original function:

data[, function_non_missing(unlist(.SD)), by = 1:nrow(data)]

nrow V1
1: 1 1
2: 2
3: 3 2
4: 4 1&&&&2
5: 5 1&&&&2
6: 6 1&&&&2
7: 7 1
8: 8 2
9: 9 1&&&&2
10: 10 1&&&&2

Get value of last non-NA row per column in data.table

If the dataset is data.table, loop through the Subset of Data.table (.SD), subset the non-NA element (x[!is.na(x)]) and extract the last element among those with tail.

df1[, lapply(.SD, function(x) tail(x[!is.na(x)],1))]
# a b c
#1: 63 57 4

Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

Here some data.table-based solutions.

setDT(df_id_year_and_type)

method 1

na.omit(df_id_year_and_type, cols="type") drops NA rows based on column type.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE) finds all the groups.
And by joining them (using the last match: mult="last"), we obtain the desired output.

na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]

# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>

method 2

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

method 3

(likely slower because of [ overhead)

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

How to fill NA with last non-missing value from previous columns?

You could use

library(dplyr)
df %>%
mutate(V5 = coalesce(V4, V3, V2, V1))

This returns

# A tibble: 7 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.19 2.45 0.83 0.87 0.87
2 1.13 0.79 0.68 5.43 5.43
3 1.18 1.09 1.04 NA 1.04
4 1.11 1.1 4.24 NA 4.24
5 1.16 1.13 NA NA 1.13
6 1.18 NA NA NA 1.18
7 1.44 NA 9.17 NA 9.17

Or more general from https://github.com/tidyverse/funs/issues/54#issuecomment-892377998

df %>% 
mutate(V5 = do.call(coalesce, rev(across(-V5))))

or https://github.com/tidyverse/funs/issues/54#issuecomment-1096449488

df %>% 
mutate(V5 = coalesce(!!!rev(select(., -V5))))

update non-missing values based on most recent date

A data.table option

setDT(data)[, employ := last(na.omit(employ[order(year)])), id]

gives

    id year employ
1: 1 2010 yes
2: 1 2011 yes
3: 2 2010 yes
4: 2 2011 yes
5: 3 2010 no
6: 3 2011 no
7: 4 2010 yes
8: 4 2011 yes
9: 5 2010 no
10: 5 2011 no

A dplyr way might be

data %>%
group_by(id) %>%
mutate(employ = last(na.omit(employ[order(year)])))

which gives

      id  year employ
<dbl> <dbl> <chr>
1 1 2010 yes
2 1 2011 yes
3 2 2010 yes
4 2 2011 yes
5 3 2010 no
6 3 2011 no
7 4 2010 yes
8 4 2011 yes
9 5 2010 no
10 5 2011 no


Related Topics



Leave a reply



Submit