How to Implement Coalesce Efficiently in R

How to implement coalesce efficiently in R

On my machine, using Reduce gets a 5x performance improvement:

coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}

> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100

How to use Coalesce function on a dataframe

For coalesce to work you need NA's and not blanks. Change the blanks to NA and try :

library(dplyr)

df[df == ''] <- NA
df %>% mutate(RCC = coalesce(RC3, RC2, RC1))

# R1 R2 RC1 RC2 RC3 RCC
#1 15515 515 AW SSSBB KKAJDJHW KKAJDJHW
#2 5156 5156.11- FG <NA> XVVJAKWA XVVJAKWA
#3 65656 415- ZA <NA> <NA> ZA
#4 1566 1455- ZI ZXXQA <NA> ZXXQA
#5 2857 886 <NA> <NA> <NA> <NA>
#6 8888 888 CW CQAER CDDGAJJA CDDGAJJA
#7 65656 777 <NA> <NA> GGGAJTTD GGGAJTTD
#8 1566 666 <NA> KKHDY <NA> KKHDY
#9 65651 4457 <NA> TTQWW BBNMNJJI BBNMNJJI

Is there an R function that unifies multiple columns?

We can use coalesce

library(dplyr)
df <- df %>%
mutate(C = coalesce(A, B))

Iteratively dplyr::coalesce()

If columns are like something and somthing.etc shape,

you may try

library(dplyr)
library(stringr)
df %>%
split.default(str_remove(names(.), "\\..*")) %>%
map_df(~ coalesce(!!! .x))

a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 1 2 3
3 1 2 3

Merge two columns of data table on condition

You can try fcoalesce if you are working with data.table

> setDT(df)[, lab3 := fcoalesce(lab2, lab1)][]
lab1 lab2 lab3
1: 5 7 7
2: 8 10 10
3: NA 3 3
4: 9 NA 9
5: NA NA NA

Dealing with values that occur on the same date

An option with coalesce which would return the first non-NA element across different columns given as argument for each row

library(dplyr)
df1 %>%
transmute(Date, A01 = coalesce(A01, A01_CD), A01_CD = NA_real_)
# Date A01 A01_CD
#1 1966/05/07 4.870000 NA
#2 1966/05/08 4.918333 NA
#3 1966/05/09 4.892000 NA
#4 1966/05/10 4.858917 NA
#5 1966/05/11 4.842000 NA
#6 1967/03/18 5.950000 NA

Or in base R with row/column indexing

df1$A01 <- df1[-1][cbind(seq_len(nrow(df1)), max.col(!is.na(df1[-1]), 'first'))]
df1$A01
#[1] 4.870000 4.918333 4.892000 4.858917 4.842000 5.950000

data

df1 <- structure(list(Date = c("1966/05/07", "1966/05/08", "1966/05/09", 
"1966/05/10", "1966/05/11", "1967/03/18"), A01 = c(4.87, 4.918333,
4.892, 4.858917, 4.842, NA), A01_CD = c(4.87, NA, 4.86, NA, NA,
5.95)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "211"))

Coalesce columns in df

coalesce() only works on "real" missing values. In your data, "N/A" is character, so at first you need to convert them to NA.

library(dplyr)

df %>%
mutate(across(where(is.character), na_if, "N/A"),
TotalWarning = coalesce(Primary.Warning.Vertical,
Primary.Warning.Horizontal,
Secondary.Sensor.Warning.Vertical,
Secondary.Sensor.Warning.Horizontal))

# Primary.Warning.Vertical Primary.Warning.Horizontal Secondary.Sensor.Warning.Vertical Secondary.Sensor.Warning.Horizontal TotalWarning
# 1 <NA> 2 <NA> <NA> 2
# 2 <NA> 2 <NA> <NA> 2
# 3 <NA> 1.1 <NA> <NA> 1.1
# 4 <NA> 2 <NA> <NA> 2
# 5 <NA> 2 <NA> <NA> 2
# 6 <NA> 2 <NA> <NA> 2
# 7 <NA> 1.7 <NA> <NA> 1.7
# 8 <NA> 2 <NA> <NA> 2
# 9 <NA> 2 <NA> <NA> 2
# 10 <NA> 2 <NA> <NA> 2

Your variable names are too tedious. To simplify the code, you can also do this:

df %>%
mutate(across(where(is.character), na_if, "N/A"),
TotalWarning = do.call(coalesce, cur_data()))

Coalesce two string columns with alternating missing values to one

You may try pmax

df$c <- pmax(df$a, df$b)
df
# a b c
# 1 dog <NA> dog
# 2 mouse <NA> mouse
# 3 <NA> cat cat
# 4 bird <NA> bird

...or ifelse:

df$c <- ifelse(is.na(df$a), df$b, df$a)

For more general solutions in cases with more than two columns, you find several ways to implement coalesce in R here.



Related Topics



Leave a reply



Submit