How to Omit Na Values While Pasting Numerous Column Values Together

How to omit NA values while pasting numerous column values together?

You could try na.omit() to omit the values, then paste. Also, you could use toString(), as it is the equivalent of paste(..., collapse = ", ").

apply(dd2, 1, function(x) toString(na.omit(x)))
# [1] "A, AK2, PPT"      "B, HFM1, PPT"     "C, TRR"          
# [4] "D, TRR, RTT, GGT" "E, RTT"

If you have specific columns you are using then

apply(dd2[, cols], 1, function(x) toString(na.omit(x)))

Omit NA values while pasting two column values together in R

You could do something like this, temporarily replacing NA values with the empty character "".

cbind(
    dd2, 
    combination = paste(dd2[,2], replace(dd2[,3], is.na(dd2[,3]), ""), sep = "*")
)
#      customer_sample_id Left.Gene.Symbols Right.Gene.Symbols combinations
# [1,] "AMLM12001KP"      "AK2"             NA                 "AK2*"      
# [2,] "AMLM12001KP"      "HFM1"            "PPT"              "HFM1*PPT"  
# [3,] "AMLM12001KP"      "HFM1"            NA                 "HFM1*"     
# [4,] "AMLM12001KP"      "HFM1"            "GGT"              "HFM1*GGT"  
# [5,] "AMLM12001KP"      "HFM1"            NA                 "HFM1*"

Of course substitute your column names for the column numbers above. I didn't write them because they are too long.

Paste together columns but ignore NAs

Using paste.

data.frame(col1=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
#   col1
# 1    A
# 2    D
# 3    B
# 4  C,E

Or without apply:

data.frame(col1=unname(as.list(as.data.frame(t(df))) |>
             (\(x) sapply(x, \(x) paste(x[!is.na(x)], collapse=',')))()))
#   col1
# 1    A
# 2    D
# 3    B
# 4  C,E

To add as a column use transform.

transform(df, colX=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
#   col1 col2 col3 col4 colX
# 1    A <NA> <NA>   NA    A
# 2 <NA> <NA>    D   NA    D
# 3    B <NA> <NA>   NA    B
# 4    C    E <NA>   NA  C,E

Note: Actually, you also could replace \(x) x[!is.na(x)] by na.omit, since it's attributes vanish; see e.g. @ G. Grothendieck's answer.

Paste columns together without NAs becoming characters

Paste columns using na.omit, see example:

# reproducible example
owners4 <- data.frame(FirstName = c("Aa", "Bb", NA),
                      MiddleInitial = c("T", "U", NA),
                      LastName = c(NA, "Yyy", NA))

owners4$Name <- apply(owners4[, c("FirstName", "MiddleInitial", "LastName")], 1,
                      function(i){ paste(na.omit(i), collapse = " ") })

owners4
#   FirstName MiddleInitial LastName     Name
# 1        Aa             T     <NA>     Aa T
# 2        Bb             U      Yyy Bb U Yyy
# 3      <NA>          <NA>     <NA>

Now filter out rows where Name is blank

result <- owners4[ owners4$Name != "", ]
result
#   FirstName MiddleInitial LastName     Name
# 1        Aa             T     <NA>     Aa T
# 2        Bb             U      Yyy Bb U Yyy

cleanly paste columns that contain NAs

Here's a way in base R -

dat$col3 <- apply(dat, 1, function(x) paste0(na.omit(x), collapse = "; "))

   col1   col2          col3
1 stuff things stuff; things
2 stuff   <NA>         stuff
3 stuff things stuff; things
4  <NA> things        things
5  <NA>   <NA>

suppress NAs in paste()

For the purpose of a "true-NA": Seems the most direct route is just to modify the value returned by paste2 to be NA when the value is ""

 paste3 <- function(...,sep=", ") {
     L <- list(...)
     L <- lapply(L,function(x) {x[is.na(x)] <- ""; x})
     ret <-gsub(paste0("(^",sep,"|",sep,"$)"),"",
                 gsub(paste0(sep,sep),sep,
                      do.call(paste,c(L,list(sep=sep)))))
     is.na(ret) <- ret==""
     ret
     }
 val<- paste3(c("a","b", "c", NA), c("A","B", NA, NA))
 val
#[1] "a, A" "b, B" "c"    NA

Combine column to remove NA's

A dplyr::coalesce based solution could be as:

data %>% mutate(mycol = coalesce(x,y,z)) %>%
         select(a, mycol)
#   a mycol
# 1 A     1
# 2 B     2
# 3 C     3
# 4 D     4
# 5 E     5

Data

data <- data.frame('a' = c('A','B','C','D','E'),
                 'x' = c(1,2,NA,NA,NA),
                 'y' = c(NA,NA,3,NA,NA),
                 'z' = c(NA,NA,NA,4,5))

How to optimize pasting single/multiple column names with its values based on some condition

Up front, I confess that I have not been able to beat the benchmarking (thanks for the challenge). There might be ways to wring a little bit of speed out of it, but let me recommend a method that does the same thing (faster with smaller data, about the same with large data) but supporting per-rule functions. It isn't what you asked directly, but you hinted at different functions for each rule.

(I've updated the code, thanks to @Cole for finding a remnant of my early exploration.)

RULES <- list(
  Rule1 = list(
    rule = "Rule1",
    lhs = "t1",
    rhs = c("a", "b", "c"),
    fun = function(z) !is.na(z) & z > 0
  ),
  Rule2 = list(
    rule = "Rule2",
    lhs = "t2",
    rhs = "d",
    fun = is.na
    )
)

fun9 <- function(dat, RULES = list()) {
  nr <- nrow(dat)
  # RE <- lapply(seq_along(RULES), function(ign) rep("", nr))
  RE <- asplit(matrix("", nrow = length(RULES), ncol = nr), 1)
  for (r in seq_along(RULES)) {
    fun <- RULES[[r]]$fun
    lhs <- RULES[[r]]$lhs
    for (rhs in RULES[[r]]$rhs) {
      lgl <- do.call(fun, list(dat[[rhs]]))
      set(dat, which(lgl), lhs, NA)
      RE[[r]][lgl] <- sprintf("%s %s=1", RE[[r]][lgl], rhs)
    }
    ind <- nzchar(RE[[r]])
    RE[[r]][ind] <- sprintf("%s:%s(%s)", RULES[[r]]$rule, lhs, RE[[r]][ind])
  }
  set(dat, j = "re", value = do.call(paste, c(RE, sep = ";")))
}

The premise of the RULES and using fun9 should be self-evident.

Benchmarking with small data seems promising:

set.seed(2021)
dat <- data.table(id = 1:10,
                  t1 = rnorm(10),
                  t2 = rnorm(10),
                  a  = c(0, NA,  0,  1,  0, NA,  1,  1,  0, 1),
                  b  = c(0, NA, NA,  0,  1,  0,  1, NA,  1, 1),
                  c  = c(0, NA,  0, NA,  0,  1, NA,  1,  1, 1),
                  d  = c(0, NA,  1,  1,  0,  1,  0,  1, NA, 1),
                  re = "")
fun9(dat, RULES)[]
#        id         t1         t2     a     b     c     d                                re
#     <int>      <num>      <num> <num> <num> <num> <num>                            <char>
#  1:     1 -0.1224600 -1.0822049     0     0     0     0                                 ;
#  2:     2  0.5524566         NA    NA    NA    NA    NA                   ;Rule2:t2( d=1)
#  3:     3  0.3486495  0.1819954     0    NA     0     1                                 ;
#  4:     4         NA  1.5085418     1     0    NA     1                   Rule1:t1( a=1);
#  5:     5         NA  1.6044701     0     1     0     0                   Rule1:t1( b=1);
#  6:     6         NA -1.8414756    NA     0     1     1                   Rule1:t1( c=1);
#  7:     7         NA  1.6233102     1     1    NA     0               Rule1:t1( a=1 b=1);
#  8:     8         NA  0.1313890     1    NA     1     1               Rule1:t1( a=1 c=1);
#  9:     9         NA         NA     0     1     1    NA Rule1:t1( b=1 c=1);Rule2:t2( d=1)
# 10:    10         NA  1.5133183     1     1     1     1           Rule1:t1( a=1 b=1 c=1);

bench::mark(fun4(dat), fun9(dat, RULES), check = FALSE)
# # A tibble: 2 x 13
#   expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                  time             gc                  
#   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                  <list>           <list>              
# 1 fun4(dat)          9.52ms   11.1ms      88.5     316KB     2.06    43     1      486ms <NULL> <Rprofmem[,3] [84 x 3]> <bch:tm [44]>    <tibble [44 x 3]>   
# 2 fun9(dat, RULES)   97.5us  113.5us    7760.       416B     6.24  3731     3      481ms <NULL> <Rprofmem[,3] [2 x 3]>  <bch:tm [3,734]> <tibble [3,734 x 3]>

Just from `itr/sec`, this fun9 looks to be a bit faster.

With larger data:

set.seed(2021)
n <- 200000
dat <- data.table(id = 1:n,
                  t1 = rnorm(n),
                  t2 = rnorm(n),
                  a  = sample(c(0, NA, 1), n, replace = TRUE),
                  b  = sample(c(0, NA, 1), n, replace = TRUE),
                  c  = sample(c(0, NA, 1), n, replace = TRUE),
                  d  = sample(c(0, NA, 1), n, replace = TRUE),
                  re = "")
bench::mark(fun4(dat), fun9(dat, RULES), check = FALSE)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 x 13
#   expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time         gc              
#   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>       <list>          
# 1 fun4(dat)           1.24s    1.24s     0.806    62.9MB     1.61     1     2      1.24s <NULL> <Rprofmem[,3] [150 x 3]> <bch:tm [1]> <tibble [1 x 3]>
# 2 fun9(dat, RULES) 296.11ms  315.4ms     3.17     53.8MB     4.76     2     3    630.8ms <NULL> <Rprofmem[,3] [70 x 3]>  <bch:tm [2]> <tibble [2 x 3]>

While this solution does not use tidytable or its flow, it is faster. The cleanup of re is another step, likely to bring this speed back down to mortal levels :-).

Side note: I was trying to use lapply, mget, and other tricks to do things within the data.table data environment, but in the end, using data.table::set (https://stackoverflow.com/a/16846530/3358272) and simple vectors appeared to be the fastest.

How to Omit Na Values While Pasting Numerous Column Values Together