How to Omit Na Values While Pasting Numerous Column Values Together

How to omit NA values while pasting numerous column values together?

You could try na.omit() to omit the values, then paste. Also, you could use toString(), as it is the equivalent of paste(..., collapse = ", ").

apply(dd2, 1, function(x) toString(na.omit(x)))
# [1] "A, AK2, PPT" "B, HFM1, PPT" "C, TRR"
# [4] "D, TRR, RTT, GGT" "E, RTT"

If you have specific columns you are using then

apply(dd2[, cols], 1, function(x) toString(na.omit(x)))

Omit NA values while pasting two column values together in R

You could do something like this, temporarily replacing NA values with the empty character "".

cbind(
dd2,
combination = paste(dd2[,2], replace(dd2[,3], is.na(dd2[,3]), ""), sep = "*")
)
# customer_sample_id Left.Gene.Symbols Right.Gene.Symbols combinations
# [1,] "AMLM12001KP" "AK2" NA "AK2*"
# [2,] "AMLM12001KP" "HFM1" "PPT" "HFM1*PPT"
# [3,] "AMLM12001KP" "HFM1" NA "HFM1*"
# [4,] "AMLM12001KP" "HFM1" "GGT" "HFM1*GGT"
# [5,] "AMLM12001KP" "HFM1" NA "HFM1*"

Of course substitute your column names for the column numbers above. I didn't write them because they are too long.

Paste together columns but ignore NAs

Using paste.

data.frame(col1=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
# col1
# 1 A
# 2 D
# 3 B
# 4 C,E

Or without apply:

data.frame(col1=unname(as.list(as.data.frame(t(df))) |>
(\(x) sapply(x, \(x) paste(x[!is.na(x)], collapse=',')))()))
# col1
# 1 A
# 2 D
# 3 B
# 4 C,E

To add as a column use transform.

transform(df, colX=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
# col1 col2 col3 col4 colX
# 1 A <NA> <NA> NA A
# 2 <NA> <NA> D NA D
# 3 B <NA> <NA> NA B
# 4 C E <NA> NA C,E

Note: Actually, you also could replace \(x) x[!is.na(x)] by na.omit, since it's attributes vanish; see e.g. @ G. Grothendieck's answer.

Paste columns together without NAs becoming characters

Paste columns using na.omit, see example:

# reproducible example
owners4 <- data.frame(FirstName = c("Aa", "Bb", NA),
MiddleInitial = c("T", "U", NA),
LastName = c(NA, "Yyy", NA))

owners4$Name <- apply(owners4[, c("FirstName", "MiddleInitial", "LastName")], 1,
function(i){ paste(na.omit(i), collapse = " ") })

owners4
# FirstName MiddleInitial LastName Name
# 1 Aa T <NA> Aa T
# 2 Bb U Yyy Bb U Yyy
# 3 <NA> <NA> <NA>

Now filter out rows where Name is blank

result <- owners4[ owners4$Name != "", ]
result
# FirstName MiddleInitial LastName Name
# 1 Aa T <NA> Aa T
# 2 Bb U Yyy Bb U Yyy

cleanly paste columns that contain NAs

Here's a way in base R -

dat$col3 <- apply(dat, 1, function(x) paste0(na.omit(x), collapse = "; "))

col1 col2 col3
1 stuff things stuff; things
2 stuff <NA> stuff
3 stuff things stuff; things
4 <NA> things things
5 <NA> <NA>

suppress NAs in paste()

For the purpose of a "true-NA": Seems the most direct route is just to modify the value returned by paste2 to be NA when the value is ""

 paste3 <- function(...,sep=", ") {
L <- list(...)
L <- lapply(L,function(x) {x[is.na(x)] <- ""; x})
ret <-gsub(paste0("(^",sep,"|",sep,"$)"),"",
gsub(paste0(sep,sep),sep,
do.call(paste,c(L,list(sep=sep)))))
is.na(ret) <- ret==""
ret
}
val<- paste3(c("a","b", "c", NA), c("A","B", NA, NA))
val
#[1] "a, A" "b, B" "c" NA

Combine column to remove NA's

A dplyr::coalesce based solution could be as:

data %>% mutate(mycol = coalesce(x,y,z)) %>%
select(a, mycol)
# a mycol
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5

Data

data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))

How to optimize pasting single/multiple column names with its values based on some condition

Up front, I confess that I have not been able to beat the benchmarking (thanks for the challenge). There might be ways to wring a little bit of speed out of it, but let me recommend a method that does the same thing (faster with smaller data, about the same with large data) but supporting per-rule functions. It isn't what you asked directly, but you hinted at different functions for each rule.

(I've updated the code, thanks to @Cole for finding a remnant of my early exploration.)

RULES <- list(
Rule1 = list(
rule = "Rule1",
lhs = "t1",
rhs = c("a", "b", "c"),
fun = function(z) !is.na(z) & z > 0
),
Rule2 = list(
rule = "Rule2",
lhs = "t2",
rhs = "d",
fun = is.na
)
)

fun9 <- function(dat, RULES = list()) {
nr <- nrow(dat)
# RE <- lapply(seq_along(RULES), function(ign) rep("", nr))
RE <- asplit(matrix("", nrow = length(RULES), ncol = nr), 1)
for (r in seq_along(RULES)) {
fun <- RULES[[r]]$fun
lhs <- RULES[[r]]$lhs
for (rhs in RULES[[r]]$rhs) {
lgl <- do.call(fun, list(dat[[rhs]]))
set(dat, which(lgl), lhs, NA)
RE[[r]][lgl] <- sprintf("%s %s=1", RE[[r]][lgl], rhs)
}
ind <- nzchar(RE[[r]])
RE[[r]][ind] <- sprintf("%s:%s(%s)", RULES[[r]]$rule, lhs, RE[[r]][ind])
}
set(dat, j = "re", value = do.call(paste, c(RE, sep = ";")))
}

The premise of the RULES and using fun9 should be self-evident.

Benchmarking with small data seems promising:

set.seed(2021)
dat <- data.table(id = 1:10,
t1 = rnorm(10),
t2 = rnorm(10),
a = c(0, NA, 0, 1, 0, NA, 1, 1, 0, 1),
b = c(0, NA, NA, 0, 1, 0, 1, NA, 1, 1),
c = c(0, NA, 0, NA, 0, 1, NA, 1, 1, 1),
d = c(0, NA, 1, 1, 0, 1, 0, 1, NA, 1),
re = "")
fun9(dat, RULES)[]
# id t1 t2 a b c d re
# <int> <num> <num> <num> <num> <num> <num> <char>
# 1: 1 -0.1224600 -1.0822049 0 0 0 0 ;
# 2: 2 0.5524566 NA NA NA NA NA ;Rule2:t2( d=1)
# 3: 3 0.3486495 0.1819954 0 NA 0 1 ;
# 4: 4 NA 1.5085418 1 0 NA 1 Rule1:t1( a=1);
# 5: 5 NA 1.6044701 0 1 0 0 Rule1:t1( b=1);
# 6: 6 NA -1.8414756 NA 0 1 1 Rule1:t1( c=1);
# 7: 7 NA 1.6233102 1 1 NA 0 Rule1:t1( a=1 b=1);
# 8: 8 NA 0.1313890 1 NA 1 1 Rule1:t1( a=1 c=1);
# 9: 9 NA NA 0 1 1 NA Rule1:t1( b=1 c=1);Rule2:t2( d=1)
# 10: 10 NA 1.5133183 1 1 1 1 Rule1:t1( a=1 b=1 c=1);

bench::mark(fun4(dat), fun9(dat, RULES), check = FALSE)
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 fun4(dat) 9.52ms 11.1ms 88.5 316KB 2.06 43 1 486ms <NULL> <Rprofmem[,3] [84 x 3]> <bch:tm [44]> <tibble [44 x 3]>
# 2 fun9(dat, RULES) 97.5us 113.5us 7760. 416B 6.24 3731 3 481ms <NULL> <Rprofmem[,3] [2 x 3]> <bch:tm [3,734]> <tibble [3,734 x 3]>

Just from `itr/sec`, this fun9 looks to be a bit faster.

With larger data:

set.seed(2021)
n <- 200000
dat <- data.table(id = 1:n,
t1 = rnorm(n),
t2 = rnorm(n),
a = sample(c(0, NA, 1), n, replace = TRUE),
b = sample(c(0, NA, 1), n, replace = TRUE),
c = sample(c(0, NA, 1), n, replace = TRUE),
d = sample(c(0, NA, 1), n, replace = TRUE),
re = "")
bench::mark(fun4(dat), fun9(dat, RULES), check = FALSE)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 fun4(dat) 1.24s 1.24s 0.806 62.9MB 1.61 1 2 1.24s <NULL> <Rprofmem[,3] [150 x 3]> <bch:tm [1]> <tibble [1 x 3]>
# 2 fun9(dat, RULES) 296.11ms 315.4ms 3.17 53.8MB 4.76 2 3 630.8ms <NULL> <Rprofmem[,3] [70 x 3]> <bch:tm [2]> <tibble [2 x 3]>

While this solution does not use tidytable or its flow, it is faster. The cleanup of re is another step, likely to bring this speed back down to mortal levels :-).

Side note: I was trying to use lapply, mget, and other tricks to do things within the data.table data environment, but in the end, using data.table::set (https://stackoverflow.com/a/16846530/3358272) and simple vectors appeared to be the fastest.



Related Topics



Leave a reply



Submit