How to omit NA values while pasting numerous column values together?
You could try na.omit()
to omit the values, then paste. Also, you could use toString()
, as it is the equivalent of paste(..., collapse = ", ")
.
apply(dd2, 1, function(x) toString(na.omit(x)))
# [1] "A, AK2, PPT" "B, HFM1, PPT" "C, TRR"
# [4] "D, TRR, RTT, GGT" "E, RTT"
If you have specific columns you are using then
apply(dd2[, cols], 1, function(x) toString(na.omit(x)))
Omit NA values while pasting two column values together in R
You could do something like this, temporarily replacing NA
values with the empty character ""
.
cbind(
dd2,
combination = paste(dd2[,2], replace(dd2[,3], is.na(dd2[,3]), ""), sep = "*")
)
# customer_sample_id Left.Gene.Symbols Right.Gene.Symbols combinations
# [1,] "AMLM12001KP" "AK2" NA "AK2*"
# [2,] "AMLM12001KP" "HFM1" "PPT" "HFM1*PPT"
# [3,] "AMLM12001KP" "HFM1" NA "HFM1*"
# [4,] "AMLM12001KP" "HFM1" "GGT" "HFM1*GGT"
# [5,] "AMLM12001KP" "HFM1" NA "HFM1*"
Of course substitute your column names for the column numbers above. I didn't write them because they are too long.
Paste together columns but ignore NAs
Using paste
.
data.frame(col1=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
# col1
# 1 A
# 2 D
# 3 B
# 4 C,E
Or without apply
:
data.frame(col1=unname(as.list(as.data.frame(t(df))) |>
(\(x) sapply(x, \(x) paste(x[!is.na(x)], collapse=',')))()))
# col1
# 1 A
# 2 D
# 3 B
# 4 C,E
To add as a column use transform
.
transform(df, colX=sapply(apply(df, 1, \(x) x[!is.na(x)]), paste, collapse=','))
# col1 col2 col3 col4 colX
# 1 A <NA> <NA> NA A
# 2 <NA> <NA> D NA D
# 3 B <NA> <NA> NA B
# 4 C E <NA> NA C,E
Note: Actually, you also could replace \(x) x[!is.na(x)]
by na.omit
, since it's attributes vanish; see e.g. @ G. Grothendieck's answer.
Paste columns together without NAs becoming characters
Paste columns using na.omit, see example:
# reproducible example
owners4 <- data.frame(FirstName = c("Aa", "Bb", NA),
MiddleInitial = c("T", "U", NA),
LastName = c(NA, "Yyy", NA))
owners4$Name <- apply(owners4[, c("FirstName", "MiddleInitial", "LastName")], 1,
function(i){ paste(na.omit(i), collapse = " ") })
owners4
# FirstName MiddleInitial LastName Name
# 1 Aa T <NA> Aa T
# 2 Bb U Yyy Bb U Yyy
# 3 <NA> <NA> <NA>
Now filter out rows where Name is blank
result <- owners4[ owners4$Name != "", ]
result
# FirstName MiddleInitial LastName Name
# 1 Aa T <NA> Aa T
# 2 Bb U Yyy Bb U Yyy
cleanly paste columns that contain NAs
Here's a way in base R -
dat$col3 <- apply(dat, 1, function(x) paste0(na.omit(x), collapse = "; "))
col1 col2 col3
1 stuff things stuff; things
2 stuff <NA> stuff
3 stuff things stuff; things
4 <NA> things things
5 <NA> <NA>
suppress NAs in paste()
For the purpose of a "true-NA": Seems the most direct route is just to modify the value returned by paste2
to be NA
when the value is ""
paste3 <- function(...,sep=", ") {
L <- list(...)
L <- lapply(L,function(x) {x[is.na(x)] <- ""; x})
ret <-gsub(paste0("(^",sep,"|",sep,"$)"),"",
gsub(paste0(sep,sep),sep,
do.call(paste,c(L,list(sep=sep)))))
is.na(ret) <- ret==""
ret
}
val<- paste3(c("a","b", "c", NA), c("A","B", NA, NA))
val
#[1] "a, A" "b, B" "c" NA
Combine column to remove NA's
A dplyr::coalesce
based solution could be as:
data %>% mutate(mycol = coalesce(x,y,z)) %>%
select(a, mycol)
# a mycol
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
Data
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
How to optimize pasting single/multiple column names with its values based on some condition
Up front, I confess that I have not been able to beat the benchmarking (thanks for the challenge). There might be ways to wring a little bit of speed out of it, but let me recommend a method that does the same thing (faster with smaller data, about the same with large data) but supporting per-rule functions. It isn't what you asked directly, but you hinted at different functions for each rule.
(I've updated the code, thanks to @Cole for finding a remnant of my early exploration.)
RULES <- list(
Rule1 = list(
rule = "Rule1",
lhs = "t1",
rhs = c("a", "b", "c"),
fun = function(z) !is.na(z) & z > 0
),
Rule2 = list(
rule = "Rule2",
lhs = "t2",
rhs = "d",
fun = is.na
)
)
fun9 <- function(dat, RULES = list()) {
nr <- nrow(dat)
# RE <- lapply(seq_along(RULES), function(ign) rep("", nr))
RE <- asplit(matrix("", nrow = length(RULES), ncol = nr), 1)
for (r in seq_along(RULES)) {
fun <- RULES[[r]]$fun
lhs <- RULES[[r]]$lhs
for (rhs in RULES[[r]]$rhs) {
lgl <- do.call(fun, list(dat[[rhs]]))
set(dat, which(lgl), lhs, NA)
RE[[r]][lgl] <- sprintf("%s %s=1", RE[[r]][lgl], rhs)
}
ind <- nzchar(RE[[r]])
RE[[r]][ind] <- sprintf("%s:%s(%s)", RULES[[r]]$rule, lhs, RE[[r]][ind])
}
set(dat, j = "re", value = do.call(paste, c(RE, sep = ";")))
}
The premise of the RULES
and using fun9
should be self-evident.
Benchmarking with small data seems promising:
set.seed(2021)
dat <- data.table(id = 1:10,
t1 = rnorm(10),
t2 = rnorm(10),
a = c(0, NA, 0, 1, 0, NA, 1, 1, 0, 1),
b = c(0, NA, NA, 0, 1, 0, 1, NA, 1, 1),
c = c(0, NA, 0, NA, 0, 1, NA, 1, 1, 1),
d = c(0, NA, 1, 1, 0, 1, 0, 1, NA, 1),
re = "")
fun9(dat, RULES)[]
# id t1 t2 a b c d re
# <int> <num> <num> <num> <num> <num> <num> <char>
# 1: 1 -0.1224600 -1.0822049 0 0 0 0 ;
# 2: 2 0.5524566 NA NA NA NA NA ;Rule2:t2( d=1)
# 3: 3 0.3486495 0.1819954 0 NA 0 1 ;
# 4: 4 NA 1.5085418 1 0 NA 1 Rule1:t1( a=1);
# 5: 5 NA 1.6044701 0 1 0 0 Rule1:t1( b=1);
# 6: 6 NA -1.8414756 NA 0 1 1 Rule1:t1( c=1);
# 7: 7 NA 1.6233102 1 1 NA 0 Rule1:t1( a=1 b=1);
# 8: 8 NA 0.1313890 1 NA 1 1 Rule1:t1( a=1 c=1);
# 9: 9 NA NA 0 1 1 NA Rule1:t1( b=1 c=1);Rule2:t2( d=1)
# 10: 10 NA 1.5133183 1 1 1 1 Rule1:t1( a=1 b=1 c=1);
bench::mark(fun4(dat), fun9(dat, RULES), check = FALSE)
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 fun4(dat) 9.52ms 11.1ms 88.5 316KB 2.06 43 1 486ms <NULL> <Rprofmem[,3] [84 x 3]> <bch:tm [44]> <tibble [44 x 3]>
# 2 fun9(dat, RULES) 97.5us 113.5us 7760. 416B 6.24 3731 3 481ms <NULL> <Rprofmem[,3] [2 x 3]> <bch:tm [3,734]> <tibble [3,734 x 3]>
Just from `itr/sec`
, this fun9
looks to be a bit faster.
With larger data:
set.seed(2021)
n <- 200000
dat <- data.table(id = 1:n,
t1 = rnorm(n),
t2 = rnorm(n),
a = sample(c(0, NA, 1), n, replace = TRUE),
b = sample(c(0, NA, 1), n, replace = TRUE),
c = sample(c(0, NA, 1), n, replace = TRUE),
d = sample(c(0, NA, 1), n, replace = TRUE),
re = "")
bench::mark(fun4(dat), fun9(dat, RULES), check = FALSE)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 fun4(dat) 1.24s 1.24s 0.806 62.9MB 1.61 1 2 1.24s <NULL> <Rprofmem[,3] [150 x 3]> <bch:tm [1]> <tibble [1 x 3]>
# 2 fun9(dat, RULES) 296.11ms 315.4ms 3.17 53.8MB 4.76 2 3 630.8ms <NULL> <Rprofmem[,3] [70 x 3]> <bch:tm [2]> <tibble [2 x 3]>
While this solution does not use tidytable
or its flow, it is faster. The cleanup of re
is another step, likely to bring this speed back down to mortal levels :-).
Side note: I was trying to use lapply
, mget
, and other tricks to do things within the data.table
data environment, but in the end, using data.table::set
(https://stackoverflow.com/a/16846530/3358272) and simple vectors appeared to be the fastest.
Related Topics
Quit and Restart a Clean R Session from Within R
Dplyr Summarise: Equivalent of ".Drop=False" to Keep Groups With Zero Length in Output
How to Number/Label Data-Table by Group-Number from Group_By
Applying a Function to Every Row of a Table Using Dplyr
Usage of '...' (Three-Dots or Dot-Dot-Dot) in Functions
Difference Between the == and %In% Operators in R
Error - Replacement Has [X] Rows, Data Has [Y]
Replace Na in Column With Value in Adjacent Column
Offline Install of R Package and Dependencies
Create Sequence of Repeated Values, in Sequence
Conditional Merge/Replacement in R
Test If Characters Are in a String
Unordered Combinations of All Lengths
How to Remove Outliers from a Dataset