grepl in R to find matches to any of a list of character strings
You can use an "or" (|
) statement inside the regular expression of grepl
.
ifelse(grepl("dog|cat", data$animal), "keep", "discard")
# [1] "keep" "keep" "discard" "keep" "keep" "keep" "keep" "discard"
# [9] "keep" "keep" "keep" "keep" "keep" "keep" "discard" "keep"
#[17] "discard" "keep" "keep" "discard" "keep" "keep" "discard" "keep"
#[25] "keep" "keep" "keep" "keep" "keep" "keep" "keep" "keep"
#[33] "keep" "discard" "keep" "discard" "keep" "discard" "keep" "keep"
#[41] "keep" "keep" "keep" "keep" "keep" "keep" "keep" "keep"
#[49] "keep" "discard"
The regular expression dog|cat
tells the regular expression engine to look for either "dog"
or "cat"
, and return the matches for both.
r- grepl to find matching strings in any order
|
in regex means "or". That's why it is TRUE on both texts.
You have to test if "illegal parking"
is followed (with or without something in between) by "obstruction"
, in regex this is "illegal parking.*obstruction"
, or if you have it the other way around, so "illegal parking.*obstruction|obstruction.*illegal parking"
grepl("illegal parking.*obstruction|obstruction.*illegal parking", Text1, ignore.case=TRUE)
R's grepl() to find multiple strings exists
Text <- c("instance", "percentage", "n",
"instance percentage", "percentage instance")
grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE
grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE
The latter one works by looking for:
('instance')(any character sequence)('percentage')
OR
('percentage')(any character sequence)('instance')
Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.
Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl
regex.
# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))
# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)
# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)
# they produce identical results
identical(longperl, longstrd)
Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you
pat <- c("instance", "percentage", "element", "character")
longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L
As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b
. E.g:
tx <- c("cent element", "percentage element", "element cent", "element centimetre")
grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE
Using grepl to match character in a string of characters with delimiters
Try this, choosing one of type1
or type2
(same result), whichever you prefer.
library(dplyr)
left_join(y, x, by = "variable") %>%
mutate(
type1 = mapply(`%in%`, species, strsplit(combinations, "\\D+")),
type2 = mapply(grepl, paste0("\\b", species, "\\b"), combinations)
)
# # A tibble: 6 x 6
# variable species active combinations type1 type2
# <chr> <chr> <chr> <chr> <lgl> <lgl>
# 1 A 16 16 16, 17, 18 TRUE TRUE
# 2 C 16 16 16,18 TRUE TRUE
# 3 A 17 <NA> 16, 17, 18 TRUE TRUE
# 4 C 17 <NA> 16,18 FALSE FALSE
# 5 A 18 <NA> 16, 17, 18 TRUE TRUE
# 6 C 18 <NA> 16,18 TRUE TRUE
Or starting with the original y
:
y
# variable species active
# 1 A, C 16 16
# 2 <NA> 17 <NA>
# 3 <NA> 18 <NA>
y %>%
mutate(variable = zoo::na.locf(variable)) %>%
tidyr::separate_rows(variable) %>%
left_join(., x, by = "variable") %>%
mutate(type1 = mapply(`%in%`, species, strsplit(combinations, "\\D+")), type2 = mapply(grepl, paste0("\\b", species, "\\b"), combinations))
# # A tibble: 6 x 6
# variable species active combinations type1 type2
# <chr> <chr> <chr> <chr> <lgl> <lgl>
# 1 A 16 16 16, 17, 18 TRUE TRUE
# 2 C 16 16 16,18 TRUE TRUE
# 3 A 17 <NA> 16, 17, 18 TRUE TRUE
# 4 C 17 <NA> 16,18 FALSE FALSE
# 5 A 18 <NA> 16, 17, 18 TRUE TRUE
# 6 C 18 <NA> 16,18 TRUE TRUE
FYI, some things wrong with your question:
When asking questions that include warnings or errors, you need to include them; in this case,
grepl
's first argument must be length 1, and it appears you are ignoring it:grepl(y$species, y$combinations)
# Warning in grepl(y$species, y$combinations) :
# argument 'pattern' has length > 1 and only the first element will be usedifelse
in your code seems to work, but you are using it incorrectly: it requires ano=
argument as well, so there needs to be something as its third argument. It does not error here because everything resolves to be true (which is another problem) so it never attempts to evaluateno=
.ifelse(c(T,T), 1:2)
# [1] 1 2
ifelse(c(T,F), 1:2)
# Error in ifelse(c(T, F), 1:2) : argument "no" is missing, with no default
ifelse(c(T,F), 1:2, 11:12)
# [1] 1 12What you're attempting to do is merge/join
x
andy
, so the tools you want are amongbase::merge
anddplyr::*_join
(for starters, others exist). To better understand what's going on in a join, I suggest you see How to join (merge) data frames (inner, outer, left, right), https://stackoverflow.com/a/6188334/3358272.
Filter column using grepl to keep particular string match
Use ^
to match the beginning of a string, as in
df2 <- df1[,grepl("^H|^TCGA", colnames(df1))]
We can also use dplyr with starts_with()
:
library(dplyr)
df1 %>%
select(starts_with(c('H', 'TCGA'))
Related Topics
Showing Different Axis Labels Using Ggplot2 with Facet_Wrap
R List Files with Multiple Conditions
Create a Formula in a Data.Table Environment in R
How to Pass Strings Denoting Expressions to Dplyr 0.7 Verbs
How to Plot Barchart Onto Ggplot2 Map
Using a Static (Prebuilt) PDF Vignette in R Package
Apply T-Test on Many Columns in a Dataframe Split by Factor
Dynamically Adjust Height And/Or Width of Shiny-Plotly Output Based on Window Size
Error in Install.Packages:Cannot Remove Prior Installation of Package 'Dbi'
Ggplot2: Have Shorter Tick Marks for Tick Marks Without Labels
Combine a List of Matrices to a Single Matrix by Rows
Multiply Many Columns by a Specific Other Column in R with Data.Table
Create Tables with Conditional Formatting with Rmarkdown + Knitr
Multiple Colour Scales in One Stacked Bar Plot Using Ggplot
One-Class Classification with Svm in R
What Does the Double Percentage Sign (%%) Mean