Grepl in R to Find Matches to Any of a List of Character Strings

grepl in R to find matches to any of a list of character strings

You can use an "or" (|) statement inside the regular expression of grepl.

ifelse(grepl("dog|cat", data$animal), "keep", "discard")
# [1] "keep" "keep" "discard" "keep" "keep" "keep" "keep" "discard"
# [9] "keep" "keep" "keep" "keep" "keep" "keep" "discard" "keep"
#[17] "discard" "keep" "keep" "discard" "keep" "keep" "discard" "keep"
#[25] "keep" "keep" "keep" "keep" "keep" "keep" "keep" "keep"
#[33] "keep" "discard" "keep" "discard" "keep" "discard" "keep" "keep"
#[41] "keep" "keep" "keep" "keep" "keep" "keep" "keep" "keep"
#[49] "keep" "discard"

The regular expression dog|cat tells the regular expression engine to look for either "dog" or "cat", and return the matches for both.

r- grepl to find matching strings in any order

|in regex means "or". That's why it is TRUE on both texts.
You have to test if "illegal parking" is followed (with or without something in between) by "obstruction", in regex this is "illegal parking.*obstruction", or if you have it the other way around, so "illegal parking.*obstruction|obstruction.*illegal parking"

grepl("illegal parking.*obstruction|obstruction.*illegal parking", Text1, ignore.case=TRUE)

R's grepl() to find multiple strings exists

Text <- c("instance", "percentage", "n", 
"instance percentage", "percentage instance")

grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE

grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE

The latter one works by looking for:

('instance')(any character sequence)('percentage')  
OR
('percentage')(any character sequence)('instance')

Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.

Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.

# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))

# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)

# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)

# they produce identical results
identical(longperl, longstrd)

Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you

pat <- c("instance", "percentage", "element", "character")

longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L

As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:

tx <- c("cent element", "percentage element", "element cent", "element centimetre")

grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE

Using grepl to match character in a string of characters with delimiters

Try this, choosing one of type1 or type2 (same result), whichever you prefer.

library(dplyr)
left_join(y, x, by = "variable") %>%
mutate(
type1 = mapply(`%in%`, species, strsplit(combinations, "\\D+")),
type2 = mapply(grepl, paste0("\\b", species, "\\b"), combinations)
)
# # A tibble: 6 x 6
# variable species active combinations type1 type2
# <chr> <chr> <chr> <chr> <lgl> <lgl>
# 1 A 16 16 16, 17, 18 TRUE TRUE
# 2 C 16 16 16,18 TRUE TRUE
# 3 A 17 <NA> 16, 17, 18 TRUE TRUE
# 4 C 17 <NA> 16,18 FALSE FALSE
# 5 A 18 <NA> 16, 17, 18 TRUE TRUE
# 6 C 18 <NA> 16,18 TRUE TRUE

Or starting with the original y:

y
# variable species active
# 1 A, C 16 16
# 2 <NA> 17 <NA>
# 3 <NA> 18 <NA>

y %>%
mutate(variable = zoo::na.locf(variable)) %>%
tidyr::separate_rows(variable) %>%
left_join(., x, by = "variable") %>%
mutate(type1 = mapply(`%in%`, species, strsplit(combinations, "\\D+")), type2 = mapply(grepl, paste0("\\b", species, "\\b"), combinations))
# # A tibble: 6 x 6
# variable species active combinations type1 type2
# <chr> <chr> <chr> <chr> <lgl> <lgl>
# 1 A 16 16 16, 17, 18 TRUE TRUE
# 2 C 16 16 16,18 TRUE TRUE
# 3 A 17 <NA> 16, 17, 18 TRUE TRUE
# 4 C 17 <NA> 16,18 FALSE FALSE
# 5 A 18 <NA> 16, 17, 18 TRUE TRUE
# 6 C 18 <NA> 16,18 TRUE TRUE

FYI, some things wrong with your question:

  1. When asking questions that include warnings or errors, you need to include them; in this case, grepl's first argument must be length 1, and it appears you are ignoring it:

    grepl(y$species, y$combinations)
    # Warning in grepl(y$species, y$combinations) :
    # argument 'pattern' has length > 1 and only the first element will be used
  2. ifelse in your code seems to work, but you are using it incorrectly: it requires a no= argument as well, so there needs to be something as its third argument. It does not error here because everything resolves to be true (which is another problem) so it never attempts to evaluate no=.

    ifelse(c(T,T), 1:2)
    # [1] 1 2
    ifelse(c(T,F), 1:2)
    # Error in ifelse(c(T, F), 1:2) : argument "no" is missing, with no default
    ifelse(c(T,F), 1:2, 11:12)
    # [1] 1 12
  3. What you're attempting to do is merge/join x and y, so the tools you want are among base::merge and dplyr::*_join (for starters, others exist). To better understand what's going on in a join, I suggest you see How to join (merge) data frames (inner, outer, left, right), https://stackoverflow.com/a/6188334/3358272.

Filter column using grepl to keep particular string match

Use ^ to match the beginning of a string, as in

df2 <- df1[,grepl("^H|^TCGA", colnames(df1))]

We can also use dplyr with starts_with():

library(dplyr)

df1 %>%
select(starts_with(c('H', 'TCGA'))


Related Topics



Leave a reply



Submit