Use Grepl to Search Either of Multiple Substrings in a Text

Use grepl to search either of multiple substrings in a text

You could paste the genres together with an "or" | separator and run that through grepl as a single regular expression.

x <- c("Action", "Adventure", "Animation", ...)
grepl(paste(x, collapse = "|"), my_text)

Here's an example.

x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)
# [1] TRUE FALSE TRUE

R's grepl() to find multiple strings exists


Text <- c("instance", "percentage", "n", 
"instance percentage", "percentage instance")

grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE

grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE

The latter one works by looking for:

('instance')(any character sequence)('percentage')  
OR
('percentage')(any character sequence)('instance')

Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.

Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.

# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))

# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)

# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)

# they produce identical results
identical(longperl, longstrd)

Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you

pat <- c("instance", "percentage", "element", "character")

longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L

As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:

tx <- c("cent element", "percentage element", "element cent", "element centimetre")

grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE

Using grepl for multipe texts

It's not all that elegant, but this function does what you want:

funny_replace <- function(c, b, a) {

max_or_null <- function(x) {
if (length(x) != 0) max(x) else NULL
}

multi_grep <- function(b, x) {
which(sapply(b, grepl, x))
}

replace_one <- function(s, b, a) {
a[max_or_null(multi_grep(b, s))]
}

unlist(sapply(c, replace_one, b, a))
}
funny_replace(c, b, a)
# there is one there one is there is one three two
# "one" "one" "three"

It works as follows: max_or_null is used to return either the maximum value of a vector or NULL, if the vector is empty. This is used later to ensure that elements of c, where no pattern from b matched, are handled correctly.

multi_grep searches multiple patterns in a single string (the usual grep does the opposite: one pattern in multiple strings) and returns the indices of the patterns that were found.

replace_one takes a single string and checks, which of the patterns in b are found using multi_grep. It then uses max_or_null to either return the largest of these indices, or NULL if nothing matched. Finally, the element with this index is picked from a.

replace_one is then applied to each element of c to get the desired result.

I think, it's a more functional solution than yours or a for loop, since it avoids the repeated assignment. On the other hand, it seems a bit complicated.

By the way: I have used a, b, and c everywhere to make it easier to match my code to youre example. However, this is not good considered good practice.

Check text for multiple substrings

As it said in the warning message, grepl will only use one pattern to match. However, you can make a pattern to cover all of your names by joining them with an OR "|".

PAT = paste(comp$n, collapse="|")
grepl(PAT, text$text)
[1] TRUE TRUE TRUE

Matching multiple patterns

Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:

x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"

grep(pattern, x)
[1] 1 2 3

This returns an index of which of your vectors contained the matching pattern (in this case the first three.)

Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:

grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE

See ?regex for help about regular expressions in R.


Edit:
To avoid creating pattern manually we can use paste:

myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")

r- grepl to find matching strings in any order

|in regex means "or". That's why it is TRUE on both texts.
You have to test if "illegal parking" is followed (with or without something in between) by "obstruction", in regex this is "illegal parking.*obstruction", or if you have it the other way around, so "illegal parking.*obstruction|obstruction.*illegal parking"

grepl("illegal parking.*obstruction|obstruction.*illegal parking", Text1, ignore.case=TRUE)

Linux Grep Command - Extract multiple texts between strings

To get 5281181XXXXX or the second string located between '334110' and "-" you can use a pattern like:

\b(?:SubId-|334110)\K[^,\s-]+

The pattern matches:

  • \b A word boundary to prevent a partial word match
  • (?: Non capture group to match as a whole
    • SubId- Match literally
    • | Or
    • 334110 Match literally
  • ) Close the non capture group
  • \K Forget what is matched so far
  • [^,\s-]+ Match 1+ occurrences of any char except a whitespace char , or -

See the matches in this regex demo.

That will match:

5281181XXXXX
0102036XX

The command could look like

zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022 | grep -oP '\b(?:SubId-|334110)\K[^,\s-]+' > analisis1.txt

How do i match multiple strings of a line using grep command

I'm not sure you can do this with pure grep, as you'd need to be able to specify a regex with grouped terms, and then only print out certain regex groups rather than everything matched by the entire regex - so you'd e.g. specify (.*INFO)(.*)(Param : [0-9]*) as the regex and then only print groups 1 and 3 (assuming you start counting at 1).

You can however use sed to post-process the output for you:

% cat foo
04-06-2013 INFO blah blah blah blah Param : 39 another text Ending the line.
05-06-2013 INFO blah blah allah line 2 ending here with Param : 21.
% grep 'Param :' foo | sed 's/\(.*INFO\)\(.*\)\(Param : [0-9]*\)\(.*\)/\1 \3/'
04-06-2013 INFO Param : 39
05-06-2013 INFO Param : 21

What I'm doing above is replacing the match with just groups 1 and 3, separated by a space.

I think this question is related (possibly even a duplicate).

grep searching multiple strings in a file

grepl() returns TRUE if the search is matched. Use this to filter your input vector. If you aren't familiar with regular expressions, it's probably wise to spend some time learning them. In this case. It's searching for your string, with one or more numbers in the middle.

input <- c("log2.read.counts.2289_12_Tumor_NF4_CTTGTAA_L002", 
"log2.read.counts.2289_1_Tail_cont_ATCACGA_L002",
"log2.read.counts.2289_2_Tail_Lmyc_CGATGTA_L002",
"log2.read.counts.2289_3_Tail_Nfib_TTAGGCA_L002",
"log2.read.counts.2289_4_Cell_LmycS3_TGACCAA_L002" )
> input[grepl("log2\\.read\\.counts\\.2289_[0-9]+_Tail", input)]
[1] "log2.read.counts.2289_1_Tail_cont_ATCACGA_L002"
[2] "log2.read.counts.2289_2_Tail_Lmyc_CGATGTA_L002"
[3] "log2.read.counts.2289_3_Tail_Nfib_TTAGGCA_L002"


Related Topics



Leave a reply



Submit