Use grepl to search either of multiple substrings in a text
You could paste the genres together with an "or" |
separator and run that through grepl
as a single regular expression.
x <- c("Action", "Adventure", "Animation", ...)
grepl(paste(x, collapse = "|"), my_text)
Here's an example.
x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)
# [1] TRUE FALSE TRUE
R's grepl() to find multiple strings exists
Text <- c("instance", "percentage", "n",
"instance percentage", "percentage instance")
grepl("instance|percentage", Text)
# TRUE TRUE FALSE TRUE TRUE
grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE TRUE
The latter one works by looking for:
('instance')(any character sequence)('percentage')
OR
('percentage')(any character sequence)('instance')
Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.
Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl
regex.
# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
"character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))
# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
Text2, perl=TRUE)
# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) &
grepl("percentage", Text2) &
grepl("element", Text2) &
grepl("character", Text2)
# they produce identical results
identical(longperl, longstrd)
Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you
pat <- c("instance", "percentage", "element", "character")
longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L
As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b
. E.g:
tx <- c("cent element", "percentage element", "element cent", "element centimetre")
grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE TRUE FALSE
Using grepl for multipe texts
It's not all that elegant, but this function does what you want:
funny_replace <- function(c, b, a) {
max_or_null <- function(x) {
if (length(x) != 0) max(x) else NULL
}
multi_grep <- function(b, x) {
which(sapply(b, grepl, x))
}
replace_one <- function(s, b, a) {
a[max_or_null(multi_grep(b, s))]
}
unlist(sapply(c, replace_one, b, a))
}
funny_replace(c, b, a)
# there is one there one is there is one three two
# "one" "one" "three"
It works as follows: max_or_null
is used to return either the maximum value of a vector or NULL
, if the vector is empty. This is used later to ensure that elements of c
, where no pattern from b
matched, are handled correctly.
multi_grep
searches multiple patterns in a single string (the usual grep does the opposite: one pattern in multiple strings) and returns the indices of the patterns that were found.
replace_one
takes a single string and checks, which of the patterns in b
are found using multi_grep
. It then uses max_or_null
to either return the largest of these indices, or NULL if nothing matched. Finally, the element with this index is picked from a
.
replace_one
is then applied to each element of c
to get the desired result.
I think, it's a more functional solution than yours or a for loop, since it avoids the repeated assignment. On the other hand, it seems a bit complicated.
By the way: I have used a
, b
, and c
everywhere to make it easier to match my code to youre example. However, this is not good considered good practice.
Check text for multiple substrings
As it said in the warning message, grepl
will only use one pattern to match. However, you can make a pattern to cover all of your names by joining them with an OR "|".
PAT = paste(comp$n, collapse="|")
grepl(PAT, text$text)
[1] TRUE TRUE TRUE
Matching multiple patterns
Yes, you can. The |
in a grep
pattern has the same meaning as or
. So you can test for your pattern by using "001|100|000"
as your pattern. At the same time, grep
is vectorised, so all of this can be done in one step:
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
grep(pattern, x)
[1] 1 2 3
This returns an index of which of your vectors contained the matching pattern (in this case the first three.)
Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl
:
grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE
See ?regex
for help about regular expressions in R.
Edit:
To avoid creating pattern manually we can use paste
:
myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")
r- grepl to find matching strings in any order
|
in regex means "or". That's why it is TRUE on both texts.
You have to test if "illegal parking"
is followed (with or without something in between) by "obstruction"
, in regex this is "illegal parking.*obstruction"
, or if you have it the other way around, so "illegal parking.*obstruction|obstruction.*illegal parking"
grepl("illegal parking.*obstruction|obstruction.*illegal parking", Text1, ignore.case=TRUE)
Linux Grep Command - Extract multiple texts between strings
To get 5281181XXXXX
or the second string located between '334110' and "-"
you can use a pattern like:
\b(?:SubId-|334110)\K[^,\s-]+
The pattern matches:
\b
A word boundary to prevent a partial word match(?:
Non capture group to match as a wholeSubId-
Match literally|
Or334110
Match literally
)
Close the non capture group\K
Forget what is matched so far[^,\s-]+
Match 1+ occurrences of any char except a whitespace char,
or-
See the matches in this regex demo.
That will match:
5281181XXXXX
0102036XX
The command could look like
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022 | grep -oP '\b(?:SubId-|334110)\K[^,\s-]+' > analisis1.txt
How do i match multiple strings of a line using grep command
I'm not sure you can do this with pure grep, as you'd need to be able to specify a regex with grouped terms, and then only print out certain regex groups rather than everything matched by the entire regex - so you'd e.g. specify (.*INFO)(.*)(Param : [0-9]*)
as the regex and then only print groups 1
and 3
(assuming you start counting at 1).
You can however use sed
to post-process the output for you:
% cat foo
04-06-2013 INFO blah blah blah blah Param : 39 another text Ending the line.
05-06-2013 INFO blah blah allah line 2 ending here with Param : 21.
% grep 'Param :' foo | sed 's/\(.*INFO\)\(.*\)\(Param : [0-9]*\)\(.*\)/\1 \3/'
04-06-2013 INFO Param : 39
05-06-2013 INFO Param : 21
What I'm doing above is replacing the match with just groups 1
and 3
, separated by a space.
I think this question is related (possibly even a duplicate).
grep searching multiple strings in a file
grepl() returns TRUE if the search is matched. Use this to filter your input vector. If you aren't familiar with regular expressions, it's probably wise to spend some time learning them. In this case. It's searching for your string, with one or more numbers in the middle.
input <- c("log2.read.counts.2289_12_Tumor_NF4_CTTGTAA_L002",
"log2.read.counts.2289_1_Tail_cont_ATCACGA_L002",
"log2.read.counts.2289_2_Tail_Lmyc_CGATGTA_L002",
"log2.read.counts.2289_3_Tail_Nfib_TTAGGCA_L002",
"log2.read.counts.2289_4_Cell_LmycS3_TGACCAA_L002" )
> input[grepl("log2\\.read\\.counts\\.2289_[0-9]+_Tail", input)]
[1] "log2.read.counts.2289_1_Tail_cont_ATCACGA_L002"
[2] "log2.read.counts.2289_2_Tail_Lmyc_CGATGTA_L002"
[3] "log2.read.counts.2289_3_Tail_Nfib_TTAGGCA_L002"
Related Topics
Stacked Barplot with Colour Gradients for Each Bar
What Methods How to Use to Reshape Very Large Data Sets
Different Legends and Fill Colours for Facetted Ggplot
Put a Break in the Y-Axis of a Histogram
Cumulative Count of Each Value
Ggplot - Multiple Legends Arrangement
How to Access the Help/Documentation .Rd Source Files in R
Merge Rows in a Dataframe Where the Rows Are Disjoint and Contain Nas
How to Force Specific Order of the Variables on the X Axis
How to Avoid: Read.Table Truncates Numeric Values Beginning with 0
Convert String to Date, Format: "Dd.Mm.Yyyy"
Error in Plot.Window(...):Need Finite 'Xlim' Values
Use Rle to Group by Runs When Using Dplyr
Differencebetween [ ] and [[ ]] in R
Render Dropdown for Single Column in Dt Shiny
How to Show Only Part of the Plot Area of Polar Ggplot with Facet