Using regex in R to detect a specific word anywhere in string
You may use grep
with the regex pattern \bLT\b
:
samplestr <- c("LT BLAHBLAH", "BLAH LT BLAH", "BLAHLT BLOO")
output <- grep("\\bLT\\b", samplestr, value=TRUE)
output
[1] "LT BLAHBLAH" "BLAH LT BLAH"
The pattern \bLT\b
has word boundaries on either side of LT
, which will only match LT
when as a standalone word, or, more generally, when surrounded by non word characters.
Use stringr to extract the whole word in a string with a particular set of characters in it
To match ?
it needs to be escaped with \\?
, so A\\?
will match A?
. \\w
matches any word character (equivalent to [a-zA-Z0-9_]) and *
matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).
unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a" "A?vila" "BabiA?"
Test if a large number of whole words appear in a string variable using grepl
You can dynamically add word boundaries using paste0
as :
df$test <- grepl(paste0('\\b', test, '\\b', collapse = '|'), df$string)
df
# id string test
#1 1 clayville FALSE
#2 2 madison FALSE
#3 3 roberts TRUE
#4 4 david TRUE
#5 5 davidson FALSE
Complete word matching using grepl in R
"\<" is another escape sequence for the beginning of a word, and "\>" is the end.
In R strings you need to double the backslashes, so:
> grepl("\\<is\\>", c("this", "who is it?", "is it?", "it is!", "iso"))
[1] FALSE TRUE TRUE TRUE FALSE
Note that this matches "is!" but not "iso".
Regular expression to match a line that doesn't contain a word
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$
Non-capturing variant:
^(?:(?!:hede).)*$
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s
in the following pattern):
/^((?!hede).)*$/s
or use it inline:
/(?s)^((?!hede).)*$/
(where the /.../
are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]
:
/^((?!hede)[\s\S])*$/
Explanation
A string is just a list of n
characters. Before, and after each character, there's an empty string. So a list of n
characters will have n+1
empty strings. Consider the string "ABhedeCD"
:
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e
's are the empty strings. The regex (?!hede).
looks ahead to see if there's no substring "hede"
to be seen, and if that is the case (so something else is seen), then the .
(dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no "hede"
up ahead, before a character is consumed by the .
(dot). The regex (?!hede).
will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*
. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD"
will fail because on e3
, the regex (?!hede)
fails (there is "hede"
up ahead!).
regex match whole word and punctuation with it using re.search()
There are two issues here.
- In regex
.
is special. It means "match one of any character". However, you are trying to use it to match a regular period. (It will indeed match that, but it will also match everything else.) Instead, to match a period, you need to use the pattern\.
. And to change that to match either a period or a hyphen, you can use a class, like[-.]
. - You are using
\b
at the end of your pattern to match the word boundary, but\b
is defined as being the boundary between a word character and a non-word character, and periods and spaces are both non-word characters. This means that Python won't find a match. Instead, you could use a lookahead assertion, which will match whatever character you want, but won't consume the string.
Now, to match a whole word - any word - you can do something like \w+
, which matches one or more word characters.
Also, it is quite possible that there won't be a match anyway, so you should check whether a match occurred using an if
statement or a try
statement. Putting it all together:
txt = "The indian in. Spain."
pattern = r"\w+[-.]"
x = re.search(r"\b" + pattern + r"(?=\W)", txt)
if x:
print(x.start(), x.end())
Edit
There is one problem with the lookahead assertion above - it won't match the end of the string. This means that if your text is The rain in Spain.
then it won't match Spain.
, as there is no non-word character following the final period.
To fix this, you can use a negative lookahead assertion, which matches when the following text does not include the pattern, and also does not consume the string.
x = re.search(r"\b" + pattern + r"(?!\w)", txt)
This will match when the character after the word is anything other than a word character, including the end of the string.
R Extract a word from a character string using pattern matching
Here is a stringr
approach. The regular expression matches AA
preceded by a space or the start of the string (?<=^| )
, and then as few characters as possible .*?
until the next space or the end of the string (?=$| )
. Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all
instead of str_extract
and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB)
as shown.
mytext <- c(
as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)
library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100" "AA3273" "AA5785" NA "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#>
#> [[2]]
#> [1] "AA100"
#>
#> [[3]]
#> [1] "AA3273"
#>
#> [[4]]
#> [1] "AA5785"
#>
#> [[5]]
#> character(0)
#>
#> [[6]]
#> [1] "AA12345" "AA5785"
as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"
Created on 2018-04-29 by the reprex package (v0.2.0).
Filter containing a word
We can use grep
either using [
(For [
, by default drop = TRUE
- therefore, we need to change it to drop = FALSE
to avoid the one column/one row datasets converted to vector
)
df1[grep("\\bJeff\\b", df1$Path, ignore.case = TRUE),, drop = FALSE]
or with subset
, we don't have to use the drop = FALSE
as it is by default FALSE
subset(df1, grepl("\\bJeff\\b", Path, ignore.case = TRUE))
# Path
#1 Adam > Bob > Jeff
#4 Jeff > Adam > Bob
#5 Adam > Kevin > Jeff
The pattern we match would be "Jeff", but to make it more stringent i.e. not to match "Jeffy" or "Jefferson", we can add the word boundary (\\b
) before and after the word.
Grep in R to find words with custom extended boundaries
Use PCRE regex with lookarounds:
grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)
See the regex demo
The (?<![A-Z])
negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z])
will fail the match if the word is followed with an uppercase ASCII letter.
To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...)
.
See the R online demo:
> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5
Related Topics
What Are the "Standard Unambiguous Date" Formats For String-To-Date Conversion in R
Dummify Character Column and Find Unique Values
Subset Rows in a Data Frame Based on a Vector of Values
Ggplot2 Keep Unused Levels Barplot
What Do Hjust and Vjust Do When Making a Plot Using Ggplot
Get Specific Object from Rdata File
Removing Empty Rows of a Data File in R
Programming With Dplyr Using String as Input
How to Insert Elements into a Vector
Plotting Contours on an Irregular Grid
Workflow For Statistical Analysis and Report Writing
Applying a Function to Every Row of a Table Using Dplyr
Plotting Lines and the Group Aesthetic in Ggplot2
Displaying Text Below the Plot Generated by Ggplot2
What Does %≫% Function Mean in R
How to Make Execution Pause, Sleep, Wait For X Seconds in R
Select Multiple Columns in Data.Table by Their Numeric Indices