Using Regex in R to Find Strings as Whole Words (But Not Strings as Part of Words)

Using regex in R to detect a specific word anywhere in string

You may use grep with the regex pattern \bLT\b:

samplestr <- c("LT BLAHBLAH", "BLAH LT BLAH", "BLAHLT BLOO")
output <- grep("\\bLT\\b", samplestr, value=TRUE)
output

[1] "LT BLAHBLAH" "BLAH LT BLAH"

The pattern \bLT\b has word boundaries on either side of LT, which will only match LT when as a standalone word, or, more generally, when surrounded by non word characters.

Use stringr to extract the whole word in a string with a particular set of characters in it

To match ? it needs to be escaped with \\?, so A\\? will match A?. \\w matches any word character (equivalent to [a-zA-Z0-9_]) and * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).

unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a" "A?vila" "BabiA?"

Test if a large number of whole words appear in a string variable using grepl

You can dynamically add word boundaries using paste0 as :

df$test <- grepl(paste0('\\b', test, '\\b', collapse = '|'), df$string)
df
# id string test
#1 1 clayville FALSE
#2 2 madison FALSE
#3 3 roberts TRUE
#4 4 david TRUE
#5 5 davidson FALSE

Complete word matching using grepl in R

"\<" is another escape sequence for the beginning of a word, and "\>" is the end.
In R strings you need to double the backslashes, so:

> grepl("\\<is\\>", c("this", "who is it?", "is it?", "it is!", "iso"))
[1] FALSE TRUE TRUE TRUE FALSE

Note that this matches "is!" but not "iso".

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

regex match whole word and punctuation with it using re.search()

There are two issues here.

  1. In regex . is special. It means "match one of any character". However, you are trying to use it to match a regular period. (It will indeed match that, but it will also match everything else.) Instead, to match a period, you need to use the pattern \.. And to change that to match either a period or a hyphen, you can use a class, like [-.].
  2. You are using \b at the end of your pattern to match the word boundary, but \b is defined as being the boundary between a word character and a non-word character, and periods and spaces are both non-word characters. This means that Python won't find a match. Instead, you could use a lookahead assertion, which will match whatever character you want, but won't consume the string.

Now, to match a whole word - any word - you can do something like \w+, which matches one or more word characters.

Also, it is quite possible that there won't be a match anyway, so you should check whether a match occurred using an if statement or a try statement. Putting it all together:

txt = "The indian in. Spain."
pattern = r"\w+[-.]"
x = re.search(r"\b" + pattern + r"(?=\W)", txt)
if x:
print(x.start(), x.end())

Edit

There is one problem with the lookahead assertion above - it won't match the end of the string. This means that if your text is The rain in Spain. then it won't match Spain., as there is no non-word character following the final period.

To fix this, you can use a negative lookahead assertion, which matches when the following text does not include the pattern, and also does not consume the string.

x = re.search(r"\b" + pattern + r"(?!\w)", txt)

This will match when the character after the word is anything other than a word character, including the end of the string.

R Extract a word from a character string using pattern matching

Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.

mytext <- c(
as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)

library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100" "AA3273" "AA5785" NA "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#>
#> [[2]]
#> [1] "AA100"
#>
#> [[3]]
#> [1] "AA3273"
#>
#> [[4]]
#> [1] "AA5785"
#>
#> [[5]]
#> character(0)
#>
#> [[6]]
#> [1] "AA12345" "AA5785"

as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"

Created on 2018-04-29 by the reprex package (v0.2.0).

Filter containing a word

We can use grep either using [ (For [, by default drop = TRUE - therefore, we need to change it to drop = FALSE to avoid the one column/one row datasets converted to vector)

df1[grep("\\bJeff\\b", df1$Path, ignore.case = TRUE),, drop = FALSE]

or with subset, we don't have to use the drop = FALSE as it is by default FALSE

subset(df1, grepl("\\bJeff\\b", Path, ignore.case = TRUE))
# Path
#1 Adam > Bob > Jeff
#4 Jeff > Adam > Bob
#5 Adam > Kevin > Jeff

The pattern we match would be "Jeff", but to make it more stringent i.e. not to match "Jeffy" or "Jefferson", we can add the word boundary (\\b) before and after the word.

Grep in R to find words with custom extended boundaries

Use PCRE regex with lookarounds:

grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)

See the regex demo

The (?<![A-Z]) negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z]) will fail the match if the word is followed with an uppercase ASCII letter.

To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...).

See the R online demo:

> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5


Related Topics



Leave a reply



Submit