How to Get the Text Between Two Words in R

How to get the text between two words in R?

You need .* at the end to match zero or more characters after the 'first'

 gsub('^.*This\\s*|\\s*first.*$', '', x)
#[1] "is my"

How do I extract text between two characters in R

You may use

> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"

Here, CITY:\s*\K.* regex matches

  • CITY: - a literal substring CITY:
  • \s* - 0+ whitespaces
  • \K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
  • .* - any 0+ chars other than line break chars, as many as possible.

See the regex demo online.

Note that since it is a PCRE regex, perl=TRUE is indispensible.

Extracting a string between other two strings in R

You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2, or use STR1\\s*(.*?)\\s*STR2 to trim the value you need). If you have multiple occurrences, use str_match_all.

Also, if you need to match strings that span across line breaks/newlines add (?s) at the start of the pattern: (?s)STR1(.*?)STR2 / (?s)STR1\\s*(.*?)\\s*STR2.

library(stringr)
a <- " anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1\\s*(.*?)\\s*STR2")
res[,2]
[1] "GET_ME"

Another way using base R regexec (to get the first match):

test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern <- "STR1\\s*(.*?)\\s*STR2"
result <- regmatches(test, regexec(pattern, test))
result[[1]][2]
[1] "GET_ME"

How to extract text between two words using str_extract?

You can get the matches without the dotall mode by first matching Address: and then capture in group 1 all the lines that do not start with "This grant"

Address:\r?\n((?:(?!This grant\b).*(?:\r?\n|$))*)

In parts

  • Address:\r?\n Match Address: and a newline
  • ( Capture group 1
    • (?: Non capturing group

      • (?!This grant\b).* Match the whole lines if what is directly on the right is not "This grant"
      • (?:\r?\n|$) Match either a newline or assert the end of the string
    • )* Close non capturing group and repeat to get all the lines
  • )

For example

library(stringr)
test_case <- "Address:
The PowerPool Corp
1434 Holyfried Route, Unit A
Melope, VA 21151

This grant is issued"

str_match(test_case, "Address:\\r?\\n((?:(?!This grant\\b).*(?:\\r?\\n|$))*)")[,2]

Output

[1] "The PowerPool Corp\n1434 Holyfried Route, Unit A\nMelope, VA 21151\n\n"

Regex demo | R demo

regex: get text between two words (in R)

it doesn't look like you are capturing anything in your search term, you just need some ()'s in there to actually grab something so \\1 will return your target :

words <- c("these are some different abstract words that might be between keywords or they might just be bounded by abstract ideas")
gsub(".* abstract (.*) keywords.*", "\\1", words)
[1] "words that might be between"

extract substrings of text between two repeating strings

I can not think of a rowwise solution right now, but maybe this helps as well

library(dplyr)
documents %>%
mutate(text=strsplit(as.character(text), 'PART ')) %>%
tidyr::unnest(text) %>%
mutate(text=trimws(sub('\\d+ (.*) Matters.*', '\\1', text))) %>%
filter(text != '') %>%
group_by(doc_id) %>%
summarise(text=paste(text, collapse=', '))

It basically splits all your text at PART and then we can work on each element separate to cut the important text out of the longer string. Later we concatenate everything together per doc_id.

How to gsub on the text between two words in R?

You need to find the unknown word between "Tree" and "Lake" first. You can use

unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)

The pattern matches any characters up to the last Tree in a string, then captures the unknown word (\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.

Then, when you know the word, replace it with

gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)

See IDEONE demo.

Here, you have [[:space:]]*( + unknown_word[1] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\1 restores the unknown word. You may replace [[:space:]] with \\s.

UPDATE

If you need to only add a newline symbols before RU that are whole words, use the \b word boundary:

> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"


Related Topics



Leave a reply



Submit