How to Extract Substring Between Patterns "_" and "." in R

Extracting a string between other two strings in R

You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2, or use STR1\\s*(.*?)\\s*STR2 to trim the value you need). If you have multiple occurrences, use str_match_all.

Also, if you need to match strings that span across line breaks/newlines add (?s) at the start of the pattern: (?s)STR1(.*?)STR2 / (?s)STR1\\s*(.*?)\\s*STR2.

library(stringr)
a <- " anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1\\s*(.*?)\\s*STR2")
res[,2]
[1] "GET_ME"

Another way using base R regexec (to get the first match):

test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern <- "STR1\\s*(.*?)\\s*STR2"
result <- regmatches(test, regexec(pattern, test))
result[[1]][2]
[1] "GET_ME"

Extract a substring according to a pattern

Here are a few ways:

1) sub

sub(".*:", "", string)
## [1] "E001" "E002" "E003"

2) strsplit

sapply(strsplit(string, ":"), "[", 2)
## [1] "E001" "E002" "E003"

3) read.table

read.table(text = string, sep = ":", as.is = TRUE)$V2
## [1] "E001" "E002" "E003"

4) substring

This assumes second portion always starts at 4th character (which is the case in the example in the question):

substring(string, 4)
## [1] "E001" "E002" "E003"

4a) substring/regex

If the colon were not always in a known position we could modify (4) by searching for it:

substring(string, regexpr(":", string) + 1)

5) strapplyc

strapplyc returns the parenthesized portion:

library(gsubfn)
strapplyc(string, ":(.*)", simplify = TRUE)
## [1] "E001" "E002" "E003"

6) read.dcf

This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub to replace it with a colon first. For example, if the separator were _ then string <- sub("_", ":", string)

c(read.dcf(textConnection(string)))
## [1] "E001" "E002" "E003"

7) separate

7a) Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.

library(dplyr)
library(tidyr)
library(purrr)

DF <- data.frame(string)
DF %>%
separate(string, into = c("pre", "post")) %>%
pull("post")
## [1] "E001" "E002" "E003"

7b) Alternately separate can be used to just create the post column and then unlist and unname the resulting data frame:

library(dplyr)
library(tidyr)

DF %>%
separate(string, into = c(NA, "post")) %>%
unlist %>%
unname
## [1] "E001" "E002" "E003"

8) trimws We can use trimws to trim word characters off the left and then use it again to trim the colon.

trimws(trimws(string, "left", "\\w"), "left", ":")
## [1] "E001" "E002" "E003"

Note

The input string is assumed to be:

string <- c("G1:E001", "G2:E002", "G3:E003")

How to extract substring between patterns _ and . in R

To achieve this, you need a regexp that

  • matches an (optional) arbitrary string in front of the _ : .*
  • matches a literal _ : [_]
  • matches everything up to (but not including) the next . and stores it in capturing group no. 1 : ([^.]+)
  • matches a literal . : [.]
  • matches an (optional) arbitrary string after the . : .*

In your call to gsub, you then

  • use the regular expression we built in the previous step
  • replace the whole string with the contents of the first capturing group: \\1 (we need to escape the backslash, hence the double backslash)

Example:

gsub(".*[_]([^.]+)[.].*", "\\1", "MA0051_IRF2.xml")

How to extract string between in R?

There are multiple strings between "", so you need some another identifier to extract what you want. Maybe try string between "" after "HREF".

sub('.*HREF="(.*?)".*', '\\1', x)
#[1] "D188_2020-03-30.csv"

How to extract everything until first occurrence of pattern

To get L0, you may use

> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"

The [^_]+ matches 1 or more chars other than _.

Also, you may split the string with _:

x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"

This way, you will have all the substrings you need.

The same can be achieved with

> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"

How to extract everything after a specific string?

With str_extract. \\b is a zero-length token that matches a word-boundary. This includes any non-word characters:

library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"

We can also use a back reference with sub. \\1 refers to string matched by the first capture group (.+), which is any character one or more times following a - at the end:

sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"

This also works with str_replace if that is already loaded:

library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"

Third option would be using strsplit and extract the second word from each element of the list (similar to word from @akrun's answer):

sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"

stringr also has str_split variant to this:

str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"

Find string between two substrings

import re

s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))

How to extract string between to patterns with special characters in R

You need to use lazy dot, and also your input pattern should match the entire input, given that you are replacing with capture group:

a <- "|Request|\nSample inlet port of the HIP cartridge with |overflow| formed "
sub("^.*\\|Request\\|\\s*(.+?)\\s*\\|.*$", "\\1", a)

[1] "Sample inlet port of the HIP cartridge with"

How to extract the substring between two markers?

Using regular expressions - documentation for further reference

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)

# found: 1234

or:

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling

# found: 1234


Related Topics



Leave a reply



Submit