Extracting a string between other two strings in R
You may use str_match
with STR1 (.*?) STR2
(note the spaces are "meaningful", if you want to just match anything in between STR1
and STR2
use STR1(.*?)STR2
, or use STR1\\s*(.*?)\\s*STR2
to trim the value you need). If you have multiple occurrences, use str_match_all
.
Also, if you need to match strings that span across line breaks/newlines add (?s)
at the start of the pattern: (?s)STR1(.*?)STR2
/ (?s)STR1\\s*(.*?)\\s*STR2
.
library(stringr)
a <- " anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1\\s*(.*?)\\s*STR2")
res[,2]
[1] "GET_ME"
Another way using base R regexec
(to get the first match):
test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern <- "STR1\\s*(.*?)\\s*STR2"
result <- regmatches(test, regexec(pattern, test))
result[[1]][2]
[1] "GET_ME"
Extract a substring according to a pattern
Here are a few ways:
1) sub
sub(".*:", "", string)
## [1] "E001" "E002" "E003"
2) strsplit
sapply(strsplit(string, ":"), "[", 2)
## [1] "E001" "E002" "E003"
3) read.table
read.table(text = string, sep = ":", as.is = TRUE)$V2
## [1] "E001" "E002" "E003"
4) substring
This assumes second portion always starts at 4th character (which is the case in the example in the question):
substring(string, 4)
## [1] "E001" "E002" "E003"
4a) substring/regex
If the colon were not always in a known position we could modify (4) by searching for it:
substring(string, regexpr(":", string) + 1)
5) strapplyc
strapplyc
returns the parenthesized portion:
library(gsubfn)
strapplyc(string, ":(.*)", simplify = TRUE)
## [1] "E001" "E002" "E003"
6) read.dcf
This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub
to replace it with a colon first. For example, if the separator were _
then string <- sub("_", ":", string)
c(read.dcf(textConnection(string)))
## [1] "E001" "E002" "E003"
7) separate
7a) Using tidyr::separate
we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.
library(dplyr)
library(tidyr)
library(purrr)
DF <- data.frame(string)
DF %>%
separate(string, into = c("pre", "post")) %>%
pull("post")
## [1] "E001" "E002" "E003"
7b) Alternately separate
can be used to just create the post
column and then unlist
and unname
the resulting data frame:
library(dplyr)
library(tidyr)
DF %>%
separate(string, into = c(NA, "post")) %>%
unlist %>%
unname
## [1] "E001" "E002" "E003"
8) trimws We can use trimws
to trim word characters off the left and then use it again to trim the colon.
trimws(trimws(string, "left", "\\w"), "left", ":")
## [1] "E001" "E002" "E003"
Note
The input string
is assumed to be:
string <- c("G1:E001", "G2:E002", "G3:E003")
How to extract substring between patterns _ and . in R
To achieve this, you need a regexp that
- matches an (optional) arbitrary string in front of the _ :
.*
- matches a literal _ :
[_]
- matches everything up to (but not including) the next . and stores it in capturing group no. 1 :
([^.]+)
- matches a literal . :
[.]
- matches an (optional) arbitrary string after the . :
.*
In your call to gsub, you then
- use the regular expression we built in the previous step
- replace the whole string with the contents of the first capturing group:
\\1
(we need to escape the backslash, hence the double backslash)
Example:
gsub(".*[_]([^.]+)[.].*", "\\1", "MA0051_IRF2.xml")
How to extract string between in R?
There are multiple strings between ""
, so you need some another identifier to extract what you want. Maybe try string between ""
after "HREF"
.
sub('.*HREF="(.*?)".*', '\\1', x)
#[1] "D188_2020-03-30.csv"
How to extract everything until first occurrence of pattern
To get L0
, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+
matches 1 or more chars other than _
.
Also, you may split the string with _
:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
How to extract everything after a specific string?
With str_extract
. \\b
is a zero-length token that matches a word-boundary. This includes any non-word characters:
library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"
We can also use a back reference with sub
. \\1
refers to string matched by the first capture group (.+)
, which is any character one or more times following a -
at the end:
sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"
This also works with str_replace
if that is already loaded:
library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"
Third option would be using strsplit
and extract the second word from each element of the list (similar to word
from @akrun's answer):
sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"
stringr
also has str_split
variant to this:
str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"
Find string between two substrings
import re
s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))
How to extract string between to patterns with special characters in R
You need to use lazy dot, and also your input pattern should match the entire input, given that you are replacing with capture group:
a <- "|Request|\nSample inlet port of the HIP cartridge with |overflow| formed "
sub("^.*\\|Request\\|\\s*(.+?)\\s*\\|.*$", "\\1", a)
[1] "Sample inlet port of the HIP cartridge with"
How to extract the substring between two markers?
Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
Related Topics
Subtract a Constant Vector from Each Row in a Matrix in R
Cartogram + Choropleth Map in R
Ggplot Geom_Point() with Colors Based on Specific, Discrete Values
Create Tables with Conditional Formatting with Rmarkdown + Knitr
Skip Specific Rows Using Read.CSV in R
R List Files with Multiple Conditions
How to Change the Resolution of a Raster Layer in R
Adjusting Width of Tables Made with Kable() in Rmarkdown Documents
Simple Frequency Tables Using Data.Table
How Can Put Multiple Plots Side-By-Side in Shiny R
Extract Rgb Channels from a Jpeg Image in R
Passing List of Named Parameters to Function
Ggplot2 Theme with No Axes or Grid
How to Print R Variables in Middle of String
Using Lapply and Read.CSV on Multiple Files (In R)
Knitr: Getting a Parse_All Error in R When Converting Rmd File into HTML