How to Do Str_Extract with Base R

How to do str_extract with base R?

1) strcapture If you want to extract a string of digits and dots from "release 1.2.3" using base then

x <- "release 1.2.3"
strcapture("([0-9.]+)", x, data.frame(version = character(0)))
## version
## 1 1.2.3

2) regexec/regmatches There is also regmatches and regexec but that has already been covered in another answer.

3) sub Also it is often possible to use sub:

sub(".* ([0-9.]+).*", "\\1", x)
## [1] "1.2.3"

3a) If you know the match is at the beginning or end then delete everything after or before it:

sub(".* ", "", x)
## [1] "1.2.3"

4) gsub Sometimes we know that the field to be extracted has certain characters and they do not appear elsewhere. In that case simply delete every occurrence of every character that cannot be in the string:

gsub("[^0-9.]", "", x)
## [1] "1.2.3"

5) read.table One can often decompose the input into fields and then pick off the desired one by number or via grep. strsplit, read.table or scan can be used:

read.table(text = x, as.is = TRUE)[[2]]
## [1] "1.2.3"

5a) grep/scan

grep("^[0-9.]+$", scan(textConnection(x), what = "", quiet = TRUE), value = TRUE)
## [1] "1.2.3"

5b) grep/strsplit

grep("^[0-9.]+$", strsplit(x, " ")[[1]], value = TRUE)
## [1] "1.2.3"

6) substring If we know the character position of the field we can use substring like this:

substring(x, 9)
## [1] "1.2.3"

6a) substring/regexpr or we may be able to use regexpr to locate the character position for us:

substring(x, regexpr("\\d", x))
## [1] "1.2.3"

7) read.dcf Sometimes it is possible to convert the input to dcf form in which case it can be read with read.dcf. Such data is of the form name: value

 read.dcf(textConnection(sub(" ", ": ", x)))
## release
## [1,] "1.2.3"

Base R Equivalent of `stringr::str_extract_all`

Changing regexpr() for gregexpr() will do the trick:

str_extract <- function(string, pattern) {
regmatches(string, gregexpr(pattern, string))
}

pattern <- "xx|xx\\."
str_extract("xx (xx.)", pattern)

Output:

[[1]]
[1] "xx" "xx."

R's documentation is quite straightforward about the functions regexpr and gregexpr:

regexpr returns an integer vector of the same length as text giving
the starting position of the first match or -1 if there is none, with
attribute "match.length", an integer vector giving the length of the
matched text (or -1 for no match).

and

gregexpr returns a list of the same length as text each element of
which is of the same form as the return value for regexpr, except that
the starting positions of every (disjoint) match are given.

extract pattern using stringr

You can use lookbehind pattern -

as.integer(stringr::str_extract(values, '(?<=A)\\d+'))
#[1] 9 15

Using str_extract in R to extract a number before a substring with regex

You can use the look ahead regular express (?=)

library(stringr)

str_extract("17 nights$5 Days", "(\\d)+(?= nights)")

(\d) - a digit

(\d)+ - one or more digits

(?= nights) - that comes in front of " nights"

The look behind (?<=) can also come in handy.

A good reference cheatsheet is from Rstudio's website: https://raw.githubusercontent.com/rstudio/cheatsheets/main/regex.pdf

stringr::str_extract all elements of a list R

Use str_extract_all and \\w+ to get the word after banana (and banana).

all_terms %>% 
str_extract_all("banana.\\w+") %>%
unlist()

# [1] "banana word2" "banana split" "banana ice"

Without unlist, you get a list:

str_extract_all(all_terms, "banana.\\w+")

[[1]]
[1] "banana word2" "banana split"

[[2]]
character(0)

[[3]]
[1] "banana ice"

Info about the semantics in str_extract in R? (With an example)

In base R, you can use sub to extract a number after 1st underscore.

sub('\\d+_(\\d+)_.*', '\\1', files)
#[1] "3"

where \\d+ refers to 1 or more number.

() is referred as capture group to capture the value that we are interested in.


You can use the same regex in str_match if you want to use stringr.

stringr::str_match(files, '\\d+_(\\d+)_.*')[, 2]
[1] "3"

R use str_extract (stringr) to export a string between _

You are probably better off with str_match, as this allows capture groups.
So you can add the _ either side for context but only return the bit you are interested in. The (\\w+?) is the capture group, and str_match returns this as the second column, hence the [,2] (the first column is what str_extract would return).

library(stringr)
str_match(x,"ROH_(\\w+?)_")[,2]

[1] "Pete" "Annette" "Steve"


Related Topics



Leave a reply



Submit