How to do str_extract with base R?
1) strcapture If you want to extract a string of digits and dots from "release 1.2.3"
using base then
x <- "release 1.2.3"
strcapture("([0-9.]+)", x, data.frame(version = character(0)))
## version
## 1 1.2.3
2) regexec/regmatches There is also regmatches
and regexec
but that has already been covered in another answer.
3) sub Also it is often possible to use sub
:
sub(".* ([0-9.]+).*", "\\1", x)
## [1] "1.2.3"
3a) If you know the match is at the beginning or end then delete everything after or before it:
sub(".* ", "", x)
## [1] "1.2.3"
4) gsub Sometimes we know that the field to be extracted has certain characters and they do not appear elsewhere. In that case simply delete every occurrence of every character that cannot be in the string:
gsub("[^0-9.]", "", x)
## [1] "1.2.3"
5) read.table One can often decompose the input into fields and then pick off the desired one by number or via grep
. strsplit
, read.table
or scan
can be used:
read.table(text = x, as.is = TRUE)[[2]]
## [1] "1.2.3"
5a) grep/scan
grep("^[0-9.]+$", scan(textConnection(x), what = "", quiet = TRUE), value = TRUE)
## [1] "1.2.3"
5b) grep/strsplit
grep("^[0-9.]+$", strsplit(x, " ")[[1]], value = TRUE)
## [1] "1.2.3"
6) substring If we know the character position of the field we can use substring
like this:
substring(x, 9)
## [1] "1.2.3"
6a) substring/regexpr or we may be able to use regexpr
to locate the character position for us:
substring(x, regexpr("\\d", x))
## [1] "1.2.3"
7) read.dcf Sometimes it is possible to convert the input to dcf form in which case it can be read with read.dcf
. Such data is of the form name: value
read.dcf(textConnection(sub(" ", ": ", x)))
## release
## [1,] "1.2.3"
Base R Equivalent of `stringr::str_extract_all`
Changing regexpr()
for gregexpr()
will do the trick:
str_extract <- function(string, pattern) {
regmatches(string, gregexpr(pattern, string))
}
pattern <- "xx|xx\\."
str_extract("xx (xx.)", pattern)
Output:
[[1]]
[1] "xx" "xx."
R's documentation is quite straightforward about the functions regexpr and gregexpr:
regexpr returns an integer vector of the same length as text giving
the starting position of the first match or -1 if there is none, with
attribute "match.length", an integer vector giving the length of the
matched text (or -1 for no match).
and
gregexpr returns a list of the same length as text each element of
which is of the same form as the return value for regexpr, except that
the starting positions of every (disjoint) match are given.
extract pattern using stringr
You can use lookbehind pattern -
as.integer(stringr::str_extract(values, '(?<=A)\\d+'))
#[1] 9 15
Using str_extract in R to extract a number before a substring with regex
You can use the look ahead regular express (?=)
library(stringr)
str_extract("17 nights$5 Days", "(\\d)+(?= nights)")
(\d) - a digit
(\d)+ - one or more digits
(?= nights) - that comes in front of " nights"
The look behind (?<=)
can also come in handy.
A good reference cheatsheet is from Rstudio's website: https://raw.githubusercontent.com/rstudio/cheatsheets/main/regex.pdf
stringr::str_extract all elements of a list R
Use str_extract_all
and \\w+
to get the word after banana (and banana).
all_terms %>%
str_extract_all("banana.\\w+") %>%
unlist()
# [1] "banana word2" "banana split" "banana ice"
Without unlist, you get a list:
str_extract_all(all_terms, "banana.\\w+")
[[1]]
[1] "banana word2" "banana split"
[[2]]
character(0)
[[3]]
[1] "banana ice"
Info about the semantics in str_extract in R? (With an example)
In base R, you can use sub
to extract a number after 1st underscore.
sub('\\d+_(\\d+)_.*', '\\1', files)
#[1] "3"
where \\d+
refers to 1 or more number.
()
is referred as capture group to capture the value that we are interested in.
You can use the same regex in str_match
if you want to use stringr
.
stringr::str_match(files, '\\d+_(\\d+)_.*')[, 2]
[1] "3"
R use str_extract (stringr) to export a string between _
You are probably better off with str_match
, as this allows capture groups.
So you can add the _
either side for context but only return the bit you are interested in. The (\\w+?)
is the capture group, and str_match
returns this as the second column, hence the [,2]
(the first column is what str_extract
would return).
library(stringr)
str_match(x,"ROH_(\\w+?)_")[,2]
[1] "Pete" "Annette" "Steve"
Related Topics
Fuzzyjoin Two Data Frames Using Data.Table
Split Data.Frame into Groups by Column Name
Automatically Detect Date Columns When Reading a File into a Data.Frame
Display Error Instead of Plot in Shiny Web App
Convert Map Data to Data Frame Using Fortify {Ggplot2} for Spatial Objects in R
Numbered Code Chunks in Rmarkdown
R: Generating All Permutations of N Weights in Multiples of P
Scatterplot: Error in Fun(X[[I]], ...):Object 'Group' Not Found
Knitr: Object Cannot Be Found When Converting Markdown File into HTML
Import All the Functions of a Package Except One When Building a Package
How to Classify a Given Date/Time by the Season (E.G. Summer, Autumn)
Rmarkdown::Render() in a Loop - Cannot Allocate Vector of Size
Major and Minor Tickmarks with Plotly
How to Read the Files in a Directory in Sorted Order Using R
System Is Computationally Singular: Reciprocal Condition Number in R