Extract a substring according to a pattern
Here are a few ways:
1) sub
sub(".*:", "", string)
## [1] "E001" "E002" "E003"
2) strsplit
sapply(strsplit(string, ":"), "[", 2)
## [1] "E001" "E002" "E003"
3) read.table
read.table(text = string, sep = ":", as.is = TRUE)$V2
## [1] "E001" "E002" "E003"
4) substring
This assumes second portion always starts at 4th character (which is the case in the example in the question):
substring(string, 4)
## [1] "E001" "E002" "E003"
4a) substring/regex
If the colon were not always in a known position we could modify (4) by searching for it:
substring(string, regexpr(":", string) + 1)
5) strapplyc
strapplyc
returns the parenthesized portion:
library(gsubfn)
strapplyc(string, ":(.*)", simplify = TRUE)
## [1] "E001" "E002" "E003"
6) read.dcf
This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub
to replace it with a colon first. For example, if the separator were _
then string <- sub("_", ":", string)
c(read.dcf(textConnection(string)))
## [1] "E001" "E002" "E003"
7) separate
7a) Using tidyr::separate
we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.
library(dplyr)
library(tidyr)
library(purrr)
DF <- data.frame(string)
DF %>%
separate(string, into = c("pre", "post")) %>%
pull("post")
## [1] "E001" "E002" "E003"
7b) Alternately separate
can be used to just create the post
column and then unlist
and unname
the resulting data frame:
library(dplyr)
library(tidyr)
DF %>%
separate(string, into = c(NA, "post")) %>%
unlist %>%
unname
## [1] "E001" "E002" "E003"
8) trimws We can use trimws
to trim word characters off the left and then use it again to trim the colon.
trimws(trimws(string, "left", "\\w"), "left", ":")
## [1] "E001" "E002" "E003"
Note
The input string
is assumed to be:
string <- c("G1:E001", "G2:E002", "G3:E003")
Extracting part of string by position in R
We can use sub
. We match one or more characters that are not _
([^_]+
) followed by a _
. Keep it in a capture group. As we wants to extract the third set of non _
characters, we repeat the previously enclosed group 2 times ({2}
) followed by another capture group of one or more non _
characters, and the rest of the characters indicated by .*
. In the replacement, we use the backreference for the second capture group (\\2
).
sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"
Or another option is with scan
scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"
A similar option as mentioned by @RHertel would be to use read.table/read.csv
on the string
read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
data
str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
How to extract the middle part of a string in a data frame in R?
Using the stringr
package:
library(stringr)
str_extract(pccmit$Description, "(?<=GN=).*(?= PE)")
(?<=GN=)
looks behind after GN=
and (?= PE)
looks ahead of = PE
, with .*
matching everything in the middle.
Extract parts of a string in R
You can simply use strsplit
with regex [-_]
and perl=TRUE
option to get all the parts.
stamp <- "section_d1_2010-07-01_08_00.txt"
strsplit(stamp, '[-_]')[[1]]
# [1] "section" "d1" "2010" "07" "01" "08" "00.txt"
See demo.
https://regex101.com/r/cK4iV0/8
Extract part of the strings with specific format
You might use a pattern to assert 9-21 chars to the right including the underscore, then the match the first 2 parts with the single underscore:
^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+
Explanation
^
Start of string(?=
Positive lookahead, assert what is to the right of the current location is\\w{9,21}_[A-Z0-9]
Match 9-21 word chars followed by an underscore and a char A-Z or a digit
)
Close the lookahead[A-Z]+
Match 1+ chars A-Z_
Match the first underscore[A-Z0-9]+
Match 1+ chars A-Z or a digit
Regex demo | R demo
x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')
regmatches(x, regexpr("^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+", x, perl = TRUE))
Output
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
Extract part of string values, make new column names, and make dataframe wide
Revised scenario
- Using
tidyr::extract
will lead you saving one extra step of mutate as you can directly extract two desired strings into two columns usingregex
here.
library(tidyverse)
whatiactuallyhave <- data_frame(v1 = c('abc [effort]', 'abc [effort]', 'def [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'abc [scope]', 'def [scope]', 'ghi [scope]', 'ghi [scope]'),
scores = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.
whatiactuallyhave %>%
tidyr::extract(v1, into = c('v1', 'name'), regex = '(\\w+)\\s\\[(\\w+)\\]') %>%
group_by(v1, name) %>%
mutate(d = row_number()) %>%
pivot_wider(names_from = name, values_from = scores, values_fill = NA) %>%
select(-d)
#> # A tibble: 6 x 3
#> # Groups: v1 [3]
#> v1 effort scope
#> <chr> <chr> <chr>
#> 1 abc 1 6
#> 2 abc 2 7
#> 3 def 3 8
#> 4 def 4 <NA>
#> 5 ghi 5 9
#> 6 ghi <NA> 10
Created on 2021-05-26 by the reprex package (v2.0.0)
Earlier answer
whatihave <- data_frame(v1 = c('abc [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'def [scope]', 'ghi [scope]'),
scores = c(1:6))
library(tidyverse)
whatihave %>%
separate(v1, into = c('v1', 'name'), sep = ' \\[') %>%
mutate(name = str_remove(name, '\\]')) %>%
pivot_wider(names_from = name, values_from = scores)
# A tibble: 3 x 3
v1 effort scope
<chr> <int> <int>
1 abc 1 4
2 def 2 5
3 ghi 3 6
R extract part of string
Try this:
sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"
R: Extract substring and paste the same substring at the end of a string
We can use sub
and build groups in the pattern
argument by wrapping them in ()
. We can access these groups in the replacement
argument with \\
followed by the group number.
strs <- c("A11B3XyC4",
"A1B14C23XyC16",
"B14C23XyC16D3")
sub("(.*)(Xy)(.*)", "\\1\\3\\. \\2", strs)
#> [1] "A11B3C4. Xy" "A1B14C23C16. Xy" "B14C23C16D3. Xy"
Created on 2021-08-27 by the reprex package (v0.3.0)
Related Topics
Sort a Factor Based on Value in One or More Other Columns
Note in R Cran Check: No Repository Set, So Cyclic Dependency Check Skipped
How to Create a Raster from a Data Frame in R
How to Properly Document S4 "[" and "[<-" Methods Using Roxygen
Debugging (Line by Line) of Rcpp-Generated Dll Under Windows
Understanding Lexical Scoping in R
R: Xtable Caption (Or Comment)
Apply() Is Slow - How to Make It Faster or What Are My Alternatives
Plotting a Large Number of Custom Functions in Ggplot in R Using Stat_Function()
How to Find the Polygon Nearest to a Point in R
Create Lagged Variable in Unbalanced Panel Data in R
Linear Regression and Storing Results in Data Frame
Excel Cell Coloring Using Xlsx
Filter One Selectinput Based on Selection from Another Selectinput