R Extract Part of String

Extract a substring according to a pattern

Here are a few ways:

1) sub

sub(".*:", "", string)
## [1] "E001" "E002" "E003"

2) strsplit

sapply(strsplit(string, ":"), "[", 2)
## [1] "E001" "E002" "E003"

3) read.table

read.table(text = string, sep = ":", as.is = TRUE)$V2
## [1] "E001" "E002" "E003"

4) substring

This assumes second portion always starts at 4th character (which is the case in the example in the question):

substring(string, 4)
## [1] "E001" "E002" "E003"

4a) substring/regex

If the colon were not always in a known position we could modify (4) by searching for it:

substring(string, regexpr(":", string) + 1)

5) strapplyc

strapplyc returns the parenthesized portion:

library(gsubfn)
strapplyc(string, ":(.*)", simplify = TRUE)
## [1] "E001" "E002" "E003"

6) read.dcf

This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub to replace it with a colon first. For example, if the separator were _ then string <- sub("_", ":", string)

c(read.dcf(textConnection(string)))
## [1] "E001" "E002" "E003"

7) separate

7a) Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.

library(dplyr)
library(tidyr)
library(purrr)

DF <- data.frame(string)
DF %>%
separate(string, into = c("pre", "post")) %>%
pull("post")
## [1] "E001" "E002" "E003"

7b) Alternately separate can be used to just create the post column and then unlist and unname the resulting data frame:

library(dplyr)
library(tidyr)

DF %>%
separate(string, into = c(NA, "post")) %>%
unlist %>%
unname
## [1] "E001" "E002" "E003"

8) trimws We can use trimws to trim word characters off the left and then use it again to trim the colon.

trimws(trimws(string, "left", "\\w"), "left", ":")
## [1] "E001" "E002" "E003"

Note

The input string is assumed to be:

string <- c("G1:E001", "G2:E002", "G3:E003")

Extracting part of string by position in R

We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).

sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"

Or another option is with scan

scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"

A similar option as mentioned by @RHertel would be to use read.table/read.csv on the string

 read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]

data

str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"

How to extract the middle part of a string in a data frame in R?

Using the stringr package:

library(stringr)
str_extract(pccmit$Description, "(?<=GN=).*(?= PE)")

(?<=GN=) looks behind after GN= and (?= PE) looks ahead of = PE, with .* matching everything in the middle.

Extract parts of a string in R

You can simply use strsplit with regex [-_] and perl=TRUE option to get all the parts.

stamp <- "section_d1_2010-07-01_08_00.txt"
strsplit(stamp, '[-_]')[[1]]
# [1] "section" "d1" "2010" "07" "01" "08" "00.txt"

See demo.

https://regex101.com/r/cK4iV0/8

Extract part of the strings with specific format

You might use a pattern to assert 9-21 chars to the right including the underscore, then the match the first 2 parts with the single underscore:

^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+

Explanation

  • ^ Start of string
  • (?= Positive lookahead, assert what is to the right of the current location is
    • \\w{9,21}_[A-Z0-9] Match 9-21 word chars followed by an underscore and a char A-Z or a digit
  • ) Close the lookahead
  • [A-Z]+ Match 1+ chars A-Z
  • _ Match the first underscore
  • [A-Z0-9]+ Match 1+ chars A-Z or a digit

Regex demo | R demo

x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')
regmatches(x, regexpr("^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+", x, perl = TRUE))

Output

[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

Extract part of string values, make new column names, and make dataframe wide

Revised scenario

  • Using tidyr::extract will lead you saving one extra step of mutate as you can directly extract two desired strings into two columns using regex here.
library(tidyverse)
whatiactuallyhave <- data_frame(v1 = c('abc [effort]', 'abc [effort]', 'def [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'abc [scope]', 'def [scope]', 'ghi [scope]', 'ghi [scope]'),
scores = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.

whatiactuallyhave %>%
tidyr::extract(v1, into = c('v1', 'name'), regex = '(\\w+)\\s\\[(\\w+)\\]') %>%
group_by(v1, name) %>%
mutate(d = row_number()) %>%
pivot_wider(names_from = name, values_from = scores, values_fill = NA) %>%
select(-d)

#> # A tibble: 6 x 3
#> # Groups: v1 [3]
#> v1 effort scope
#> <chr> <chr> <chr>
#> 1 abc 1 6
#> 2 abc 2 7
#> 3 def 3 8
#> 4 def 4 <NA>
#> 5 ghi 5 9
#> 6 ghi <NA> 10

Created on 2021-05-26 by the reprex package (v2.0.0)



Earlier answer

whatihave <- data_frame(v1 = c('abc [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'def [scope]', 'ghi [scope]'),
scores = c(1:6))

library(tidyverse)
whatihave %>%
separate(v1, into = c('v1', 'name'), sep = ' \\[') %>%
mutate(name = str_remove(name, '\\]')) %>%
pivot_wider(names_from = name, values_from = scores)

# A tibble: 3 x 3
v1 effort scope
<chr> <int> <int>
1 abc 1 4
2 def 2 5
3 ghi 3 6

R extract part of string

Try this:

sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"

R: Extract substring and paste the same substring at the end of a string

We can use sub and build groups in the pattern argument by wrapping them in (). We can access these groups in the replacement argument with \\ followed by the group number.

strs <- c("A11B3XyC4", 
"A1B14C23XyC16",
"B14C23XyC16D3")

sub("(.*)(Xy)(.*)", "\\1\\3\\. \\2", strs)
#> [1] "A11B3C4. Xy" "A1B14C23C16. Xy" "B14C23C16D3. Xy"

Created on 2021-08-27 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit