R Split String by Symbol

Split a string by a plus sign (+) character

Use

strsplit("(1)+(2)", "\\+")

or

strsplit("(1)+(2)", "+", fixed = TRUE)

The idea of using strsplit("(1)+(2)", "+") doesn't work since unless specified otherwise, the split argument is a regular expression, and the + character is special in regex. Other characters that also need extra care are

  • ?
  • *
  • .
  • ^
  • $
  • \
  • |
  • { }
  • [ ]
  • ( )

Split a character string by the symbol *

You need to escape the star...

test = "23*45"

strsplit( test , "\\*" )
#[[1]]
#[1] "23" "45"

The split is a regular expression and * means the preceeding item is matched zero or more times. You are splitting on nothing , i.e. splitting into individual characters, as noted in the Details section of strsplit(). \\* means *treat * as a literal *.

Alternatively use the fixed argument...

strsplit( test , "*" , fixed = TRUE )
#[[1]]
#[1] "23" "45"

Which gets R to treat the split pattern as literal and not a regular expression.

Split R string into individual characters

You could use

data.frame(Reduce(rbind, strsplit(df$V1, "")))

This returns

     X1 X2 X3 X4 X5 X6
init g g g g c c
X c c c c t t
X.1 t t t t t t
X.2 a a a a a a

or

data.frame(do.call(rbind, strsplit(df$V1, "")))

which returns

  X1 X2 X3 X4 X5 X6
1 g g g g c c
2 c c c c t t
3 t t t t t t
4 a a a a a a

How to split a string after the nth character in r

You can use substr if you always want to split by the second character.

District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)

split string each x characters in dataframe

An option would be separate

library(tidyverse)
df %>%
separate(seq, into = paste0("x", 1:3), sep = c(3, 6))
# id x1 x2 x3
#1 1 ABC DEF GHI
#2 2 ZAB CDJ HIA

If we want to create it more generic

n1 <- nchar(as.character(df$seq[1])) - 3
s1 <- seq(3, n1, by = 3)
nm1 <- paste0("x", seq_len(length(s1) +1))
df %>%
separate(seq, into = nm1, sep = s1)

Or using base R, using strsplit, split the 'seq' column for each instance of 3 characters by passing a regex lookaround into a list and then rbind the list elements

df[paste0("x", 1:3)] <- do.call(rbind, 
strsplit(as.character(df$seq), "(?<=.{3})", perl = TRUE))

NOTE: It is better to avoid column names that start with non-standard labels such as numbers. For that reason, appended 'x' at the beginning of the names

R: How to split string into pieces

You can try with str_extract_all :

stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG_HAS"

With base R :

regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]

Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00? are not there in final output.

Split character by multiple criteria in R

In base R, we can use strsplit

out <- strsplit("variable1+variable2 + variable3*variable4+ variable5", 
"\\s*[*+]\\s*")[[1]]

-output

out
[1] "variable1" "variable2" "variable3" "variable4" "variable5"

The structure is

dput(out)
c("variable1", "variable2", "variable3", "variable4", "variable5"
)


Related Topics



Leave a reply



Submit