R Strsplit with Multiple Unordered Split Arguments

R strsplit with multiple unordered split arguments?

Actually strsplit uses grep patterns as well. (A comma is a regex metacharacter whereas a space is not; hence the need for double escaping the commas in the pattern argument. So the use of "\\s" would be more to improve readability than of necessity):

> strsplit(test_1, "\\, |\\,| ")  # three possibilities OR'ed
[[1]]
[1] "abc" "def" "ghi" "klm"

> strsplit(test_2, "\\, |\\,| ")
[[1]]
[1] "abc" "def" "ghi" "klm"

Without using both \\, and \\, (note extra space that SO does not show) you would have gotten some character(0) values. Might have been clearer if I had written:

> strsplit(test_2, "\\,\\s|\\,|\\s")
[[1]]
[1] "abc" "def" "ghi" "klm"

@Fojtasek is so right: Using character classes often simplifies the task because it creates an implicit logical OR:

> strsplit(test_2, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"

> strsplit(test_1, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"

Use strsplit with multiple delimiters

You can also try str_split from stringr:

library(stringr)
lapply(str_split(df$V1, "(?<!\\()\\-|[:\\)\\(]"), function(x) x[x != ""])

Result:

[[1]]
[1] "Chr3" "153922357" "153944632" "-"

[[2]]
[1] "Chr11" "70010183" "70015411" "-"

Data:

df = read.table(text = " Chr3:153922357-153944632(-)
Chr11:70010183-70015411(-) ")

R strsplit() with multiple criteria

We can use regex lookarounds to split the lines at the space after the 'is' or 'never'. Here, the (?<=\\bis)\\s+ matches one or more spaces (\\s+) that follows a is or | to match spaces (\\s+) that follows the 'never' word.

strsplit(str[,1], "(?<=\\bis)\\s+|(?<=\\bnever)\\s+", perl = TRUE)
#[[1]]
#[1] "This is" "line one"

#[[2]]
#[1] "This is" "not line one"

#[[3]]
#[1] "This can never" "be line one"

If we want to remove the 'is' and 'never' also

strsplit(str[,1], "(?:\\s+(is|never)\\s+)")
#[[1]]
#[1] "This" "line one"

#[[2]]
#[1] "This" "not line one"

#[[3]]
#[1] "This can" "be line one"

R: how to split string correctly when there are multiple signs

The strsplit function from the base library is somewhat limited. It drops trailing empty strings. Try the stringr or stringi libraries. For example:

library(stringr)
str_split("A++", "\\+")

This has your required return:

[[1]]
[1] "A" "" ""

str_split is vectorized over both the input string and the match pattern.

strsplit doesn't always split on '?'

For multiple split elements, place it inside a [] and remove the fixed = TRUE or paste the patterns with a | to split either by one of them

strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = "[.!?]")[[1]]

According to ?strsplit

split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.

How do i atribbute different parameters to the function strsplit(split = )?

This should work if you want to split on both:

library(stringr)
x <- c("banana.apple turning.something")
str_split(x, "[\\.\\s]")
# [[1]]
# [1] "banana" "apple" "turning" "something"

strsplit by spaces greater than one in R

You may specify it through a repetition quantifier.

strsplit(mystr, "\\s{2,}")

\\s{2,} regex should match two or more spaces.

strsplit: split strings from integers

An option is to use look-aheads/look-behinds

ss <- "w17u2"

unlist(strsplit(ss, "((?<=[a-z])(?![a-z])|(?<=\\d)(?!\\d))", perl = T))
#[1] "w" "17" "u" "2"

Explanation:

(?<=[a-z])(?![a-z]) splits the string at the position where the preceding character matches [a-z] and the following character does not match [a-z]. Similarly, (?<=\\d)(?!\\d) splits the string at the position where the preceding character matches a digit and the following character does not match a digit. The final regular expression is the OR concatenation of both regex patterns.



Related Topics



Leave a reply



Submit