R strsplit with multiple unordered split arguments?
Actually strsplit
uses grep patterns as well. (A comma is a regex metacharacter whereas a space is not; hence the need for double escaping the commas in the pattern argument. So the use of "\\s"
would be more to improve readability than of necessity):
> strsplit(test_1, "\\, |\\,| ") # three possibilities OR'ed
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_2, "\\, |\\,| ")
[[1]]
[1] "abc" "def" "ghi" "klm"
Without using both \\,
and \\,
(note extra space that SO does not show) you would have gotten some character(0) values. Might have been clearer if I had written:
> strsplit(test_2, "\\,\\s|\\,|\\s")
[[1]]
[1] "abc" "def" "ghi" "klm"
@Fojtasek is so right: Using character classes often simplifies the task because it creates an implicit logical OR:
> strsplit(test_2, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_1, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
Use strsplit with multiple delimiters
You can also try str_split
from stringr
:
library(stringr)
lapply(str_split(df$V1, "(?<!\\()\\-|[:\\)\\(]"), function(x) x[x != ""])
Result:
[[1]]
[1] "Chr3" "153922357" "153944632" "-"
[[2]]
[1] "Chr11" "70010183" "70015411" "-"
Data:
df = read.table(text = " Chr3:153922357-153944632(-)
Chr11:70010183-70015411(-) ")
R strsplit() with multiple criteria
We can use regex lookarounds to split the lines at the space after the 'is' or 'never'. Here, the (?<=\\bis)\\s+
matches one or more spaces (\\s+
) that follows a is
or |
to match spaces (\\s+
) that follows the 'never' word.
strsplit(str[,1], "(?<=\\bis)\\s+|(?<=\\bnever)\\s+", perl = TRUE)
#[[1]]
#[1] "This is" "line one"
#[[2]]
#[1] "This is" "not line one"
#[[3]]
#[1] "This can never" "be line one"
If we want to remove the 'is' and 'never' also
strsplit(str[,1], "(?:\\s+(is|never)\\s+)")
#[[1]]
#[1] "This" "line one"
#[[2]]
#[1] "This" "not line one"
#[[3]]
#[1] "This can" "be line one"
R: how to split string correctly when there are multiple signs
The strsplit
function from the base library is somewhat limited. It drops trailing empty strings. Try the stringr
or stringi
libraries. For example:
library(stringr)
str_split("A++", "\\+")
This has your required return:
[[1]]
[1] "A" "" ""
str_split
is vectorized over both the input string and the match pattern.
strsplit doesn't always split on '?'
For multiple split
elements, place it inside a []
and remove the fixed = TRUE
or paste
the patterns with a |
to split either by one of them
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = "[.!?]")[[1]]
According to ?strsplit
split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.
How do i atribbute different parameters to the function strsplit(split = )?
This should work if you want to split on both:
library(stringr)
x <- c("banana.apple turning.something")
str_split(x, "[\\.\\s]")
# [[1]]
# [1] "banana" "apple" "turning" "something"
strsplit by spaces greater than one in R
You may specify it through a repetition quantifier.
strsplit(mystr, "\\s{2,}")
\\s{2,}
regex should match two or more spaces.
strsplit: split strings from integers
An option is to use look-aheads/look-behinds
ss <- "w17u2"
unlist(strsplit(ss, "((?<=[a-z])(?![a-z])|(?<=\\d)(?!\\d))", perl = T))
#[1] "w" "17" "u" "2"
Explanation:
(?<=[a-z])(?![a-z])
splits the string at the position where the preceding character matches [a-z]
and the following character does not match [a-z]
. Similarly, (?<=\\d)(?!\\d)
splits the string at the position where the preceding character matches a digit and the following character does not match a digit. The final regular expression is the OR concatenation of both regex patterns.
Related Topics
How to Call a Function Using the Character String of the Function Name in R
How to Parametrize Function Calls in Dplyr 0.7
Plotting a 3D Surface Plot with Contour Map Overlay, Using R
Add Objects to Package Namespace
Detach All Packages While Working in R
Mean of a Column in a Data Frame, Given the Column's Name
Promise Already Under Evaluation: Recursive Default Argument Reference or Earlier Problems
Is There a Way of Manipulating Ggplot Scale Breaks and Labels
How to Delete Columns That Contain Only Nas
Merge Data Frames Based on Rownames in R
Predict.Lm() with an Unknown Factor Level in Test Data
What Ides Are Available for R in Linux
Count Number of Rows Matching a Criteria
Data.Table and Parallel Computing