How to Use the Strsplit Function with a Period

How to split a string using the period/dot/decimal point '.' as a delimiter in R

We need to escape (\\.) or use fixed = TRUE as . is a metacharacter in regex and it can match any character

strsplit(s, '.', fixed = TRUE)[[1]][2]
[1] "334"

According to ?strsplit

split - character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.

Also, as strsplit, returns a list, extract the list with [[ and get the second element ([2])


Or wrap with fixed in str_split

library(stringr)
str_split(s, fixed('.'))[[1]][2]
[1] "334"

We can also get the output with trimws

trimws(s, whitespace = ".*\\.")
[1] "334"

Or with sub

sub(".*\\.", "", s)
[1] "334"

R strsplit doesn't split on . ?

To avoid regex altogether use fixed = TRUE in the call of strsplit

infile = "ACC_1346.table.txt"
x = strsplit(infile, ".", fixed = TRUE)

x

[[1]]
[1] "ACC_1346" "table" "txt"

strsplit from a using a space instead of a period

the error comes from the fact that data.frame coerces your character vector into a factor, which throws an error with strsplit, as said in the documentation.

Either you can do

student.exam.data$Student <-  strsplit(as.character(student.exam.data$Student), " ", fixed = TRUE)

Or

student.exam.data <- data.frame(Student,Math,Science,English, stringsAsFactors = FALSE)
student.exam.data$Student <- strsplit(student.exam.data$Student, " ", fixed = TRUE)

Use strsplit starting at end of string

It looks like you're a bit mixed up about how to use look-around assertions. The pattern you're using, "(?<=.{4})", is a look-behind assertion that says "find me all inter-character spaces that are preceded by four characters of any kind", which is not what you really want.

The pattern you actually want, "(?=.{4}$)", is a look-ahead assertion that finds the single inter-character space that is followed by four characters of any kind followed by the end of the string.

There is, unfortunately, an unpleasant twist. For reasons discussed in the answers to this question, strsplit() interacts oddly with look-ahead assertions; as a result, the pattern you'll actually need is "(?<=.)(?=.{4}$)". Here's what that looks like in action:

x <- c("Samp003A", "Sam003A")
strsplit(x, split="(?<=.)(?=.{4}$)", perl=T)
# [[1]]
# [1] "Samp" "003A"
#
# [[2]]
# [1] "Sam" "003A"

If all you really want are the final four characters of each entry, maybe just use substr(), like this:

x <- c("Samp003A", "Sam003A")
substr(x, start=nchar(x)-3, stop=nchar(x))
# [1] "003A" "003A"

Regexes works on their own, but not when used together in strsplit

Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:

#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^

#match + left removed
"AB-A4K7-01A-13B-J29Q-10"

#further matches found and removed
"01A-13B-J29Q-10"

#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"

#algorithm continues
"13B-J29Q-10"

This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:

> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"

apply strsplit rowwise

This should do the trick

R> sapply(strsplit(as.character(h$tes), "\\."), "[[", 2)
[1] "abc" "di" "lik"

How to use strsplit on SparkDataFrame

Instead of strsplit you need to use Spark specific functions that you can find in the Spark R API documentation. Specifically, you need to use split_string function, combined with getItem function (please note that you need to specify L to force number be an integer):

new_df <- withColumn(sdf, "new_id", getItem(split_string(sdf$old, ","), 0L))


Related Topics



Leave a reply



Submit