How to split a string using the period/dot/decimal point '.' as a delimiter in R
We need to escape (\\.
) or use fixed = TRUE
as .
is a metacharacter in regex and it can match any character
strsplit(s, '.', fixed = TRUE)[[1]][2]
[1] "334"
According to ?strsplit
split - character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.
Also, as strsplit
, returns a list
, extract the list with [[
and get the second element ([2]
)
Or wrap with fixed
in str_split
library(stringr)
str_split(s, fixed('.'))[[1]][2]
[1] "334"
We can also get the output with trimws
trimws(s, whitespace = ".*\\.")
[1] "334"
Or with sub
sub(".*\\.", "", s)
[1] "334"
R strsplit doesn't split on . ?
To avoid regex altogether use fixed = TRUE
in the call of strsplit
infile = "ACC_1346.table.txt"
x = strsplit(infile, ".", fixed = TRUE)
x
[[1]]
[1] "ACC_1346" "table" "txt"
strsplit from a using a space instead of a period
the error comes from the fact that data.frame coerces your character vector into a factor, which throws an error with strsplit
, as said in the documentation.
Either you can do
student.exam.data$Student <- strsplit(as.character(student.exam.data$Student), " ", fixed = TRUE)
Or
student.exam.data <- data.frame(Student,Math,Science,English, stringsAsFactors = FALSE)
student.exam.data$Student <- strsplit(student.exam.data$Student, " ", fixed = TRUE)
Use strsplit starting at end of string
It looks like you're a bit mixed up about how to use look-around assertions. The pattern you're using, "(?<=.{4})"
, is a look-behind assertion that says "find me all inter-character spaces that are preceded by four characters of any kind", which is not what you really want.
The pattern you actually want, "(?=.{4}$)"
, is a look-ahead assertion that finds the single inter-character space that is followed by four characters of any kind followed by the end of the string.
There is, unfortunately, an unpleasant twist. For reasons discussed in the answers to this question, strsplit()
interacts oddly with look-ahead assertions; as a result, the pattern you'll actually need is "(?<=.)(?=.{4}$)"
. Here's what that looks like in action:
x <- c("Samp003A", "Sam003A")
strsplit(x, split="(?<=.)(?=.{4}$)", perl=T)
# [[1]]
# [1] "Samp" "003A"
#
# [[2]]
# [1] "Sam" "003A"
If all you really want are the final four characters of each entry, maybe just use substr()
, like this:
x <- c("Samp003A", "Sam003A")
substr(x, start=nchar(x)-3, stop=nchar(x))
# [1] "003A" "003A"
Regexes works on their own, but not when used together in strsplit
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit
finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-]
class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
apply strsplit rowwise
This should do the trick
R> sapply(strsplit(as.character(h$tes), "\\."), "[[", 2)
[1] "abc" "di" "lik"
How to use strsplit on SparkDataFrame
Instead of strsplit
you need to use Spark specific functions that you can find in the Spark R API documentation. Specifically, you need to use split_string
function, combined with getItem
function (please note that you need to specify L
to force number be an integer):
new_df <- withColumn(sdf, "new_id", getItem(split_string(sdf$old, ","), 0L))
Related Topics
How to Read Only Lines That Fulfil a Condition from a CSV into R
Proper Idiom for Adding Zero Count Rows in Tidyr/Dplyr
Re-Ordering Factor Levels in Data Frame
How to Create a Loop That Includes Both a Code Chunk and Text with Knitr in R
Automatically Delete Files/Folders
Code to Import Data from a Stack Overflow Query into R
Non-Equi Join Using Data.Table: Column Missing from the Output
Using Lists Inside Data.Table Columns
How to Update R Packages in Default Library on Windows 7
Perform a Semi-Join with Data.Table
Count Number of Zeros Per Row, and Remove Rows with More Than N Zeros
Converting Two Columns of a Data Frame to a Named Vector
Is Set.Seed Consistent Over Different Versions of R (And Ubuntu)
Align Ggplot2 Plots Vertically
Reverse Order of Discrete Y Axis in Ggplot2
How to Avoid Warning When Introducing Nas by Coercion