Split text string in a data.table columns
Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit()
to get the results directly (and in a much more efficient manner):
require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
# PREFIX VALUE PX PY
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D
tstrsplit()
basically is a wrapper for transpose(strsplit())
, where transpose()
function, also recently implemented, transposes a list. Please see ?tstrsplit()
and ?transpose()
for examples.
See history for old answers.
Split text string in a data.table columns by location
You can split on regex "(?<=[A-Za-z])(?=[0-9])"
if you want to split between letters and digits, (?<=[A-Za-z])(?=[0-9]) restricts the split to a position that is preceded by a letter and followed by a digit:
The regex contains two parts, look behind (?<=[A-Za-z])
which means after a letter and look ahead (?=[0-9])
, i.e before a digit, see more about regex look around, in r, you need to specify perl=TRUE
to use Perl-compatible regexps to make these work:
DT2[, c("L", "D") := tstrsplit(a, "(?<=[A-Za-z])(?=[0-9])", perl=TRUE)][]
# a b L D
#1: A10 0.01487372 A 10
#2: B11 0.95035709 B 11
#3: C12 0.49230300 C 12
#4: D13 0.67183871 D 13
#5: E14 0.40076579 E 14
#6: A15 0.27871477 A 15
Split string in data.frame data.table into two columns with Base R
First, we need some sample data, taken from what you posted:
dataset <- data.frame(reCNru = c(15.9, 19.2, 25.2, 21.3),
rn = c("oberschlumpf.2020-05-13", "oberschlumpf.2020-05-12",
"oberschlumpf.2020-05-11", "oberschlumpf.2020-05-10"),
stringsAsFactors = FALSE)
Then we apply the following code in Base R:
newdataset <- setNames(do.call(rbind.data.frame, strsplit(unlist(dataset$rn), '\\.')),
c('rn1', 'rn2'))
newdataset$reCNru <- dataset$reCNru
Perhaps it is interesting to see the solution given by the tidyverse:
dataset %>% tidyr::separate(col = rn, into = c("rn1","rn2"), sep = "\\.")
You will have:
reCNru rn1 rn2
1 15.9 oberschlumpf 2020-05-13
2 19.2 oberschlumpf 2020-05-12
3 25.2 oberschlumpf 2020-05-11
4 21.3 oberschlumpf 2020-05-10
Be aware that the separator is not just "."
but an expression which represents the dot.
Hope it helps.
Split a column of strings into variable number of columns using data.table in R
Just pushed two functions transpose()
and tstrsplit()
in data.table v1.9.5.
With this we can do:
require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
# category.tag transaction category tag
# 1: toys/David 1 toys David
# 2: toys/David 2 toys David
# 3: toys/James 3 toys James
# 4: toys 4 toys NA
# 5: toys 5 toys NA
# 6: toys/James 6 toys James
tstrsplit
is a wrapper for transpose(strsplit(as.character(x), ...))
. And you can also pass fill=.
to fill missing values with any other value than NA
.
transpose()
can also be used on lists, data frames and data tables.
How to split a column in multiple columns using data.table
Use tstrsplit
with keep = 1:3
to keep only the first three columns:
dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
name bin position ID
1: bin1_position1_ID1 bin1 position1 ID1
2: bin2_position2_ID2 bin2 position2 ID2
3: bin3_position3_ID3 bin3 position3 ID3
4: bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5
Splitting text column into ragged multiple new columns in a data table in R
Check out cSplit
from my "splitstackshape" package. It works on either data.frame
s or data.table
s (but always returns a data.table
).
Assuming KFB's sample data is at least slightly representative of your actual data, you can try:
library(splitstackshape)
cSplit(df, "x", " ")
# x_1 x_2 x_3 x_4
# 1: This is interesting NA
# 2: This actually is not
Another (blazing) option is to use stri_split_fixed
with simplify = TRUE
(from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):
library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "interesting" NA
# [2,] "This" "actually" "is" "not"
Split a string with varying length in a data table
I wouldn't use regex for this- it won't be efficient for a big data set. It seems like the word you looking for is always located after the second space. A very simple and efficient solution could be
d1[, Track2 := tstrsplit(MENU_HINT, " ", fixed = TRUE)[[3]]]
Benchmark
bigDT <- data.table(MENU_HINT = sample(d1$MENU_HINT, 1e6, replace = TRUE))
microbenchmark::microbenchmark("sub: " = sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", bigDT$MENU_HINT),
"gsub: " = gsub("^[^/]+/\\s*|\\s+.*$", "", bigDT$MENU_HINT),
"tstrsplit: " = tstrsplit(bigDT$MENU_HINT, " ", fixed = TRUE)[[3]])
# Unit: milliseconds
# expr min lq mean median uq max neval
# sub: 982.1185 998.6264 1058.1576 1025.8775 1083.1613 1405.051 100
# gsub: 1236.9453 1262.6014 1320.4436 1305.6711 1339.2879 1766.027 100
# tstrsplit: 385.4785 452.6476 498.8681 470.8281 537.5499 1044.691 100
can I use string split with dcast in data.table?
We can use data.table
methods i.e. dcast
library(data.table)
dcast(dt[, {x1 <- strsplit(x, "\\."); c(list(unlist(x1)),
.SD[rep(seq_len(.N), lengths(x1))])}], id + x ~ V1, length)
# id x NA ab cde co cox hij kl
#1: 1 <NA> 1 0 0 0 0 0 0
#2: 2 ab.cde 0 1 1 0 0 0 0
#3: 3 co.hij.ab 0 1 0 1 0 1 0
#4: 4 cox.cde.kl 0 0 1 0 1 0 1
#5: 5 <NA> 1 0 0 0 0 0 0
Related Topics
Concatenate Unique Strings After Groupby in R
Drop-Down Checkbox Input in Shiny
Ggplot Separate Legend and Plot
What's the Difference Between '1L' and '1'
R Suppress Startupmessages from Dependency
A Matrix Version of Cor.Test()
Convert Seconds to Days: Hours:Minutes:Seconds
Dplyr Mutate Rowwise Max of Range of Columns
Reduce PDF File Size of Plots by Filtering Hidden Objects
Automatically Delete Files/Folders
Reading Multiple Files and Calculating Mean Based on User Input
What's the Best Way to Use R Scripts on the Command Line (Terminal)