Split Text String in a Data.Table Columns

Split text string in a data.table columns

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
#    PREFIX VALUE PX PY
# 1:    A_B     1  A  B
# 2:    A_C     2  A  C
# 3:    A_D     3  A  D
# 4:    B_A     4  B  A
# 5:    B_C     5  B  C
# 6:    B_D     6  B  D

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

See history for old answers.

Split text string in a data.table columns by location

You can split on regex "(?<=[A-Za-z])(?=[0-9])" if you want to split between letters and digits, (?<=[A-Za-z])(?=[0-9]) restricts the split to a position that is preceded by a letter and followed by a digit:

The regex contains two parts, look behind (?<=[A-Za-z]) which means after a letter and look ahead (?=[0-9]), i.e before a digit, see more about regex look around, in r, you need to specify perl=TRUE to use Perl-compatible regexps to make these work:

DT2[, c("L", "D") := tstrsplit(a, "(?<=[A-Za-z])(?=[0-9])", perl=TRUE)][]

#     a          b L  D
#1: A10 0.01487372 A 10
#2: B11 0.95035709 B 11
#3: C12 0.49230300 C 12
#4: D13 0.67183871 D 13
#5: E14 0.40076579 E 14
#6: A15 0.27871477 A 15

Split string in data.frame data.table into two columns with Base R

First, we need some sample data, taken from what you posted:

dataset <- data.frame(reCNru = c(15.9, 19.2, 25.2, 21.3),
                      rn = c("oberschlumpf.2020-05-13", "oberschlumpf.2020-05-12", 
                             "oberschlumpf.2020-05-11", "oberschlumpf.2020-05-10"), 
                      stringsAsFactors = FALSE)

Then we apply the following code in Base R:

newdataset <- setNames(do.call(rbind.data.frame, strsplit(unlist(dataset$rn), '\\.')), 
         c('rn1', 'rn2')) 
newdataset$reCNru <- dataset$reCNru

Perhaps it is interesting to see the solution given by the tidyverse:

dataset %>% tidyr::separate(col = rn, into = c("rn1","rn2"), sep = "\\.")

You will have:

reCNru          rn1        rn2
1   15.9 oberschlumpf 2020-05-13
2   19.2 oberschlumpf 2020-05-12
3   25.2 oberschlumpf 2020-05-11
4   21.3 oberschlumpf 2020-05-10

Be aware that the separator is not just "." but an expression which represents the dot.

Hope it helps.

Split a column of strings into variable number of columns using data.table in R

Just pushed two functions transpose() and tstrsplit() in data.table v1.9.5.

With this we can do:

require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

tstrsplit is a wrapper for transpose(strsplit(as.character(x), ...)). And you can also pass fill=. to fill missing values with any other value than NA.

transpose() can also be used on lists, data frames and data tables.

How to split a column in multiple columns using data.table

Use tstrsplit with keep = 1:3 to keep only the first three columns:

dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]

                                name  bin  position  ID
1:                bin1_position1_ID1 bin1 position1 ID1
2:                bin2_position2_ID2 bin2 position2 ID2
3:                bin3_position3_ID3 bin3 position3 ID3
4:                bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5

Splitting text column into ragged multiple new columns in a data table in R

Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

library(splitstackshape)
cSplit(df, "x", " ")
#     x_1      x_2         x_3 x_4
# 1: This       is interesting  NA
# 2: This actually          is not

Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
#      [,1]   [,2]       [,3]          [,4] 
# [1,] "This" "is"       "interesting" NA   
# [2,] "This" "actually" "is"          "not"

Split a string with varying length in a data table

I wouldn't use regex for this- it won't be efficient for a big data set. It seems like the word you looking for is always located after the second space. A very simple and efficient solution could be

d1[, Track2 := tstrsplit(MENU_HINT, " ", fixed = TRUE)[[3]]]

Benchmark

bigDT <- data.table(MENU_HINT = sample(d1$MENU_HINT, 1e6, replace = TRUE))
microbenchmark::microbenchmark("sub: " = sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", bigDT$MENU_HINT),
                               "gsub: " = gsub("^[^/]+/\\s*|\\s+.*$", "", bigDT$MENU_HINT),
                               "tstrsplit: " = tstrsplit(bigDT$MENU_HINT, " ", fixed = TRUE)[[3]])
# Unit: milliseconds
#        expr       min        lq      mean    median        uq      max neval
#       sub:   982.1185  998.6264 1058.1576 1025.8775 1083.1613 1405.051   100
#      gsub:  1236.9453 1262.6014 1320.4436 1305.6711 1339.2879 1766.027   100
# tstrsplit:   385.4785  452.6476  498.8681  470.8281  537.5499 1044.691   100

can I use string split with dcast in data.table?

We can use data.table methods i.e. dcast

library(data.table)
dcast(dt[, {x1 <- strsplit(x, "\\."); c(list(unlist(x1)), 
      .SD[rep(seq_len(.N), lengths(x1))])}], id + x ~ V1, length)
#    id          x NA ab cde co cox hij kl
#1:  1       <NA>  1  0   0  0   0   0  0
#2:  2     ab.cde  0  1   1  0   0   0  0
#3:  3  co.hij.ab  0  1   0  1   0   1  0
#4:  4 cox.cde.kl  0  0   1  0   1   0  1
#5:  5       <NA>  1  0   0  0   0   0  0