Split Text String in a Data.Table Columns

Split text string in a data.table columns

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
# PREFIX VALUE PX PY
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

See history for old answers.

Split text string in a data.table columns by location

You can split on regex "(?<=[A-Za-z])(?=[0-9])" if you want to split between letters and digits, (?<=[A-Za-z])(?=[0-9]) restricts the split to a position that is preceded by a letter and followed by a digit:

The regex contains two parts, look behind (?<=[A-Za-z]) which means after a letter and look ahead (?=[0-9]), i.e before a digit, see more about regex look around, in r, you need to specify perl=TRUE to use Perl-compatible regexps to make these work:

DT2[, c("L", "D") := tstrsplit(a, "(?<=[A-Za-z])(?=[0-9])", perl=TRUE)][]

# a b L D
#1: A10 0.01487372 A 10
#2: B11 0.95035709 B 11
#3: C12 0.49230300 C 12
#4: D13 0.67183871 D 13
#5: E14 0.40076579 E 14
#6: A15 0.27871477 A 15

Split string in data.frame data.table into two columns with Base R

First, we need some sample data, taken from what you posted:

dataset <- data.frame(reCNru = c(15.9, 19.2, 25.2, 21.3),
rn = c("oberschlumpf.2020-05-13", "oberschlumpf.2020-05-12",
"oberschlumpf.2020-05-11", "oberschlumpf.2020-05-10"),
stringsAsFactors = FALSE)

Then we apply the following code in Base R:

newdataset <- setNames(do.call(rbind.data.frame, strsplit(unlist(dataset$rn), '\\.')), 
c('rn1', 'rn2'))
newdataset$reCNru <- dataset$reCNru

Perhaps it is interesting to see the solution given by the tidyverse:

dataset %>% tidyr::separate(col = rn, into = c("rn1","rn2"), sep = "\\.")

You will have:

reCNru          rn1        rn2
1 15.9 oberschlumpf 2020-05-13
2 19.2 oberschlumpf 2020-05-12
3 25.2 oberschlumpf 2020-05-11
4 21.3 oberschlumpf 2020-05-10

Be aware that the separator is not just "." but an expression which represents the dot.

Hope it helps.

Split a column of strings into variable number of columns using data.table in R

Just pushed two functions transpose() and tstrsplit() in data.table v1.9.5.

With this we can do:

require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
# category.tag transaction category tag
# 1: toys/David 1 toys David
# 2: toys/David 2 toys David
# 3: toys/James 3 toys James
# 4: toys 4 toys NA
# 5: toys 5 toys NA
# 6: toys/James 6 toys James

tstrsplit is a wrapper for transpose(strsplit(as.character(x), ...)). And you can also pass fill=. to fill missing values with any other value than NA.

transpose() can also be used on lists, data frames and data tables.

How to split a column in multiple columns using data.table

Use tstrsplit with keep = 1:3 to keep only the first three columns:

dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
                                name  bin  position  ID
1: bin1_position1_ID1 bin1 position1 ID1
2: bin2_position2_ID2 bin2 position2 ID2
3: bin3_position3_ID3 bin3 position3 ID3
4: bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5

Splitting text column into ragged multiple new columns in a data table in R

Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

library(splitstackshape)
cSplit(df, "x", " ")
# x_1 x_2 x_3 x_4
# 1: This is interesting NA
# 2: This actually is not

Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "interesting" NA
# [2,] "This" "actually" "is" "not"

Split a string with varying length in a data table

I wouldn't use regex for this- it won't be efficient for a big data set. It seems like the word you looking for is always located after the second space. A very simple and efficient solution could be

d1[, Track2 := tstrsplit(MENU_HINT, " ", fixed = TRUE)[[3]]] 

Benchmark

bigDT <- data.table(MENU_HINT = sample(d1$MENU_HINT, 1e6, replace = TRUE))
microbenchmark::microbenchmark("sub: " = sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", bigDT$MENU_HINT),
"gsub: " = gsub("^[^/]+/\\s*|\\s+.*$", "", bigDT$MENU_HINT),
"tstrsplit: " = tstrsplit(bigDT$MENU_HINT, " ", fixed = TRUE)[[3]])
# Unit: milliseconds
# expr min lq mean median uq max neval
# sub: 982.1185 998.6264 1058.1576 1025.8775 1083.1613 1405.051 100
# gsub: 1236.9453 1262.6014 1320.4436 1305.6711 1339.2879 1766.027 100
# tstrsplit: 385.4785 452.6476 498.8681 470.8281 537.5499 1044.691 100

can I use string split with dcast in data.table?

We can use data.table methods i.e. dcast

library(data.table)
dcast(dt[, {x1 <- strsplit(x, "\\."); c(list(unlist(x1)),
.SD[rep(seq_len(.N), lengths(x1))])}], id + x ~ V1, length)
# id x NA ab cde co cox hij kl
#1: 1 <NA> 1 0 0 0 0 0 0
#2: 2 ab.cde 0 1 1 0 0 0 0
#3: 3 co.hij.ab 0 1 0 1 0 1 0
#4: 4 cox.cde.kl 0 0 1 0 1 0 1
#5: 5 <NA> 1 0 0 0 0 0 0


Related Topics



Leave a reply



Submit