Splitting Text to Words with R and Csplit()

Splitting text to words with R and cSplit()

The cSplit function returns a data.table.

What you are describing is the default print behavior for data.tables. To see this in action, try the following:

library(data.table)
as.data.table(airquality)
print(as.data.table(airquality))

print(as.data.table(airquality), nrows = Inf)

Thus, to get the full table displayed, you can try:

library(splitstackshape)
print(cSplit(data, "text", " ", "long"), nrows = Inf)

Split words in R Dataframe column

We can try with gsub. Capture one or more non-white space (\\S+) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter (,) which we use for converting to different columns with read.table.

 df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)", 
"\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to

data

df1 <- structure(list(Text = c("one of the", "i want to")), 
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))

Split string into multiple rows by capital letters with cSplit

An option with separate_rows

library(dplyr)
library(tidyr)
survey %>%
separate_rows(q1, sep=",(?=[A-Z])")
# q1
#1 I like this
#2 I like that
#3 I like this, but not much
#4 I like that, but not much
#5 I like this
#6 I like that
#7 I like this, but not much
#8 I like that

With cSplit, there is an argument fixed which is TRUE by default, but if we use fixed = FALSE, it may fail. May be because it is not optimized for PCRE regex expressions

library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)

Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed)
: invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'

One option to bypass it would be to modify the column with a function (sub/gsub) that can take PCRE regex to change the sep and then use cSplit on that sep

cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
"q1", sep=":", direction = "long")
# q1
#1: I like this
#2: I like that
#3: I like this, but not much
#4: I like that, but not much
#5: I like this
#6: I like that
#7: I like this, but not much
#8: I like that

data

survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much", 
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))

Split data frame string column into multiple columns

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

Splitting text column into ragged multiple new columns in a data table in R

Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

library(splitstackshape)
cSplit(df, "x", " ")
# x_1 x_2 x_3 x_4
# 1: This is interesting NA
# 2: This actually is not

Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "interesting" NA
# [2,] "This" "actually" "is" "not"

Split text with '\r\n'

The problem is not with the splitting but rather with the WriteLine. A \n in a string printed with WriteLine will produce an "extra" line.

Example

var text = 
"somet interesting text\n" +
"some text that should be in the same line\r\n" +
"some text should be in another line";

string[] stringSeparators = new string[] { "\r\n" };
string[] lines = text.Split(stringSeparators, StringSplitOptions.None);
Console.WriteLine("Nr. Of items in list: " + lines.Length); // 2 lines
foreach (string s in lines)
{
Console.WriteLine(s); //But will print 3 lines in total.
}

To fix the problem remove \n before you print the string.

Console.WriteLine(s.Replace("\n", ""));

Split text string in a data.table columns

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
# PREFIX VALUE PX PY
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

See history for old answers.



Related Topics



Leave a reply



Submit