Splitting text to words with R and cSplit()
The cSplit
function returns a data.table
.
What you are describing is the default print behavior for data.table
s. To see this in action, try the following:
library(data.table)
as.data.table(airquality)
print(as.data.table(airquality))
print(as.data.table(airquality), nrows = Inf)
Thus, to get the full table displayed, you can try:
library(splitstackshape)
print(cSplit(data, "text", " ", "long"), nrows = Inf)
Split words in R Dataframe column
We can try with gsub
. Capture one or more non-white space (\\S+
) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter (,
) which we use for converting to different columns with read.table
.
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)",
"\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
data
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))
Split string into multiple rows by capital letters with cSplit
An option with separate_rows
library(dplyr)
library(tidyr)
survey %>%
separate_rows(q1, sep=",(?=[A-Z])")
# q1
#1 I like this
#2 I like that
#3 I like this, but not much
#4 I like that, but not much
#5 I like this
#6 I like that
#7 I like this, but not much
#8 I like that
With cSplit
, there is an argument fixed
which is TRUE
by default, but if we use fixed = FALSE
, it may fail. May be because it is not optimized for PCRE regex expressions
library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)
Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed)
: invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'
One option to bypass it would be to modify the column with a function (sub/gsub
) that can take PCRE regex to change the sep
and then use cSplit
on that sep
cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)),
"q1", sep=":", direction = "long")
# q1
#1: I like this
#2: I like that
#3: I like this, but not much
#4: I like that, but not much
#5: I like this
#6: I like that
#7: I like this, but not much
#8: I like that
data
survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much",
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))
Split data frame string column into multiple columns
Use stringr::str_split_fixed
library(stringr)
str_split_fixed(before$type, "_and_", 2)
Splitting text column into ragged multiple new columns in a data table in R
Check out cSplit
from my "splitstackshape" package. It works on either data.frame
s or data.table
s (but always returns a data.table
).
Assuming KFB's sample data is at least slightly representative of your actual data, you can try:
library(splitstackshape)
cSplit(df, "x", " ")
# x_1 x_2 x_3 x_4
# 1: This is interesting NA
# 2: This actually is not
Another (blazing) option is to use stri_split_fixed
with simplify = TRUE
(from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):
library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "interesting" NA
# [2,] "This" "actually" "is" "not"
Split text with '\r\n'
The problem is not with the splitting but rather with the WriteLine
. A \n
in a string printed with WriteLine
will produce an "extra" line.
Example
var text =
"somet interesting text\n" +
"some text that should be in the same line\r\n" +
"some text should be in another line";
string[] stringSeparators = new string[] { "\r\n" };
string[] lines = text.Split(stringSeparators, StringSplitOptions.None);
Console.WriteLine("Nr. Of items in list: " + lines.Length); // 2 lines
foreach (string s in lines)
{
Console.WriteLine(s); //But will print 3 lines in total.
}
To fix the problem remove \n
before you print the string.
Console.WriteLine(s.Replace("\n", ""));
Split text string in a data.table columns
Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit()
to get the results directly (and in a much more efficient manner):
require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
# PREFIX VALUE PX PY
# 1: A_B 1 A B
# 2: A_C 2 A C
# 3: A_D 3 A D
# 4: B_A 4 B A
# 5: B_C 5 B C
# 6: B_D 6 B D
tstrsplit()
basically is a wrapper for transpose(strsplit())
, where transpose()
function, also recently implemented, transposes a list. Please see ?tstrsplit()
and ?transpose()
for examples.
See history for old answers.
Related Topics
Do I Need to Reshape This Wide Data to Effectively Use Ggplot2
Assign Color to 2 Different Geoms and Get 2 Different Legends
Handling Missing Combinations of Factors in R
Programmatically Create Tab and Plot in Markdown
Separate a Column into Multiple Columns Using Tidyr::Separate with Sep=""
Plot a Function with Several Arguments in R
Make Legend Invisible But Keep Figure Dimensions and Margins the Same
Creating Sequence of Dates for Each Group in R
Dummy Variables to Single Categorical Variable (Factor) in R
Drawing Minor Ticks (Not Grid Ticks) in Ggplot2 in a Date Format Axis
Levenshtein Type Algorithm with Numeric Vectors
Is There Something Like a Pmax Index
Convert an Integer Column to Time Hh:Mm
Code Folding for Individual Chunks in R Markdown
How to Read Large Numbers Precisely in R and Perform Arithmetic on Them