Split comma-separated strings in a column into separate rows
This old question frequently is being used as dupe target (tagged with r-faq
). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmark as guidance which of the approaches is the fastest1.
The benchmarked solutions include
- Matthew Lundberg's base R approach but modified according to Rich Scriven's comment,
- Jaap's two
data.table
methods and twodplyr
/tidyr
approaches, - Ananda's
splitstackshape
solution, - and two additional variants of Jaap's
data.table
methods.
Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark
package (see code below).
The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.
Benchmark results
The benchmark results show that for sufficiently large data frames all data.table
methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table
method 2 and the variant DT3
are the fastest, magnitudes faster than the slowest methods.
Remarkably, the timings of the two tidyverse
methods and the splistackshape
solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.
For smaller data frames, Matt's base R solution and data.table
method 4 seem to have less overhead than the other methods.
Code
director <-
c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula",
"Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A",
"A", "B", "B", "B", "B", "B", "B", "A")
library(data.table)
library(magrittr)
Define function for benchmark runs of problem size n
run_mb <- function(n) {
# compute number of benchmark runs depending on problem size `n`
mb_times <- scales::squish(10000L / n , c(3L, 100L))
cat(n, " ", mb_times, "\n")
# create data
DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
DT <- as.data.table(DF)
# start benchmarks
microbenchmark::microbenchmark(
matt_mod = {
s <- strsplit(as.character(DF$director), ',')
data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
jaap_DT1 = {
DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)]},
jaap_DT2 = {
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][,.(director = V1, AB)]},
jaap_dplyr = {
DF %>%
dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
tidyr::unnest(director)},
jaap_tidyr = {
tidyr::separate_rows(DF, director, sep = ",")},
cSplit = {
splitstackshape::cSplit(DF, "director", ",", direction = "long")},
DT3 = {
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][, director := NULL][
, setnames(.SD, "V1", "director")]},
DT4 = {
DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))),
by = .(AB)]},
times = mb_times
)
}
Run benchmark for different problem sizes
# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)
Prepare data for plotting
mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]
Create chart
library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +
geom_point() + geom_smooth(se = FALSE) +
scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() +
xlab("number of rows") + ylab("median of execution time [ms]") +
ggtitle("microbenchmark results") + theme_bw()
Session info & package versions (excerpt)
devtools::session_info()
#Session info
# version R version 3.3.2 (2016-10-31)
# system x86_64, mingw32
#Packages
# data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2)
# dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1)
# forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
# ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
# magrittr * 1.5 2014-11-22 CRAN (R 3.3.0)
# microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
# splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3)
# tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)
1My curiosity was piqued by this exuberant comment Brilliant! Orders of magnitude faster! to a tidyverse
answer of a question which was closed as a duplicate of this question.
How to split comma separated strings in a column into different columns if they're not of same length using python or pandas in jupyter notebook
We can use a regular expression pattern to find all the matching key-value pairs from each row of column_A
, then map
the list of pairs from each row to dictionary in order to create records then construct a dataframe from these records
pd.DataFrame(map(dict, df['column_A'].str.findall(r'\s*([^:,]+):\s*([^,]+)')))
See the online regex demo
Garbage Organics Recycle Junk
0 Tissues Milk Cardboards NaN
1 Paper Towels Eggs Glass Feces
2 cups NaN Plastic bottles NaN
Here is an alternate approach in case you don't want to use regular expression patterns
df['column_A'].str.split(', ').explode()\
.str.split(': ', expand=True)\
.set_index(0, append=True)[1].unstack()
Split delimited strings in a column and insert as new rows
Here is another way of doing it..
df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)
df
## V1 V2
## 1 1 a,b,c
## 2 2 a,c
## 3 3 b,d
## 4 4 e,f
s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
## V1 V2
## 1 1 a
## 2 1 b
## 3 1 c
## 4 2 a
## 5 2 c
## 6 3 b
## 7 3 d
## 8 4 e
## 9 4 f
unlist and split a column to add to rows without losing information of other column in R
We can use separate_rows
library(tidyr)
separate_rows(df, col1, sep = ";")
If we want to use strsplit
, then have to rep
licate the rows based on the length
of list
elements (lengths
)
lst1 <- strsplit(df$col1, ";")
df1 <- df[rep(seq_len(nrow(df)), lengths(lst1)),]
df1$col1 <- unlist(lst1)
Split one row into multiple rows based on comma-separated string column
Use unnest
on the array returned by split
.
SELECT a,split_b
FROM tbl
CROSS JOIN UNNEST(SPLIT(b,',')) AS t (split_b)
Split strings by commas into columns
Here are several different ways to extract and use the data sticking with read.table()
.
I started with two fake sets of data. In one with nothing of value in the column name (real column name is in row one).
df1 <- data.frame("V1" = c("A,B,C,D",
"AA,D,E,F3",
"Car1,Car2,Car3,Car4",
"a,b,c,d",
"a1,b1,c1,d1"))
# V1
# 1 A,B,C,D
# 2 AA,D,E,F3
# 3 Car1,Car2,Car3,Car4
# 4 a,b,c,d
# 5 a1,b1,c1,d1
In the other, the string that is listed as the column name is a list of the would-be names.
df2 <- data.frame("A,B,C,D" = c("AA,D,E,F3",
"Car1,Car2,Car3,Car4",
"a,b,c,d",
"a1,b1,c1,d1"),
check.names = F)
# A,B,C,D
# 1 AA,D,E,F3
# 2 Car1,Car2,Car3,Car4
# 3 a,b,c,d
# 4 a1,b1,c1,d1
To extract the names and values delimited by a comma, where the would-be headings are in row 1 (using df1
).
# single data.frame with headers concatenated in the first row
df.noHeader <- read.table(col.names = unlist(strsplit(df1[1,],
split = "[,]")),
sep = ",",
skip = 1, # since the headers were in row 1
text = unlist(df1, use.names = F))
# A B C D
# 1 AA D E F3
# 2 Car1 Car2 Car3 Car4
# 3 a b c d
# 4 a1 b1 c1 d1
For clarity, this is what works for when the names are in the column name of the original data frame.
# splitting the original header when splitting the data
df.header <- read.table(col.names = unlist(strsplit(names(df2),
split = "[,]")),
sep = ",",
text = unlist(df2))
# A B C D
# 1 AA D E F3
# 2 Car1 Car2 Car3 Car4
# 3 a b c d
# 4 a1 b1 c1 d1
If you had the headings in some other row, you only need to change the value in the call to strsplit()
, like this:
# if the headers were in row 2
df.noHeader <- read.table(col.names = unlist(strsplit(df1[2,], # <- see 2 here
split = "[,]")),
sep = ",",
skip = 2, # since the headers were in row 2
text = unlist(df1, use.names = F))
# AA D E F3
# 1 Car1 Car2 Car3 Car4
# 2 a b c d
# 3 a1 b1 c1 d1
# if the headers were in row 3
df.noHeader <- read.table(col.names = unlist(strsplit(df1[3,], # <- see 3 here
split = "[,]")),
sep = ",",
skip = 3, # since the headers were in row 3
text = unlist(df1, use.names = F))
# Car1 Car2 Car3 Car4
# 1 a b c d
# 2 a1 b1 c1 d1
Split (explode) pandas dataframe string entry to separate rows
How about something like this:
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Then you just have to rename the columns
Turning a Comma Separated string into individual rows
You can use the wonderful recursive functions from SQL Server:
Sample table:
CREATE TABLE Testdata
(
SomeID INT,
OtherID INT,
String VARCHAR(MAX)
);
INSERT Testdata SELECT 1, 9, '18,20,22';
INSERT Testdata SELECT 2, 8, '17,19';
INSERT Testdata SELECT 3, 7, '13,19,20';
INSERT Testdata SELECT 4, 6, '';
INSERT Testdata SELECT 9, 11, '1,2,3,4';
The query
WITH tmp(SomeID, OtherID, DataItem, String) AS
(
SELECT
SomeID,
OtherID,
LEFT(String, CHARINDEX(',', String + ',') - 1),
STUFF(String, 1, CHARINDEX(',', String + ','), '')
FROM Testdata
UNION all
SELECT
SomeID,
OtherID,
LEFT(String, CHARINDEX(',', String + ',') - 1),
STUFF(String, 1, CHARINDEX(',', String + ','), '')
FROM tmp
WHERE
String > ''
)
SELECT
SomeID,
OtherID,
DataItem
FROM tmp
ORDER BY SomeID;
-- OPTION (maxrecursion 0)
-- normally recursion is limited to 100. If you know you have very long
-- strings, uncomment the option
Output
SomeID | OtherID | DataItem
--------+---------+----------
1 | 9 | 18
1 | 9 | 20
1 | 9 | 22
2 | 8 | 17
2 | 8 | 19
3 | 7 | 13
3 | 7 | 19
3 | 7 | 20
4 | 6 |
9 | 11 | 1
9 | 11 | 2
9 | 11 | 3
9 | 11 | 4
Related Topics
Creating a New Column Based on Unique Id With Values in R
Creating Grouped Bar-Plot of Multi-Column Data in R
Subtracting Two Columns to Give a New Column in R
Extract Rows for the First Occurrence of a Variable in a Data Frame
Sort (Order) Data Frame Rows by Multiple Columns
Combine Two Data Frames by Rows (Rbind) When They Have Different Sets of Columns
How to Convert Excel Date Format to Proper Date in R
Remove 'A' from Legend When Using Aesthetics and Geom_Text
Sum Rows in Data.Frame or Matrix
Generating All Distinct Permutations of a List in R
Concatenate String Columns and Order in Alphabetical Order
Remove Unwanted Symbols from Expression Function - R
Choose the Top Five Values from Each Group in R