R Semicolon Delimited a Column into Rows

R semicolon delimited a column into rows

You could try unnest from tidyr after splitting the "PolId" column and get the unique rows

library(dplyr)
library(tidyr)
unnest(setNames(strsplit(df$PolId, ';'), df$Description),
Description) %>% unique()

Or using base R with stack/strsplit/duplicated. Split the "PolId" (strsplit) by the delimiter(;), name the output list elements with "Description" column, stack the list to get a 'data.frame' and use duplicated to remove the duplicate rows.

df1 <- stack(setNames(strsplit(df$PolId, ';'), df$Description))
setNames(df1[!duplicated(df1),], names(df))
# PolId Description
#1 ABC123 TEST1
#2 ABC456 TEST1
#3 ABC789 TEST1
#10 AAA123 TEST1
#11 AAA123 TEST2
#12 ABB123 TEST3
#13 ABC123 TEST3

Or another option without using strsplit

v1 <- with(df, tapply(PolId, Description, FUN= function(x) {
x1 <- paste(x, collapse=";")
gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)}))
library(stringr)
Description <- rep(names(v1), str_count(v1, '\\w+'))
PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE)
data.frame(PolId, Description)
# PolId Description
#1 ABC123 TEST1
#2 ABC456 TEST1
#3 ABC789 TEST1
#4 AAA123 TEST1
#5 AAA123 TEST2
#6 ABB123 TEST3
#7 ABC123 TEST3

R data.table: melt comma-separated column into rows?

Here's a way:

# get last two column names
cols=names(df)[-1]

# convert columns to list
df[,(cols) := lapply(.SD, strsplit, split=','), .SDcols=cols]

# explode the list
dfnew = df[, lapply(.SD, unlist), by=1:nrow(df)]
dfnew[, nrow:=NULL]

print(dfnew)

Id Column1 Column2
1: 1 A H
2: 1 B H
3: 1 C H
4: 2 D J
5: 2 E K
6: 3 F L
7: 3 F M
8: 3 F N

Split comma-separated strings in a column into separate rows

This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmark as guidance which of the approaches is the fastest1.

The benchmarked solutions include

  • Matthew Lundberg's base R approach but modified according to Rich Scriven's comment,
  • Jaap's two data.table methods and two dplyr / tidyr approaches,
  • Ananda's splitstackshapesolution,
  • and two additional variants of Jaap's data.table methods.

Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark package (see code below).

The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

Benchmark results

Sample Image

The benchmark results show that for sufficiently large data frames all data.table methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table method 2 and the variant DT3 are the fastest, magnitudes faster than the slowest methods.

Remarkably, the timings of the two tidyverse methods and the splistackshape solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

For smaller data frames, Matt's base R solution and data.table method 4 seem to have less overhead than the other methods.

Code

director <- 
c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula",
"Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A",
"A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

Define function for benchmark runs of problem size n

run_mb <- function(n) {
# compute number of benchmark runs depending on problem size `n`
mb_times <- scales::squish(10000L / n , c(3L, 100L))
cat(n, " ", mb_times, "\n")
# create data
DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
DT <- as.data.table(DF)
# start benchmarks
microbenchmark::microbenchmark(
matt_mod = {
s <- strsplit(as.character(DF$director), ',')
data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
jaap_DT1 = {
DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)]},
jaap_DT2 = {
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][,.(director = V1, AB)]},
jaap_dplyr = {
DF %>%
dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
tidyr::unnest(director)},
jaap_tidyr = {
tidyr::separate_rows(DF, director, sep = ",")},
cSplit = {
splitstackshape::cSplit(DF, "director", ",", direction = "long")},
DT3 = {
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][, director := NULL][
, setnames(.SD, "V1", "director")]},
DT4 = {
DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))),
by = .(AB)]},
times = mb_times
)
}

Run benchmark for different problem sizes

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

Prepare data for plotting

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

Create chart

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +
geom_point() + geom_smooth(se = FALSE) +
scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() +
xlab("number of rows") + ylab("median of execution time [ms]") +
ggtitle("microbenchmark results") + theme_bw()

Session info & package versions (excerpt)

devtools::session_info()
#Session info
# version R version 3.3.2 (2016-10-31)
# system x86_64, mingw32
#Packages
# data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2)
# dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1)
# forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
# ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
# magrittr * 1.5 2014-11-22 CRAN (R 3.3.0)
# microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
# splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3)
# tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)

1My curiosity was piqued by this exuberant comment Brilliant! Orders of magnitude faster! to a tidyverse answer of a question which was closed as a duplicate of this question.

Split delimited strings in a column and insert as new rows

Here is another way of doing it..

df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)

df
## V1 V2
## 1 1 a,b,c
## 2 2 a,c
## 3 3 b,d
## 4 4 e,f

s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
## V1 V2
## 1 1 a
## 2 1 b
## 3 1 c
## 4 2 a
## 5 2 c
## 6 3 b
## 7 3 d
## 8 4 e
## 9 4 f

R separate comma separated cells into rows and Cartesian product

There are times when a for loop is totally fine to work with in R. This is one of those times. Try:

library(splitstackshape)
cols <- c("name", "new")
for (i in cols) {
mydf <- cSplit(mydf, i, ",", "long")
}

mydf
## name AB new
## 1: AB A 1
## 2: AB A 2
## 3: AB A 3
## 4: BW A 1
## 5: BW A 2
## 6: BW A 3
## 7: x B 4
## 8: x B 5
## 9: x B 6
## 10: x B 7
## 11: y B 4
## 12: y B 5
## 13: y B 6
## 14: y B 7
## 15: z B 4
## 16: z B 5
## 17: z B 6
## 18: z B 7

Here's a small test using slightly bigger data:

# concat.test = sample data from "splitstackshape"
test <- do.call(rbind, replicate(5000, concat.test, FALSE))

fun1 <- function() {
cols <- c("Likes", "Siblings")
for (i in cols) {
test <- cSplit(test, i, ",", "long")
}
test
}

fun2 <- function() {
test %>%
separate_rows("Likes") %>%
separate_rows("Siblings")
}

system.time(fun1())
# user system elapsed
# 3.205 0.056 3.261
system.time(fun2())
# user system elapsed
# 11.598 0.066 11.662

splitting multiple values in one column into multiple rows R

There's a function separate_rows in tidyr to do just that:

library(tidyr)
## The ";\\s+" means that the separator is a ";" followed by one or more spaces
separate_rows(df,abilities,sep=";\\s+")
number abilities
<dbl> <chr>
1 51 b1261
2 57 d710
3 57 b1301
4 57 d550

Separate Comma Delimited Cells To New Rows

Here's an approach that should work for you. I'm assuming that your three input vectors are in different objects. We are going to create a list of those inputs and write a function that process each object and returns them in the form of a data.frame with plyr.

The things to take note of here are the splitting of the character vector into it's component parts, then using as.numeric to convert the numbers from the character form when they were split. Since R fills matrices by column, we define a 2 column matrix and let R fill the values for us. We then retrieve the Name column and put it all together in a data.frame. plyr is nice enough to process the list and convert it into a data.frame for us automatically.

library(plyr)

a <- paste("A",1, 2,3,4,5,6, sep = ",", collapse = "")
b <- paste("B",1, 2,4,5, sep = ",", collapse = "")
c <- paste("C",1, 2,3,4,6,7,8,9, sep = ",", collapse = "")

input <- list(a,b,c)

splitter <- function(x) {
x <- unlist(strsplit(x, ","))
out <- data.frame(x[1], matrix(as.numeric(x[-1]), ncol = 2))
colnames(out) <- c("Name", "Start", "End")
return(out)
}


ldply(input, splitter)

And the output:

> ldply(input, splitter)
Name Start End
1 A 1 4
2 A 2 5
3 A 3 6
4 B 1 4
5 B 2 5
6 C 1 6
7 C 2 7
8 C 3 8
9 C 4 9

Split multiple columns into rows

Here's a simple base R approach in two steps:

1) split the columns:

x <- lapply(d[-1], strsplit, "|", fixed = TRUE)

2) expand and combine:

d2 <- setNames(do.call(rbind, Map(expand.grid, d$id, x$name, x$score)), names(d)) 

The result is then:

#    id name score
#1 22 c e
#2 22 e e
#3 565 m k
#4 565 q k
#5 565 m e
#6 565 q e
#7 893 w e
#8 893 w k
#9 893 w e
#10 415 w e
#11 415 s e
#12 415 e e
#13 415 w o
#14 415 s o
#15 415 e o


Related Topics



Leave a reply



Submit