Proper/fastest way to reshape a data.table
The data.table
package implements faster melt/dcast
functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
There is no need to load
reshape2
package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table
.dcast
is also a S3 generic. No moredcast.data.table()
. Just usedcast()
.melt
:is capable of melting on columns of type 'list'.
gains
variable.factor
andvalue.factor
which by default areTRUE
andFALSE
respectively for compatibility withreshape2
. This allows for directly controlling the output type ofvariable
andvalue
columns (as factors or not).melt.data.table
'sna.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.NEW:
melt
can accept a list formeasure.vars
and columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns()
. See vignette or?melt
.
dcast
:accepts multiple
fun.aggregate
and multiplevalue.var
. See vignette or?dcast
.use
rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
?
Fastest way to reshape variable values as columns
There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table
-based solution:
df <- read.table(text="PatientID Year PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)
library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))
dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0
EDIT: aggregate()
provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)
out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))
2nd EDIT Finally, a succinct solution using the reshape
package gets you to the same place.
library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
Reshape long structured data.table into a wide structure using data.table functionality?
I'll pick an example with unequal groups so that it's easier to illustrate for the general case:
A <- data.table(x=c(1,1,1,2,2), y=c(1,2,3,1,2), v=(1:5)/5)
> A
x y v
1: 1 1 0.2
2: 1 2 0.4
3: 1 3 0.6
4: 2 1 0.8
5: 2 2 1.0
The first step is to get the number of elements/entries for each group of "x" to be the same. Here, for x=1 there are 3 values of y, but only 2 for x=2. So, we'll have to fix that first with NA for x=2, y=3.
setkey(A, x, y)
A[CJ(unique(x), unique(y))]
Now, to get it to wide format, we should group by "x" and use as.list
on v
as follows:
out <- A[CJ(unique(x), unique(y))][, as.list(v), by=x]
x V1 V2 V3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0 NA
Now, you can set the names of the reshaped columns using reference with setnames
as follows:
setnames(out, c("x", as.character(unique(A$y)))
x 1 2 3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0 NA
R reshape() extremely slow
If you look at what functions reshape
calls with the profvis package, you can see that almost all of the total time spent is on this one line in the function. The interaction
function is used only to combine your two id columns into a single column.
data[, tempidname] <- interaction(data[, idvar],
drop = TRUE)
Rather than interaction
, you could use do.call(paste0, data[, idvar]
). You can use a function to create an environment with interaction
equal to this faster function.
new_reshape <- function(...){
interaction <- function(x, drop) do.call(paste0, x)
environment(reshape) <- environment()
reshape(...)
}
Now it's much faster
system.time(dfWide <- reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 35.292 0.538 36.236
system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 0.071 0.009 0.081
all.equal(new_dfWide, dfWide)
# [1] TRUE
You can be even faster than that by using plyr:::ninteraction
. The only non-base dependency of this function is plyr:::id_var
, which has no dependencies, meaning if you can't install packages you can just copy-paste this function definition pretty easily (adding a comment giving credit).
new_reshape <- function(...){
# interaction = plyr:::ninteraction
# id_var = plyr:::id_var
interaction <-
function (.variables, drop = FALSE)
{
lengths <- vapply(.variables, length, integer(1))
.variables <- .variables[lengths != 0]
if (length(.variables) == 0) {
n <- nrow(.variables) %||% 0L
return(structure(seq_len(n), n = n))
}
if (length(.variables) == 1) {
return(id_var(.variables[[1]], drop = drop))
}
ids <- rev(lapply(.variables, id_var, drop = drop))
p <- length(ids)
ndistinct <- vapply(ids, attr, "n", FUN.VALUE = numeric(1),
USE.NAMES = FALSE)
n <- prod(ndistinct)
if (n > 2^31) {
char_id <- do.call("paste", c(ids, sep = "\r"))
res <- match(char_id, unique(char_id))
}
else {
combs <- c(1, cumprod(ndistinct[-p]))
mat <- do.call("cbind", ids)
res <- c((mat - 1L) %*% combs + 1L)
}
attr(res, "n") <- n
if (drop) {
id_var(res, drop = TRUE)
}
else {
structure(as.integer(res), n = attr(res, "n"))
}
}
id_var <-
function (x, drop = FALSE)
{
if (length(x) == 0)
return(structure(integer(), n = 0L))
if (!is.null(attr(x, "n")) && !drop)
return(x)
if (is.factor(x) && !drop) {
x <- addNA(x, ifany = TRUE)
id <- as.integer(x)
n <- length(levels(x))
}
else {
levels <- sort(unique(x), na.last = TRUE)
id <- match(x, levels)
n <- max(id)
}
structure(id, n = n)
}
environment(reshape) <- environment()
reshape(...)
}
system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 0.015 0.000 0.015
Undo reshape to original data.table format
The development version of data.table, v1.9.5
has new features for melt
and dcast
data.table methods. Check the Efficient reshaping using data.tables for more details, and the Getting started Wiki for other vignettes.
Long-to-Wide - dcast
accepts multiple columns:
require(data.table) # v1.9.5+
DT.c = dcast(DT, Abb ~ input, value.var = c('AT_Car', 'BE_Car'))
# Abb AT_Car_glass AT_Car_iron BE_Car_glass BE_Car_iron
# 1: AT 5 5 3 0
# 2: BE 5 0 2 2
Wide-to-Long - melt
also accepts a list for measure.vars
:
DT.m = melt(DT.c, id = "Abb", measure.vars = patterns("^AT_Car", "^BE_Car"),
variable.name = "input", value.name = c("AT_Car", "BE_Car"))
setattr(DT.m$input, 'levels', c("glass", "iron"))
# Abb input AT_Car BE_Car
# 1: AT glass 5 3
# 2: BE glass 5 2
# 3: AT iron 5 0
# 4: BE iron 0 2
Reshape wide data to long with multiple rows using data.table
This can be solved using data.table
's rowid()
function:
library(data.table)
m <- melt(tdat, id.vars="am")
dcast(m, variable + rowid(am) ~ am)[, am := NULL][]
variable 0 1
1: qsec 18.1831600 17.3600000
2: qsec 17.8200000 17.0200000
3: vs 0.3684211 0.5384615
4: vs 0.0000000 1.0000000
5: am 0.0000000 1.0000000
6: am 0.0000000 1.0000000
7: gear 3.2105260 4.3846150
8: gear 3.0000000 4.0000000
9: carb 2.7368420 2.9230770
10: carb 3.0000000 2.0000000
Data
library(data.table)
tdat <- fread(
"# i am qsec vs am gear carb
# 1: 1 17.36000 0.5384615 1 4.384615 2.923077
# 2: 1 17.02000 1.0000000 1 4.000000 2.000000
# 3: 0 18.18316 0.3684211 0 3.210526 2.736842
# 4: 0 17.82000 0.0000000 0 3.000000 3.000000",
drop = 1:2, colClasses = list(integer = c(3, 6))
)
Alternatively, the sample dataset can be produced in a more concise way without doubling the am
column:
setDT(mtcars[7:11])[, lapply(.SD, function(y) c(mean(y), median(y))), by = am]
am qsec vs gear carb
1: 1 17.36000 0.5384615 4.384615 2.923077
2: 1 17.02000 1.0000000 4.000000 2.000000
3: 0 18.18316 0.3684211 3.210526 2.736842
4: 0 17.82000 0.0000000 3.000000 3.000000
Related Topics
How to Perform Natural (Lexicographic) Sorting in R
Sum Values in a Rolling/Sliding Window
Identify Groups of Linked Episodes Which Chain Together
Scatterplot With Too Many Points
Subset Data Frame Based on Multiple Conditions
How to Save Plots That Are Made in a Shiny App
R Conditional Evaluation When Using the Pipe Operator %≫%
Quit and Restart a Clean R Session from Within R
Dplyr Summarise: Equivalent of ".Drop=False" to Keep Groups With Zero Length in Output
Merge Two Data Frames While Keeping the Original Row Order
What Ways Are There to Edit a Function in R
Usage of '...' (Three-Dots or Dot-Dot-Dot) in Functions
Difference Between the == and %In% Operators in R
Ggplot2 Keep Unused Levels Barplot
How to Open CSV File in R When R Says "No Such File or Directory"