Proper/Fastest Way to Reshape a Data.Table

Proper/fastest way to reshape a data.table

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

  • melt:

    • is capable of melting on columns of type 'list'.

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

  • dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

  • Old benchmarks:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

Fastest way to reshape variable values as columns

There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table-based solution:

df <- read.table(text="PatientID Year  PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)

library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))

dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]

# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0

EDIT: aggregate() provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)

out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))

2nd EDIT Finally, a succinct solution using the reshape package gets you to the same place.

library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)

Reshape long structured data.table into a wide structure using data.table functionality?

I'll pick an example with unequal groups so that it's easier to illustrate for the general case:

A <- data.table(x=c(1,1,1,2,2), y=c(1,2,3,1,2), v=(1:5)/5)
> A
x y v
1: 1 1 0.2
2: 1 2 0.4
3: 1 3 0.6
4: 2 1 0.8
5: 2 2 1.0

The first step is to get the number of elements/entries for each group of "x" to be the same. Here, for x=1 there are 3 values of y, but only 2 for x=2. So, we'll have to fix that first with NA for x=2, y=3.

setkey(A, x, y)
A[CJ(unique(x), unique(y))]

Now, to get it to wide format, we should group by "x" and use as.list on v as follows:

out <- A[CJ(unique(x), unique(y))][, as.list(v), by=x]
x V1 V2 V3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0 NA

Now, you can set the names of the reshaped columns using reference with setnames as follows:

setnames(out, c("x", as.character(unique(A$y)))

x 1 2 3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0 NA

R reshape() extremely slow

If you look at what functions reshape calls with the profvis package, you can see that almost all of the total time spent is on this one line in the function. The interaction function is used only to combine your two id columns into a single column.

data[, tempidname] <- interaction(data[, idvar], 
drop = TRUE)

Rather than interaction, you could use do.call(paste0, data[, idvar]). You can use a function to create an environment with interaction equal to this faster function.

new_reshape <- function(...){
interaction <- function(x, drop) do.call(paste0, x)
environment(reshape) <- environment()
reshape(...)
}

Now it's much faster

system.time(dfWide <- reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 35.292 0.538 36.236

system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))

# user system elapsed
# 0.071 0.009 0.081

all.equal(new_dfWide, dfWide)
# [1] TRUE

You can be even faster than that by using plyr:::ninteraction. The only non-base dependency of this function is plyr:::id_var, which has no dependencies, meaning if you can't install packages you can just copy-paste this function definition pretty easily (adding a comment giving credit).

new_reshape <- function(...){
# interaction = plyr:::ninteraction
# id_var = plyr:::id_var
interaction <-
function (.variables, drop = FALSE)
{
lengths <- vapply(.variables, length, integer(1))
.variables <- .variables[lengths != 0]
if (length(.variables) == 0) {
n <- nrow(.variables) %||% 0L
return(structure(seq_len(n), n = n))
}
if (length(.variables) == 1) {
return(id_var(.variables[[1]], drop = drop))
}
ids <- rev(lapply(.variables, id_var, drop = drop))
p <- length(ids)
ndistinct <- vapply(ids, attr, "n", FUN.VALUE = numeric(1),
USE.NAMES = FALSE)
n <- prod(ndistinct)
if (n > 2^31) {
char_id <- do.call("paste", c(ids, sep = "\r"))
res <- match(char_id, unique(char_id))
}
else {
combs <- c(1, cumprod(ndistinct[-p]))
mat <- do.call("cbind", ids)
res <- c((mat - 1L) %*% combs + 1L)
}
attr(res, "n") <- n
if (drop) {
id_var(res, drop = TRUE)
}
else {
structure(as.integer(res), n = attr(res, "n"))
}
}
id_var <-
function (x, drop = FALSE)
{
if (length(x) == 0)
return(structure(integer(), n = 0L))
if (!is.null(attr(x, "n")) && !drop)
return(x)
if (is.factor(x) && !drop) {
x <- addNA(x, ifany = TRUE)
id <- as.integer(x)
n <- length(levels(x))
}
else {
levels <- sort(unique(x), na.last = TRUE)
id <- match(x, levels)
n <- max(id)
}
structure(id, n = n)
}
environment(reshape) <- environment()
reshape(...)
}
system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))

# user system elapsed
# 0.015 0.000 0.015

Undo reshape to original data.table format

The development version of data.table, v1.9.5 has new features for melt and dcast data.table methods. Check the Efficient reshaping using data.tables for more details, and the Getting started Wiki for other vignettes.

Long-to-Wide - dcast accepts multiple columns:

require(data.table) # v1.9.5+
DT.c = dcast(DT, Abb ~ input, value.var = c('AT_Car', 'BE_Car'))
# Abb AT_Car_glass AT_Car_iron BE_Car_glass BE_Car_iron
# 1: AT 5 5 3 0
# 2: BE 5 0 2 2

Wide-to-Long - melt also accepts a list for measure.vars:

DT.m = melt(DT.c, id = "Abb", measure.vars = patterns("^AT_Car", "^BE_Car"), 
variable.name = "input", value.name = c("AT_Car", "BE_Car"))
setattr(DT.m$input, 'levels', c("glass", "iron"))
# Abb input AT_Car BE_Car
# 1: AT glass 5 3
# 2: BE glass 5 2
# 3: AT iron 5 0
# 4: BE iron 0 2

Reshape wide data to long with multiple rows using data.table

This can be solved using data.table's rowid() function:

library(data.table)
m <- melt(tdat, id.vars="am")
dcast(m, variable + rowid(am) ~ am)[, am := NULL][]
    variable          0          1
1: qsec 18.1831600 17.3600000
2: qsec 17.8200000 17.0200000
3: vs 0.3684211 0.5384615
4: vs 0.0000000 1.0000000
5: am 0.0000000 1.0000000
6: am 0.0000000 1.0000000
7: gear 3.2105260 4.3846150
8: gear 3.0000000 4.0000000
9: carb 2.7368420 2.9230770
10: carb 3.0000000 2.0000000

Data

library(data.table)
tdat <- fread(
"# i am qsec vs am gear carb
# 1: 1 17.36000 0.5384615 1 4.384615 2.923077
# 2: 1 17.02000 1.0000000 1 4.000000 2.000000
# 3: 0 18.18316 0.3684211 0 3.210526 2.736842
# 4: 0 17.82000 0.0000000 0 3.000000 3.000000",
drop = 1:2, colClasses = list(integer = c(3, 6))
)

Alternatively, the sample dataset can be produced in a more concise way without doubling the am column:

setDT(mtcars[7:11])[, lapply(.SD, function(y) c(mean(y), median(y))), by = am]
   am     qsec        vs     gear     carb
1: 1 17.36000 0.5384615 4.384615 2.923077
2: 1 17.02000 1.0000000 4.000000 2.000000
3: 0 18.18316 0.3684211 3.210526 2.736842
4: 0 17.82000 0.0000000 3.000000 3.000000


Related Topics



Leave a reply



Submit