What Methods How to Use to Reshape Very Large Data Sets

How to reshape data from long to wide format

Using reshape function:

reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")

R reshape() extremely slow

If you look at what functions reshape calls with the profvis package, you can see that almost all of the total time spent is on this one line in the function. The interaction function is used only to combine your two id columns into a single column.

data[, tempidname] <- interaction(data[, idvar], 
drop = TRUE)

Rather than interaction, you could use do.call(paste0, data[, idvar]). You can use a function to create an environment with interaction equal to this faster function.

new_reshape <- function(...){
interaction <- function(x, drop) do.call(paste0, x)
environment(reshape) <- environment()
reshape(...)
}

Now it's much faster

system.time(dfWide <- reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 35.292 0.538 36.236

system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))

# user system elapsed
# 0.071 0.009 0.081

all.equal(new_dfWide, dfWide)
# [1] TRUE

You can be even faster than that by using plyr:::ninteraction. The only non-base dependency of this function is plyr:::id_var, which has no dependencies, meaning if you can't install packages you can just copy-paste this function definition pretty easily (adding a comment giving credit).

new_reshape <- function(...){
# interaction = plyr:::ninteraction
# id_var = plyr:::id_var
interaction <-
function (.variables, drop = FALSE)
{
lengths <- vapply(.variables, length, integer(1))
.variables <- .variables[lengths != 0]
if (length(.variables) == 0) {
n <- nrow(.variables) %||% 0L
return(structure(seq_len(n), n = n))
}
if (length(.variables) == 1) {
return(id_var(.variables[[1]], drop = drop))
}
ids <- rev(lapply(.variables, id_var, drop = drop))
p <- length(ids)
ndistinct <- vapply(ids, attr, "n", FUN.VALUE = numeric(1),
USE.NAMES = FALSE)
n <- prod(ndistinct)
if (n > 2^31) {
char_id <- do.call("paste", c(ids, sep = "\r"))
res <- match(char_id, unique(char_id))
}
else {
combs <- c(1, cumprod(ndistinct[-p]))
mat <- do.call("cbind", ids)
res <- c((mat - 1L) %*% combs + 1L)
}
attr(res, "n") <- n
if (drop) {
id_var(res, drop = TRUE)
}
else {
structure(as.integer(res), n = attr(res, "n"))
}
}
id_var <-
function (x, drop = FALSE)
{
if (length(x) == 0)
return(structure(integer(), n = 0L))
if (!is.null(attr(x, "n")) && !drop)
return(x)
if (is.factor(x) && !drop) {
x <- addNA(x, ifany = TRUE)
id <- as.integer(x)
n <- length(levels(x))
}
else {
levels <- sort(unique(x), na.last = TRUE)
id <- match(x, levels)
n <- max(id)
}
structure(id, n = n)
}
environment(reshape) <- environment()
reshape(...)
}
system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))

# user system elapsed
# 0.015 0.000 0.015

Functions for creating and reshaping big data in R using the FF package

The function reshape does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase. Just use ffdfdply from package ffbase, split by Subject and apply reshape inside the function.

An example on the Indometh dataset with 1000000 subjects.

require(ffbase)
require(datasets)
data(Indometh)

## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000 3

## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
df <- reshape(datawithseveralsplitelements,
v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject" "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
dim(result)
[1] 1000000 12

Reshape DF from long to wide in R using Reshape2 without an aggregation function

We can use dcast from data.table, which can take multiple value.var columns. Convert the 'data.frame' to 'data.table' (setDT(df)), use the dcast with formula and value.var specified.

library(data.table)
dcast(setDT(df), id~gid, value.var=names(df)[2:6])

NOTE: The data.table method would be faster compared to the reshape2

dcast for huge dataframe [R]

The easy solution to this case turned out to be switching back to the old reshape package. Which means useing cast instead of dcast. Arun's comments are highly usable, providede one can actually update.
Related

Efficient way to reshape (alternately) thousands of data

The reason it takes so long is that the out matrix should be preallocated.

For example this completed in about 1 second on my laptop:

in=rand(71760,320);
m=240; n=320;
[R,C]=size(in);
R_out=R/m;

out=zeros(m*C,R_out);
for k=1:m %from row 1 to nth row
for i=1:C %reshape every column of nth row
out(i+C*(k-1),:) = in(k:m:end,i)';
end
end

Alternative method

The best practice would be to use a vectorized approach using arrayfun which could be done in a single line like this:

out=cell2mat(arrayfun(@(k) in(k:m:end,:)', 1:m,'uniformoutput',0)');

this also runs more quickly.

Fastest way to reshape variable values as columns

There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table-based solution:

df <- read.table(text="PatientID Year  PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)

library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))

dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]

# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0

EDIT: aggregate() provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)

out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))

2nd EDIT Finally, a succinct solution using the reshape package gets you to the same place.

library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)


Related Topics



Leave a reply



Submit