How to reshape data from long to wide format
Using reshape
function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
R reshape() extremely slow
If you look at what functions reshape
calls with the profvis package, you can see that almost all of the total time spent is on this one line in the function. The interaction
function is used only to combine your two id columns into a single column.
data[, tempidname] <- interaction(data[, idvar],
drop = TRUE)
Rather than interaction
, you could use do.call(paste0, data[, idvar]
). You can use a function to create an environment with interaction
equal to this faster function.
new_reshape <- function(...){
interaction <- function(x, drop) do.call(paste0, x)
environment(reshape) <- environment()
reshape(...)
}
Now it's much faster
system.time(dfWide <- reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 35.292 0.538 36.236
system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 0.071 0.009 0.081
all.equal(new_dfWide, dfWide)
# [1] TRUE
You can be even faster than that by using plyr:::ninteraction
. The only non-base dependency of this function is plyr:::id_var
, which has no dependencies, meaning if you can't install packages you can just copy-paste this function definition pretty easily (adding a comment giving credit).
new_reshape <- function(...){
# interaction = plyr:::ninteraction
# id_var = plyr:::id_var
interaction <-
function (.variables, drop = FALSE)
{
lengths <- vapply(.variables, length, integer(1))
.variables <- .variables[lengths != 0]
if (length(.variables) == 0) {
n <- nrow(.variables) %||% 0L
return(structure(seq_len(n), n = n))
}
if (length(.variables) == 1) {
return(id_var(.variables[[1]], drop = drop))
}
ids <- rev(lapply(.variables, id_var, drop = drop))
p <- length(ids)
ndistinct <- vapply(ids, attr, "n", FUN.VALUE = numeric(1),
USE.NAMES = FALSE)
n <- prod(ndistinct)
if (n > 2^31) {
char_id <- do.call("paste", c(ids, sep = "\r"))
res <- match(char_id, unique(char_id))
}
else {
combs <- c(1, cumprod(ndistinct[-p]))
mat <- do.call("cbind", ids)
res <- c((mat - 1L) %*% combs + 1L)
}
attr(res, "n") <- n
if (drop) {
id_var(res, drop = TRUE)
}
else {
structure(as.integer(res), n = attr(res, "n"))
}
}
id_var <-
function (x, drop = FALSE)
{
if (length(x) == 0)
return(structure(integer(), n = 0L))
if (!is.null(attr(x, "n")) && !drop)
return(x)
if (is.factor(x) && !drop) {
x <- addNA(x, ifany = TRUE)
id <- as.integer(x)
n <- length(levels(x))
}
else {
levels <- sort(unique(x), na.last = TRUE)
id <- match(x, levels)
n <- max(id)
}
structure(id, n = n)
}
environment(reshape) <- environment()
reshape(...)
}
system.time(new_dfWide <- new_reshape(data = dfLong,
v.names = c("Date", "Score"),
timevar = "Key",
idvar = c("Index1", "Index2"),
sep = "_Q",
direction = "wide"))
# user system elapsed
# 0.015 0.000 0.015
Functions for creating and reshaping big data in R using the FF package
The function reshape
does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase
. Just use ffdfdply from package ffbase
, split by Subject and apply reshape
inside the function.
An example on the Indometh dataset with 1000000 subjects.
require(ffbase)
require(datasets)
data(Indometh)
## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000 3
## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
df <- reshape(datawithseveralsplitelements,
v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject" "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
dim(result)
[1] 1000000 12
Reshape DF from long to wide in R using Reshape2 without an aggregation function
We can use dcast
from data.table
, which can take multiple value.var
columns. Convert the 'data.frame' to 'data.table' (setDT(df)
), use the dcast
with formula and value.var
specified.
library(data.table)
dcast(setDT(df), id~gid, value.var=names(df)[2:6])
NOTE: The data.table
method would be faster compared to the reshape2
dcast for huge dataframe [R]
The easy solution to this case turned out to be switching back to the old reshape package. Which means useing cast instead of dcast. Arun's comments are highly usable, providede one can actually update.
Related
Efficient way to reshape (alternately) thousands of data
The reason it takes so long is that the out
matrix should be preallocated.
For example this completed in about 1 second on my laptop:
in=rand(71760,320);
m=240; n=320;
[R,C]=size(in);
R_out=R/m;
out=zeros(m*C,R_out);
for k=1:m %from row 1 to nth row
for i=1:C %reshape every column of nth row
out(i+C*(k-1),:) = in(k:m:end,i)';
end
end
Alternative method
The best practice would be to use a vectorized approach using arrayfun
which could be done in a single line like this:
out=cell2mat(arrayfun(@(k) in(k:m:end,:)', 1:m,'uniformoutput',0)');
this also runs more quickly.
Fastest way to reshape variable values as columns
There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table
-based solution:
df <- read.table(text="PatientID Year PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)
library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))
dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0
EDIT: aggregate()
provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)
out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))
2nd EDIT Finally, a succinct solution using the reshape
package gets you to the same place.
library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
Related Topics
R Function Not Returning Values
What Do the %Op% Operators in Mean? for Example "%In%"
Change Background and Text of Strips Associated to Multiple Panels in R/Lattice
Format for Ordinal Dates (Day of Month with Suffixes -St, -Nd, -Rd, -Th)
Setting Document Title in Rmarkdown from Parameters
Using Grep to Help Subset a Data Frame
Way to Securely Give a Password to R Application from the Terminal
How to Speed Up Subset by Groups
Fill Na in a Time Series Only to a Limited Number
How to Create Two Independent Drill Down Plot Using Highcharter
Why Is the Terminology of Labels and Levels in Factors So Weird
Removing the Border of Legend Symbol
Forward and Backward Fill Data Frame in R
R Knitr: Possible to Programmatically Modify Chunk Labels