Why Can't One Have Several 'Value.Var' in 'Dcast'

can the value.var in dcast be a list or have multiple value variables?

From v1.9.6 of data.table, we can cast multiple value.var columns simultaneously (and also use multiple aggregation functions in fun.aggregate). Please see ?dcast and the Efficient reshaping using data.tables vignette for more.

Here's how we could use dcast:

dcast(setDT(mydf), x1 ~ x2, value.var=c("salt", "sugar"))
# x1 salt_1 salt_2 salt_3 sugar_1 sugar_2 sugar_3
# 1: 1 3 4 6 1 2 2
# 2: 2 10 3 9 5 3 6
# 3: 3 10 7 7 4 6 7

Why can't one have several `value.var` in `dcast`?

This question is very much related to your other question from earlier today.

@beginneR wrote in the comments that "As long as the existing data is already in long-format, I don't see any general need to melt it before casting." In my answer posted at your other question, I gave an example of when melt would be required, or rather, how to decide whether your data are long enough.

This question here is another example of when further melting would be required since point 3 in my answer is not satisfied.

To get the behavior you want, try the following:

C93L <- melt(Cars93, measure.vars = c("Price", "Weight"))
dcast(C93L, AirBags ~ DriveTrain + variable, mean, value.var = "value")
# AirBags 4WD_Price 4WD_Weight Front_Price Front_Weight
# 1 Driver & Passenger NaN NaN 26.17273 3393.636
# 2 Driver only 21.38 3623 18.69286 2996.250
# 3 None 13.88 2987 12.98571 2703.036
# Rear_Price Rear_Weight
# 1 33.20 3515.0
# 2 28.23 3463.5
# 3 14.90 3610.0

An alternative is to use aggregate to calculate the means, and then use reshape or dcast to go from "long" to "wide". Both are required since reshape does not perform any aggregation:

temp <- aggregate(cbind(Price, Weight) ~ AirBags + DriveTrain, 
Cars93, mean)
# AirBags DriveTrain Price Weight
# 1 Driver only 4WD 21.38000 3623.000
# 2 None 4WD 13.88000 2987.000
# 3 Driver & Passenger Front 26.17273 3393.636
# 4 Driver only Front 18.69286 2996.250
# 5 None Front 12.98571 2703.036
# 6 Driver & Passenger Rear 33.20000 3515.000
# 7 Driver only Rear 28.23000 3463.500
# 8 None Rear 14.90000 3610.000

reshape(temp, direction = "wide",
idvar = "AirBags", timevar = "DriveTrain")
# AirBags Price.4WD Weight.4WD Price.Front Weight.Front
# 1 Driver only 21.38 3623 18.69286 2996.250
# 2 None 13.88 2987 12.98571 2703.036
# 3 Driver & Passenger NA NA 26.17273 3393.636
# Price.Rear Weight.Rear
# 1 28.23 3463.5
# 2 14.90 3610.0
# 3 33.20 3515.0

dcast with multiple variables in val.var option

The error occurs when we are using reshape2::dcast instead of data.table::dcast because reshape2::dcast doesn't support more than one value.var.

The documentation for ?reshape2::dcast gives

value.var - name of column which stores values, see guess_value for default strategies to figure this out.

while in ?data.table::dcast it is

value.var - Name of the column whose values will be filled to cast. Function guess() tries to, well, guess this column automatically, if none is provided. Cast multiple value.var columns simultaneously by passing their names as a character vector. See Examples.


With a small reproducible example

data(mtcars)
dcast(mtcars, vs + am ~ carb, fun.aggregate = sum, value.var = c('mpg', 'disp'))

Error in .subset2(x, i, exact = exact) : subscript out of bounds
In addition: Warning messages:
1: In dcast(mtcars, vs + am ~ carb, fun.aggregate = sum, value.var = c("mpg",

If we convert to data.table

library(data.table)
dcast(as.data.table(mtcars), vs + am ~ carb, fun.aggregate = sum, value.var = c('mpg', 'disp'))
# vs am mpg_1 mpg_2 mpg_3 mpg_4 mpg_6 mpg_8 disp_1 disp_2 disp_3 disp_4 disp_6 disp_8
#1: 0 0 0.0 68.6 48.9 63.1 0.0 0 0.0 1382.0 827.4 2082.0 0 0
#2: 0 1 0.0 26.0 0.0 57.8 19.7 15 0.0 120.3 0.0 671.0 145 301
#3: 1 0 61.0 47.2 0.0 37.0 0.0 0 603.1 287.5 0.0 335.2 0 0
#4: 1 1 116.4 82.2 0.0 0.0 0.0 0 336.8 291.8 0.0 0.0 0 0

In the OP's code, it would be

summary_out <- dcast(setDT(DB1), 
REGION_ID + REGION_NAME ~ STATUS,
fun.aggregate = sum,
value.var = c("SALES","PROFIT"))

Error using dcast with multiple value.var

I encountered this same thing and it was frustrating as heck.

The answer/problem is that you need to "force" the data.table dcast function otherwise it will use the reshape2 function

The only way I was successfull was running dcast as follows:

# multiple value.var
data.table::dcast(dt, x + y ~ z, fun=sum, value.var=c("d1","d2"))

reshape2: dcast when there are multiple values for one cell but keep this values

This can be done with dcast (here from data.table) though you need a row identifier.

library(data.table)
dcast(dt, HLA_Status + rowid(HLA_Status, variable) ~ variable)
# HLA_Status HLA_Status_1 CCL24 SPP1
#1: PC 1 5.698 2.698
#2: PC 2 89.457 9.457
#3: PC 3 78.230 8.230
#4: PP 1 9.645 23.120
#5: PP 2 56.320 36.320
#6: PP 3 7.268 17.268

data

dt <- fread("    HLA_Status    variable      value
PP CCL24 9.645
PP CCL24 56.32
PP CCL24 7.268
PC CCL24 5.698
PC CCL24 89.457
PC CCL24 78.23
PP SPP1 23.12
PP SPP1 36.32
PP SPP1 17.268
PC SPP1 2.698
PC SPP1 9.457
PC SPP1 8.23")

dcast data.table with multiple value.var's of different classes

An imperfect method:

inDT[, rn := rowid(id)]
Filter(function(z) !all(is.na(z)),
dcast(inDT, rn ~ id, value.var = list("int_value", "num_value", "timestamp_value")))
# rn int_value_int_id_1 int_value_int_id_2 num_value_num_id timestamp_value_timestamp_id
# <int> <int> <int> <num> <POSc>
# 1: 1 2020 1 0.1 2021-09-23 09:15:41
# 2: 2 NA 2 0.2 2021-09-23 09:15:40
# 3: 3 NA 3 0.3 2021-09-23 09:15:39
# 4: 4 NA 4 0.4 2021-09-23 09:15:38
# 5: 5 NA 5 0.5 2021-09-23 09:15:37
# 6: 6 NA 6 0.6 2021-09-23 09:15:36
# 7: 7 NA 7 0.7 2021-09-23 09:15:35
# 8: 8 NA 8 0.8 2021-09-23 09:15:34
# 9: 9 NA 9 0.9 2021-09-23 09:15:33
# 10: 10 NA 10 1.0 2021-09-23 09:15:32

Note: I had to add rn, a column indicating row number within each id, since pivoting operations require the premise of associating rows together.

dcast specific column and keep all

This might not be exactly what you want because you have a separate column for value. Then, what do you put under PPT, TMAX and TMIN?

Here's how to put the value under the appropriate column with dplyr and tidyr:

library(dplyr)
library(tidyr)
df1 %>%
spread(element,value)
date year month day gridNumber PPT TMAX TMIN
1 1899-12-15 1899 12 15 526228 0.0000 43.4782 21.7403
2 1899-12-16 1899 12 16 526228 0.0000 43.3297 20.7510
3 1899-12-17 1899 12 17 526229 0.0000 57.3625 25.8157
4 1899-12-18 1899 12 18 526229 0.2105 NA NA

Can be written in one line using tidyr only:

spread(df1,element,value)

dcast for numeric and character columns in R - returning length by default

We can specify length in fun.aggregate if the length is needed

library(data.table)
dcast(setDT(data), zip + date + calories ~ data_source,
value.var=c("user","price"), length)

Based on the data showed, there are no duplicates, so it would work

dcast(setDT(data), zip + date + calories ~ data_source, value.var=c("user","price"))

If there are duplicates, make a correction to have unique combinations by adding rowid for the grouping variable

dcast(setDT(data), rowid(zip, date, calories) + zip + date + calories 
~ data_source, value.var=c("user","price"))


Related Topics



Leave a reply



Submit