Can the Value.Var in Dcast Be a List or Have Multiple Value Variables

can the value.var in dcast be a list or have multiple value variables?

From v1.9.6 of data.table, we can cast multiple value.var columns simultaneously (and also use multiple aggregation functions in fun.aggregate). Please see ?dcast and the Efficient reshaping using data.tables vignette for more.

Here's how we could use dcast:

dcast(setDT(mydf), x1 ~ x2, value.var=c("salt", "sugar"))
# x1 salt_1 salt_2 salt_3 sugar_1 sugar_2 sugar_3
# 1: 1 3 4 6 1 2 2
# 2: 2 10 3 9 5 3 6
# 3: 3 10 7 7 4 6 7

dcast with multiple variables in val.var option

The error occurs when we are using reshape2::dcast instead of data.table::dcast because reshape2::dcast doesn't support more than one value.var.

The documentation for ?reshape2::dcast gives

value.var - name of column which stores values, see guess_value for default strategies to figure this out.

while in ?data.table::dcast it is

value.var - Name of the column whose values will be filled to cast. Function guess() tries to, well, guess this column automatically, if none is provided. Cast multiple value.var columns simultaneously by passing their names as a character vector. See Examples.


With a small reproducible example

data(mtcars)
dcast(mtcars, vs + am ~ carb, fun.aggregate = sum, value.var = c('mpg', 'disp'))

Error in .subset2(x, i, exact = exact) : subscript out of bounds
In addition: Warning messages:
1: In dcast(mtcars, vs + am ~ carb, fun.aggregate = sum, value.var = c("mpg",

If we convert to data.table

library(data.table)
dcast(as.data.table(mtcars), vs + am ~ carb, fun.aggregate = sum, value.var = c('mpg', 'disp'))
# vs am mpg_1 mpg_2 mpg_3 mpg_4 mpg_6 mpg_8 disp_1 disp_2 disp_3 disp_4 disp_6 disp_8
#1: 0 0 0.0 68.6 48.9 63.1 0.0 0 0.0 1382.0 827.4 2082.0 0 0
#2: 0 1 0.0 26.0 0.0 57.8 19.7 15 0.0 120.3 0.0 671.0 145 301
#3: 1 0 61.0 47.2 0.0 37.0 0.0 0 603.1 287.5 0.0 335.2 0 0
#4: 1 1 116.4 82.2 0.0 0.0 0.0 0 336.8 291.8 0.0 0.0 0 0

In the OP's code, it would be

summary_out <- dcast(setDT(DB1), 
REGION_ID + REGION_NAME ~ STATUS,
fun.aggregate = sum,
value.var = c("SALES","PROFIT"))

Why can't one have several `value.var` in `dcast`?

This question is very much related to your other question from earlier today.

@beginneR wrote in the comments that "As long as the existing data is already in long-format, I don't see any general need to melt it before casting." In my answer posted at your other question, I gave an example of when melt would be required, or rather, how to decide whether your data are long enough.

This question here is another example of when further melting would be required since point 3 in my answer is not satisfied.

To get the behavior you want, try the following:

C93L <- melt(Cars93, measure.vars = c("Price", "Weight"))
dcast(C93L, AirBags ~ DriveTrain + variable, mean, value.var = "value")
# AirBags 4WD_Price 4WD_Weight Front_Price Front_Weight
# 1 Driver & Passenger NaN NaN 26.17273 3393.636
# 2 Driver only 21.38 3623 18.69286 2996.250
# 3 None 13.88 2987 12.98571 2703.036
# Rear_Price Rear_Weight
# 1 33.20 3515.0
# 2 28.23 3463.5
# 3 14.90 3610.0

An alternative is to use aggregate to calculate the means, and then use reshape or dcast to go from "long" to "wide". Both are required since reshape does not perform any aggregation:

temp <- aggregate(cbind(Price, Weight) ~ AirBags + DriveTrain, 
Cars93, mean)
# AirBags DriveTrain Price Weight
# 1 Driver only 4WD 21.38000 3623.000
# 2 None 4WD 13.88000 2987.000
# 3 Driver & Passenger Front 26.17273 3393.636
# 4 Driver only Front 18.69286 2996.250
# 5 None Front 12.98571 2703.036
# 6 Driver & Passenger Rear 33.20000 3515.000
# 7 Driver only Rear 28.23000 3463.500
# 8 None Rear 14.90000 3610.000

reshape(temp, direction = "wide",
idvar = "AirBags", timevar = "DriveTrain")
# AirBags Price.4WD Weight.4WD Price.Front Weight.Front
# 1 Driver only 21.38 3623 18.69286 2996.250
# 2 None 13.88 2987 12.98571 2703.036
# 3 Driver & Passenger NA NA 26.17273 3393.636
# Price.Rear Weight.Rear
# 1 28.23 3463.5
# 2 14.90 3610.0
# 3 33.20 3515.0

on dcast() argument value.var

Both reshape2 and spread have been deprecated or retired - the tidyverse now wants you to use pivot_wider. I'm not up to date on that syntax, but dcast still does what you want it to with data.table.

library(data.table)
d1 <- data.table(ID = c(11,11,11,12,12,12),
codes = c('a', 'a', 'a', 'b', 'a', 'a'),
gfreq = c(.5,.5,.5,NA,.5,.5))
dcast(d1, ID ~ codes)
#> Using 'gfreq' as value column. Use 'value.var' to override
#> Aggregate function missing, defaulting to 'length'
#> ID a b
#> 1: 11 3 0
#> 2: 12 2 1

d2 <- data.table(ID = c(11,11,11,12,12,12),
codes = c('a', 'a', 'a', 'b', 'a', 'a'))
dcast(d2, ID ~ codes)
#> Using 'codes' as value column. Use 'value.var' to override
#> Aggregate function missing, defaulting to 'length'
#> ID a b
#> 1: 11 3 0
#> 2: 12 2 1

## If you only want 1's and 0's
dcast(unique(d2), ID ~ codes,
fun.aggregate = length)
#> Using 'codes' as value column. Use 'value.var' to override
#> ID a b
#> 1: 11 1 0
#> 2: 12 1 1

Created on 2019-10-16 by the reprex package (v0.3.0)

Apply dcast multiple times for different variables

Here is an option with cSplit_e

library(splitstackshape)
cSplit_e(mydf, 'V1', type = 'character', fill = '0') %>%
cSplit_e('V2', type = 'character', fill = '0')
# A V1 V2 V1_x V1_y V2_u V2_v V2_w
#1: A x u 1 0 1 0 0
#2: B x v 1 0 0 1 0
#3: C y w 0 1 0 0 1
#4: D x v 1 0 0 1 0
#5: E y u 0 1 1 0 0

Or with table from base R

 do.call(cbind, lapply(2:3, function(i) table(mydf$A, mydf[[i]])))

Or the same approach in data.table syntax

nm1 <- names(mydf)[-1]
out <- mydf[, lapply(.SD, function(x)
as.data.frame.matrix(table(A, x))), .SDcols = nm1]
mydf[, names(out) := out][]
# A V1 V2 V1.x V1.y V2.u V2.v V2.w
#1: A x u 1 0 1 0 0
#2: B x v 1 0 0 1 0
#3: C y w 0 1 0 0 1
#4: D x v 1 0 0 1 0
#5: E y u 0 1 1 0 0

dcast With multiple Ids and variables

A tidyverse solution, using gather and spread from tidyr pacakge:

library(dplyr)
library(tidyr) #version 1.0.0 which has pivot_wider

df1 %>%
group_by(Type) %>%
mutate(name_x = row_number()) %>%
gather(key=var, value=val, c(Score, Time)) %>%
mutate(var = paste(var, name_x, sep="_")) %>%
select(-name_x) %>%
spread(key=var, value=val)

#> # A tibble: 3 x 11
#> # Groups: Type [3]
#> id Date Type Score_1 Score_2 Score_3 Score_4 Time_1 Time_2 Time_3 Time_4
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 2001~ aaa 123 456 789 NA 12:12 13:12 14:12 <NA>
#> 2 2 2001~ ddd 113 145 NA NA 15:12 16:12 <NA> <NA>
#> 3 3 2001~ bbb 789 145 113 145 17:12 18:12 19:12 20:12

You can do the same with pivot_wider much more conveniently:

df1 %>% 
group_by(Type) %>%
mutate(name_x = row_number()) %>%
pivot_wider(id_cols = c("id","Date", "Type"),
names_from = c("name_x"),
values_from = c("Score", "Time"))

Data:

df1 <- data.frame(id=c(1,1,1,2,2,3,3,3,3),
Date = c(rep("2001-01-13", 3), rep("2001-01-16", 2), rep("2001-01-18", 4)),
Type = c(rep("aaa",3), rep("ddd", 2), rep("bbb",4)),
Score = c(123,456,789,113,145,789,145,113,145),
Time = paste0(12:20, ":12"),
stringsAsFactors = F)

reshape2: dcast when there are multiple values for one cell but keep this values

This can be done with dcast (here from data.table) though you need a row identifier.

library(data.table)
dcast(dt, HLA_Status + rowid(HLA_Status, variable) ~ variable)
# HLA_Status HLA_Status_1 CCL24 SPP1
#1: PC 1 5.698 2.698
#2: PC 2 89.457 9.457
#3: PC 3 78.230 8.230
#4: PP 1 9.645 23.120
#5: PP 2 56.320 36.320
#6: PP 3 7.268 17.268

data

dt <- fread("    HLA_Status    variable      value
PP CCL24 9.645
PP CCL24 56.32
PP CCL24 7.268
PC CCL24 5.698
PC CCL24 89.457
PC CCL24 78.23
PP SPP1 23.12
PP SPP1 36.32
PP SPP1 17.268
PC SPP1 2.698
PC SPP1 9.457
PC SPP1 8.23")

dcast with value being text

Since you had dcast in your title, I'll assume data.table:

data.table::dcast(question ~ employeeid, data = df, value.var = "Answer")
# question 1 2
# 1 do you like apples? No No
# 2 do you like milk? Yes No

but an alternative:

tidyr::spread(df, employeeid, Answer)
# question 1 2
# 1 do you like apples? No No
# 2 do you like milk? Yes No

Edit: since it appears you have dupes in the data, you can find the "most-occurring" answer with:

most <- function(x) names(sort(table(x)))[1]
data.table::dcast(question~employeeid, data=df, value.var="Answer", fun.aggregate = most)
# question 1 2
# 1 do you like apples? Yes Yes
# 2 do you like milk? No Yes

dcast function taking arguments from two value variables

Not sure if I understood your goal but from my interpretation, a quick and dirty way is to group by cars and state first, create the new column, then dcast the new data table

mycars <- as.data.table(mycars)

temp <- mycars[, .(z = car_PS_var(PS_mean, PS_stdv)),
by = c("cars", "state")]

dcast(temp, cars ~ state)

cars 1 2
1: A 1.449275 1.449275
2: B 4.325825 4.325825
3: C 4.545340 4.545340

Is it possible to use dcast without variable column?

With dcast, we can create formula on the fly with an expression created with paste and rowid

library(data.table)
dcast(dt, id ~ paste0('var_', rowid(id)))

-output

   id var_1 var_2
1: 1 100 300
2: 2 200 NA


Related Topics



Leave a reply



Submit