Spread VS Dcast

Spread vs dcast

Let's say you were starting with data that looked like this:

mydf
#   Weight Height  Fitted interval limit    value
# 1     42  153.4 51.0792     conf   lwr 49.15463
# 2     42  153.4 51.0792     pred   lwr 32.82122
# 3     42  153.4 51.0792     conf   upr 53.00376
# 4     42  153.4 51.0792     pred   upr 69.33717
# 5     42  153.4 51.0792     conf   lwr 60.00000
# 6     42  153.4 51.0792     pred   lwr 90.00000

Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise for the "value" column.

If you stopped the execution at the summarise step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread would work as expected.

mydf %>% 
  group_by(Weight, Height, Fitted, interval, limit) %>% 
  summarise(value = mean(value)) %>% 
  spread(limit, value)
# Source: local data frame [2 x 6]
# 
#   Weight Height  Fitted interval      lwr      upr
#    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

This matches the expected output from dcast with fun.aggregate = mean.

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
#   Weight Height  Fitted interval      lwr      upr
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

Sample data:

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
     153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
     51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
     "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
     2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
         value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
         90)), .Names = c("Weight", "Height", "Fitted", "interval",     
     "limit", "value"), row.names = c(NA, 6L), class = "data.frame")

Spread or dcast and fill in counts

I think you want to count the features and not sum them. Try with the function length.

tidyr::pivot_wider(df, names_from = feature, 
            values_from = feature, values_fn = length, values_fill = 0)

Or with dcast.

library(data.table)
dcast(setDT(df), id~feature, value.var = 'feature', fun.aggregate = length)

In base R, using table(df) would give the same output.

table(df)

#     feature
#id     f1 f2 f3 f4 f5
#  id1   1  0  1  1  0
#  id10  1  0  1  1  0
#  id2   1  1  0  0  1
#  id3   0  1  1  1  0
#  id4   1  0  1  0  1
#  id5   1  1  0  0  1
#  id6   1  1  1  0  0
#  id7   1  0  0  1  1
#  id8   1  1  0  0  1
#  id9   0  1  0  1  1

Is it possible to use spread on multiple columns in tidyr similar to dcast?

One option would be to create a new 'Prod_Count' by joining the 'Product' and 'Country' columns by paste, remove those columns with the select and reshape from 'long' to 'wide' using spread from tidyr.

 library(dplyr)
 library(tidyr)
 sdt %>%
 mutate(Prod_Count=paste(Product, Country, sep="_")) %>%
 select(-Product, -Country)%>% 
 spread(Prod_Count, value)%>%
 head(2)
 #  Year      A_AI       B_EI
 #1 1990 0.7878674  0.2486044
 #2 1991 0.2343285 -1.1694878

Or we can avoid a couple of steps by using unite from tidyr (from @beetroot's comment) and reshape as before.

 sdt%>% 
 unite(Prod_Count, Product,Country) %>%
 spread(Prod_Count, value)%>% 
 head(2)
 #   Year      A_AI       B_EI
 # 1 1990 0.7878674  0.2486044
 # 2 1991 0.2343285 -1.1694878

Spread and dcast not aligning correctly

Your issue is with precision with your lat & long values

> unique(dat$long)
# [1] -73.45833 -73.45833
> unique(dat$lat)
# [1] 42.04167 42.04167

## Notice two 'unique' values that are 'printed' the same

If you look at your structure data closely you will see you have slightly different values in your lat and long

For example, rounding the lat/long values removes the precision 'error'

dat$lat  <- round(dat$lat, 4)
dat$long <- round(dat$long, 4)

spread(dat, element, value)

## now tmin is aligned 

    #gridNumber  fips cropArea  state       county_name     long     lat  year month      ppt   tmax  tmean   tmin
#(int) (int)    (dbl) (fctr)             (chr)    (dbl)   (dbl) (int) (chr)    (dbl)  (dbl)  (dbl)  (dbl)
#1     266783  9005        0     CT Litchfield County -73.4583 42.0417  1996   apr 6.642506 53.294 43.052 32.810
#2     266783  9005        0     CT Litchfield County -73.4583 42.0417  1996   aug 2.519680 76.550 66.551 56.552
#3     266783  9005        0     CT Litchfield County -73.4583 42.0417  1996   dec       NA     NA     NA 26.132

Replacing dplyr's spread by pivot_wider and gather by pivot_longer

This seems to be another instance of this issue on Github, which is supposed to be fixed in the dev version of tidyr. After updating tidyr (i.e.devtools::install_github("tidyverse/tidyr")) I get comparable performance on your example:

library(tidyverse)

dates <- seq(from = as.Date("1975-01-01"), to = as.Date("2019-10-31"), by = "months")

returndata <- tibble(stock = sort(rep(letters, length(dates))),
                     month = rep(dates, length(letters)),
                     ret   = runif(length(dates) * length(letters)) - 0.5)

bench::mark(
  spread = returndata %>% spread(stock, ret),
  pivot_wider = returndata %>% pivot_wider(names_from = stock, values_from = ret)
)
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 spread        8.83ms   9.57ms     100.         0B     6.39
#> 2 pivot_wider  10.96ms  11.37ms      86.1        0B     4.42

^{Created on 2019-11-25 by the reprex package (v0.3.0)}

R pivot_wider or dcast to cast multiple columns

We can use pivot_wider

library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
   mutate(rn = rowid(NAME)) %>%
   pivot_wider(names_from = rn, values_from = c(CATEGORY, NUMBER_CATEGORY))

-output

# A tibble: 2 x 7
#  NAME  CATEGORY_1 CATEGORY_2 CATEGORY_3 NUMBER_CATEGORY_1 NUMBER_CATEGORY_2 NUMBER_CATEGORY_3
#  <chr> <chr>      <chr>      <chr>                  <int>             <int>             <int>
#1 Amy   Low        Med        <NA>                     180               185                NA
#2 John  Low        Med        Med                      118               182               185

data

df1 <- structure(list(NAME = c("Amy", "Amy", "John", "John", "John"), 
    CATEGORY = c("Low", "Med", "Low", "Med", "Med"), NUMBER_CATEGORY = c(180L, 
    185L, 118L, 182L, 185L)), class = "data.frame", row.names = c(NA, 
-5L))

how to spread or cast multiple values in r

We could do this using dplyr/tidyr. We reshape the 'data' from 'wide' to 'long' format with gather specifying the columns (starts_with('value')) to be combined to a key/value column pair ('Var/Val'), unite the 'Var' and 'y' column to create a single 'Var1' column, and reconvert back to 'wide' format with spread.

 library(dplyr)
 library(tidyr)
 data %>%
      gather(Var, val, starts_with("value")) %>% 
      unite(Var1,Var, y) %>% 
      spread(Var1, val)

 #      x value.1_a value.1_b value.1_c value.1_d value.2_a value.2_b   value.2_c
 #1   blue         5         6         7         8        17        18        19
 #2  green         9        10        11        12        21        22        23
 #3    red         1         2         3         4        13        14        15
 #    value.2_d
 #1        20
 #2        24
 #3        16

Update

(After 6 months)

Reshaping multiple value columns to wide is now possible with dcast from data.table_1.9.5 without using the melt. We can install the devel version from here

 library(data.table)
 dcast(setDT(data), x~y, value.var=c('value.1', 'value.2'))
 #       x a_value.1 b_value.1 c_value.1 d_value.1 a_value.2 b_value.2 c_value.2
 #1:  blue         5         6         7         8        17        18        19
 #2: green         9        10        11        12        21        22        23
 #3:   red         1         2         3         4        13        14        15
 #   d_value.2
 #1:        20
 #2:        24
 #3:        16

R spread vs gather in tidyr

You can use a combined gather + spread approach; Gather the *Values columns and combine with currentTest to form the new header, then spread to wide format:

resultsData %>% 
    gather(key, value, -person, -currentTest) %>% 
    unite(header, c('currentTest', 'key'), sep = "_") %>% 
    spread(header, value)

# A tibble: 10 x 13
#   person A_afterValue A_beforeValue B_afterValue B_beforeValue C_afterValue C_beforeValue
# *  <int>        <dbl>         <dbl>        <dbl>         <dbl>        <dbl>         <dbl>
# 1      1   2.67176351    1.28429706   -0.8694723   -0.47279938  1.090681730     0.5774694
# 2      2  -2.35492691   -0.61835955    1.4448554    0.05972074  4.375377235     1.9274817
# 3      3  -0.09170997    0.03945743    2.7310495    0.92420153 -0.116850493    -0.2455936
# 4      4  -0.36285183   -0.44860832    1.0175425    0.65888418  1.953383260     0.1261048
# 5      5  -1.41628434   -0.96177712   -4.7126717   -1.98980726 -1.690035639    -0.5593383
# 6      6   2.05218144    0.70247190    1.9712327    0.66024130  2.622505813     1.2980212
# 7      7  -2.12568428   -0.45522204   -0.5644579    0.08963695 -0.008373791     0.7194067
# 8      8  -2.77742515   -1.23154913    0.5076592   -0.82839994  1.841929380     0.9694145
# 9      9  -0.55830618   -0.79723499   -0.3169969   -0.83807424 -0.675899869    -0.8146971
#10     10  -1.24415955   -0.70973496   -3.3176237   -1.65919710  2.085066600     0.8646598
# ... with 6 more variables: D_afterValue <dbl>, D_beforeValue <dbl>, E_afterValue <dbl>,
#   E_beforeValue <dbl>, F_afterValue <dbl>, F_beforeValue <dbl>

If you need to rename the columns:

resultsData %>% 
    gather(key, value, -person, -currentTest) %>% 
    unite(header, c('currentTest', 'key'), sep = "_") %>% 
    spread(header, value) %>% 
    rename_at(vars(matches("Value$")), funs(gsub("Value$", "", .)))

data.table is copied when using spread()?

I think using dcast is the way to go. However, a possible solution using tidyr::spread would be to add setDT() to the piped call, i.e.,

set.seed(123)

# install.packages(c("data.table"), dependencies = TRUE)
library(data.table)

mydt <- data.table(id = 1:100, x = sample(LETTERS[1:6], size = 100, replace = TRUE),
                   group = paste0("group", sample(1:3, size = 100, replace = TRUE)),
                   prob = runif(100, 0, 1)
                   )

class(mydt)
mydt2 <- mydt %>% tidyr::spread(group, prob) %>% setDT()

mydt2[!is.na(group1), new.col := x]