How to Use the Spread Function Properly in Tidyr

Problem with {tidyr} spread function dropping rows

We can create a column with row_number() because there are some duplicate rows

library(dplyr)
library(tidyr)
df %>% 
    mutate(i=1, rn = row_number()) %>% 
    spread(group, i, fill=0) %>%
    select(-rn)

Or using pivot_wider

df %>%
   mutate(rn = row_number(), i = 1) %>%
   pivot_wider(names_from = group, values_from = i, values_fill = list(i = 0))

Is it possible to use spread on multiple columns in tidyr similar to dcast?

One option would be to create a new 'Prod_Count' by joining the 'Product' and 'Country' columns by paste, remove those columns with the select and reshape from 'long' to 'wide' using spread from tidyr.

 library(dplyr)
 library(tidyr)
 sdt %>%
 mutate(Prod_Count=paste(Product, Country, sep="_")) %>%
 select(-Product, -Country)%>% 
 spread(Prod_Count, value)%>%
 head(2)
 #  Year      A_AI       B_EI
 #1 1990 0.7878674  0.2486044
 #2 1991 0.2343285 -1.1694878

Or we can avoid a couple of steps by using unite from tidyr (from @beetroot's comment) and reshape as before.

 sdt%>% 
 unite(Prod_Count, Product,Country) %>%
 spread(Prod_Count, value)%>% 
 head(2)
 #   Year      A_AI       B_EI
 # 1 1990 0.7878674  0.2486044
 # 2 1991 0.2343285 -1.1694878

tidyr spread function generates sparse matrix when compact vector expected

The key here is that spread doesn't aggregate the data.

Hence, if you hadn't already used xtabs to aggregate first, you would be doing this:

a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>% 
    unite(S,A,P)
a
##             S Freq
## 1 FALSE_FALSE    1
## 2  FALSE_TRUE    1
## 3  TRUE_FALSE    1
## 4   TRUE_TRUE    1
## 5  TRUE_FALSE    1

a %>% spread(S, Freq)
##   FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1           1         NA         NA        NA
## 2          NA          1         NA        NA
## 3          NA         NA          1        NA
## 4          NA         NA         NA         1
## 5          NA         NA          1        NA

Which wouldn't make sense any other way (without aggregation).

This is predictable based on the help file for the fill parameter:

If there isn't a value for every combination of the other variables
and the key column, this value will be substituted.

In your case, there aren't any other variables to combine with the key column. Had there been, then...

b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
                                , h = rep(c("foo", "bar"), length.out = 5)) %>% 
    unite(S,A,P)
b
##             S Freq   h
## 1 FALSE_FALSE    1 foo
## 2  FALSE_TRUE    1 bar
## 3  TRUE_FALSE    1 foo
## 4   TRUE_TRUE    1 bar
## 5  TRUE_FALSE    1 foo

> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)

...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).

The tidyr/dplyr way to do it would be group_by and summarize instead of xtabs, because summarize preserves the grouping column, hence spread can tell which observations belong in the same row:

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
## 
##     h           S Freq
## 1 bar  FALSE_TRUE    1
## 2 bar   TRUE_TRUE    1
## 3 foo FALSE_FALSE    1
## 4 foo  TRUE_FALSE    2

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq)) %>%
    spread(S, Freq)
## Source: local data frame [2 x 5]
## 
##     h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar          NA          1         NA         1
## 2 foo           1         NA          2        NA

How do you use spread() when your data has multiple key variables?

You can use unite from tidyr to combine the three columns into one prior to spreading.

Then you can spread, using the new column as the key and the "result" as value.

I also removed columns "a" through "p" prior to spreading, as it didn't seem like these were needed in the desired result.

pracdf2 %>%
     unite("allgroups", month, scenario, ptype) %>%
     select(-(a:p)) %>%
     spread(allgroups, result)

# A tibble: 2 x 13
  ID    `1_a_p.high` `1_a_p.low` `1_a_p.mid` `1_b_p.high` `1_b_p.low` `1_b_p.mid` `2_a_p.high` `2_a_p.low`
  <fct>        <dbl>       <dbl>       <dbl>        <dbl>       <dbl>       <dbl>        <dbl>       <dbl>
1 a              160        96.2       128          423         254         338            209       126  
2 b              120        72.0        96.0         20.9        12.5        16.7          133        79.5
# ... with 4 more variables: `2_a_p.mid` <dbl>, `2_b_p.high` <dbl>, `2_b_p.low` <dbl>, `2_b_p.mid` <dbl>

R - tidyr - mutate and spread multiple columns

Rather than spread(), you can use the new pivot_wider() that was added in the recent tidyr 1.0.0 release. It has a values_from argument that allows you to specify multiple columns at once:

library(dplyr)
library(tidyr)

my_df_test %>% 
  group_by(V1, V2) %>% 
  mutate(new = V3, V3 = toString(V3)) %>% 
  pivot_wider(
    names_from  = new,
    values_from = c(V6, V7)
  )
#> # A tibble: 2 x 9
#> # Groups:   V1, V2 [4]
#>      V1 V2    V3     V4    V5    V6_S1 V6_S2 V7_S1 V7_S2
#>   <dbl> <fct> <chr>  <fct> <fct> <fct> <fct> <fct> <fct>
#> 1     1 A     S1, S2 x     y     A     C     D     F    
#> 2     2 B     S1     x     y     B     <NA>  E     <NA>

^{Created on 2019-09-18 by the reprex package (v0.3.0)}

tidyr::spread() with multiple keys and values

Reshaping with multiple value variables can best be done with dcast from data.table or reshape from base R.

library(data.table)
out <- dcast(setDT(df), id ~ paste0("time", time), value.var = c("x", "y"), sep = "")
out
#    id     xtime1     xtime2      xtime3      ytime1      ytime2      ytime3
# 1:  1  0.4334921 -0.5205570 -1.44364515  0.49288757 -1.26955148 -0.83344256
# 2:  2  0.4785870  0.9261711  0.68173681  1.24639813  0.91805332  0.34346260
# 3:  3 -1.2067665  1.7309593  0.04923993  1.28184341 -0.69435556  0.01609261
# 4:  4  0.5240518  0.7481787  0.07966677 -1.36408357  1.72636849 -0.45827205
# 5:  5  0.3733316 -0.3689391 -0.11879819 -0.03276689  0.91824437  2.18084692
# 6:  6  0.2363018 -0.2358572  0.73389984 -1.10946940 -1.05379502 -0.82691626
# 7:  7 -1.4979165  0.9026397  0.84666801  1.02138768 -0.01072588  0.08925716
# 8:  8  0.3428946 -0.2235349 -1.21684977  0.40549497  0.68937085 -0.15793111
# 9:  9 -1.1304688 -0.3901419 -0.10722222 -0.54206830  0.34134397  0.48504564
#10: 10 -0.5275251 -1.1328937 -0.68059800  1.38790593  0.93199593 -1.77498807

Using reshape we could do

# setDF(df) # in case df is a data.table now
reshape(df, idvar = "id", timevar = "time", direction = "wide")

Spread multiple columns in a function

We'll return to the answer provided in the question linked to, but for the moment let's start with a more naive approach.

One idea would be to spread each value column individually, and then join the results, i.e.

library(dplyr)
library(tidyr)
library(tibble)

dat_avg <- dat %>% 
    select(-sd) %>%
    spread(key = grp,value = avg) %>%
    rename(a_avg = a,
           b_avg = b)

dat_sd <- dat %>% 
    select(-avg) %>%
    spread(key = grp,value = sd) %>%
    rename(a_sd = a,
           b_sd = b)

> full_join(dat_avg,
          dat_sd,
          by = 'id')

# A tibble: 2 x 5
     id     a_avg      b_avg      a_sd      b_sd
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

(I used a full_join just in case we run into situations where not all combinations of the join columns appear in all of them.)

Let's start with a function that works like spread but allows you to pass the key and value columns as characters:

spread_chr <- function(data, key_col, value_cols, fill = NA, 
                       convert = FALSE,drop = TRUE,sep = NULL){
    n_val <- length(value_cols)
    result <- vector(mode = "list", length = n_val)
    id_cols <- setdiff(names(data), c(key_col,value_cols))

    for (i in seq_along(result)){
        result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
                              key = !!key_col,
                              value = !!value_cols[i],
                              fill = fill,
                              convert = convert,
                              drop = drop,
                              sep = paste0(sep,value_cols[i],sep))
    }

    result %>%
        purrr::reduce(.f = full_join, by = id_cols)
}

> dat %>%
  spread_chr(key_col = "grp",
             value_cols = c("avg","sd"),
             sep = "_")

# A tibble: 2 x 5
     id grp_avg_a  grp_avg_b  grp_sd_a  grp_sd_b
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

The key ideas here are to unquote the arguments key_col and value_cols[i] using the !! operator, and using the sep argument in spread to control the resulting value column names.

If we wanted to convert this function to accept unquoted arguments for the key and value columns, we could modify it like so:

spread_nq <- function(data, key_col,..., fill = NA, 
                      convert = FALSE, drop = TRUE, sep = NULL){
    val_quos <- rlang::quos(...)
    key_quo <- rlang::enquo(key_col)
    value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
    key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))

    n_val <- length(value_cols)
    result <- vector(mode = "list",length = n_val)
    id_cols <- setdiff(names(data),c(key_col,value_cols))

    for (i in seq_along(result)){
        result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
                              key = !!key_col,
                              value = !!value_cols[i],
                              fill = fill,
                              convert = convert,
                              drop = drop,
                              sep = paste0(sep,value_cols[i],sep))
    }

    result %>%
        purrr::reduce(.f = full_join,by = id_cols)
}

> dat %>%
  spread_nq(key_col = grp,avg,sd,sep = "_")

# A tibble: 2 x 5
     id grp_avg_a  grp_avg_b  grp_sd_a  grp_sd_b
  <int>     <dbl>      <dbl>     <dbl>     <dbl>
1     1 1.3709584 -0.5646982 0.6569923 0.7050648
2     2 0.3631284  0.6328626 0.4577418 0.7191123

The change here is that we capture the unquoted arguments with rlang::quos and rlang::enquo and then simply convert them back to characters using tidyselect::vars_select.

Returning to the solution in the linked question that uses a sequence of gather, unite and spread, we can use what we've learned to make a function like this:

spread_nt <- function(data,key_col,...,fill = NA,
                      convert = TRUE,drop = TRUE,sep = "_"){
  key_quo <- rlang::enquo(key_col)
  val_quos <- rlang::quos(...)
  value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
  key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))

  data %>%
    gather(key = ..var..,value = ..val..,!!!val_quos) %>%
    unite(col = ..grp..,c(key_col,"..var.."),sep = sep) %>%
    spread(key = ..grp..,value = ..val..,fill = fill,
           convert = convert,drop = drop,sep = NULL)
}

> dat %>%
  spread_nt(key_col = grp,avg,sd,sep = "_")

# A tibble: 2 x 5
     id     a_avg      a_sd      b_avg      b_sd
* <int>     <dbl>     <dbl>      <dbl>     <dbl>
1     1 1.3709584 0.6569923 -0.5646982 0.7050648
2     2 0.3631284 0.4577418  0.6328626 0.7191123

This relies on the same techniques from rlang from the last example. We're using some unusual names like ..var.. for our intermediate variables in order to reduce the chances of name collisions with existing columns in our data frame.

Also, we're using the sep argument in unite to control the resulting column names, so in this case when we spread we force sep = NULL.