Problem with {tidyr} spread function dropping rows
We can create a column with row_number()
because there are some duplicate rows
library(dplyr)
library(tidyr)
df %>%
mutate(i=1, rn = row_number()) %>%
spread(group, i, fill=0) %>%
select(-rn)
Or using pivot_wider
df %>%
mutate(rn = row_number(), i = 1) %>%
pivot_wider(names_from = group, values_from = i, values_fill = list(i = 0))
Is it possible to use spread on multiple columns in tidyr similar to dcast?
One option would be to create a new 'Prod_Count' by joining the 'Product' and 'Country' columns by paste
, remove those columns with the select
and reshape from 'long' to 'wide' using spread
from tidyr
.
library(dplyr)
library(tidyr)
sdt %>%
mutate(Prod_Count=paste(Product, Country, sep="_")) %>%
select(-Product, -Country)%>%
spread(Prod_Count, value)%>%
head(2)
# Year A_AI B_EI
#1 1990 0.7878674 0.2486044
#2 1991 0.2343285 -1.1694878
Or we can avoid a couple of steps by using unite
from tidyr
(from @beetroot's comment) and reshape as before.
sdt%>%
unite(Prod_Count, Product,Country) %>%
spread(Prod_Count, value)%>%
head(2)
# Year A_AI B_EI
# 1 1990 0.7878674 0.2486044
# 2 1991 0.2343285 -1.1694878
tidyr spread function generates sparse matrix when compact vector expected
The key here is that spread
doesn't aggregate the data.
Hence, if you hadn't already used xtabs
to aggregate first, you would be doing this:
a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>%
unite(S,A,P)
a
## S Freq
## 1 FALSE_FALSE 1
## 2 FALSE_TRUE 1
## 3 TRUE_FALSE 1
## 4 TRUE_TRUE 1
## 5 TRUE_FALSE 1
a %>% spread(S, Freq)
## FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 1 NA NA NA
## 2 NA 1 NA NA
## 3 NA NA 1 NA
## 4 NA NA NA 1
## 5 NA NA 1 NA
Which wouldn't make sense any other way (without aggregation).
This is predictable based on the help file for the fill
parameter:
If there isn't a value for every combination of the other variables
and the key column, this value will be substituted.
In your case, there aren't any other variables to combine with the key column. Had there been, then...
b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
, h = rep(c("foo", "bar"), length.out = 5)) %>%
unite(S,A,P)
b
## S Freq h
## 1 FALSE_FALSE 1 foo
## 2 FALSE_TRUE 1 bar
## 3 TRUE_FALSE 1 foo
## 4 TRUE_TRUE 1 bar
## 5 TRUE_FALSE 1 foo
> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)
...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).
The tidyr
/dplyr
way to do it would be group_by
and summarize
instead of xtabs
, because summarize
preserves the grouping column, hence spread
can tell which observations belong in the same row:
b %>% group_by(h, S) %>%
summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
##
## h S Freq
## 1 bar FALSE_TRUE 1
## 2 bar TRUE_TRUE 1
## 3 foo FALSE_FALSE 1
## 4 foo TRUE_FALSE 2
b %>% group_by(h, S) %>%
summarize(Freq = sum(Freq)) %>%
spread(S, Freq)
## Source: local data frame [2 x 5]
##
## h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar NA 1 NA 1
## 2 foo 1 NA 2 NA
How do you use spread() when your data has multiple key variables?
You can use unite
from tidyr to combine the three columns into one prior to spreading.
Then you can spread
, using the new column as the key
and the "result" as value
.
I also removed columns "a" through "p" prior to spreading, as it didn't seem like these were needed in the desired result.
pracdf2 %>%
unite("allgroups", month, scenario, ptype) %>%
select(-(a:p)) %>%
spread(allgroups, result)
# A tibble: 2 x 13
ID `1_a_p.high` `1_a_p.low` `1_a_p.mid` `1_b_p.high` `1_b_p.low` `1_b_p.mid` `2_a_p.high` `2_a_p.low`
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 160 96.2 128 423 254 338 209 126
2 b 120 72.0 96.0 20.9 12.5 16.7 133 79.5
# ... with 4 more variables: `2_a_p.mid` <dbl>, `2_b_p.high` <dbl>, `2_b_p.low` <dbl>, `2_b_p.mid` <dbl>
R - tidyr - mutate and spread multiple columns
Rather than spread()
, you can use the new pivot_wider()
that was added in the recent tidyr 1.0.0 release. It has a values_from
argument that allows you to specify multiple columns at once:
library(dplyr)
library(tidyr)
my_df_test %>%
group_by(V1, V2) %>%
mutate(new = V3, V3 = toString(V3)) %>%
pivot_wider(
names_from = new,
values_from = c(V6, V7)
)
#> # A tibble: 2 x 9
#> # Groups: V1, V2 [4]
#> V1 V2 V3 V4 V5 V6_S1 V6_S2 V7_S1 V7_S2
#> <dbl> <fct> <chr> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 1 A S1, S2 x y A C D F
#> 2 2 B S1 x y B <NA> E <NA>
Created on 2019-09-18 by the reprex package (v0.3.0)
tidyr::spread() with multiple keys and values
Reshaping with multiple value variables can best be done with dcast
from data.table
or reshape
from base R
.
library(data.table)
out <- dcast(setDT(df), id ~ paste0("time", time), value.var = c("x", "y"), sep = "")
out
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# 1: 1 0.4334921 -0.5205570 -1.44364515 0.49288757 -1.26955148 -0.83344256
# 2: 2 0.4785870 0.9261711 0.68173681 1.24639813 0.91805332 0.34346260
# 3: 3 -1.2067665 1.7309593 0.04923993 1.28184341 -0.69435556 0.01609261
# 4: 4 0.5240518 0.7481787 0.07966677 -1.36408357 1.72636849 -0.45827205
# 5: 5 0.3733316 -0.3689391 -0.11879819 -0.03276689 0.91824437 2.18084692
# 6: 6 0.2363018 -0.2358572 0.73389984 -1.10946940 -1.05379502 -0.82691626
# 7: 7 -1.4979165 0.9026397 0.84666801 1.02138768 -0.01072588 0.08925716
# 8: 8 0.3428946 -0.2235349 -1.21684977 0.40549497 0.68937085 -0.15793111
# 9: 9 -1.1304688 -0.3901419 -0.10722222 -0.54206830 0.34134397 0.48504564
#10: 10 -0.5275251 -1.1328937 -0.68059800 1.38790593 0.93199593 -1.77498807
Using reshape
we could do
# setDF(df) # in case df is a data.table now
reshape(df, idvar = "id", timevar = "time", direction = "wide")
Spread multiple columns in a function
We'll return to the answer provided in the question linked to, but for the moment let's start with a more naive approach.
One idea would be to spread
each value column individually, and then join the results, i.e.
library(dplyr)
library(tidyr)
library(tibble)
dat_avg <- dat %>%
select(-sd) %>%
spread(key = grp,value = avg) %>%
rename(a_avg = a,
b_avg = b)
dat_sd <- dat %>%
select(-avg) %>%
spread(key = grp,value = sd) %>%
rename(a_sd = a,
b_sd = b)
> full_join(dat_avg,
dat_sd,
by = 'id')
# A tibble: 2 x 5
id a_avg b_avg a_sd b_sd
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
(I used a full_join
just in case we run into situations where not all combinations of the join columns appear in all of them.)
Let's start with a function that works like spread
but allows you to pass the key
and value
columns as characters:
spread_chr <- function(data, key_col, value_cols, fill = NA,
convert = FALSE,drop = TRUE,sep = NULL){
n_val <- length(value_cols)
result <- vector(mode = "list", length = n_val)
id_cols <- setdiff(names(data), c(key_col,value_cols))
for (i in seq_along(result)){
result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
key = !!key_col,
value = !!value_cols[i],
fill = fill,
convert = convert,
drop = drop,
sep = paste0(sep,value_cols[i],sep))
}
result %>%
purrr::reduce(.f = full_join, by = id_cols)
}
> dat %>%
spread_chr(key_col = "grp",
value_cols = c("avg","sd"),
sep = "_")
# A tibble: 2 x 5
id grp_avg_a grp_avg_b grp_sd_a grp_sd_b
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
The key ideas here are to unquote the arguments key_col
and value_cols[i]
using the !!
operator, and using the sep
argument in spread
to control the resulting value column names.
If we wanted to convert this function to accept unquoted arguments for the key and value columns, we could modify it like so:
spread_nq <- function(data, key_col,..., fill = NA,
convert = FALSE, drop = TRUE, sep = NULL){
val_quos <- rlang::quos(...)
key_quo <- rlang::enquo(key_col)
value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))
n_val <- length(value_cols)
result <- vector(mode = "list",length = n_val)
id_cols <- setdiff(names(data),c(key_col,value_cols))
for (i in seq_along(result)){
result[[i]] <- spread(data = data[,c(id_cols,key_col,value_cols[i]),drop = FALSE],
key = !!key_col,
value = !!value_cols[i],
fill = fill,
convert = convert,
drop = drop,
sep = paste0(sep,value_cols[i],sep))
}
result %>%
purrr::reduce(.f = full_join,by = id_cols)
}
> dat %>%
spread_nq(key_col = grp,avg,sd,sep = "_")
# A tibble: 2 x 5
id grp_avg_a grp_avg_b grp_sd_a grp_sd_b
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 -0.5646982 0.6569923 0.7050648
2 2 0.3631284 0.6328626 0.4577418 0.7191123
The change here is that we capture the unquoted arguments with rlang::quos
and rlang::enquo
and then simply convert them back to characters using tidyselect::vars_select
.
Returning to the solution in the linked question that uses a sequence of gather
, unite
and spread
, we can use what we've learned to make a function like this:
spread_nt <- function(data,key_col,...,fill = NA,
convert = TRUE,drop = TRUE,sep = "_"){
key_quo <- rlang::enquo(key_col)
val_quos <- rlang::quos(...)
value_cols <- unname(tidyselect::vars_select(names(data),!!!val_quos))
key_col <- unname(tidyselect::vars_select(names(data),!!key_quo))
data %>%
gather(key = ..var..,value = ..val..,!!!val_quos) %>%
unite(col = ..grp..,c(key_col,"..var.."),sep = sep) %>%
spread(key = ..grp..,value = ..val..,fill = fill,
convert = convert,drop = drop,sep = NULL)
}
> dat %>%
spread_nt(key_col = grp,avg,sd,sep = "_")
# A tibble: 2 x 5
id a_avg a_sd b_avg b_sd
* <int> <dbl> <dbl> <dbl> <dbl>
1 1 1.3709584 0.6569923 -0.5646982 0.7050648
2 2 0.3631284 0.4577418 0.6328626 0.7191123
This relies on the same techniques from rlang from the last example. We're using some unusual names like ..var..
for our intermediate variables in order to reduce the chances of name collisions with existing columns in our data frame.
Also, we're using the sep
argument in unite
to control the resulting column names, so in this case when we spread
we force sep = NULL
.
Related Topics
Ggplot2: Fill Color Behaviour of Geom_Ribbon
How to Create a Vector of Functions
Replace Nas with Mean of the Same Column of a Data.Table
Rbindlist Two Data.Tables Where One Has Factor and Other Has Character Type for a Column
How to Adjust the Font Size of Tablegrob
R - Scaling Numeric Values Only in a Dataframe with Mixed Types
Boxplot of Table Using Ggplot2
Draw Multiple Squares with Ggplot
How Does R's Ifelse Work with Character Data
Combine Multiple .Rdata Files Containing Objects with the Same Name into One Single .Rdata File
How to Perform a Pairwise T.Test in R Across Multiple Independent Vectors
Calculate Summary Statistics (E.G. Mean) on All Numeric Columns Using Data.Table
Package Domc Not Available for R Version 3.0.0 Warning in Install.Packages
Find Elements Not in Smaller Character Vector List But in Big List
R: Replacing Nas in a Data.Frame with Values in the Same Position in Another Dataframe
Directly Adding Titles and Labels to Visnetwork
How to Rename All Columns of a Data Frame Based on Another Data Frame in R