Spread VS Dcast

Spread vs dcast

Let's say you were starting with data that looked like this:

mydf
# Weight Height Fitted interval limit value
# 1 42 153.4 51.0792 conf lwr 49.15463
# 2 42 153.4 51.0792 pred lwr 32.82122
# 3 42 153.4 51.0792 conf upr 53.00376
# 4 42 153.4 51.0792 pred upr 69.33717
# 5 42 153.4 51.0792 conf lwr 60.00000
# 6 42 153.4 51.0792 pred lwr 90.00000

Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise for the "value" column.

If you stopped the execution at the summarise step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread would work as expected.

mydf %>% 
group_by(Weight, Height, Fitted, interval, limit) %>%
summarise(value = mean(value)) %>%
spread(limit, value)
# Source: local data frame [2 x 6]
#
# Weight Height Fitted interval lwr upr
# (dbl) (dbl) (dbl) (chr) (dbl) (dbl)
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717

This matches the expected output from dcast with fun.aggregate = mean.

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
# Weight Height Fitted interval lwr upr
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717

Sample data:

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,
51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",
"conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,
2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),
value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,
90)), .Names = c("Weight", "Height", "Fitted", "interval",
"limit", "value"), row.names = c(NA, 6L), class = "data.frame")

Spread or dcast and fill in counts

I think you want to count the features and not sum them. Try with the function length.

tidyr::pivot_wider(df, names_from = feature, 
values_from = feature, values_fn = length, values_fill = 0)

Or with dcast.

library(data.table)
dcast(setDT(df), id~feature, value.var = 'feature', fun.aggregate = length)

In base R, using table(df) would give the same output.

table(df)

# feature
#id f1 f2 f3 f4 f5
# id1 1 0 1 1 0
# id10 1 0 1 1 0
# id2 1 1 0 0 1
# id3 0 1 1 1 0
# id4 1 0 1 0 1
# id5 1 1 0 0 1
# id6 1 1 1 0 0
# id7 1 0 0 1 1
# id8 1 1 0 0 1
# id9 0 1 0 1 1

Is it possible to use spread on multiple columns in tidyr similar to dcast?

One option would be to create a new 'Prod_Count' by joining the 'Product' and 'Country' columns by paste, remove those columns with the select and reshape from 'long' to 'wide' using spread from tidyr.

 library(dplyr)
library(tidyr)
sdt %>%
mutate(Prod_Count=paste(Product, Country, sep="_")) %>%
select(-Product, -Country)%>%
spread(Prod_Count, value)%>%
head(2)
# Year A_AI B_EI
#1 1990 0.7878674 0.2486044
#2 1991 0.2343285 -1.1694878

Or we can avoid a couple of steps by using unite from tidyr (from @beetroot's comment) and reshape as before.

 sdt%>% 
unite(Prod_Count, Product,Country) %>%
spread(Prod_Count, value)%>%
head(2)
# Year A_AI B_EI
# 1 1990 0.7878674 0.2486044
# 2 1991 0.2343285 -1.1694878

Spread and dcast not aligning correctly

Your issue is with precision with your lat & long values

> unique(dat$long)
# [1] -73.45833 -73.45833
> unique(dat$lat)
# [1] 42.04167 42.04167

## Notice two 'unique' values that are 'printed' the same

If you look at your structure data closely you will see you have slightly different values in your lat and long

For example, rounding the lat/long values removes the precision 'error'

dat$lat  <- round(dat$lat, 4)
dat$long <- round(dat$long, 4)

spread(dat, element, value)

## now tmin is aligned

#gridNumber fips cropArea state county_name long lat year month ppt tmax tmean tmin
#(int) (int) (dbl) (fctr) (chr) (dbl) (dbl) (int) (chr) (dbl) (dbl) (dbl) (dbl)
#1 266783 9005 0 CT Litchfield County -73.4583 42.0417 1996 apr 6.642506 53.294 43.052 32.810
#2 266783 9005 0 CT Litchfield County -73.4583 42.0417 1996 aug 2.519680 76.550 66.551 56.552
#3 266783 9005 0 CT Litchfield County -73.4583 42.0417 1996 dec NA NA NA 26.132

Replacing dplyr's spread by pivot_wider and gather by pivot_longer

This seems to be another instance of this issue on Github, which is supposed to be fixed in the dev version of tidyr. After updating tidyr (i.e.devtools::install_github("tidyverse/tidyr")) I get comparable performance on your example:

library(tidyverse)

dates <- seq(from = as.Date("1975-01-01"), to = as.Date("2019-10-31"), by = "months")

returndata <- tibble(stock = sort(rep(letters, length(dates))),
month = rep(dates, length(letters)),
ret = runif(length(dates) * length(letters)) - 0.5)

bench::mark(
spread = returndata %>% spread(stock, ret),
pivot_wider = returndata %>% pivot_wider(names_from = stock, values_from = ret)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spread 8.83ms 9.57ms 100. 0B 6.39
#> 2 pivot_wider 10.96ms 11.37ms 86.1 0B 4.42

Created on 2019-11-25 by the reprex package (v0.3.0)

R pivot_wider or dcast to cast multiple columns

We can use pivot_wider

library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
mutate(rn = rowid(NAME)) %>%
pivot_wider(names_from = rn, values_from = c(CATEGORY, NUMBER_CATEGORY))

-output

# A tibble: 2 x 7
# NAME CATEGORY_1 CATEGORY_2 CATEGORY_3 NUMBER_CATEGORY_1 NUMBER_CATEGORY_2 NUMBER_CATEGORY_3
# <chr> <chr> <chr> <chr> <int> <int> <int>
#1 Amy Low Med <NA> 180 185 NA
#2 John Low Med Med 118 182 185

data

df1 <- structure(list(NAME = c("Amy", "Amy", "John", "John", "John"), 
CATEGORY = c("Low", "Med", "Low", "Med", "Med"), NUMBER_CATEGORY = c(180L,
185L, 118L, 182L, 185L)), class = "data.frame", row.names = c(NA,
-5L))

how to spread or cast multiple values in r

We could do this using dplyr/tidyr. We reshape the 'data' from 'wide' to 'long' format with gather specifying the columns (starts_with('value')) to be combined to a key/value column pair ('Var/Val'), unite the 'Var' and 'y' column to create a single 'Var1' column, and reconvert back to 'wide' format with spread.

 library(dplyr)
library(tidyr)
data %>%
gather(Var, val, starts_with("value")) %>%
unite(Var1,Var, y) %>%
spread(Var1, val)

# x value.1_a value.1_b value.1_c value.1_d value.2_a value.2_b value.2_c
#1 blue 5 6 7 8 17 18 19
#2 green 9 10 11 12 21 22 23
#3 red 1 2 3 4 13 14 15
# value.2_d
#1 20
#2 24
#3 16

Update

(After 6 months)

Reshaping multiple value columns to wide is now possible with dcast from data.table_1.9.5 without using the melt. We can install the devel version from here

 library(data.table)
dcast(setDT(data), x~y, value.var=c('value.1', 'value.2'))
# x a_value.1 b_value.1 c_value.1 d_value.1 a_value.2 b_value.2 c_value.2
#1: blue 5 6 7 8 17 18 19
#2: green 9 10 11 12 21 22 23
#3: red 1 2 3 4 13 14 15
# d_value.2
#1: 20
#2: 24
#3: 16

R spread vs gather in tidyr

You can use a combined gather + spread approach; Gather the *Values columns and combine with currentTest to form the new header, then spread to wide format:

resultsData %>% 
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value)

# A tibble: 10 x 13
# person A_afterValue A_beforeValue B_afterValue B_beforeValue C_afterValue C_beforeValue
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.67176351 1.28429706 -0.8694723 -0.47279938 1.090681730 0.5774694
# 2 2 -2.35492691 -0.61835955 1.4448554 0.05972074 4.375377235 1.9274817
# 3 3 -0.09170997 0.03945743 2.7310495 0.92420153 -0.116850493 -0.2455936
# 4 4 -0.36285183 -0.44860832 1.0175425 0.65888418 1.953383260 0.1261048
# 5 5 -1.41628434 -0.96177712 -4.7126717 -1.98980726 -1.690035639 -0.5593383
# 6 6 2.05218144 0.70247190 1.9712327 0.66024130 2.622505813 1.2980212
# 7 7 -2.12568428 -0.45522204 -0.5644579 0.08963695 -0.008373791 0.7194067
# 8 8 -2.77742515 -1.23154913 0.5076592 -0.82839994 1.841929380 0.9694145
# 9 9 -0.55830618 -0.79723499 -0.3169969 -0.83807424 -0.675899869 -0.8146971
#10 10 -1.24415955 -0.70973496 -3.3176237 -1.65919710 2.085066600 0.8646598
# ... with 6 more variables: D_afterValue <dbl>, D_beforeValue <dbl>, E_afterValue <dbl>,
# E_beforeValue <dbl>, F_afterValue <dbl>, F_beforeValue <dbl>

If you need to rename the columns:

resultsData %>% 
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value) %>%
rename_at(vars(matches("Value$")), funs(gsub("Value$", "", .)))

data.table is copied when using spread()?

I think using dcast is the way to go. However, a possible solution using tidyr::spread would be to add setDT() to the piped call, i.e.,

set.seed(123)

# install.packages(c("data.table"), dependencies = TRUE)
library(data.table)

mydt <- data.table(id = 1:100, x = sample(LETTERS[1:6], size = 100, replace = TRUE),
group = paste0("group", sample(1:3, size = 100, replace = TRUE)),
prob = runif(100, 0, 1)
)

class(mydt)
mydt2 <- mydt %>% tidyr::spread(group, prob) %>% setDT()

mydt2[!is.na(group1), new.col := x]


Related Topics



Leave a reply



Submit