Spread vs dcast
Let's say you were starting with data that looked like this:
mydf
# Weight Height Fitted interval limit value
# 1 42 153.4 51.0792 conf lwr 49.15463
# 2 42 153.4 51.0792 pred lwr 32.82122
# 3 42 153.4 51.0792 conf upr 53.00376
# 4 42 153.4 51.0792 pred upr 69.33717
# 5 42 153.4 51.0792 conf lwr 60.00000
# 6 42 153.4 51.0792 pred lwr 90.00000
Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.
tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)
As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise
for the "value" column.
If you stopped the execution at the summarise
step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread
would work as expected.
mydf %>%
group_by(Weight, Height, Fitted, interval, limit) %>%
summarise(value = mean(value)) %>%
spread(limit, value)
# Source: local data frame [2 x 6]
#
# Weight Height Fitted interval lwr upr
# (dbl) (dbl) (dbl) (chr) (dbl) (dbl)
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
This matches the expected output from dcast
with fun.aggregate = mean
.
reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
# Weight Height Fitted interval lwr upr
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
Sample data:
mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4,
153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,
51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",
"conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,
2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),
value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,
90)), .Names = c("Weight", "Height", "Fitted", "interval",
"limit", "value"), row.names = c(NA, 6L), class = "data.frame")
Spread or dcast and fill in counts
I think you want to count the features and not sum
them. Try with the function length
.
tidyr::pivot_wider(df, names_from = feature,
values_from = feature, values_fn = length, values_fill = 0)
Or with dcast
.
library(data.table)
dcast(setDT(df), id~feature, value.var = 'feature', fun.aggregate = length)
In base R, using table(df)
would give the same output.
table(df)
# feature
#id f1 f2 f3 f4 f5
# id1 1 0 1 1 0
# id10 1 0 1 1 0
# id2 1 1 0 0 1
# id3 0 1 1 1 0
# id4 1 0 1 0 1
# id5 1 1 0 0 1
# id6 1 1 1 0 0
# id7 1 0 0 1 1
# id8 1 1 0 0 1
# id9 0 1 0 1 1
Is it possible to use spread on multiple columns in tidyr similar to dcast?
One option would be to create a new 'Prod_Count' by joining the 'Product' and 'Country' columns by paste
, remove those columns with the select
and reshape from 'long' to 'wide' using spread
from tidyr
.
library(dplyr)
library(tidyr)
sdt %>%
mutate(Prod_Count=paste(Product, Country, sep="_")) %>%
select(-Product, -Country)%>%
spread(Prod_Count, value)%>%
head(2)
# Year A_AI B_EI
#1 1990 0.7878674 0.2486044
#2 1991 0.2343285 -1.1694878
Or we can avoid a couple of steps by using unite
from tidyr
(from @beetroot's comment) and reshape as before.
sdt%>%
unite(Prod_Count, Product,Country) %>%
spread(Prod_Count, value)%>%
head(2)
# Year A_AI B_EI
# 1 1990 0.7878674 0.2486044
# 2 1991 0.2343285 -1.1694878
Spread and dcast not aligning correctly
Your issue is with precision with your lat
& long
values
> unique(dat$long)
# [1] -73.45833 -73.45833
> unique(dat$lat)
# [1] 42.04167 42.04167
## Notice two 'unique' values that are 'printed' the same
If you look at your structure
data closely you will see you have slightly different values in your lat
and long
For example, rounding the lat/long values removes the precision 'error'
dat$lat <- round(dat$lat, 4)
dat$long <- round(dat$long, 4)
spread(dat, element, value)
## now tmin is aligned
#gridNumber fips cropArea state county_name long lat year month ppt tmax tmean tmin
#(int) (int) (dbl) (fctr) (chr) (dbl) (dbl) (int) (chr) (dbl) (dbl) (dbl) (dbl)
#1 266783 9005 0 CT Litchfield County -73.4583 42.0417 1996 apr 6.642506 53.294 43.052 32.810
#2 266783 9005 0 CT Litchfield County -73.4583 42.0417 1996 aug 2.519680 76.550 66.551 56.552
#3 266783 9005 0 CT Litchfield County -73.4583 42.0417 1996 dec NA NA NA 26.132
Replacing dplyr's spread by pivot_wider and gather by pivot_longer
This seems to be another instance of this issue on Github, which is supposed to be fixed in the dev version of tidyr
. After updating tidyr
(i.e.devtools::install_github("tidyverse/tidyr")
) I get comparable performance on your example:
library(tidyverse)
dates <- seq(from = as.Date("1975-01-01"), to = as.Date("2019-10-31"), by = "months")
returndata <- tibble(stock = sort(rep(letters, length(dates))),
month = rep(dates, length(letters)),
ret = runif(length(dates) * length(letters)) - 0.5)
bench::mark(
spread = returndata %>% spread(stock, ret),
pivot_wider = returndata %>% pivot_wider(names_from = stock, values_from = ret)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spread 8.83ms 9.57ms 100. 0B 6.39
#> 2 pivot_wider 10.96ms 11.37ms 86.1 0B 4.42
Created on 2019-11-25 by the reprex package (v0.3.0)
R pivot_wider or dcast to cast multiple columns
We can use pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
df1 %>%
mutate(rn = rowid(NAME)) %>%
pivot_wider(names_from = rn, values_from = c(CATEGORY, NUMBER_CATEGORY))
-output
# A tibble: 2 x 7
# NAME CATEGORY_1 CATEGORY_2 CATEGORY_3 NUMBER_CATEGORY_1 NUMBER_CATEGORY_2 NUMBER_CATEGORY_3
# <chr> <chr> <chr> <chr> <int> <int> <int>
#1 Amy Low Med <NA> 180 185 NA
#2 John Low Med Med 118 182 185
data
df1 <- structure(list(NAME = c("Amy", "Amy", "John", "John", "John"),
CATEGORY = c("Low", "Med", "Low", "Med", "Med"), NUMBER_CATEGORY = c(180L,
185L, 118L, 182L, 185L)), class = "data.frame", row.names = c(NA,
-5L))
how to spread or cast multiple values in r
We could do this using dplyr/tidyr
. We reshape the 'data' from 'wide' to 'long' format with gather
specifying the columns (starts_with('value')
) to be combined to a key/value column pair ('Var/Val'), unite
the 'Var' and 'y' column to create a single 'Var1' column, and reconvert back to 'wide' format with spread
.
library(dplyr)
library(tidyr)
data %>%
gather(Var, val, starts_with("value")) %>%
unite(Var1,Var, y) %>%
spread(Var1, val)
# x value.1_a value.1_b value.1_c value.1_d value.2_a value.2_b value.2_c
#1 blue 5 6 7 8 17 18 19
#2 green 9 10 11 12 21 22 23
#3 red 1 2 3 4 13 14 15
# value.2_d
#1 20
#2 24
#3 16
Update
(After 6 months)
Reshaping multiple value columns to wide is now possible with dcast
from data.table_1.9.5
without using the melt
. We can install the devel version from here
library(data.table)
dcast(setDT(data), x~y, value.var=c('value.1', 'value.2'))
# x a_value.1 b_value.1 c_value.1 d_value.1 a_value.2 b_value.2 c_value.2
#1: blue 5 6 7 8 17 18 19
#2: green 9 10 11 12 21 22 23
#3: red 1 2 3 4 13 14 15
# d_value.2
#1: 20
#2: 24
#3: 16
R spread vs gather in tidyr
You can use a combined gather
+ spread
approach; Gather the *Values columns and combine with currentTest to form the new header, then spread to wide format:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value)
# A tibble: 10 x 13
# person A_afterValue A_beforeValue B_afterValue B_beforeValue C_afterValue C_beforeValue
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.67176351 1.28429706 -0.8694723 -0.47279938 1.090681730 0.5774694
# 2 2 -2.35492691 -0.61835955 1.4448554 0.05972074 4.375377235 1.9274817
# 3 3 -0.09170997 0.03945743 2.7310495 0.92420153 -0.116850493 -0.2455936
# 4 4 -0.36285183 -0.44860832 1.0175425 0.65888418 1.953383260 0.1261048
# 5 5 -1.41628434 -0.96177712 -4.7126717 -1.98980726 -1.690035639 -0.5593383
# 6 6 2.05218144 0.70247190 1.9712327 0.66024130 2.622505813 1.2980212
# 7 7 -2.12568428 -0.45522204 -0.5644579 0.08963695 -0.008373791 0.7194067
# 8 8 -2.77742515 -1.23154913 0.5076592 -0.82839994 1.841929380 0.9694145
# 9 9 -0.55830618 -0.79723499 -0.3169969 -0.83807424 -0.675899869 -0.8146971
#10 10 -1.24415955 -0.70973496 -3.3176237 -1.65919710 2.085066600 0.8646598
# ... with 6 more variables: D_afterValue <dbl>, D_beforeValue <dbl>, E_afterValue <dbl>,
# E_beforeValue <dbl>, F_afterValue <dbl>, F_beforeValue <dbl>
If you need to rename the columns:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value) %>%
rename_at(vars(matches("Value$")), funs(gsub("Value$", "", .)))
data.table is copied when using spread()?
I think using dcast
is the way to go. However, a possible solution using tidyr::spread
would be to add setDT()
to the piped call, i.e.,
set.seed(123)
# install.packages(c("data.table"), dependencies = TRUE)
library(data.table)
mydt <- data.table(id = 1:100, x = sample(LETTERS[1:6], size = 100, replace = TRUE),
group = paste0("group", sample(1:3, size = 100, replace = TRUE)),
prob = runif(100, 0, 1)
)
class(mydt)
mydt2 <- mydt %>% tidyr::spread(group, prob) %>% setDT()
mydt2[!is.na(group1), new.col := x]
Related Topics
R/Quantmod: Multiple Charts All Using the Same Y-Axis
Remove Duplicate Values Based on 2 Columns
Lme4::Glmer VS. Stata's Melogit Command
Changing Word Template for Knitr in Rmarkdown
How to Control Ggplot's Plotting Area Proportions Instead of Fitting Them to Devices in R
Get Stack Trace on Trycatch'Ed Error in R
Replace Blank Cells with Character
Igraph Axes Xlim Ylim Plot Incorrectly
Porting Set Operations from R's Data Frames to Data Tables: How to Identify Duplicated Rows
How to Save Output from Ggforce::Facet_Grid_Paginate in Only One PDF
Easiest Way to Discretize Continuous Scales for Ggplot2 Color Scales
R: Creating a Map of Selected Canadian Provinces and U.S. States
How to Edit and Save Changes Made on Shiny Datatable Using Dt Package
Image in R Leaflet Marker Popups
R: Merge Based on Multiple Conditions (With Non-Equal Criteria)
Ggplot2:How to Reduce the Width and the Space Between Bars with Geom_Bar