Using spread with duplicate identifiers for rows
The issue is the two columns for both A
and B
. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt
when you use the code below.
library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
I won't mark this as a duplicate since the other question did not summarize by sum
, but the data.table
answer could help with one additional argument, fun=sum
:
library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
# month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1: 1 17 8 11 8
# 2: 2 13 8 13 5
# 3: 3 15 6 15 11
If you would like to use the tidyr
solution, combine it with dcast
to summarize by sum
.
as.data.frame(jj)
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
dcast(month ~ temp, fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
Edit
Based on your new requirements, I have added an activity column.
library(dplyr)
jj %>% group_by(month, student) %>%
mutate(id=1:n()) %>%
melt(id=c("month", "id", "student")) %>%
dcast(... ~ student + variable, value.var="value")
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 1 2 8 5 5 3
# 3 2 1 7 7 2 4
# 4 2 2 6 6 6 1
# 5 3 1 6 8 1 6
# 6 3 2 9 7 5 5
The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
dcast(... ~ temp) %>%
arrange(id)
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 2 2 7 7 2 4
# 3 3 3 6 8 1 6
# 4 1 4 8 5 5 3
# 5 2 5 6 6 6 1
# 6 3 6 9 7 5 5
The data.table
syntax is compact because it allows for multiple value.var
columns and will take care of the spread for us. We can then skip the melt -> cast
process.
library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
# month activityID A_Amy A_Bob B_Amy B_Bob
# 1: 1 1 9 3 6 5
# 2: 1 4 8 5 5 3
# 3: 2 2 7 2 7 4
# 4: 2 5 6 6 6 1
# 5: 3 3 6 1 8 6
# 6: 3 6 9 5 7 5
Spread with duplicate identifiers (using tidyverse and %%)
We can use tidyverse
. After grouping by 'start_end', 'id', create a sequence column 'ind' , then spread
from 'long' to 'wide' format
library(dplyr)
library(tidyr)
df %>%
group_by(start_end, id) %>%
mutate(ind = row_number()) %>%
spread(start_end, date) %>%
select(start, end)
# id start end
#* <int> <fctr> <fctr>
#1 2 1994-05-01 1996-11-04
#2 4 1979-07-18 NA
#3 5 2005-02-01 2009-09-17
#4 5 2010-10-01 2012-10-06
Or using tidyr_1.0.0
chop(df, date) %>%
spread(start_end, date) %>%
unnest(c(start, end))
R Spread Error: Duplicate identifiers for rows
We need to create a sequence column and then spread
library(tidyverse)
df %>%
group_by(Index) %>%
mutate(ind = row_number()) %>%
spread(Index, confint, convert = FALSE)
NOTE: This would be an issue in the original dataset and not in the example data showed in the post
How to spread columns with duplicate identifiers?
Right now you have two age
values for Female
and three for Male
, and no other variables keeping them from being collapsed into a single row, as spread
tries to do with values with similar/no index values:
library(tidyverse)
df <- data_frame(x = c('a', 'b'), y = 1:2)
df # 2 rows...
#> # A tibble: 2 x 2
#> x y
#> <chr> <int>
#> 1 a 1
#> 2 b 2
df %>% spread(x, y) # ...become one if there's only one value for each.
#> # A tibble: 1 x 2
#> a b
#> * <int> <int>
#> 1 1 2
spread
doesn't apply a function to combine multiple values (à la dcast
), so rows must be indexed so there's one or zero values for a location, e.g.
df <- data_frame(i = c(1, 1, 2, 2, 3, 3),
x = c('a', 'b', 'a', 'b', 'a', 'b'),
y = 1:6)
df # the two rows with each `i` value here...
#> # A tibble: 6 x 3
#> i x y
#> <dbl> <chr> <int>
#> 1 1 a 1
#> 2 1 b 2
#> 3 2 a 3
#> 4 2 b 4
#> 5 3 a 5
#> 6 3 b 6
df %>% spread(x, y) # ...become one row here.
#> # A tibble: 3 x 3
#> i a b
#> * <dbl> <int> <int>
#> 1 1 1 2
#> 2 2 3 4
#> 3 3 5 6
If you your values aren't indexed naturally by the other columns you can add a unique index column (e.g. by adding the row numbers as a column) which will stop spread
from trying to collapse the rows:
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L),
.Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame"),
.Names = c("age", "gender"))
df %>% mutate(i = row_number()) %>% spread(gender, age)
#> # A tibble: 5 x 3
#> i Female Male
#> * <int> <chr> <chr>
#> 1 1 <NA> 21
#> 2 2 17 <NA>
#> 3 3 32 <NA>
#> 4 4 <NA> 29
#> 5 5 <NA> 15
If you want to remove it afterwards, add on select(-i)
. This doesn't produce a terribly useful data.frame in this case, but can be very useful in the midst of more complicated reshaping.
R: Spread data.frame/tibble with shared keys and missing data
Assuming 'name'
is always present for each entry, we can create an identifier column and reshape using pivot_wider
.
library(dplyr)
a %>%
group_by(grp = cumsum(categories == 'name')) %>%
tidyr::pivot_wider(names_from = categories, values_from = values) %>%
ungroup %>%
select(-grp)
# name sex age weight high
# <chr> <chr> <chr> <chr> <chr>
#1 Emma female 32 72 175
#2 Jane female 28 NA 165
#3 Emma female 42 63 170
Same logic in data.table
:
library(data.table)
dcast(setDT(a), cumsum(categories == 'name')~categories, value.var = 'values')
Turning variable values into column names; duplicate identifiers for rows in tidyr::spread
tidyr provides some syntax for dealing with this problem.
# set up
library(dplyr)
library(tidyr)
dat <- tibble(
id = factor(c("A","B","C","D","E")),
demographic_info1 = round(rnorm(5),2),
demographic_info2 = round(rnorm(5),2),
election_1 = c(NA,"GN2016","GN2016","SE2016","GN2008"),
election_2 = c(NA,"MT2014","GN2012","GN2016","GN2004"),
election_3 = c(NA,NA,NA,"MT2014","GN2000"),
election_4 = c(NA,NA,NA,"GN2012",NA),
election_5 = c(NA,NA,NA,"MT2010",NA)
)
What we eventually want is a TRUE
or FALSE
for every voter (5) x election (8) pairing. When we gather the data into a long format, we only see the voter x election combinations that exist in the data-set.
d_votes <- dat %>%
gather("variable", "election", election_1:election_5) %>%
select(-variable) %>%
mutate(voted = TRUE)
d_votes
#> # A tibble: 25 x 5
#> id demographic_info1 demographic_info2 election voted
#> <fctr> <dbl> <dbl> <chr> <lgl>
#> 1 A 0.76 -0.23 <NA> TRUE
#> 2 B -0.80 0.08 GN2016 TRUE
#> 3 C -0.33 1.60 GN2016 TRUE
#> 4 D -0.50 -1.27 SE2016 TRUE
#> 5 E -1.03 0.59 GN2008 TRUE
#> 6 A 0.76 -0.23 <NA> TRUE
#> 7 B -0.80 0.08 MT2014 TRUE
#> 8 C -0.33 1.60 GN2012 TRUE
#> 9 D -0.50 -1.27 GN2016 TRUE
#> 10 E -1.03 0.59 GN2004 TRUE
#> # ... with 15 more rows
count(d_votes, election)
#> # A tibble: 9 x 2
#> election n
#> <chr> <int>
#> 1 GN2000 1
#> 2 GN2004 1
#> 3 GN2008 1
#> 4 GN2012 2
#> 5 GN2016 3
#> 6 MT2010 1
#> 7 MT2014 2
#> 8 SE2016 1
#> 9 <NA> 13
We need to generate every combination of voter and election. tidyr's expand()
function creates all combinations of variables from different columns/vectors of data. (It works like the base function expand.grid()
, so the name expand()
is evocative).
d_possible_votes <- d_votes %>%
expand(nesting(id, demographic_info1, demographic_info2),
election)
d_possible_votes
#> # A tibble: 40 x 4
#> id demographic_info1 demographic_info2 election
#> <fctr> <dbl> <dbl> <chr>
#> 1 A 0.76 -0.23 GN2000
#> 2 A 0.76 -0.23 GN2004
#> 3 A 0.76 -0.23 GN2008
#> 4 A 0.76 -0.23 GN2012
#> 5 A 0.76 -0.23 GN2016
#> 6 A 0.76 -0.23 MT2010
#> 7 A 0.76 -0.23 MT2014
#> 8 A 0.76 -0.23 SE2016
#> 9 B -0.80 0.08 GN2000
#> 10 B -0.80 0.08 GN2004
#> # ... with 30 more rows
Note that we now have 8 elections x 5 ids = 40 rows.
We used the nesting()
function to treat each (id
, demographic_info1
, demographic_info2
) set/row as a single unit; demographics are nested within ids. Expanding provided all 40 combinations of (id
, demographic_info1
, demographic_info2
) x election
.
If we join the observed votes onto the possible votes, the voted
column is populated with TRUE
or NA
values. tidyr's replace_na()
function can correct those NA
values.
d_possible_votes <- d_possible_votes %>%
left_join(d_votes) %>%
replace_na(list(voted = FALSE))
#> Joining, by = c("id", "demographic_info1", "demographic_info2", "election")
d_possible_votes
#> # A tibble: 40 x 5
#> id demographic_info1 demographic_info2 election voted
#> <fctr> <dbl> <dbl> <chr> <lgl>
#> 1 A 0.76 -0.23 GN2000 FALSE
#> 2 A 0.76 -0.23 GN2004 FALSE
#> 3 A 0.76 -0.23 GN2008 FALSE
#> 4 A 0.76 -0.23 GN2012 FALSE
#> 5 A 0.76 -0.23 GN2016 FALSE
#> 6 A 0.76 -0.23 MT2010 FALSE
#> 7 A 0.76 -0.23 MT2014 FALSE
#> 8 A 0.76 -0.23 SE2016 FALSE
#> 9 B -0.80 0.08 GN2000 FALSE
#> 10 B -0.80 0.08 GN2004 FALSE
#> # ... with 30 more rows
Now, we can spread out the elections and achieve the desired dataframe.
spread(d_possible_votes, election, voted)
#> # A tibble: 5 x 11
#> id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
#> * <fctr> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A 0.76 -0.23 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 2 B -0.80 0.08 FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
#> 3 C -0.33 1.60 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
#> 4 D -0.50 -1.27 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
#> 5 E -1.03 0.59 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
This pattern of generating combinations of identifiers, joining actual data, and correcting missing values is very common—so much so that tidyr includes a function complete()
to do all three at once.
d_votes %>%
complete(nesting(id, demographic_info1, demographic_info2),
election, fill = list(voted = FALSE)) %>%
spread(election, voted)
#> # A tibble: 5 x 11
#> id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
#> * <fctr> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A 0.76 -0.23 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 2 B -0.80 0.08 FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
#> 3 C -0.33 1.60 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
#> 4 D -0.50 -1.27 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
#> 5 E -1.03 0.59 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
How to spread or cast a single column to several - Error: Duplicate identifiers for rows
We can create an ID column for each experiment
group to overcome this issue.
library(dplyr)
library(tidyr)
df2 <- df %>%
arrange(experiment, mod) %>%
group_by(experiment) %>%
mutate(ID = 1:n()) %>%
spread(ID, mod) %>%
ungroup()
df2
# # A tibble: 4 x 3
# experiment `1` `2`
# <fct> <fct> <fct>
# 1 ex1 mod1 mod7
# 2 ex2 mod8 NA
# 3 ex3 mod1 NA
# 4 ex7 mod3 mod9
Related Topics
Make Frequency Histogram for Factor Variables
Control Point Border Thickness in Ggplot
Recommendations for Windows Text Editor for R
Rvest Error in Open.Connection(X, "Rb"):Timeout Was Reached
Getting the Last N Elements of a Vector. Is There a Better Way Than Using the Length() Function
Merge Three Different Columns into a Date in R
Efficient Row-Wise Operations on a Data.Table
Add Line Break to Axis Labels and Ticks in Ggplot
How to Group My Date Variable into Month/Year in R
Change Both Legend Titles in a Ggplot with Two Legends
How to Convert Data.Frame to Transactions for Arules
Adding Greek Character to Axis Title
Calculate Cumsum() While Ignoring Na Values
Why Is Using '<<-' Frowned Upon and How to Avoid It
How to Convert Data.Frame Column from Factor to Numeric