Using spread with duplicate identifiers for rows
The issue is the two columns for both A
and B
. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt
when you use the code below.
library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
I won't mark this as a duplicate since the other question did not summarize by sum
, but the data.table
answer could help with one additional argument, fun=sum
:
library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
# month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1: 1 17 8 11 8
# 2: 2 13 8 13 5
# 3: 3 15 6 15 11
If you would like to use the tidyr
solution, combine it with dcast
to summarize by sum
.
as.data.frame(jj)
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
dcast(month ~ temp, fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11
Edit
Based on your new requirements, I have added an activity column.
library(dplyr)
jj %>% group_by(month, student) %>%
mutate(id=1:n()) %>%
melt(id=c("month", "id", "student")) %>%
dcast(... ~ student + variable, value.var="value")
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 1 2 8 5 5 3
# 3 2 1 7 7 2 4
# 4 2 2 6 6 6 1
# 5 3 1 6 8 1 6
# 6 3 2 9 7 5 5
The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
dcast(... ~ temp) %>%
arrange(id)
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 2 2 7 7 2 4
# 3 3 3 6 8 1 6
# 4 1 4 8 5 5 3
# 5 2 5 6 6 6 1
# 6 3 6 9 7 5 5
The data.table
syntax is compact because it allows for multiple value.var
columns and will take care of the spread for us. We can then skip the melt -> cast
process.
library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
# month activityID A_Amy A_Bob B_Amy B_Bob
# 1: 1 1 9 3 6 5
# 2: 1 4 8 5 5 3
# 3: 2 2 7 2 7 4
# 4: 2 5 6 6 6 1
# 5: 3 3 6 1 8 6
# 6: 3 6 9 5 7 5
Spread with duplicate identifiers for rows
In order for spread
to work as intended, the resulting data frame must have uniquely identified rows and columns. In the case of your data, the "date" column is the only unique identifier after spreading. However, rows 36 and 38 are identical:
date tmin state
36 2018-01-03 -3 OH
38 2018-01-03 -3 OH
This puts tidyr in the impossible position of trying to resolve two data points to the same row and column. In addition, rows 35 and 37 both have the same date and state, once again creating the impossible situation of placing two different values into the same position in the new data frame:
date tmin state
35 2018-01-03 NA UT
37 2018-01-03 22 UT
The following data cleanup will make spreading possible:
df %>%
filter(!is.na(tmin)) %>% # remove NA values
unique %>% # remove duplicated rows
spread(state, tmin)
date OH UT
1 2018-01-02 -4 24
2 2018-01-03 -3 22
3 2018-01-04 11 19
4 2018-01-05 3 23
5 2018-01-06 0 29
...
R Spread Error: Duplicate identifiers for rows
We need to create a sequence column and then spread
library(tidyverse)
df %>%
group_by(Index) %>%
mutate(ind = row_number()) %>%
spread(Index, confint, convert = FALSE)
NOTE: This would be an issue in the original dataset and not in the example data showed in the post
How to spread columns with duplicate identifiers?
Right now you have two age
values for Female
and three for Male
, and no other variables keeping them from being collapsed into a single row, as spread
tries to do with values with similar/no index values:
library(tidyverse)
df <- data_frame(x = c('a', 'b'), y = 1:2)
df # 2 rows...
#> # A tibble: 2 x 2
#> x y
#> <chr> <int>
#> 1 a 1
#> 2 b 2
df %>% spread(x, y) # ...become one if there's only one value for each.
#> # A tibble: 1 x 2
#> a b
#> * <int> <int>
#> 1 1 2
spread
doesn't apply a function to combine multiple values (à la dcast
), so rows must be indexed so there's one or zero values for a location, e.g.
df <- data_frame(i = c(1, 1, 2, 2, 3, 3),
x = c('a', 'b', 'a', 'b', 'a', 'b'),
y = 1:6)
df # the two rows with each `i` value here...
#> # A tibble: 6 x 3
#> i x y
#> <dbl> <chr> <int>
#> 1 1 a 1
#> 2 1 b 2
#> 3 2 a 3
#> 4 2 b 4
#> 5 3 a 5
#> 6 3 b 6
df %>% spread(x, y) # ...become one row here.
#> # A tibble: 3 x 3
#> i a b
#> * <dbl> <int> <int>
#> 1 1 1 2
#> 2 2 3 4
#> 3 3 5 6
If you your values aren't indexed naturally by the other columns you can add a unique index column (e.g. by adding the row numbers as a column) which will stop spread
from trying to collapse the rows:
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L),
.Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame"),
.Names = c("age", "gender"))
df %>% mutate(i = row_number()) %>% spread(gender, age)
#> # A tibble: 5 x 3
#> i Female Male
#> * <int> <chr> <chr>
#> 1 1 <NA> 21
#> 2 2 17 <NA>
#> 3 3 32 <NA>
#> 4 4 <NA> 29
#> 5 5 <NA> 15
If you want to remove it afterwards, add on select(-i)
. This doesn't produce a terribly useful data.frame in this case, but can be very useful in the midst of more complicated reshaping.
Spread with duplicate identifiers (using tidyverse and %%)
We can use tidyverse
. After grouping by 'start_end', 'id', create a sequence column 'ind' , then spread
from 'long' to 'wide' format
library(dplyr)
library(tidyr)
df %>%
group_by(start_end, id) %>%
mutate(ind = row_number()) %>%
spread(start_end, date) %>%
select(start, end)
# id start end
#* <int> <fctr> <fctr>
#1 2 1994-05-01 1996-11-04
#2 4 1979-07-18 NA
#3 5 2005-02-01 2009-09-17
#4 5 2010-10-01 2012-10-06
Or using tidyr_1.0.0
chop(df, date) %>%
spread(start_end, date) %>%
unnest(c(start, end))
Duplicate Identifiers - How to Dynamically Spread Repeating Values in a Column using Tidyverse
You can create a new column to assign a unique id for each phone number for a given Id
and then spread using that new column. This avoids the "duplicate id" problem. When spreading, this approach inherently adds as many columns as are needed to accommodate all of the unique phone numbers for each Id
. For example, in the code below I've called that new column seq
(for "sequence"):
library(tidyverse)
Df %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(seq=paste0("Phone_",1:n())) %>%
spread(seq, value)
Id Phone_1 Phone_2 Phone_3
1 111 6453348736 NA NA
2 121 6783450101 NA NA
3 144 6451123425 NA NA
4 145 6457886543 NA NA
5 148 6572231223 6541132112 NA
6 165 7372222222 NA NA
7 177 6548887777 NA NA
8 188 7890986543 6785554444 8764443344
9 199 6532881717 NA NA
R - Wrong error message - Error: Duplicate identifiers for rows
This ought to work. As a comment noted, this is because spread
tries to combine rows that are no longer uniquely identified after the gather
. rowid_to_column
is a simple function that converts the row ids into a column. The reason the numbers are larger than the size of the original dataset is because after gathering you have a data frame with 8003580 rows.
data2 <- data %>%
gather(Day, value, Day01:Day31) %>%
tibble::rowid_to_column() %>%
spread(Station, value)
I ran into memory issues trying to actually do this on my laptop though.
Related Topics
Combining Date and Time into a Date Column for Plotting
Error: C Stack Usage Is Too Close to The Limit in R
Calculate a 2D Spline Curve in R
Add Points to Usmap with Ggplot in R
Split Data.Frame Row into Multiple Rows Based on Commas
Identify a Value Changes' Date and Summarize The Data with Sum() and Diff() in R
Linear Regression with Constraints on The Coefficients
How to Define "Hidden Global Variables" Inside R Packages
R - Stuck with Plot() - Colouring Shapefile Polygons Based Upon a Slot Value
Why Does Apt-Get Install R-Base Install 3.2.3 Instead of 3.4.0 in R
Fill in Gaps (E.G. Not Single Cells) of Na Values in Raster Using a Neighborhood Analysis
Plot Weighted Frequency Matrix
Extract First N Digits from a String
How to Fix Axis Margin with Ggplot2
Line Segments or Rectangles with Hover Information in R Plotly Figure