How to Spread Columns with Duplicate Identifiers

How to spread columns with duplicate identifiers?

Right now you have two age values for Female and three for Male, and no other variables keeping them from being collapsed into a single row, as spread tries to do with values with similar/no index values:

library(tidyverse)

df <- data_frame(x = c('a', 'b'), y = 1:2)

df # 2 rows...
#> # A tibble: 2 x 2
#> x y
#> <chr> <int>
#> 1 a 1
#> 2 b 2

df %>% spread(x, y) # ...become one if there's only one value for each.
#> # A tibble: 1 x 2
#> a b
#> * <int> <int>
#> 1 1 2

spread doesn't apply a function to combine multiple values (à la dcast), so rows must be indexed so there's one or zero values for a location, e.g.

df <- data_frame(i = c(1, 1, 2, 2, 3, 3), 
x = c('a', 'b', 'a', 'b', 'a', 'b'),
y = 1:6)

df # the two rows with each `i` value here...
#> # A tibble: 6 x 3
#> i x y
#> <dbl> <chr> <int>
#> 1 1 a 1
#> 2 1 b 2
#> 3 2 a 3
#> 4 2 b 4
#> 5 3 a 5
#> 6 3 b 6

df %>% spread(x, y) # ...become one row here.
#> # A tibble: 3 x 3
#> i a b
#> * <dbl> <int> <int>
#> 1 1 1 2
#> 2 2 3 4
#> 3 3 5 6

If you your values aren't indexed naturally by the other columns you can add a unique index column (e.g. by adding the row numbers as a column) which will stop spread from trying to collapse the rows:

df <- structure(list(age = c("21", "17", "32", "29", "15"), 
gender = structure(c(2L, 1L, 1L, 2L, 2L),
.Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame"),
.Names = c("age", "gender"))

df %>% mutate(i = row_number()) %>% spread(gender, age)
#> # A tibble: 5 x 3
#> i Female Male
#> * <int> <chr> <chr>
#> 1 1 <NA> 21
#> 2 2 17 <NA>
#> 3 3 32 <NA>
#> 4 4 <NA> 29
#> 5 5 <NA> 15

If you want to remove it afterwards, add on select(-i). This doesn't produce a terribly useful data.frame in this case, but can be very useful in the midst of more complicated reshaping.

Spread with duplicate identifiers for rows

In order for spread to work as intended, the resulting data frame must have uniquely identified rows and columns. In the case of your data, the "date" column is the only unique identifier after spreading. However, rows 36 and 38 are identical:

         date tmin state
36 2018-01-03 -3 OH
38 2018-01-03 -3 OH

This puts tidyr in the impossible position of trying to resolve two data points to the same row and column. In addition, rows 35 and 37 both have the same date and state, once again creating the impossible situation of placing two different values into the same position in the new data frame:

         date tmin state
35 2018-01-03 NA UT
37 2018-01-03 22 UT

The following data cleanup will make spreading possible:

df %>% 
filter(!is.na(tmin)) %>% # remove NA values
unique %>% # remove duplicated rows
spread(state, tmin)

date OH UT
1 2018-01-02 -4 24
2 2018-01-03 -3 22
3 2018-01-04 11 19
4 2018-01-05 3 23
5 2018-01-06 0 29
...

Using spread with duplicate identifiers for rows

The issue is the two columns for both A and B. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt when you use the code below.

library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11

I won't mark this as a duplicate since the other question did not summarize by sum, but the data.table answer could help with one additional argument, fun=sum:

library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
# month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1: 1 17 8 11 8
# 2: 2 13 8 13 5
# 3: 3 15 6 15 11

If you would like to use the tidyr solution, combine it with dcast to summarize by sum.

as.data.frame(jj)
library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
dcast(month ~ temp, fun=sum)
# month Amy_A Amy_B Bob_A Bob_B
# 1 1 17 11 8 8
# 2 2 13 13 8 5
# 3 3 15 15 6 11

Edit

Based on your new requirements, I have added an activity column.

library(dplyr)
jj %>% group_by(month, student) %>%
mutate(id=1:n()) %>%
melt(id=c("month", "id", "student")) %>%
dcast(... ~ student + variable, value.var="value")
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 1 2 8 5 5 3
# 3 2 1 7 7 2 4
# 4 2 2 6 6 6 1
# 5 3 1 6 8 1 6
# 6 3 2 9 7 5 5

The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:

library(tidyr)
jj %>%
gather(variable, value, -(month:student)) %>%
unite(temp, student, variable) %>%
group_by(temp) %>%
mutate(id=1:n()) %>%
dcast(... ~ temp) %>%
arrange(id)
# month id Amy_A Amy_B Bob_A Bob_B
# 1 1 1 9 6 3 5
# 2 2 2 7 7 2 4
# 3 3 3 6 8 1 6
# 4 1 4 8 5 5 3
# 5 2 5 6 6 6 1
# 6 3 6 9 7 5 5

The data.table syntax is compact because it allows for multiple value.var columns and will take care of the spread for us. We can then skip the melt -> cast process.

library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
# month activityID A_Amy A_Bob B_Amy B_Bob
# 1: 1 1 9 3 6 5
# 2: 1 4 8 5 5 3
# 3: 2 2 7 2 7 4
# 4: 2 5 6 6 6 1
# 5: 3 3 6 1 8 6
# 6: 3 6 9 5 7 5

Duplicate Identifiers - How to Dynamically Spread Repeating Values in a Column using Tidyverse

You can create a new column to assign a unique id for each phone number for a given Id and then spread using that new column. This avoids the "duplicate id" problem. When spreading, this approach inherently adds as many columns as are needed to accommodate all of the unique phone numbers for each Id. For example, in the code below I've called that new column seq (for "sequence"):

library(tidyverse)

Df %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(seq=paste0("Phone_",1:n())) %>%
spread(seq, value)
     Id    Phone_1    Phone_2    Phone_3
1 111 6453348736 NA NA
2 121 6783450101 NA NA
3 144 6451123425 NA NA
4 145 6457886543 NA NA
5 148 6572231223 6541132112 NA
6 165 7372222222 NA NA
7 177 6548887777 NA NA
8 188 7890986543 6785554444 8764443344
9 199 6532881717 NA NA

Turning variable values into column names; duplicate identifiers for rows in tidyr::spread

tidyr provides some syntax for dealing with this problem.

# set up
library(dplyr)
library(tidyr)
dat <- tibble(
id = factor(c("A","B","C","D","E")),
demographic_info1 = round(rnorm(5),2),
demographic_info2 = round(rnorm(5),2),
election_1 = c(NA,"GN2016","GN2016","SE2016","GN2008"),
election_2 = c(NA,"MT2014","GN2012","GN2016","GN2004"),
election_3 = c(NA,NA,NA,"MT2014","GN2000"),
election_4 = c(NA,NA,NA,"GN2012",NA),
election_5 = c(NA,NA,NA,"MT2010",NA)
)

What we eventually want is a TRUE or FALSE for every voter (5) x election (8) pairing. When we gather the data into a long format, we only see the voter x election combinations that exist in the data-set.

d_votes <- dat %>%
gather("variable", "election", election_1:election_5) %>%
select(-variable) %>%
mutate(voted = TRUE)
d_votes
#> # A tibble: 25 x 5
#> id demographic_info1 demographic_info2 election voted
#> <fctr> <dbl> <dbl> <chr> <lgl>
#> 1 A 0.76 -0.23 <NA> TRUE
#> 2 B -0.80 0.08 GN2016 TRUE
#> 3 C -0.33 1.60 GN2016 TRUE
#> 4 D -0.50 -1.27 SE2016 TRUE
#> 5 E -1.03 0.59 GN2008 TRUE
#> 6 A 0.76 -0.23 <NA> TRUE
#> 7 B -0.80 0.08 MT2014 TRUE
#> 8 C -0.33 1.60 GN2012 TRUE
#> 9 D -0.50 -1.27 GN2016 TRUE
#> 10 E -1.03 0.59 GN2004 TRUE
#> # ... with 15 more rows

count(d_votes, election)
#> # A tibble: 9 x 2
#> election n
#> <chr> <int>
#> 1 GN2000 1
#> 2 GN2004 1
#> 3 GN2008 1
#> 4 GN2012 2
#> 5 GN2016 3
#> 6 MT2010 1
#> 7 MT2014 2
#> 8 SE2016 1
#> 9 <NA> 13

We need to generate every combination of voter and election. tidyr's expand() function creates all combinations of variables from different columns/vectors of data. (It works like the base function expand.grid(), so the name expand() is evocative).

d_possible_votes <- d_votes %>%
expand(nesting(id, demographic_info1, demographic_info2),
election)
d_possible_votes
#> # A tibble: 40 x 4
#> id demographic_info1 demographic_info2 election
#> <fctr> <dbl> <dbl> <chr>
#> 1 A 0.76 -0.23 GN2000
#> 2 A 0.76 -0.23 GN2004
#> 3 A 0.76 -0.23 GN2008
#> 4 A 0.76 -0.23 GN2012
#> 5 A 0.76 -0.23 GN2016
#> 6 A 0.76 -0.23 MT2010
#> 7 A 0.76 -0.23 MT2014
#> 8 A 0.76 -0.23 SE2016
#> 9 B -0.80 0.08 GN2000
#> 10 B -0.80 0.08 GN2004
#> # ... with 30 more rows

Note that we now have 8 elections x 5 ids = 40 rows.

We used the nesting() function to treat each (id, demographic_info1, demographic_info2) set/row as a single unit; demographics are nested within ids. Expanding provided all 40 combinations of (id, demographic_info1, demographic_info2) x election.

If we join the observed votes onto the possible votes, the voted column is populated with TRUE or NA values. tidyr's replace_na() function can correct those NA values.

d_possible_votes <- d_possible_votes %>%
left_join(d_votes) %>%
replace_na(list(voted = FALSE))
#> Joining, by = c("id", "demographic_info1", "demographic_info2", "election")
d_possible_votes
#> # A tibble: 40 x 5
#> id demographic_info1 demographic_info2 election voted
#> <fctr> <dbl> <dbl> <chr> <lgl>
#> 1 A 0.76 -0.23 GN2000 FALSE
#> 2 A 0.76 -0.23 GN2004 FALSE
#> 3 A 0.76 -0.23 GN2008 FALSE
#> 4 A 0.76 -0.23 GN2012 FALSE
#> 5 A 0.76 -0.23 GN2016 FALSE
#> 6 A 0.76 -0.23 MT2010 FALSE
#> 7 A 0.76 -0.23 MT2014 FALSE
#> 8 A 0.76 -0.23 SE2016 FALSE
#> 9 B -0.80 0.08 GN2000 FALSE
#> 10 B -0.80 0.08 GN2004 FALSE
#> # ... with 30 more rows

Now, we can spread out the elections and achieve the desired dataframe.

spread(d_possible_votes, election, voted)
#> # A tibble: 5 x 11
#> id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
#> * <fctr> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A 0.76 -0.23 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 2 B -0.80 0.08 FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
#> 3 C -0.33 1.60 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
#> 4 D -0.50 -1.27 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
#> 5 E -1.03 0.59 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

This pattern of generating combinations of identifiers, joining actual data, and correcting missing values is very common—so much so that tidyr includes a function complete() to do all three at once.

d_votes %>%
complete(nesting(id, demographic_info1, demographic_info2),
election, fill = list(voted = FALSE)) %>%
spread(election, voted)
#> # A tibble: 5 x 11
#> id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
#> * <fctr> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A 0.76 -0.23 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 2 B -0.80 0.08 FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
#> 3 C -0.33 1.60 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
#> 4 D -0.50 -1.27 FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
#> 5 E -1.03 0.59 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

Spread with duplicate identifiers (using tidyverse and % %)

We can use tidyverse. After grouping by 'start_end', 'id', create a sequence column 'ind' , then spread from 'long' to 'wide' format

library(dplyr)
library(tidyr)
df %>%
group_by(start_end, id) %>%
mutate(ind = row_number()) %>%
spread(start_end, date) %>%
select(start, end)
# id start end
#* <int> <fctr> <fctr>
#1 2 1994-05-01 1996-11-04
#2 4 1979-07-18 NA
#3 5 2005-02-01 2009-09-17
#4 5 2010-10-01 2012-10-06

Or using tidyr_1.0.0

chop(df, date) %>%
spread(start_end, date) %>%
unnest(c(start, end))

How to spread columns with key and value of equal length (without creating ids)?

The spread function tries to identify in which row the values should go and tries to use var1 as identifier for the rows.

As @Frank mentioned you need to create this column to be used as identifier. Identifier for rows means that the numbers you'll provide will affect the order of your values.

Check this example:

library(tidyverse)

df = tibble(var1 = c('a', 'a', 'a', 'b','b', 'b'), value=c(2,1,4,2,1,4))

df %>%
group_by(var1) %>%
mutate(id = row_number()) %>%
spread(var1, value)

# # A tibble: 3 x 3
# id a b
# * <int> <dbl> <dbl>
# 1 1 2 2
# 2 2 1 1
# 3 3 4 4

df %>%
group_by(var1) %>%
mutate(id = n():1) %>%
spread(var1, value)

# # A tibble: 3 x 3
# id a b
# * <int> <dbl> <dbl>
# 1 1 4 4
# 2 2 1 1
# 3 3 2 2


Related Topics



Leave a reply



Submit