Tidy Data.Frame with Repeated Column Names

Tidy data.frame with repeated column names

You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:

# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])

# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A','^B','^C','^ID'), 
                 value.name = c('A','B','C','ID'))

which gives:

 > tidy_toy
                  file_path condition Trial.Num variable  A B  C    ID
 1:     root/some.extension  Baseline         1        1  2 3  5   car
 2:    root/thing.extension  Baseline         2        1  3 6 45   car
 3:     root/else.extension  Baseline         3        1  4 4  6   car
 4: root/uniquely.extension Treatment         1        1  5 3  7   car
 5:  root/defined.extension Treatment         2        1  6 7  3   car
 6:     root/some.extension  Baseline         1        2  2 1  7  bike
 7:    root/thing.extension  Baseline         2        2  5 4  4  bike
 8:     root/else.extension  Baseline         3        2  7 5  4  bike
 9: root/uniquely.extension Treatment         1        2  1 7 37  bike
10:  root/defined.extension Treatment         2        2  4 6  8  bike
11:     root/some.extension  Baseline         1        3  4 9  0 plane
12:    root/thing.extension  Baseline         2        3  9 5  4 plane
13:     root/else.extension  Baseline         3        3 68 7 56 plane
14: root/uniquely.extension Treatment         1        3  9 8  7 plane
15:  root/defined.extension Treatment         2        3  9 0  8 plane

Another option is to use a list of column-indexes for measure.vars:

tidy_toy <- melt(setDT(toy), 
                 measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)), 
                 value.name = c('A','B','C','ID'))

Making the column-names unique isn't necessary then.

A more complicated method that creates names that are better distinguishable by the patterns argument:

# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]

# make them unique
names(toy)[idx] <- paste(nms, 
                         rep(seq(length(nms) / length(names(tt)[tt > 1])), 
                             each = length(names(tt)[tt > 1])), 
                         sep = '.')

# your columnnames are now unique:
> names(toy)
 [1] "file_path" "Condition" "Trial.Num" "A.1"       "B.1"       "C.1"       "ID.1"      "A.2"      
 [9] "B.2"       "C.2"       "ID.2"      "A.3"       "B.3"       "C.3"       "ID.3"     

# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'), 
                 value.name = c('A','B','C','ID'))

which will give the same end-result.

As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.

Import file with repeated column names for each individual R

Perhaps the best strategy may depend also on how you input your raw data (e.g., from Excel).

If you happen to have Excel data, you can use read_excel from tidyverse and can include .name_repair = "minimal" to prevent changes in column names.

In this case, with repair_names you can have a consistent structure to column names that are repeated, perhaps with an underscore (this would give you Name, Name_1, Name_2, Age, Age_1, Age_2, etc.).

Finally, pivot_longer of your repeated columns would provide a tidy data frame.

Also, there are a number of alternative ways to fix your repeating column names and make unique; for example, make.unique called on names(df) or clean_names(df) from janitor package.

library(tidyverse)
library(readxl)

df <- read_excel("raw_data.xlsx", .name_repair = "minimal")

df %>%
  repair_names(sep = "_") %>%
  pivot_longer(-c(Family, Location), names_to = c(".value", "variable"), names_sep = "_") %>%
  select(-variable)

pivot to wide dataframe with repeating column names

It is not recommended to have duplicate column names, therefore, we modify the 'name' column by appending an unique index created with rowid, and use that to reshape with pivot_wider

library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
long_df %>% 
   mutate(name = str_c(name, "_", rowid(ID, name))) %>% 
   pivot_wider(names_from = name, values_from = value, names_sort = TRUE)

-output

# A tibble: 5 x 7
#  ID    cond1_1 cond1_2 cond2_1 cond2_2 cond3_1 cond3_2
#  <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#1 A       0.293   0.920    6.44    9.14    18.5    71.9
#2 B       0.225   0.280    4.34    2.78    59.7    16.9
#3 C       0.704   0.764    7.05    1.40    75.3    64.0
#4 D       0.519   0.802    7.06    5.51    22.4    66.7
#5 E       0.663   0.255    3.88    2.25    30.1    14.2

If it needs to have repeating names, just strip off the _\\d+ at the end of the name with str_remove

Lengthen data frame with duplicate names

The problem is not pivot_wider, it can be used on data.frames containing columns with the same name - mutate can't. So we need to transform the columns to character columns either by (i) using base R or (ii) if you want to stay in the larger tidyverse purrr::modify_at (after all a data.frame is always a list). After that its just a regular call to pivot_wider.

df <- setNames(data.frame(c(1,2,3), 
                          c(4,5,6), 
                          c("a","b","c"), 
                          c("d","e","f"), 
                          stringsAsFactors = F), 
               c("a","b","c","c"))

library(dplyr)
library(tidyr)

# Alternatively use base R to transform cols to character
# df[,c("a", "b")] <- lapply(df[,c("a", "b")], as.character)

df %>%
  purrr::modify_at(c("a","b"), as.character) %>% 
  pivot_longer(-a,
               names_to = "Names",
               values_to = "Values")
#> # A tibble: 9 x 3
#>   a     Names Values
#>   <chr> <chr> <chr> 
#> 1 1     b     4     
#> 2 1     c     a     
#> 3 1     c     d     
#> 4 2     b     5     
#> 5 2     c     b     
#> 6 2     c     e     
#> 7 3     b     6     
#> 8 3     c     c     
#> 9 3     c     f

^{Created on 2021-02-23 by the reprex package (v0.3.0)}

Tidying data with several repeating variables in R

You are on the right track with using gather, but need some additional steps to split the prefix off the column names. Try the following:

library(dplyr)
library(tidyr)

df = data.frame(
  id = c(1,2,3),
  c_health = c(3,2,4),
  c_animals = c(2,3,1),
  z_health = c(7,8,9),
  z_animals = c(9,7,6),
  stringsAsFactors = FALSE
)

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(prefix = substr(question,1,1),
         desc = substr(question,3,nchar(question))) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

Update - my comment on differing prefix lengths does not return the correct answer. Because [] indexing does not work that way inside mutate. Same idea but correct syntax as follows:

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(split = strsplit(question, "_")) %>%
  mutate(prefix = sapply(split, function(x){x[1]}),
         desc = sapply(split, function(x){x[2]})) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

R dates as column names containing duplicate values (need to retain original date)

You have two header rows, which is pretty messy. I'd recommend re-reading the data, skipping the date line, then incorporating the date line as part of the column names.

If you already have the data read in, you can try something like this:

library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
#     Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1      pending      pending     pending
# 2: Org2     complete     complete    complete
# 3: Org3     complete      pending     pending
# 4: Org4      pending     complete    complete

# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
#      Org    value     Date Contract
#  1: Org1  pending 12/16/18      111
#  2: Org2 complete 12/16/18      111
#  3: Org3 complete 12/16/18      111
#  4: Org4  pending 12/16/18      111
#  5: Org1  pending 12/16/18      222
#  6: Org2 complete 12/16/18      222
#  7: Org3  pending 12/16/18      222
#  8: Org4 complete 12/16/18      222
#  9: Org1  pending  1/18/18      222
# 10: Org2 complete  1/18/18      222
# 11: Org3  pending  1/18/18      222
# 12: Org4 complete  1/18/18      222

If you do want to stick with dplyr and tidyr, here's a translation of the above:

library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>% 
  slice(-1) %>% 
  pivot_longer(-Org) %>% 
  separate(name, into = c("Date", "Contract"), sep = "_")

Note that you have to rename the dataset before you start chaining the other commands together.

R: gather() to Tidy data with two column headings

This is one option:

library(tidyverse)

# get unique Year values and create column names (to add later)
df %>%
  filter(Country_Territory == "Year") %>%
  gather() %>%
  filter(value != "Year" & !is.na(value)) %>%
  pull(value) %>%
  unique() %>%
  paste0("Year_",.) -> col_years

# reshape data (excluding the Year row)
df %>%
  filter(Country_Territory != "Year") %>%
  gather(key,y,-Country_Territory, -WBCode) %>%
  separate(key, c("measure","v")) %>%
  group_by(v = ifelse(is.na(v), 0, v)) %>%
  nest() -> df_info

reduce(df_info$data, function(x,y) left_join(x,y,by=c("Country_Territory","WBCode","measure"))) %>%
  setNames(c("Country_Territory", "WBCode", "measure", col_years))

# # A tibble: 18 x 5
#   Country_Territory WBCode measure  Year_1996 Year_1998
#   <chr>             <chr>  <chr>        <dbl>     <dbl>
# 1 Andorra           ADO    Estimate      1.32      1.38
# 2 Afghanistan       AFG    Estimate     -1.29     -1.18
# 3 Angola            AGO    Estimate     -1.17     -1.41
# 4 Andorra           ADO    StdErr        0.48      0.46
# 5 Afghanistan       AFG    StdErr        0.34      0.33
# 6 Angola            AGO    StdErr        0.26      0.21
# 7 Andorra           ADO    NumSrc        1         1   
# 8 Afghanistan       AFG    NumSrc        2         2   
# 9 Angola            AGO    NumSrc        4         6   
# 10 Andorra           ADO    Rank         87.1      89.2 
# 11 Afghanistan       AFG    Rank          4.3       9.79
# 12 Angola            AGO    Rank          9.68      1.55
# 13 Andorra           ADO    Lower        72.0      74.7 
# 14 Afghanistan       AFG    Lower         0         0   
# 15 Angola            AGO    Lower         0.54      0   
# 16 Andorra           ADO    Upper        96.8      96.9 
# 17 Afghanistan       AFG    Upper        27.4      31.4 
# 18 Angola            AGO    Upper        27.4      13.4

If you save the above output as df_upd you can reshape a bit more to get Year as one column:

df_upd %>%
  gather(Year, value, -Country_Territory, -WBCode, -measure) %>%
  separate(Year, c("y","Year"), convert = T) %>%
  select(-y)

# # A tibble: 36 x 5
#   Country_Territory WBCode measure   Year  value
#   <chr>             <chr>  <chr>    <int>  <dbl>
# 1 Andorra           ADO    Estimate  1996   1.32
# 2 Afghanistan       AFG    Estimate  1996  -1.29
# 3 Angola            AGO    Estimate  1996  -1.17
# 4 Andorra           ADO    StdErr    1996   0.48
# 5 Afghanistan       AFG    StdErr    1996   0.34
# 6 Angola            AGO    StdErr    1996   0.26
# 7 Andorra           ADO    NumSrc    1996   1   
# 8 Afghanistan       AFG    NumSrc    1996   2   
# 9 Angola            AGO    NumSrc    1996   4   
# 10 Andorra           ADO    Rank      1996  87.1 
# # ... with 26 more rows

In pandas, how to turn a dataframe into tidy data?

Use melt:

out = df.melt('year', var_name='localization', value_name='number_of_tests')

You can also use:

out = df.set_index('year').rename_axis(columns='localization').unstack() \
        .rename('number_of_tests').reset_index()

Tidy Data.Frame with Repeated Column Names