Tidy Data.Frame with Repeated Column Names

Tidy data.frame with repeated column names

You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:

# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])

# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy),
measure.vars = patterns('^A','^B','^C','^ID'),
value.name = c('A','B','C','ID'))

which gives:

 > tidy_toy
file_path condition Trial.Num variable A B C ID
1: root/some.extension Baseline 1 1 2 3 5 car
2: root/thing.extension Baseline 2 1 3 6 45 car
3: root/else.extension Baseline 3 1 4 4 6 car
4: root/uniquely.extension Treatment 1 1 5 3 7 car
5: root/defined.extension Treatment 2 1 6 7 3 car
6: root/some.extension Baseline 1 2 2 1 7 bike
7: root/thing.extension Baseline 2 2 5 4 4 bike
8: root/else.extension Baseline 3 2 7 5 4 bike
9: root/uniquely.extension Treatment 1 2 1 7 37 bike
10: root/defined.extension Treatment 2 2 4 6 8 bike
11: root/some.extension Baseline 1 3 4 9 0 plane
12: root/thing.extension Baseline 2 3 9 5 4 plane
13: root/else.extension Baseline 3 3 68 7 56 plane
14: root/uniquely.extension Treatment 1 3 9 8 7 plane
15: root/defined.extension Treatment 2 3 9 0 8 plane

Another option is to use a list of column-indexes for measure.vars:

tidy_toy <- melt(setDT(toy), 
measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)),
value.name = c('A','B','C','ID'))

Making the column-names unique isn't necessary then.


A more complicated method that creates names that are better distinguishable by the patterns argument:

# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]

# make them unique
names(toy)[idx] <- paste(nms,
rep(seq(length(nms) / length(names(tt)[tt > 1])),
each = length(names(tt)[tt > 1])),
sep = '.')

# your columnnames are now unique:
> names(toy)
[1] "file_path" "Condition" "Trial.Num" "A.1" "B.1" "C.1" "ID.1" "A.2"
[9] "B.2" "C.2" "ID.2" "A.3" "B.3" "C.3" "ID.3"

# tidy the data into long format
tidy_toy <- melt(setDT(toy),
measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'),
value.name = c('A','B','C','ID'))

which will give the same end-result.


As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.

Import file with repeated column names for each individual R

Perhaps the best strategy may depend also on how you input your raw data (e.g., from Excel).

If you happen to have Excel data, you can use read_excel from tidyverse and can include .name_repair = "minimal" to prevent changes in column names.

In this case, with repair_names you can have a consistent structure to column names that are repeated, perhaps with an underscore (this would give you Name, Name_1, Name_2, Age, Age_1, Age_2, etc.).

Finally, pivot_longer of your repeated columns would provide a tidy data frame.

Also, there are a number of alternative ways to fix your repeating column names and make unique; for example, make.unique called on names(df) or clean_names(df) from janitor package.

library(tidyverse)
library(readxl)

df <- read_excel("raw_data.xlsx", .name_repair = "minimal")

df %>%
repair_names(sep = "_") %>%
pivot_longer(-c(Family, Location), names_to = c(".value", "variable"), names_sep = "_") %>%
select(-variable)

pivot to wide dataframe with repeating column names

It is not recommended to have duplicate column names, therefore, we modify the 'name' column by appending an unique index created with rowid, and use that to reshape with pivot_wider

library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
long_df %>%
mutate(name = str_c(name, "_", rowid(ID, name))) %>%
pivot_wider(names_from = name, values_from = value, names_sort = TRUE)

-output

# A tibble: 5 x 7
# ID cond1_1 cond1_2 cond2_1 cond2_2 cond3_1 cond3_2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 0.293 0.920 6.44 9.14 18.5 71.9
#2 B 0.225 0.280 4.34 2.78 59.7 16.9
#3 C 0.704 0.764 7.05 1.40 75.3 64.0
#4 D 0.519 0.802 7.06 5.51 22.4 66.7
#5 E 0.663 0.255 3.88 2.25 30.1 14.2

If it needs to have repeating names, just strip off the _\\d+ at the end of the name with str_remove

Lengthen data frame with duplicate names

The problem is not pivot_wider, it can be used on data.frames containing columns with the same name - mutate can't. So we need to transform the columns to character columns either by (i) using base R or (ii) if you want to stay in the larger tidyverse purrr::modify_at (after all a data.frame is always a list). After that its just a regular call to pivot_wider.

df <- setNames(data.frame(c(1,2,3), 
c(4,5,6),
c("a","b","c"),
c("d","e","f"),
stringsAsFactors = F),
c("a","b","c","c"))

library(dplyr)
library(tidyr)

# Alternatively use base R to transform cols to character
# df[,c("a", "b")] <- lapply(df[,c("a", "b")], as.character)

df %>%
purrr::modify_at(c("a","b"), as.character) %>%
pivot_longer(-a,
names_to = "Names",
values_to = "Values")
#> # A tibble: 9 x 3
#> a Names Values
#> <chr> <chr> <chr>
#> 1 1 b 4
#> 2 1 c a
#> 3 1 c d
#> 4 2 b 5
#> 5 2 c b
#> 6 2 c e
#> 7 3 b 6
#> 8 3 c c
#> 9 3 c f

Created on 2021-02-23 by the reprex package (v0.3.0)

Tidying data with several repeating variables in R

You are on the right track with using gather, but need some additional steps to split the prefix off the column names. Try the following:

library(dplyr)
library(tidyr)

df = data.frame(
id = c(1,2,3),
c_health = c(3,2,4),
c_animals = c(2,3,1),
z_health = c(7,8,9),
z_animals = c(9,7,6),
stringsAsFactors = FALSE
)

output = df %>%
# gather on all columns other than id
gather(key = "question", value = "response", -all_of("id")) %>%
# split off prefix and rest of column name
mutate(prefix = substr(question,1,1),
desc = substr(question,3,nchar(question))) %>%
# keep just the columns of interest
select(id, prefix, desc, response) %>%
# reshape wider
spread(prefix, response)

Update - my comment on differing prefix lengths does not return the correct answer. Because [] indexing does not work that way inside mutate. Same idea but correct syntax as follows:

output = df %>%
# gather on all columns other than id
gather(key = "question", value = "response", -all_of("id")) %>%
# split off prefix and rest of column name
mutate(split = strsplit(question, "_")) %>%
mutate(prefix = sapply(split, function(x){x[1]}),
desc = sapply(split, function(x){x[2]})) %>%
# keep just the columns of interest
select(id, prefix, desc, response) %>%
# reshape wider
spread(prefix, response)

R dates as column names containing duplicate values (need to retain original date)

You have two header rows, which is pretty messy. I'd recommend re-reading the data, skipping the date line, then incorporating the date line as part of the column names.

If you already have the data read in, you can try something like this:

library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
# Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1 pending pending pending
# 2: Org2 complete complete complete
# 3: Org3 complete pending pending
# 4: Org4 pending complete complete

# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
# Org value Date Contract
# 1: Org1 pending 12/16/18 111
# 2: Org2 complete 12/16/18 111
# 3: Org3 complete 12/16/18 111
# 4: Org4 pending 12/16/18 111
# 5: Org1 pending 12/16/18 222
# 6: Org2 complete 12/16/18 222
# 7: Org3 pending 12/16/18 222
# 8: Org4 complete 12/16/18 222
# 9: Org1 pending 1/18/18 222
# 10: Org2 complete 1/18/18 222
# 11: Org3 pending 1/18/18 222
# 12: Org4 complete 1/18/18 222

If you do want to stick with dplyr and tidyr, here's a translation of the above:

library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>%
slice(-1) %>%
pivot_longer(-Org) %>%
separate(name, into = c("Date", "Contract"), sep = "_")

Note that you have to rename the dataset before you start chaining the other commands together.

R: gather() to Tidy data with two column headings

This is one option:

library(tidyverse)

# get unique Year values and create column names (to add later)
df %>%
filter(Country_Territory == "Year") %>%
gather() %>%
filter(value != "Year" & !is.na(value)) %>%
pull(value) %>%
unique() %>%
paste0("Year_",.) -> col_years

# reshape data (excluding the Year row)
df %>%
filter(Country_Territory != "Year") %>%
gather(key,y,-Country_Territory, -WBCode) %>%
separate(key, c("measure","v")) %>%
group_by(v = ifelse(is.na(v), 0, v)) %>%
nest() -> df_info

reduce(df_info$data, function(x,y) left_join(x,y,by=c("Country_Territory","WBCode","measure"))) %>%
setNames(c("Country_Territory", "WBCode", "measure", col_years))

# # A tibble: 18 x 5
# Country_Territory WBCode measure Year_1996 Year_1998
# <chr> <chr> <chr> <dbl> <dbl>
# 1 Andorra ADO Estimate 1.32 1.38
# 2 Afghanistan AFG Estimate -1.29 -1.18
# 3 Angola AGO Estimate -1.17 -1.41
# 4 Andorra ADO StdErr 0.48 0.46
# 5 Afghanistan AFG StdErr 0.34 0.33
# 6 Angola AGO StdErr 0.26 0.21
# 7 Andorra ADO NumSrc 1 1
# 8 Afghanistan AFG NumSrc 2 2
# 9 Angola AGO NumSrc 4 6
# 10 Andorra ADO Rank 87.1 89.2
# 11 Afghanistan AFG Rank 4.3 9.79
# 12 Angola AGO Rank 9.68 1.55
# 13 Andorra ADO Lower 72.0 74.7
# 14 Afghanistan AFG Lower 0 0
# 15 Angola AGO Lower 0.54 0
# 16 Andorra ADO Upper 96.8 96.9
# 17 Afghanistan AFG Upper 27.4 31.4
# 18 Angola AGO Upper 27.4 13.4

If you save the above output as df_upd you can reshape a bit more to get Year as one column:

df_upd %>%
gather(Year, value, -Country_Territory, -WBCode, -measure) %>%
separate(Year, c("y","Year"), convert = T) %>%
select(-y)

# # A tibble: 36 x 5
# Country_Territory WBCode measure Year value
# <chr> <chr> <chr> <int> <dbl>
# 1 Andorra ADO Estimate 1996 1.32
# 2 Afghanistan AFG Estimate 1996 -1.29
# 3 Angola AGO Estimate 1996 -1.17
# 4 Andorra ADO StdErr 1996 0.48
# 5 Afghanistan AFG StdErr 1996 0.34
# 6 Angola AGO StdErr 1996 0.26
# 7 Andorra ADO NumSrc 1996 1
# 8 Afghanistan AFG NumSrc 1996 2
# 9 Angola AGO NumSrc 1996 4
# 10 Andorra ADO Rank 1996 87.1
# # ... with 26 more rows

In pandas, how to turn a dataframe into tidy data?

Use melt:

out = df.melt('year', var_name='localization', value_name='number_of_tests')

You can also use:

out = df.set_index('year').rename_axis(columns='localization').unstack() \
.rename('number_of_tests').reset_index()


Related Topics



Leave a reply



Submit