Reading Two-Line Headers in R

Reading two-line headers in R

I would do two steps, assuming we know that the first row contains the labels, and there are always two headers.

header <- scan("file.txt", nlines = 1, what = character())
data <- read.table("file.txt", skip = 2, header = FALSE)

Then add the character vector header on as the names component:

names(data) <- header

For your data this would be

header <- scan("data.txt", nlines = 1, what = character())
data <- read.table("data.txt", skip = 2, header = FALSE)
names(data) <- header

head(data)

> head(data)
trt biomass yield
1 C2 17.76 205.92
2 C2 17.96 207.86
3 CC 17.72 197.22
4 CC 18.42 205.20
5 CCW 18.15 200.51
6 CCW 17.45 190.59

If you want the units, as per @DWin's answer, then do a second scan() on line 2

header2 <- scan("data.txt", skip = 1, nlines = 1, what = character())
names(data) <- paste0(header, header2)

> head(data)
trtcrop biomassMg/ha yieldbu/ac
1 C2 17.76 205.92
2 C2 17.96 207.86
3 CC 17.72 197.22
4 CC 18.42 205.20
5 CCW 18.15 200.51
6 CCW 17.45 190.59

Read csv with two headers into a data.frame

Use base R reshape():

temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5

Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.

You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.

You can change the "time" variable to your "cell" variable as follows:

# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])

Then, drop the "time" column:

OUT$time = NULL

Update

To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:

temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Gather a dataset with multiple header rows

Note: Find the original dataset (i.e. df0) at the bottom.


It can be done in tidyverse by creating a character vector with metal name, unit, and method pasted together. Later we can use that vector to rename the columns. After that, reshaping the data into long format we'll have a column with all those three together (i.e. name column) which can be separated into three columns in order to get the desired output. See below;

library(dplyr)
library(tidyr)

df1 <- df0
## creating a character vector with the desired column names
df1 %>%
select(-c(1,2)) %>%
slice(1:2) %>%
mutate_if(is.factor, as.character) %>%
rbind(names(.), .) %>%
summarise_all(toString) %>%
as.character -> comp.header

## renaming columns and then converting to long format
df1 %>%
slice(-c(1,2)) %>%
rename_at(3:ncol(.), list(~comp.header)) %>%
pivot_longer(-c(Sample_Date, Site_Code)) %>%
separate(name, sep = ", ", into = c("Metal", "Detection", "Method"))

#> # A tibble: 20 x 6
#> Sample_Date Site_Code Metal Detection Method value
#> <fct> <fct> <chr> <chr> <chr> <fct>
#> 1 21/07/2016 1 Arsenic ug/L TM66TW 0.7
#> 2 21/07/2016 1 Barium ug/L TM66TW 88.6
#> 3 21/07/2016 1 Boron ug/L TM66TW 24
#> 4 21/07/2016 1 Antimony ug/L TM66FW <0.3
#> 5 21/07/2016 A Arsenic ug/L TM66TW 0.7
#> 6 21/07/2016 A Barium ug/L TM66TW 110
#> # ... with 14 more rows



In base we can achieve this by appending the site code and units to the headers, then converting data to long format and at last separating the column with column names to three columns with "Metal", "Unit", and "Method". See below;

df1 <- df0
## making sure that everything is character and not factpr
df1[] <- lapply(df1, as.character)

## appending unit and site info to the header names
names(df1)[3:ncol(df1)] <- paste(names(df1)[3:ncol(df1)],
df1[1,3:ncol(df1)],
df1[2,3:ncol(df1)], sep = ";")

## removing second and third row
df1 <- df1[3:nrow(df1),]

## converting wide to long
df2 <- cbind(df1[1:2], stack(lapply(df1[-c(1, 2)], as.character)))

## separate ind column to three desired variables and append to the dataframe
df2 <- data.frame(subset(df2, select = -ind),
setNames(as.data.frame(do.call(rbind, strsplit(as.character(df2$ind), ';'))),
c("Metal", "Detection", "Code")))

## rearranging the columns
df2 <- df2[c(1,2,4:ncol(df2), 3)]

## result
head(df2)
#> Sample_Date Site_Code Metal Detection Code values
#> 1 21/07/2016 1 Arsenic ug/L TM66TW 0.7
#> 2 21/07/2016 A Arsenic ug/L TM66TW 0.7
#> 3 15/09/2016 1 Arsenic ug/L TM66TW 0.5
#> 4 15/09/2016 A Arsenic ug/L TM66TW 0.4
#> 5 15/09/2016 2 Arsenic ug/L TM66TW 0.6
#> 6 21/07/2016 1 Barium ug/L TM66TW 88.6
#> ...






Data:

## reading data as df0 to preserve it for both solutions
read.table(text="Sample_Date Site_Code Arsenic Barium Boron Antimony
. . ug/L ug/L ug/L ug/L
. . TM66TW TM66TW TM66TW TM66FW
21/07/2016 1 0.7 88.6 24 <0.3
21/07/2016 A 0.7 110 19 <0.3
15/09/2016 1 0.5 67 32 <0.3
15/09/2016 A 0.4 96.5 22 <0.3
15/09/2016 2 0.6 122 26 <0.3", header= T) -> df0

Read excel with two-line headers in R

As long as the Excel file is small, I would approach it like this.

library(readxl)
library(zoo)
library(tidyr)
library(dplyr)
library(reshape2)

cols <- read_excel("path_to_file.xlsx",
col_names = FALSE)[1:2, ]

cols[1, ] <- na.locf(as.character(cols[1, ]),
na.rm = FALSE)

cols <- vapply(X = cols,
FUN = function(x) if (is.na(x[1])) x[2] else paste0(x, collapse = "_"),
FUN.VALUE = character(1))

DFrame <- read_excel("path_to_file.xlsx",
col_names = FALSE,
skip = 2) %>%
setNames(cols) %>%
mutate(row_id = row_number())

DFrame

DFrame %>%
select(row_id, A_a:B_c) %>%
gather(column, value, A_a:B_c) %>%
mutate(AB_group = sub("(^[A-B])_.+$", "\\1", column),
column = sub("^[A-B]_", "", column)) %>%
dcast(row_id + AB_group ~ column,
value.var = "value")

The downside is that, since read_excel doesn't have an nrows argument, this code actually reads the file in twice. There are other packages that can read excel files, such as gdata::read.xls that will let you only read in the top two rows to get the column names. I think that package has a little more overhead.

Regardless, when I face this situation, I either import the column names separately and manipulate them to what I want before assigning them to the data, or I just don't read in the names at all and manually reset all of the variable names using setNames.

Reshape data with multiple header rows in R

The tricky part is getting the correct column names and then performing gather and separate, which it sounds like you're already familiar with. Normally I do not like selecting rows by number, but in this case I think it makes sense if you're reliably reading the data from Excel in this way.

note: If a person's name has a . in it, the last step would fail.

library(tidyverse)

less_messy <- messy %>%
mutate_all(as.character) %>%
set_names(c("Part",
paste(names(.)[2:ncol(.)],
.[1, 2:ncol(.)],
sep = "-"))) %>%
`[`(2:nrow(.), ) %>%
gather("key", "value", contains("Measurement")) %>%
separate("key", c("person", "measurement"), sep = "-") %>%
mutate_at("person", ~ stringr::str_replace(.x, "\\..*",""))

less_messy
#> Part person measurement value
#> 1 Part Number 1 Julie Measurement 1 33
#> 2 Part Number 2 Julie Measurement 1 34
#> 3 Part Number 3 Julie Measurement 1 33
#> 4 Part Number 1 Julie Measurement 2 32
#> 5 Part Number 2 Julie Measurement 2 31
#> 6 Part Number 3 Julie Measurement 2 31
#> 7 Part Number 1 Joe Measurement 1 33
#> 8 Part Number 2 Joe Measurement 1 33
#> 9 Part Number 3 Joe Measurement 1 30
#> 10 Part Number 1 Joe Measurement 2 31
#> 11 Part Number 2 Joe Measurement 2 32
#> 12 Part Number 3 Joe Measurement 2 31

data

changed marginally because I noticed a discrepancy between what you showed and your code for the measurement numbers:

messy <- data.frame(
" " = c(" ", "Part Number 1", "Part Number 2", "Part Number 3"),
Julie = c("Measurement 1", 33, 34, 33),
Julie = c("Measurement 2", 32, 31, 31),
Joe = c("Measurement 1", 33, 33, 30),
Joe = c("Measurement 2", 31, 32, 31))

When reading a file, can I specify some column names and let R read the rest from the header row?

There's not a way to do it in one step with read.csv (or any other CSV-reading function I'm aware of).

It's trivial to do it in two steps.

Just do it in two steps.

How to format data from excel containing two rows of column headers to be able to use in R?

One option would be to

  1. read your excel file with option colNames = FALSE
  2. Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
  3. Get rid of the first two rows
  4. Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)

# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]

library(tidyr)
library(dplyr)

df %>%
pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
arrange(Time)
#> # A tibble: 16 × 4
#> category Time Y Z
#> <chr> <chr> <chr> <chr>
#> 1 Total Feb-21 1 1
#> 2 A Feb-21 2 2
#> 3 B Feb-21 3 3
#> 4 C Feb-21 4 4
#> 5 D Feb-21 5 5
#> 6 E Feb-21 6 6
#> 7 F Feb-21 7 7
#> 8 G Feb-21 8 8
#> 9 Total Jan-21 1 1
#> 10 A Jan-21 2 2
#> 11 B Jan-21 3 3
#> 12 C Jan-21 4 4
#> 13 D Jan-21 5 5
#> 14 E Jan-21 6 6
#> 15 F Jan-21 7 7
#> 16 G Jan-21 8 8

DATA

df <- structure(list(X1 = c(
NA, NA, "Total", "A", "B", "C", "D", "E",
"F", "G"
), X2 = c(
"Year Rolling Avg.", "Share", NA, "1", "1",
"1", "1", "1", "1", "1"
), X3 = c(
"Year Rolling Avg.", "Y", "1",
"2", "3", "4", "5", "6", "7", "8"
), X4 = c(
"Year Rolling Avg.",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
"Jan-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
"Jan-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
"Feb-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
"Feb-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
NA,
10L
), class = "data.frame")


Related Topics



Leave a reply



Submit