Reading in Files with Two Rows for Header

Reading two-line headers in R

I would do two steps, assuming we know that the first row contains the labels, and there are always two headers.

header <- scan("file.txt", nlines = 1, what = character())
data <- read.table("file.txt", skip = 2, header = FALSE)

Then add the character vector header on as the names component:

names(data) <- header

For your data this would be

header <- scan("data.txt", nlines = 1, what = character())
data <- read.table("data.txt", skip = 2, header = FALSE)
names(data) <- header

head(data)

> head(data)
trt biomass yield
1 C2 17.76 205.92
2 C2 17.96 207.86
3 CC 17.72 197.22
4 CC 18.42 205.20
5 CCW 18.15 200.51
6 CCW 17.45 190.59

If you want the units, as per @DWin's answer, then do a second scan() on line 2

header2 <- scan("data.txt", skip = 1, nlines = 1, what = character())
names(data) <- paste0(header, header2)

> head(data)
trtcrop biomassMg/ha yieldbu/ac
1 C2 17.76 205.92
2 C2 17.96 207.86
3 CC 17.72 197.22
4 CC 18.42 205.20
5 CCW 18.15 200.51
6 CCW 17.45 190.59

When reading a file, can I specify some column names and let R read the rest from the header row?

There's not a way to do it in one step with read.csv (or any other CSV-reading function I'm aware of).

It's trivial to do it in two steps.

Just do it in two steps.

Skip multiple header rows while using read_excel or read.excel in R

You just need to skip = 3 instead of 2, as you need to skip the header when you read in the data. Since we have already defined column names in myCols, then we do not need to keep the column name row when you read it in.

library(readxl)

myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE))
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)

Output

  indicator_name         M2   GDP
<dttm> <dbl> <dbl>
1 2018-01-01 00:00:00 6.71 8.17
2 2018-01-02 00:00:00 6.79 8.19
3 2018-01-03 00:00:00 6.77 8.21
4 2018-01-04 00:00:00 6.73 8.20
5 2018-01-05 00:00:00 6.67 8.20
6 2018-01-06 00:00:00 6.62 8.21
7 2018-01-07 00:00:00 6.62 8.21
8 2018-01-08 00:00:00 6.64 8.22
9 2018-01-09 00:00:00 6.64 8.22

If you have the first column name blank, then you could replace the NA in the column names before reading in the data.

library(tidyverse)

myCols <- read_excel("./test123.xlsx", n_max = 2, col_names = FALSE) %>%
slice(1) %>%
mutate(across(everything(), ~replace_na(., "indicator_name"))) %>%
as.character()
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)

Benchmark

In this instant, it looks like it would still be faster to just filter out the rows after reading them in.

Sample Image

bm <- microbenchmark::microbenchmark(filter_before = {myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE));
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)},
filter_after = {myDF2 <- read_excel("./test123.xlsx");
myDF2 <- myDF2[-c(1:2),]},
times = 1000)
autoplot(bm)


Related Topics



Leave a reply



Submit