Reading two-line headers in R
I would do two steps, assuming we know that the first row contains the labels, and there are always two headers.
header <- scan("file.txt", nlines = 1, what = character())
data <- read.table("file.txt", skip = 2, header = FALSE)
Then add the character vector header
on as the names
component:
names(data) <- header
For your data this would be
header <- scan("data.txt", nlines = 1, what = character())
data <- read.table("data.txt", skip = 2, header = FALSE)
names(data) <- header
head(data)
> head(data)
trt biomass yield
1 C2 17.76 205.92
2 C2 17.96 207.86
3 CC 17.72 197.22
4 CC 18.42 205.20
5 CCW 18.15 200.51
6 CCW 17.45 190.59
If you want the units, as per @DWin's answer, then do a second scan()
on line 2
header2 <- scan("data.txt", skip = 1, nlines = 1, what = character())
names(data) <- paste0(header, header2)
> head(data)
trtcrop biomassMg/ha yieldbu/ac
1 C2 17.76 205.92
2 C2 17.96 207.86
3 CC 17.72 197.22
4 CC 18.42 205.20
5 CCW 18.15 200.51
6 CCW 17.45 190.59
When reading a file, can I specify some column names and let R read the rest from the header row?
There's not a way to do it in one step with read.csv
(or any other CSV-reading function I'm aware of).
It's trivial to do it in two steps.
Just do it in two steps.
Skip multiple header rows while using read_excel or read.excel in R
You just need to skip = 3
instead of 2, as you need to skip the header when you read in the data. Since we have already defined column names in myCols
, then we do not need to keep the column name row when you read it in.
library(readxl)
myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE))
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)
Output
indicator_name M2 GDP
<dttm> <dbl> <dbl>
1 2018-01-01 00:00:00 6.71 8.17
2 2018-01-02 00:00:00 6.79 8.19
3 2018-01-03 00:00:00 6.77 8.21
4 2018-01-04 00:00:00 6.73 8.20
5 2018-01-05 00:00:00 6.67 8.20
6 2018-01-06 00:00:00 6.62 8.21
7 2018-01-07 00:00:00 6.62 8.21
8 2018-01-08 00:00:00 6.64 8.22
9 2018-01-09 00:00:00 6.64 8.22
If you have the first column name blank, then you could replace the NA
in the column names before reading in the data.
library(tidyverse)
myCols <- read_excel("./test123.xlsx", n_max = 2, col_names = FALSE) %>%
slice(1) %>%
mutate(across(everything(), ~replace_na(., "indicator_name"))) %>%
as.character()
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)
Benchmark
In this instant, it looks like it would still be faster to just filter out the rows after reading them in.
bm <- microbenchmark::microbenchmark(filter_before = {myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE));
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)},
filter_after = {myDF2 <- read_excel("./test123.xlsx");
myDF2 <- myDF2[-c(1:2),]},
times = 1000)
autoplot(bm)
Related Topics
Merging Data Frames with Different Number of Rows and Different Columns
How to Test If Object Is a Vector
Can Lapply Not Modify Variables in a Higher Scope
How to Fit Long Text into Ggplot2 Facet Titles
R: How to Create a Vector of Functions
Ggplot2 Time Series Plotting: How to Omit Periods When There Is No Data Points
Conditionally Apply Pipeline Step Depending on External Value
How to Store the Returned Value from a Shiny Module in Reactivevalues
Convert Lat/Lon to Zipcode/Neighborhood Name
How to Ddply() Without Sorting
How to Call the 'Function' Function
More Efficient Strategy for Which() or Match()
Drawing Simple Mediation Diagram in R