Read CSV with Two Headers into a Data.Frame

Read csv with two headers into a data.frame

Use base R reshape():

temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5

Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.

You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.

You can change the "time" variable to your "cell" variable as follows:

# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])

Then, drop the "time" column:

OUT$time = NULL

Update

To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:

temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

How to read a CSV file with multiple headers into two DataFrames in pandas, one with the headers and one with the data with some headers removed?

You can use:

#read file with MultiIndex
df = pd.read_csv(file, header=[0,1], na_values=[-1,''])
print (df)
Name Height Age
Unnamed: 0_level_1 Metres Unnamed: 2_level_1
0 A NaN 25.0
1 B 95.0 NaN

df1 = df.copy()
#remove first level of MultiIndex
df1.columns = df1.columns.droplevel(1)
print (df1)
Name Height Age
0 A NaN 25.0
1 B 95.0 NaN

#select first level of MultiIndex
a = df.columns.get_level_values(level=0)
#select second level of MultiIndex and replace Unnamed
b = df.columns.get_level_values(level=1).str.replace('Un.*','')
#DataFrame constructor
df2 = pd.DataFrame([a, b])
print (df2)
0 1 2
0 Name Height Age
1 Metres

Pandas: Read CSV with multiple headers

To keep both header rows, I would suggest to create a pd.Multindex from the first two rows of your data.

Therefore, you will need to import your data without header.

import numpy as np
import pandas as pd

df = pd.read_csv('~/Desktop/stackoverflow_data.csv', sep=r"[| ^]", header=None, engine='python')
df.reset_index(inplace=True)
df.fillna(np.nan, inplace=True)
df.head()

Output:

    level_0     level_1     level_2     0   1
0 HEADER 20181130 NaN NaN NaN
1 [Col1] [Col2] [Col3] [Col4] [Col5]
2 The quick "bro,wn" fox jumped
3 over the fat lazy dog
4 m1213 4,12r4 fr,34 ,56,gt 12fr,12fr

Then you will need to zip the two first rows as tuples (and btw remove the square brackets) and create a Multindex object:

cols = tuple(zip(df.iloc[0], df.iloc[1].apply(lambda x: x[1:-1])))

header = pd.MultiIndex.from_tuples(cols, names=['Lvl_1', 'Lvl_2'])

# delete the header rows and assign new header
df.drop([0,1], inplace=True)
df.columns = header

df.head()

This is the output:

Lvl_1   HEADER 20181130       NaN                   
Lvl_2 Col1 Col2 Col3 Col4 Col5
2 The quick "bro,wn" fox jumped
3 over the fat lazy dog
4 m1213 4,12r4 fr,34 ,56,gt 12fr,12fr
5 Trailer N NaN NaN NaN

Gather a dataset with multiple header rows

Note: Find the original dataset (i.e. df0) at the bottom.


It can be done in tidyverse by creating a character vector with metal name, unit, and method pasted together. Later we can use that vector to rename the columns. After that, reshaping the data into long format we'll have a column with all those three together (i.e. name column) which can be separated into three columns in order to get the desired output. See below;

library(dplyr)
library(tidyr)

df1 <- df0
## creating a character vector with the desired column names
df1 %>%
select(-c(1,2)) %>%
slice(1:2) %>%
mutate_if(is.factor, as.character) %>%
rbind(names(.), .) %>%
summarise_all(toString) %>%
as.character -> comp.header

## renaming columns and then converting to long format
df1 %>%
slice(-c(1,2)) %>%
rename_at(3:ncol(.), list(~comp.header)) %>%
pivot_longer(-c(Sample_Date, Site_Code)) %>%
separate(name, sep = ", ", into = c("Metal", "Detection", "Method"))

#> # A tibble: 20 x 6
#> Sample_Date Site_Code Metal Detection Method value
#> <fct> <fct> <chr> <chr> <chr> <fct>
#> 1 21/07/2016 1 Arsenic ug/L TM66TW 0.7
#> 2 21/07/2016 1 Barium ug/L TM66TW 88.6
#> 3 21/07/2016 1 Boron ug/L TM66TW 24
#> 4 21/07/2016 1 Antimony ug/L TM66FW <0.3
#> 5 21/07/2016 A Arsenic ug/L TM66TW 0.7
#> 6 21/07/2016 A Barium ug/L TM66TW 110
#> # ... with 14 more rows



In base we can achieve this by appending the site code and units to the headers, then converting data to long format and at last separating the column with column names to three columns with "Metal", "Unit", and "Method". See below;

df1 <- df0
## making sure that everything is character and not factpr
df1[] <- lapply(df1, as.character)

## appending unit and site info to the header names
names(df1)[3:ncol(df1)] <- paste(names(df1)[3:ncol(df1)],
df1[1,3:ncol(df1)],
df1[2,3:ncol(df1)], sep = ";")

## removing second and third row
df1 <- df1[3:nrow(df1),]

## converting wide to long
df2 <- cbind(df1[1:2], stack(lapply(df1[-c(1, 2)], as.character)))

## separate ind column to three desired variables and append to the dataframe
df2 <- data.frame(subset(df2, select = -ind),
setNames(as.data.frame(do.call(rbind, strsplit(as.character(df2$ind), ';'))),
c("Metal", "Detection", "Code")))

## rearranging the columns
df2 <- df2[c(1,2,4:ncol(df2), 3)]

## result
head(df2)
#> Sample_Date Site_Code Metal Detection Code values
#> 1 21/07/2016 1 Arsenic ug/L TM66TW 0.7
#> 2 21/07/2016 A Arsenic ug/L TM66TW 0.7
#> 3 15/09/2016 1 Arsenic ug/L TM66TW 0.5
#> 4 15/09/2016 A Arsenic ug/L TM66TW 0.4
#> 5 15/09/2016 2 Arsenic ug/L TM66TW 0.6
#> 6 21/07/2016 1 Barium ug/L TM66TW 88.6
#> ...






Data:

## reading data as df0 to preserve it for both solutions
read.table(text="Sample_Date Site_Code Arsenic Barium Boron Antimony
. . ug/L ug/L ug/L ug/L
. . TM66TW TM66TW TM66TW TM66FW
21/07/2016 1 0.7 88.6 24 <0.3
21/07/2016 A 0.7 110 19 <0.3
15/09/2016 1 0.5 67 32 <0.3
15/09/2016 A 0.4 96.5 22 <0.3
15/09/2016 2 0.6 122 26 <0.3", header= T) -> df0

Handling multiple column headers and same column names in csv - pandas/python

You do not want to transpose the dataframe but stack one column level. Simply you must declare to pandas that the csv file has a 2 rows header:

data=pd.read_csv('transposedata.csv', header=[0,1]).stack(level=0).sort_index(level=2)

It should give:

             X   Y
AA A PROD1 1 2
BB B PROD1 3 4
CC C PROD1 5 6
DD D PROD1 7 8
AA A PROD2 9 10
BB B PROD2 11 12
CC C PROD2 13 14
DD D PROD2 15 16

Read CSV with multiple Headers

Less efficient and clever than CT Zhu's solution but maybe a little simpler:

import pandas as pd
from StringIO import StringIO

with open ('foo.csv', 'r') as myfile:
data = myfile.read().split('\n\n')

pieces = [pd.read_csv(StringIO(x),sep=' ') for x in data]
print pd.concat(pieces,axis=1)

h1 h2 h3 h4 h5 h6
0 11 12 13 14 15 16
1 10 10 10 10 10 10

Split a dataframe with multiple header rows into unique dataframes

I did a brute force method below.

I also used a snippet from @enke to get the row-header indices. I originally thought all the row-headers had Line as their value so I commented that out and used the snippet from @enke

Note, I changed the data you had to make the output split data easier to see. I changed row index 2 to have DateX and LibraryFileX to see the new header applied, and the values in the column Line

"""
Extract row as header and split up df into chunks
"""

import pandas as pd

# Original data
#data = {'Line': ['0', '0', 'Line', '0', '0'], 'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'],
# 'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}

# Changed 'Line' Values and the row-header values to show outputs better
data = {'Line': ['0', '1', 'Line', '2', '3'], 'Date': ['8/25/2021', '8/25/2021', 'DateX', '8/25/2021', '8/25/2021'],
'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFileX','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}

# Create DataFrame.
df = pd.DataFrame(data)
# Print the output.
print("INPUT")
print(df)
print("")

FIELD_TO_SEARCH = 'Line'
# VALUE_TO_MATCH_FIELD = 'Line'
# header_indices = df.index[df[FIELD_TO_SEARCH] == VALUE_TO_MATCH_FIELD].tolist()
header_indices = df.index[pd.to_numeric(df[FIELD_TO_SEARCH], errors='coerce').isna()].tolist()
# Add one row past the end so we can have a stopping point later
header_indices.append(df.shape[0] + 1)

# Preallocate output df list with the first chunk (using existing headers).
list_of_dfs = [df.iloc[0:header_indices[0]]]

if len(header_indices) > 1:
for idx in range(len(header_indices) - 1):
# Extract new header
header_index = header_indices[idx]
next_header_index = header_indices[idx + 1]
current_header = df.iloc[[header_index]].values.flatten().tolist()

# Make a df from this chunk
current_df = df[header_index + 1:next_header_index]
# Apply the new header
current_df.columns = current_header
current_df.reset_index(drop=True, inplace=True)
list_of_dfs.append(current_df)

# Show output
print("OUTPUT")
for df_index, current_df in enumerate(list_of_dfs):
print("DF chunk index: {}".format(df_index))
print(current_df)

Here's the output from what I ran:

INPUT
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 1 8/25/2021 PSI_210825_G2_ASD4_F.LIB
2 Line DateX LibraryFileX
3 2 8/25/2021 PSI_210825_G2_ASD3.LIB
4 3 8/25/2021 PSI_210825_G2_ASD3.LIB

OUTPUT
DF chunk index: 0
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 1 8/25/2021 PSI_210825_G2_ASD4_F.LIB
DF chunk index: 1
Line DateX LibraryFileX
0 2 8/25/2021 PSI_210825_G2_ASD3.LIB
1 3 8/25/2021 PSI_210825_G2_ASD3.LIB


Related Topics



Leave a reply



Submit