Reading a CSV File into Pandas Dataframe With Quotation in Some Entries

How to read a CSV file where rows are quoted into a dataframe

Looking carefully through pd.read_csv's (many) options, I can't find a way of removing these quotes during the read, and, thinking about it, I'm not sure there should be one.

Quoting is done for individual values, not rows. For example, it often might be used to intentionally store the commas as cell data, rather than cell separators. By telling pandas to ignore the quoting, you're also telling it the quote character (") is just normal data.

For this case, I'd either strip the CSV file of quotes prior to reading it, or remove the quotes after reading it.



Method 1. Removing the quotes before reading the file

Two ways. The first is doing it line by line, which is more intuitive but less efficient. The other way other way is doing it all in one go (less intuitive but more efficient):

import re

with open('test_csv.csv') as f:
text = re.sub(r'"*([\r\n])+"*|(?:^"*|"*$)', '\\1', f.read())

Now, you can either write the processed data back to the file and read the file with pd.read_csv, or you can read the CSV from the string directly. I'll show both methods:

Writing back to the file:

with open('test_csv.csv', 'w') as f:
f.write(text)

header_df = pd.read_csv("test_csv.csv", ...)
data_df = pd.read_csv("test_csv.csv", ...)
footer_df = pd.read_csv("test_csv.csv", ...)

Reading directly from the processed string:

from io import StringIO
s = StringIO(text)

header_df = pd.read_csv(s, ...); s.seek(0)
data_df = pd.read_csv(s, ...); s.seek(0)
footer_df = pd.read_csv(s, ...); s.seek(0)


Method 2. Removing the quotes after reading the file

Use df.iloc[:, [0, -1]] to select the first and last column of a dataframe:

def remove_quotes(df):
df.iloc[:, [0, -1]] = df.iloc[:, [0, -1]].astype(str).apply(lambda col: col.str.strip('"')).astype(int)
df.columns = df.columns.str.strip('"')

remove_quotes(header_df)
remove_quotes(data_df)
remove_quotes(footer_df)

Output:

>>> header_df
x y z w
0 1 2 3 4

>>> data_df
a s d f g h j k l z x c v b n m
0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7

>>> footer_df
x y z w
0 1 2 3 4

Reading a csv file into pandas dataframe with quotation in some entries

The main problem lies in the way csv file of microsoft excel is actually saved. When the same csv file is opened in notepad it adds extra quotation marks in the lines which have quotes.

1) It adds quote at the starting and ending of the line.

2) It escapes the existing quotes with one more quote.
Hence, when we import our csv file in pandas it takes the whole line as one string and thus it ends up all in the first column.

To tackle this-

I imported the csv file and corrected the csv by applying regex substitution and saved it as text file. Then i imported this text file as pandas dataframe. Problem Solved.

with open('csvdata.csv','r+') as csv_file:
for line in csv_file:
# removing starting and ending quotes of a line
pattern1 = re.compile(r'^"|"$',re.MULTILINE)
line = re.sub(r'^"|"$',"",line)
# substituting escaped quote with a single quote
pattern2 = re.compile(r'""')
line = re.sub(r'""','"',line)

corrected_csv = open("new_csv.txt",'a')
corrected_csv.write(line)
corrected_csv.close()

pandas read csv - recognize entries in quotation marks while using space delimiter?

Try to change the sep='\s' to sep=' ':

df = pd.read_csv('<your file>', sep=' ', quotechar='"')
print(df)

Prints:

   Flag   Rootname  ...        Date                              Target Description
0 1 lcjw02hwq ... 2014-10-23 ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD
1 1 lcjw02ikq ... 2014-10-23 ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD

[2 rows x 19 columns]

df.to_csv() produces then (screenshot from LibreOffice):

enter image description here

How to read CSV file with pandas containing quotes and using multiple seperators

I sugges replacing chunks of more than one " before or after a comma with a single occurrence, and then using pd.read_csv with the quotechar='"' argument to make sure the quoted fields end up in a single column:

content = re.sub(r'(?<![^,])"{2,}|"{2,}(?![^,])', '"', content)
#...
combined_csv = pd.read_csv(csv, sep=";|,", engine="python", quotechar='"')

Regex details:

  • (?<![^,]) - immediately before the current location, there must be a comma or start of string
  • "{2,} - two or more " chars
  • | - or
  • "{2,} - two or more " chars
  • (?![^,]) - immediately after the current location, there must be a comma or end of string.

How to read CSV file ignoring commas between quotes with Pandas

You might try

data = pd.read_csv('testfile.csv', sep=',', quotechar='"',
skipinitialspace=True, encoding='utf-8')

which tells pandas to ignore the space that comes after the comma, otherwise it can't recognize the quote.

EDIT: Apparently this does not work for the author of the question

Therefore, this is a script that produces the wanted result.
I have python 3.8.9, pandas 1.2.3.

itworks.py

import pandas as pd

with open("testfile.csv", "w") as f:
f.write("""column1,column2,column3
a, b, c
a, c, "c, d"
""")

data = pd.read_csv("testfile.csv", sep=",", quotechar='"', skipinitialspace=True, encoding="utf-8")
print(data)
$ python itworks.py
column1 column2 column3
0 a b c
1 a c c, d
$

Try to reproduce this minimal example.

Properly reading in a CSV file in pandas with double quotes


Solution 1. Remove starting and ending "(double quotes) for each line and use

input_data = pd.read_csv('temp.csv' , sep = ',' )

Solution 2. Use parameter quoting =3

input_data = pd.read_csv('temp.csv' , encoding = 'iso-8859-1', engine = 'python' ,sep = ',' , quoting =3)

Solution 3. Remove extra "" from each value (each column values will be exactly what you wanted)



Related Topics



Leave a reply



Submit