How to read a CSV file where rows are quoted into a dataframe
Looking carefully through pd.read_csv
's (many) options, I can't find a way of removing these quotes during the read, and, thinking about it, I'm not sure there should be one.
Quoting is done for individual values, not rows. For example, it often might be used to intentionally store the commas as cell data, rather than cell separators. By telling pandas to ignore the quoting, you're also telling it the quote character ("
) is just normal data.
For this case, I'd either strip the CSV file of quotes prior to reading it, or remove the quotes after reading it.
Method 1. Removing the quotes before reading the file
Two ways. The first is doing it line by line, which is more intuitive but less efficient. The other way other way is doing it all in one go (less intuitive but more efficient):
import re
with open('test_csv.csv') as f:
text = re.sub(r'"*([\r\n])+"*|(?:^"*|"*$)', '\\1', f.read())
Now, you can either write the processed data back to the file and read the file with pd.read_csv
, or you can read the CSV from the string directly. I'll show both methods:
Writing back to the file:
with open('test_csv.csv', 'w') as f:
f.write(text)
header_df = pd.read_csv("test_csv.csv", ...)
data_df = pd.read_csv("test_csv.csv", ...)
footer_df = pd.read_csv("test_csv.csv", ...)
Reading directly from the processed string:
from io import StringIO
s = StringIO(text)
header_df = pd.read_csv(s, ...); s.seek(0)
data_df = pd.read_csv(s, ...); s.seek(0)
footer_df = pd.read_csv(s, ...); s.seek(0)
Method 2. Removing the quotes after reading the file
Use df.iloc[:, [0, -1]]
to select the first and last column of a dataframe:
def remove_quotes(df):
df.iloc[:, [0, -1]] = df.iloc[:, [0, -1]].astype(str).apply(lambda col: col.str.strip('"')).astype(int)
df.columns = df.columns.str.strip('"')
remove_quotes(header_df)
remove_quotes(data_df)
remove_quotes(footer_df)
Output:
>>> header_df
x y z w
0 1 2 3 4
>>> data_df
a s d f g h j k l z x c v b n m
0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7
>>> footer_df
x y z w
0 1 2 3 4
Reading a csv file into pandas dataframe with quotation in some entries
The main problem lies in the way csv file of microsoft excel is actually saved. When the same csv file is opened in notepad it adds extra quotation marks in the lines which have quotes.
1) It adds quote at the starting and ending of the line.
2) It escapes the existing quotes with one more quote.
Hence, when we import our csv file in pandas it takes the whole line as one string and thus it ends up all in the first column.
To tackle this-
I imported the csv file and corrected the csv by applying regex substitution and saved it as text file. Then i imported this text file as pandas dataframe. Problem Solved.
with open('csvdata.csv','r+') as csv_file:
for line in csv_file:
# removing starting and ending quotes of a line
pattern1 = re.compile(r'^"|"$',re.MULTILINE)
line = re.sub(r'^"|"$',"",line)
# substituting escaped quote with a single quote
pattern2 = re.compile(r'""')
line = re.sub(r'""','"',line)
corrected_csv = open("new_csv.txt",'a')
corrected_csv.write(line)
corrected_csv.close()
pandas read csv - recognize entries in quotation marks while using space delimiter?
Try to change the sep='\s'
to sep=' '
:
df = pd.read_csv('<your file>', sep=' ', quotechar='"')
print(df)
Prints:
Flag Rootname ... Date Target Description
0 1 lcjw02hwq ... 2014-10-23 ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD
1 1 lcjw02ikq ... 2014-10-23 ISM;ABSORPTION LINE SYSTEM;HIGH VELOCITY CLOUD
[2 rows x 19 columns]
df.to_csv()
produces then (screenshot from LibreOffice):
How to read CSV file with pandas containing quotes and using multiple seperators
I sugges replacing chunks of more than one "
before or after a comma with a single occurrence, and then using pd.read_csv
with the quotechar='"'
argument to make sure the quoted fields end up in a single column:
content = re.sub(r'(?<![^,])"{2,}|"{2,}(?![^,])', '"', content)
#...
combined_csv = pd.read_csv(csv, sep=";|,", engine="python", quotechar='"')
Regex details:
(?<![^,])
- immediately before the current location, there must be a comma or start of string"{2,}
- two or more"
chars|
- or"{2,}
- two or more"
chars(?![^,])
- immediately after the current location, there must be a comma or end of string.
How to read CSV file ignoring commas between quotes with Pandas
You might try
data = pd.read_csv('testfile.csv', sep=',', quotechar='"',
skipinitialspace=True, encoding='utf-8')
which tells pandas to ignore the space that comes after the comma, otherwise it can't recognize the quote.
EDIT: Apparently this does not work for the author of the question
Therefore, this is a script that produces the wanted result.
I have python 3.8.9, pandas 1.2.3.
itworks.py
import pandas as pd
with open("testfile.csv", "w") as f:
f.write("""column1,column2,column3
a, b, c
a, c, "c, d"
""")
data = pd.read_csv("testfile.csv", sep=",", quotechar='"', skipinitialspace=True, encoding="utf-8")
print(data)
$ python itworks.py
column1 column2 column3
0 a b c
1 a c c, d
$
Try to reproduce this minimal example.
Properly reading in a CSV file in pandas with double quotes
Solution 1. Remove starting and ending "(double quotes) for each line and use
input_data = pd.read_csv('temp.csv' , sep = ',' )
Solution 2. Use parameter quoting =3
input_data = pd.read_csv('temp.csv' , encoding = 'iso-8859-1', engine = 'python' ,sep = ',' , quoting =3)
Solution 3. Remove extra "" from each value (each column values will be exactly what you wanted)
Related Topics
Best Way to Get the Max Value in a Spark Dataframe Column
How to Verify If a Button Is Enabled and Disabled in Webdriver Python
How to Do a Conditional Count After Groupby on a Pandas Dataframe
Discord Bot Messaging a User With a Specific User Id
How to Delete Tkinter Widgets from a Window
How to Calculate Average a Dictionary from List of Dictionary Data
Pandas Counting and Summing Specific Conditions
Printing a Multiplication Table With Nested Loops
Python Convert Comma Separated List to Pandas Dataframe
Using Python, How to Access a Shared Folder on Windows Network
Remove White Space from Entire Dataframe
Regex Check If Specific Multiple Words Present in a Sentence
How to Delete the Words Between Two Delimiters
Pandas Dataframe Calculations With Previous Row