UnicodeDecodeError when reading CSV file in Pandas with Python
read_csv
takes an encoding
option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1")
, or alternatively encoding = "utf-8"
for reading, and generally utf-8
for to_csv
.
You can also use one of several alias
options like 'latin'
or 'cp1252'
(Windows) instead of 'ISO-8859-1'
(see python docs, also for numerous other encodings you may encounter).
See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.
To detect the encoding (assuming the file contains non-ascii characters), you can use enca
(see man page) or file -i
(linux) or file -I
(osx) (see man page).
python pandas, unicode decode error on read_csv
Try using s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='Latin-1')
Mostly encoding issues arise with the characters within the data. While utf-8 supports all languages according to pandas' documentation, utf-8 has a byte structure that must be respected at all times. Some of the values not included in utf-8 are latin small letters i with diaeresis, right-pointing double angle quotation mark, inverted question mark. This are mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.
pandas csv UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 162: invalid start byte
read_csv takes an encoding argument to deal with files in different formats, "ISO-8859-1" should work for you. See here:
import pandas as pd
file = r'C:\...\thpdb.csv'
df = pd.read_csv(file, usecols=['Peptide Sequence'], encoding = "ISO-8859-1")
print(df)
Related Topics
How to Improve Performance of This Code
How to Call a Script from Another Script
Saving an Object (Data Persistence)
Understanding Inplace=True in Pandas
Annotate Bars With Values on Pandas Bar Plots
Changing One Character in a String
Installing Specific Package Version With Pip
How to Send an Email With Gmail as Provider Using Python
How to Get the Path and Name of the File That Is Currently Executing
Best Way to Return Multiple Values from a Function
Why Should Exec() and Eval() Be Avoided
Python 3: Unboundlocalerror: Local Variable Referenced Before Assignment