Unicodedecodeerror When Reading CSV File in Pandas With Python

UnicodeDecodeError when reading CSV file in Pandas with Python

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' or 'cp1252' (Windows) instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).

python pandas, unicode decode error on read_csv

Try using s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='Latin-1')

Mostly encoding issues arise with the characters within the data. While utf-8 supports all languages according to pandas' documentation, utf-8 has a byte structure that must be respected at all times. Some of the values not included in utf-8 are latin small letters i with diaeresis, right-pointing double angle quotation mark, inverted question mark. This are mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.

pandas csv UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 162: invalid start byte

read_csv takes an encoding argument to deal with files in different formats, "ISO-8859-1" should work for you. See here:

import pandas as pd
file = r'C:\...\thpdb.csv'
df = pd.read_csv(file, usecols=['Peptide Sequence'], encoding = "ISO-8859-1")
print(df)


Related Topics



Leave a reply



Submit