How to Detect the Encoding of a CSV File

Encodings are a complex subject, which we won't try to explain here. The important thing to know is that PANDA can not infer from a CSV file what encoding it is in.

If the file contains characters that are not supported by the default encoding (known as utf-8) this can cause PANDA to generate errors. When uploading or importing a CSV file PANDA raises an error related to the encoding of the file.

The default is right in the large majority of cases, but when it isn't right figuring out the correct encoding can be very tricky. Using non-ASCII characters is never trivial, but sometimes unavoidable. Specifically, most of the world's languages use non-Latin alphabets or diacritics added to the standard Latin script. The default character encoding in stylo is UTF-8, deviating from it can cause problems.

This function allows users to check the character encoding in a corpus. A summary is returned to the terminal and a detailed list reporting the most probable encodings of all the text files in the folder can be written to a CSV file. The function is basically a wrapper around the function guess_encoding() from the 'readr' package by Wickham et al. (2017). To change the encoding to UTF-8, try the change.encoding() function.

Recently, I have had to deal with a lot of spreadsheets coming my way with a different type of data. I don't control these csv files, hence I never know how they are being generated. If I were to simply read the file, I would often get something like that.

UnicodeDecodeError Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

Basically, when you specify the following, you assume that the information was encoded in UTF-8 ()default) format.

data = pd.read_csv("my_data.csv")

However, if that's not the case and the format is not UTF-8 then you get a nasty error shown previously. What to do? Try manually some common encoders, or look at the file and try to figure it out?

A much better way is to use chardet module to do it for you. Here we going to read the first ten thousand bytes to figure out the encoding type. Note that chardet is not 100% accurate and you would actually see the level of confidence of encoder detection as part of chardet output. But it is still better than guessing manually.

# look at the first ten thousand bytes to guess the character encoding
with open("my_data.csv", 'rb') as rawdata:
result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

The result is:

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

So chardet is 73% confident that the right encoding is "Windows-1252". Now we can use this data to specify the encoding type as we try to read the file.

data = pd.read_csv("my_data.csv", encoding='Windows-1252'))

No errors!

Leave a reply