How to Detect the Right Encoding for Read.Csv

How to detect the right encoding for read.csv?

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.

I've struggle this many times and come to non-automatic solution:

Use iconvlist to get all possible encodings:

codepages <- setNames(iconvlist(), iconvlist())

Then read data using each of them

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
fileEncoding=enc,
nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here

Important here is to know structure of file (separator, headers). Set encoding using fileEncoding argument. Read only few rows.

Now you could lookup on results:

unique(do.call(rbind, sapply(x, dim)))
# [,1] [,2]
# 437 14 2
# CP1200 3 29
# CP12000 0 1

Seems like correct one is that with 3 rows and 29 columns, so lets see them:

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
# CP1200 UCS-2LE UTF-16 UTF-16LE UTF16 UTF16LE
# "CP1200" "UCS-2LE" "UTF-16" "UTF-16LE" "UTF16" "UTF16LE"

You could look on data too

x[maybe_ok]

For your file all this encodings returns identical data (partially because there is some redundancy as you see).

If you don't know specific of your file you need to use readLines with some changes in workflow (e.g. you can't use fileEncoding, must use length instead of dim, do more magic to find correct ones).

How to check encoding of a CSV file

You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.

type of encoding to read csv files in pandas

A CSV file is a text file. If it contains only ASCII characters, no problem nowadays, most encodings can correctly handle plain ASCII characters. The problem arises with non ASCII characters. Exemple































characterLatin1 codecp850 codeUTF-8 codes
é'\xe9''\x82''\xc3\xa9'
è'\xe8''\x8a''\xc3\xa8'
ö'\xf6''\x94''\xc3\xb6'

pd.read_csv not sure how to determine the encoding for my csv files

I was able to figure this out. It's not the most eligant solution, but it works. I made a method that finds all csv files in the current working directory if any of the filenames contain a "µ" character replace with an "_". Return a list of all csv file names. I understand that this could potentially create naming conflicts, but since I'm the end user I'll be careful.

    # -*- coding: Latin-1 -*-
import os
import pandas as pd

filenames = os.listdir(path_to_dir)
filenames_fixed = []
for filename in filenames:

if filename.endswith(suffix) and 'µ' in filename:
new_filename = filename.replace('µ', '_')
os.rename(os.path.join(path_to_dir, filename),
os.path.join(path_to_dir, new_filename))
filenames_fixed.append(new_filename)

elif filename.endswith(suffix):
filenames_fixed.append(filename)

return filenames_fixed

csv_list_cwd = find_csv_filenames_remove_nonASCII(os.getcwd())

for csv_file in csv_list_cwd:
df_cwd = pd.read_csv(csv_file, encoding="Latin-1")

How to use the appropriate encoding when reading csv in Pandas?

First look at the encoding format of the file.

import chardet
with open(path+file,"rb") as f:
data = f.read()
print(chardet.detect(data))

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

Then

df_assets_&_liab = pd.read_csv(path+file,encoding='ISO-8859-1')


Related Topics



Leave a reply



Submit