How to detect the right encoding for read.csv?
First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.
I've struggle this many times and come to non-automatic solution:
Use iconvlist
to get all possible encodings:
codepages <- setNames(iconvlist(), iconvlist())
Then read data using each of them
x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
fileEncoding=enc,
nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here
Important here is to know structure of file (separator, headers). Set encoding using fileEncoding
argument. Read only few rows.
Now you could lookup on results:
unique(do.call(rbind, sapply(x, dim)))
# [,1] [,2]
# 437 14 2
# CP1200 3 29
# CP12000 0 1
Seems like correct one is that with 3 rows and 29 columns, so lets see them:
maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
# CP1200 UCS-2LE UTF-16 UTF-16LE UTF16 UTF16LE
# "CP1200" "UCS-2LE" "UTF-16" "UTF-16LE" "UTF16" "UTF16LE"
You could look on data too
x[maybe_ok]
For your file all this encodings returns identical data (partially because there is some redundancy as you see).
If you don't know specific of your file you need to use readLines
with some changes in workflow (e.g. you can't use fileEncoding
, must use length
instead of dim
, do more magic to find correct ones).
How to check encoding of a CSV file
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory
and looking in the drop down.
type of encoding to read csv files in pandas
A CSV file is a text file. If it contains only ASCII characters, no problem nowadays, most encodings can correctly handle plain ASCII characters. The problem arises with non ASCII characters. Exemple
character | Latin1 code | cp850 code | UTF-8 codes |
---|---|---|---|
é | '\xe9' | '\x82' | '\xc3\xa9' |
è | '\xe8' | '\x8a' | '\xc3\xa8' |
ö | '\xf6' | '\x94' | '\xc3\xb6' |
pd.read_csv not sure how to determine the encoding for my csv files
I was able to figure this out. It's not the most eligant solution, but it works. I made a method that finds all csv files in the current working directory if any of the filenames contain a "µ" character replace with an "_". Return a list of all csv file names. I understand that this could potentially create naming conflicts, but since I'm the end user I'll be careful.
# -*- coding: Latin-1 -*-
import os
import pandas as pd
filenames = os.listdir(path_to_dir)
filenames_fixed = []
for filename in filenames:
if filename.endswith(suffix) and 'µ' in filename:
new_filename = filename.replace('µ', '_')
os.rename(os.path.join(path_to_dir, filename),
os.path.join(path_to_dir, new_filename))
filenames_fixed.append(new_filename)
elif filename.endswith(suffix):
filenames_fixed.append(filename)
return filenames_fixed
csv_list_cwd = find_csv_filenames_remove_nonASCII(os.getcwd())
for csv_file in csv_list_cwd:
df_cwd = pd.read_csv(csv_file, encoding="Latin-1")
How to use the appropriate encoding when reading csv in Pandas?
First look at the encoding format of the file.
import chardet
with open(path+file,"rb") as f:
data = f.read()
print(chardet.detect(data))
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
Then
df_assets_&_liab = pd.read_csv(path+file,encoding='ISO-8859-1')
Related Topics
R - Group by Variable and Then Assign a Unique Id
How to Remove Empty Factors from Ggplot2 Facets
How to Work with Large Numbers in R
How to Convert R Markdown to HTML? I.E., What Does "Knit HTML" Do in Rstudio 0.96
Efficiently Sum Across Multiple Columns in R
Cut Function in R- Labeling Without Scientific Notations for Use in Ggplot2
Export a List into a CSV or Txt File in R
How to Draw a Line Across a Multiple-Figure Environment in R
How to Run R on a Server Without X11, and Avoid Broken Dependencies
Replace Negative Values by Zero
Fast Pairwise Simple Linear Regression Between Variables in a Data Frame
Adaptive Moving Average - Top Performance in R
Finding Overlaps Between Interval Sets/Efficient Overlap Joins
How to Change the Formatting of Numbers on an Axis with Ggplot