Reading a Utf8 CSV File with Python

Reading a UTF8 CSV file with Python

The .encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs module in the standard library and codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:

import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]

filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
print field1, field2, field3

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

read utf-8 CSV file into dataframe

I fixed it thanks to the post at this question

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I thought I would try the fix that they suggested

df = pd.read_csv('myfile.csv', encoding='cp1252')

and it worked! It's Windows codepage 1252... not utf-8

Opening a CSV explicitly saved as UTF-8 still shows its encoding as cp1252

try this:

with open(filename, encoding="utf8") as f:
print(f)

Open csv file in utf-8 with Python

You can try using pandas.

import pandas
myfile = open('myfile.csv')
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=';')
print(data.values)

or unicodecsv

import unicodecsv
myfile = open('myfile.csv')
data = unicodecsv.reader(myfile, encoding='utf-8', delimiter=';')
for row in data:
print row

You may be able to install them using pip:

pip install pandas

pip install unicodecsv

Depending on your needs you may also try simple string operations:

data = [line.strip().split(';') for i, line in enumerate(open('./foo.csv').readlines()) if i != 0]

Update
You can also try replacing unicode characters with ASCII equivalents:

from StringIO import StringIO
import codecs
import unicodedata

...

try:
self.FichierE = StringIO(
unicodedata.normalize(
'NFKD', codecs.open(self.CheminFichierE, "r", "utf-8").read()
).encode('ascii', 'ignore'))
self.ReaderFichierE = csv.reader(self.FichierE, delimiter=';')

except IOError:
self.TextCtrl.AppendText(u"Fichier E n'a pas été trouvé")
return

try:
DataFichierE = [ligne for ligne in self.ReaderFichierE]
except UnicodeDecodeError:
self.TextCtrl.AppendText(self.NomFichierE+ u" n'est pas lisible")
return
except UnicodeEncodeError:
self.TextCtrl.AppendText(self.NomFichierE+ u" n'est pas lisible (ASCII)")
return

Trouble with UTF-8 CSV input in Python

Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.

Your 2nd and 3rd snippets are confused. Something like the following is all that you need:

f = open('your_utf8_encoded_file.csv', 'rb')
reader = csv.reader(f)
for utf8_row in reader:
unicode_row = [x.decode('utf8') for x in utf8_row]
print unicode_row


Related Topics



Leave a reply



Submit