General Unicode/Utf-8 Support for CSV Files in Python 2.6

Reading a UTF8 CSV file with Python

The .encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs module in the standard library and codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:

import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]

filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
print field1, field2, field3

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

Python DictWriter writing UTF-8 encoded CSV files

UPDATE: The 3rd party unicodecsv module implements this 7-year old answer for you. Example below this code. There's also a Python 3 solution that doesn't required a 3rd party module.

Original Python 2 Answer

If using Python 2.7 or later, use a dict comprehension to remap the dictionary to utf-8 before passing to DictWriter:

# coding: utf-8
import csv
D = {'name':u'马克','pinyin':u'mǎkè'}
f = open('out.csv','wb')
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f,sorted(D.keys()))
w.writeheader()
w.writerow({k:v.encode('utf8') for k,v in D.items()})
f.close()

You can use this idea to update UnicodeWriter to DictUnicodeWriter:

# coding: utf-8
import csv
import cStringIO
import codecs

class DictUnicodeWriter(object):

def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, D):
self.writer.writerow({k:v.encode("utf-8") for k,v in D.items()})
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)

def writerows(self, rows):
for D in rows:
self.writerow(D)

def writeheader(self):
self.writer.writeheader()

D1 = {'name':u'马克','pinyin':u'Mǎkè'}
D2 = {'name':u'美国','pinyin':u'Měiguó'}
f = open('out.csv','wb')
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = DictUnicodeWriter(f,sorted(D.keys()))
w.writeheader()
w.writerows([D1,D2])
f.close()

Python 2 unicodecsv Example:

# coding: utf-8
import unicodecsv as csv

D = {u'name':u'马克',u'pinyin':u'mǎkè'}

with open('out.csv','wb') as f:
w = csv.DictWriter(f,fieldnames=sorted(D.keys()),encoding='utf-8-sig')
w.writeheader()
w.writerow(D)

Python 3:

Additionally, Python 3's built-in csv module supports Unicode natively:

# coding: utf-8
import csv

D = {u'name':u'马克',u'pinyin':u'mǎkè'}

# Use newline='' instead of 'wb' in Python 3.
with open('out.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.DictWriter(f,fieldnames=sorted(D.keys()))
w.writeheader()
w.writerow(D)

Read Unicode from CSV

It appears you have written Python lists directly to your CSV file, resulting in the [...] literal syntax instead of normal columns. You then removed most of the information that could have been used to turn the information back to Python lists with unicode strings again.

What you have left are Python unicode literals, but without the quotes. Use the unicode_escape to decode the values to Unicode again:

with open('foo.csv','r') as b0rken
for line in b0rken:
value = line.rstrip('\r\n').decode('unicode_escape')
print value

or add back the u'..' quoting, using a triple-quoted string in an attempt to avoid needing to escape embedded quotes:

with open('foo.csv','r') as b0rken
for line in b0rken:
value = literal_eval("u'''{}'''".format(line.rstrip('\r\n')))
print value

If you still have the original file (with the [u'...'] formatted lines), use the ast.literal_eval() function to turn those back into Python lists. No point in using the CSV module here:

from ast import literal_eval

with open('foo.csv','r') as b0rken
for line in b0rken:
lis = literal_eval(line)
value = lis[0]
print value

Demo with unicode_escape:

>>> for line in b0rken:
... print line.rstrip('\r\n').decode('unicode_escape')
...
Aeronáutica
Niš
Künste
École de l'Air

Cannot convert csv from utf-8 to ansi with csv writer python 2.6

There may be some redundant code here but I got this to work by doing the following:

  • First I did the enconding using the .decode and .encode funtion to make it "cp1252".

    • Then I read the csv from the cp1252 encoded file and wrote it to a new csv

...

import datetime
import csv

# Define what filenames to read
filenames = ["FILE1","FILE2"]

infilenames = [filename+".csv" for filename in filenames]
outfilenames = [filename+"_out_.csv" for filename in filenames]
midfilenames = [filename+"_mid_.csv" for filename in filenames]

# Iterate over each file
for infilename,outfilename,midfilename in zip(infilenames,outfilenames,midfilenames):

# Open file and read utf-8 text, then encode in cp1252
infile = open(infilename, "r")
infilet = infile.read()
infilet = infilet.decode("utf-8")
infilet = infilet.encode("cp1252","ignore")

#write cp1252 encoded file
midfile = open(midfilename,"w")
midfile.write(infilet)
midfile.close()

# read csv with new cp1252 encoding
midfile = open(midfilename,"r")
reader = csv.reader(midfile,delimiter=',', quotechar='"',quoting=csv.QUOTE_MINIMAL)

# define output
outfile = open(outfilename, "w")
writer = csv.writer(outfile, delimiter='|', quotechar='"',quoting=csv.QUOTE_NONE,escapechar='\\')

#write output to new csv file
for row in reader:
writer.writerow(row)

print("written file",outfilename)
infile.close()
midfile.close()
outfile.close()

Unicode to UTF8 for CSV Files - Python via xlrd

I expect the cell_value return value is the unicode string that's giving you problems (please print its type() to confirm that), in which case you should be able to solve it by changing this one line:

this_row.append(s.cell_value(row,col))

to:

this_row.append(s.cell_value(row,col).encode('utf8'))

If cell_value is returning multiple different types, then you need to encode if and only if it's returning a unicode string; so you'd split this line into a few lines:

val = s.cell_value(row, col)
if isinstance(val, unicode):
val = val.encode('utf8')
this_row.append(val)

Python - read csv file of unicode substitutions

I don't think your problem actually exists:

Ok, now self.mapping[example][0] = u'\xe0'. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'.

Those are just different representations of the exact same string. You can test it yourself:

>>> u'\xe0' == u'\u00e0'
True

The actual problem is that you're not doing any replacing. In this code:

def replace_UTF8(self, string):
for old, new in self.mapping:
print new
string.replace(old, new)
return string

You're just calling string.replace over and over, which returns a new string, but does nothing to string itself. (It can't do anything to string itself; strings are immutable.) What you want is:

def replace_UTF8(self, string):
for old, new in self.mapping:
print new
string = string.replace(old, new)
return string

However, if string really is a UTF-8-encoded str, as the function name implies, this still won't work. When you UTF-8-encode u'\u00e0', what you get is '\xce\xa0'. There is no \u00e0 in there to be replaced. So, what you really need to do is decode it, do the replaces, then re-encode. Like this:

def replace_UTF8(self, string):
u = string.decode('utf-8')
for old, new in self.mapping:
print new
u = u.replace(old, new)
return u.encode('utf-8')

Or, even better, keep things as unicode instead of encoded str throughout your program except at the very edges, so you don't have to worry about this stuff.


Finally, this is a very slow and complicated way to do the replacing, when strings (both str and unicode) have a built-in translate method to do exactly what you want.

Instead of building your table as a list of pairs of Unicode strings, build it as a dict mapping ordinals to ordinals:

mapping = {}
for row in reader:
mapping[ord(row[0].decode("unicode_escape"))] = ord(row[1])

And now, the whole thing is a one-liner, even with your encoding mess:

def replace_UTF8(self, string):
return string.decode('utf-8').translate(self.mapping).encode('utf-8')

A resilient, actually working CSV implementation for non-ascii?

You are attempting to apply a solution to a different problem. Note this:

def utf_8_encoder(unicode_csv_data)

You are feeding it str objects.

The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:

for row in csv.reader("foo.csv", delimiter=known_delimiter):
row = [item.decode(encoding) for item in row]

Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".

Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.

Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""

You must know the encoding for ANY file-reading exercise to work.

Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252 for your euro file and windows-1251 for your Russian file -- a fantastic achievement given their tiny size.

Update 2 in response to """working code would be most welcome"""

Working code (Python 2.x):

from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
global chardet_detector
chardet_detector.reset()
while 1:
chunk = f.read(chunk_size)
if not chunk: break
chardet_detector.feed(chunk)
if chardet_detector.done: break
chardet_detector.close()
return chardet_detector.result

# Exercise for the reader: replace the above with a class

import csv
import sys
from pprint import pprint

pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)

with open(pathname, 'rb') as f:
cd_result = charset_detect(f)
encoding = cd_result['encoding']
confidence = cd_result['confidence']
print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
# insert actions contingent on encoding and confidence here
f.seek(0)
csv_reader = csv.reader(f, delimiter=delim)
for bytes_row in csv_reader:
unicode_row = [x.decode(encoding) for x in bytes_row]
pprint(unicode_row)

Output 1:

delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
u'Overf\xf8rsel utland',
u'UTLBET; ID 9710032001647082',
u'1990.00',
u'']
[u'31-01-11',
u'Overf\xf8ring',
u'OVERF\xd8RING MELLOM EGNE KONTI',
u'5750.00',
u';']

Output 2:

delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
u'04.02.2011 23:20',
u'300,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421',
u'']
[u'-',
u'04.02.2011 23:15',
u'450,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
u'']
[u'-',
u'13.01.2011 02:05',
u'100,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421 kolombina',
u'']

Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd.

python's encoding difference

Firstly, writerow interface is expecting a list-like object, so the first snippet is correct for this interface. But in your second snippet, the method is assuming that the string you have passed as an argument is a list - and iterating it as such - which is probably not what you wanted. You could try writerow([temp]) and see that it should match the output of the first case.

Secondly, I want to warn you that Python csv module is notorious for headaches with unicode, basically it's unsupoorted. Try using unicodecsv as a drop-in replacement for the csv module if you need to support unicode. Then you won't need to encode the strings before writing them to file, you just write the unicode objects directly and let the library handle the encoding.



Related Topics



Leave a reply



Submit