Parsing a CSV file using different encodings and libraries
Looking at the file in question:
$ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700 ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00 n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500 c.y...B.u.d.g.e.
The byte order markffee
at the start suggests the file encoding is little endian UTF-16, and the 00
bytes at every other position back this up.This would suggest that you should be able to do this:
CSV.foreach('./testfile.csv', :encoding => 'utf-16le') do |row| ...
However that gives me invalid byte sequence in UTF-16LE (ArgumentError)
coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.You can get CSV to strip of the BOM, by using bom|utf-16-le
as the encoding:
CSV.foreach('./testfile.csv', :encoding => 'bom|utf-16le') do |row| ...
You might prefer to convert the string to a more familiar encoding instead, in which case you could do:CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...
Both of these appear to work okay. How to check encoding of a CSV file
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory
and looking in the drop down.
Using the right encoding for csv file in Python 3
If I understand correctly, you have a csv
file with cp1252
encoding.
If that is the case, all you have to do is open the file with the right encoding.
As far as the csv
is concerned, I would use the csv
module from the standard library.
Alternatively, you may want to look into a more specialized library like pandas
.
Anyway, to parse your csv
you could do just:
import csv
with open(filepath, 'r', encoding='cp1252') as file_obj:
# adjust the parameters according to your file, see docs for more
csv_obj = csv.reader(file_obj, delimiter='\t', quotechar='"')
for row in csv_obj:
# row is a list of entries
# this would print all entries, separated by commas
print(', '.join(row))
Reading a UTF8 CSV file with Python
The .encode
method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs
module in the standard library and codecs.open
in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv
module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:
import csv
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
print field1, field2, field3
PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv
module level), of the form line.decode('whateverweirdcodec').encode('utf-8')
-- but probably you can just use the name of your existing encoding in the yield
line in my code above, instead of 'utf-8'
, as csv
is actually going to be just fine with ISO-8859-* encoded bytestrings. Mixed encoding in csv file
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding. How to determine the encoding of text
EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimizedThere is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.
You can also use UnicodeDammit. It will try the following methods:
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
How to read a text file with mixed encodings in Scala or Java?
This is how I managed to do it with java:
FileInputStream input;
String result = null;
try {
input = new FileInputStream(new File("invalid.txt"));
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
InputStreamReader reader = new InputStreamReader(input, decoder);
BufferedReader bufferedReader = new BufferedReader( reader );
StringBuilder sb = new StringBuilder();
String line = bufferedReader.readLine();
while( line != null ) {
sb.append( line );
line = bufferedReader.readLine();
}
bufferedReader.close();
result = sb.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch( IOException e ) {
e.printStackTrace();
}
System.out.println(result);
The invalid file is created with bytes:0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94
Which is hellö wörld
in UTF-8 with 4 invalid bytes mixed in.With .REPLACE
you see the standard unicode replacement character being used:
//"h�ellö� wö�rld�"
With .IGNORE
, you see the invalid bytes ignored://"hellö wörld"
Without specifying .onMalformedInput
, you get java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
at sun.nio.cs.StreamDecoder.read(Unknown Source)
at java.io.InputStreamReader.read(Unknown Source)
at java.io.BufferedReader.fill(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
Related Topics
Use Pry in Gems Without Modifying The Gemfile or Using 'Require'
How to Use The "Self" Keyword in Rails
Fastest Way to Skip Lines While Parsing Files in Ruby
Alias_Method on Activerecord::Base Results in Nameerror
How to Convert Ruby Formatted JSON String to JSON Hash in Ruby
Ubuntu 10 Ruby 1.9 Rails 3: No Such File or Directory
How to Compare Xml Output in a Cucumber Step Using a Multiline String Example
Ruby on Rails Query with Sum and Group
How to Programmatically Remove "Singleton Information" on an Instance to Make It Marshal
How to Send an Image on The Web in an Xmpp (Jabber) Message
Rails - Understanding Application.Js and Application.CSS
Including Methods to a Controller from a Plugin
How to Show Longer Traces in Rails Testcases
Current Password Can't Be Blank When Updating Devise Account