Parsing a CSV File Using Different Encodings and Libraries

Parsing a CSV file using different encodings and libraries

Looking at the file in question:

 $ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700  ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00  n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500  c.y...B.u.d.g.e.

The byte order markffee at the start suggests the file encoding is little endian UTF-16, and the 00 bytes at every other position back this up.

This would suggest that you should be able to do this:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le') do |row| ...

However that gives me invalid byte sequence in UTF-16LE (ArgumentError) coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.

You can get CSV to strip of the BOM, by using bom|utf-16-le as the encoding:

CSV.foreach('./testfile.csv', :encoding => 'bom|utf-16le') do |row| ...

You might prefer to convert the string to a more familiar encoding instead, in which case you could do:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...

Both of these appear to work okay.

How to check encoding of a CSV file

You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.

Using the right encoding for csv file in Python 3

If I understand correctly, you have a csv file with cp1252 encoding.
If that is the case, all you have to do is open the file with the right encoding.
As far as the csv is concerned, I would use the csv module from the standard library.
Alternatively, you may want to look into a more specialized library like pandas.

Anyway, to parse your csv you could do just:

import csv

with open(filepath, 'r', encoding='cp1252') as file_obj:
    # adjust the parameters according to your file, see docs for more
    csv_obj = csv.reader(file_obj, delimiter='\t', quotechar='"')
    for row in csv_obj:
        # row is a list of entries
        # this would print all entries, separated by commas
        print(', '.join(row))

Reading a UTF8 CSV file with Python

The .encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs module in the standard library and codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:

import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
  print field1, field2, field3

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

Mixed encoding in csv file

In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.

The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.

Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).

This is the code I used to process the file:

import csv
import ftfy
import sys

def main(argv):
    # input file
    csvfile = open(argv[1], "r", encoding = "UTF8")
    reader = csv.DictReader(csvfile)

    # output stream
    outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
    writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")

    # clean values
    writer.writeheader()
    for row in reader:
        for col in row:
            row[col] = ftfy.fix_text(row[col])
        writer.writerow(row)

    # close files
    csvfile.close()
    outfile.close()

if __name__ == "__main__":
    main(sys.argv)

And then, calling:

$ python fix_encoding.py data.csv out.csv

will output a csv file with the right encoding.

How to determine the encoding of text

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

How to read a text file with mixed encodings in Scala or Java?

This is how I managed to do it with java:

    FileInputStream input;
    String result = null;
    try {
        input = new FileInputStream(new File("invalid.txt"));
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.IGNORE);
        InputStreamReader reader = new InputStreamReader(input, decoder);
        BufferedReader bufferedReader = new BufferedReader( reader );
        StringBuilder sb = new StringBuilder();
        String line = bufferedReader.readLine();
        while( line != null ) {
            sb.append( line );
            line = bufferedReader.readLine();
        }
        bufferedReader.close();
        result = sb.toString();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch( IOException e ) {
        e.printStackTrace();
    }

    System.out.println(result);

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

Without specifying .onMalformedInput, you get

java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)

Parsing a CSV File Using Different Encodings and Libraries