Unicodedecodeerror Reading Binary Input

UnicodeDecodeError reading binary input

You need to read the files in binary mode:

bin_file_A = open(infile_A, "rb")
bin_file_B = open(infile_B, "rb")

Why does Python3 get a UnicodeDecodeError reading a text file where Python2 does not?

Your file is not UTF-8 encoded. Figure out what encoding is used and specificy that explicitly when opening the file:

with open('negative-words.txt', 'r', encoding="<correct codec>") as f:

In Python 2, str is a binary string, containing encoded data, not Unicode text. If you were to use import io then io.open(), you'd get the same issues, or if you were to try to decode the data you read with word.decode('utf8').

You probably want to read up on Unicode and Python. I strongly recommend Ned Batchelder's Pragmatic Unicode.

How to prevent UnicodeDecodeError when reading piped input from sys.stdin?

I finally managed to work around this issue by not using sys.stdin!

Instead I used with open(0, 'rb'). Where:

0 is the file pointer equivalent to stdin.
'rb' is using binary mode for reading.

This seem to circumvent the issues with the system trying to interpret your locale character in the pipe. I got the idea after seeing that the following worked, and returned the correct (non-printable) characters:

echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"

▒▒▒
   ▒▒

So to correctly read any pipe data, I used:

if not sys.stdin.isatty() :
    try:
        with open(0, 'rb') as f: 
            inpipe = f.read()

    except Exception as e:
        err_unknown(e)        
    # This can't happen in binary mode:
    #except UnicodeDecodeError as e:
    #    err_unicode(e)
...

That will read your pipe data into a python byte string.

The next problem was to determine whether or not the pipe data was coming from a character string (like echo "BADDATA0") or from a binary stream. The latter can be emulated by echo -ne "\xBA\xDD\xAT\xA0" as shown in OP. In my case I just used a RegEx to look for out of bounds non ASCII characters.

if inpipe :
    rx = re.compile(b'[^0-9a-fA-F ]+') 
    r = rx.findall(inpipe.strip())
    if r == [] :
        print("is probably a HEX ASCII string")
    else:
        print("is something else, possibly binary")

Surely this could be done better and smarter. (Feel free to comment!)

Addendum: (from here)

mode is an optional string that specifies the mode in which the file is opened. It defaults to r which means open for reading in text mode. In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt'). For binary read-write access, the mode w+b opens and truncates the file to 0 bytes. r+b opens the file without truncation.

... Python distinguishes between binary and text I/O. Files opened in binary mode (including b in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when t is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

If closefd is False and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed. If a filename is given, closefd must be True (the default) otherwise an error will be raised.

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Since you did not provide any code we could look at, we only could guess on the rest.

From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:

with open(path, 'rb') as f:
  contents = f.read()

That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

Unable to reproduce UnicodeDecodeError in windows with sample data

The file is not being written as UTF-8. The bug is that open() without an explicit encoding keyword argument opens the file in the default encoding for your system, and so the file is being read as CP1252.

Apparently the book presupposes that you are on a system where the default system encoding is UTF-8, which is true on every remotely sane modern system which isn't Windows (sorry for the tautology). It would be surprising if this isn't actually explained in the book itself.

Your understanding of UTF-8 is clearly incomplete. There is no way for \xf1\xf2 to be valid UTF-8. You can inspect the actual UTF-8 encoding of the corresponding code points e.g. with

>>> '\u00f1\u00f2'.encode('utf-8')
b'\xc3\xb1\xc3\xb2'

Probably now would be a good time to read up on encodings. The Stack Overflow character-encoding tag info page has a brief primer and links to more resources, including Joel Spolsky's standand piece The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) For Python, probably also read Ned Batchelder's Pragmatic Unicode. Both of these are short enough to read before bed together, or just reserve 45 minutes here and now to perhaps also include some time for experimentation while you read.

UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

You can tell open() how to treat decoding errors, with the errors keyword:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.

'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.

'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.

'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

E.g.

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to undefined

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.

You specify the encoding when you open the file:

file = open(filename, encoding="utf8")

UnicodeDecodeError when reading CSV file in Pandas with Python

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' or 'cp1252' (Windows) instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).

Unicodedecodeerror Reading Binary Input