Python3 Utf-8 Decode Issue

Python3 utf-8 decode issue

The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.

Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,

  • the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
  • the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).

In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:

  • Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
  • Re-encode STDOUT, like so:

    sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')

    The encoding used has to match the one of the terminal.

  • Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
  • Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).

There might be other options, but I doubt that there are nicer ones.

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Since you did not provide any code we could look at, we only could guess on the rest.

From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:

with open(path, 'rb') as f:
contents = f.read()

That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

How to fix incorrectly UTF-8 decoded string?

The requests library tries to guess the encoding of the response.
It's possible requests is decoding the response as cp1252 (aka Windows-1252).

I'mg guessing this because if you take that text and encode it back to cp1252 and then decode it as utf-8, you'll see the correct text:

>>> 'criança'.encode('cp1252').decode('utf-8')
'criança'

Based on that, I'd guess that if you ask your response object what encoding it guessed, it'll tell you cp1252:

>>> response.encoding
'cp1252'

Forcing requests to decode as utf-8 instead, like this, will probably fix your issue:

>>> response.encoding = 'utf-8'

How am I supposed to fix this utf-8 encoding error?

When trying to detangle a string that has doubly encoded sequences that was intended to be an escape sequence (i.e. \\ instead of \), the special text encoding codec unicode_escape may be used to rectify them back to the expected entity for further processing. However, given that the input is already of the type str, it needs to be turned into a bytes - assuming that the entire string is of fully valid ascii code points, that may be the codec for the initial conversion of the initial str input into bytes. The utf8 codec may be used should there are standard unicode codepoints represented inside the str, as the unicode_escape sequences wouldn't affect those codepoints. Examples:

>>> broken_string = 'La funci\\xc3\\xb3n est\\xc3\\xa1ndar datetime.'
>>> broken_string2 = 'La funci\\xc3\\xb3n estándar datetime.'
>>> broken_string.encode('ascii').decode('unicode_escape')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape')
'La función estándar datetime.'

Given the assumption that the unicode_escape codec assumes decoding to latin1, this intermediate string may simply be encoded to bytes using the latin1 codec post decoding, before turning that back into unicode str type through the utf8 (or whatever appropriate target) codec:

>>> broken_string.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'

As requested, an addendum to clarify the partially messed up string. Note that attempting to decode broken_string2 using the ascii codec will not work, due to the presence of the unescaped á character.

>>> broken_string2.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 21: ordinal not in range(128)

UTF-8 encoding issue with Python 3

You are passing a string which contain non-ASCII characters to urllib.urlopen, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).

You need to make the IRI a valid URI before passing it to urlopen. The specifics of this
depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.

Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri, you can fix it using the following:

import urllib.parse
import urllib.request

split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2]) # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)

urllib.request.urlopen(url).read()

However, if you can avoid urllib and have the option of using the requests library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.



Related Topics



Leave a reply



Submit