Dangers of Sys.Setdefaultencoding('Utf-8')

Why should we NOT use sys.setdefaultencoding(utf-8) in a py script?

As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.

This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py, After this module has been evaluated, the setdefaultencoding() function is removed from the sys module.

The only way to actually use it is with a reload hack that brings the attribute back.

Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to "utf-8" and changing it raises an error.

I suggest some pointers for reading:

  • http://blog.ianbicking.org/illusive-setdefaultencoding.html
  • http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
  • http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
  • http://boodebr.org/main/python/all-about-python-and-unicode
  • http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python

Persist UTF-8 as Default Encoding

Please take a look into site.py library - it is the place where sys.setdefaultencoding happens. You could, I think, modify or substitute this module in order to make it permanent on your machine. Here is some of it's source code, comments explains something:

def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""

encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !

Full source https://hg.python.org/cpython/file/2.7/Lib/site.py.

This is the place where they delete the sys.setdefaultencoding function, if you were wondering:

def main():

...

# Remove sys.setdefaultencoding() so that users cannot change the
# encoding after initialization. The test for presence is needed when
# this module is run as a script, because this code is executed twice.
if hasattr(sys, "setdefaultencoding"):
del sys.setdefaultencoding

python print doesn't work after sys.setdefaultencoding('utf-8')

The sys.setdefaultencoding is removed for a reason by site and you shouldn't use reload(sys) to restore it. Instead, my solution would be to do nothing, Python automatically detects encoding basing on ENV LANG variable or Windows chcp encoding.

$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import os
>>> sys.stdout.encoding
'UTF-8'
>>> os.environ["LANG"]
'pl_PL.UTF-8'
>>> print u"\xabtest\xbb"
«test»
>>>

But that could cause issues when encoding doesn't have characters you want. You should instead try degrading gracefully - the chance of displaying characters you want is close to 0 (so you should try using pure-ASCII version, or use Unidecode to show usable output (or simply fail)). You could try catching exception and printing basic version of string instead.

$ LANG=C python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import os
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> os.environ["LANG"]
'C'
>>> print u"\xabtest\xbb"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 0: ordinal not in range(128)
>>>

But there is problem called Windows that has problems with Unicode support. While technically chcp 65001 should work, it doesn't actually work unless you're using Python 3.3. Python uses portable stdio.h, but cmd.exe expects Windows specific calls, like WriteConsoleW(). Only 8-bit encodings work reliably (such as CP437), really.

The workaround would be to use other terminal that supports Unicode properly, such as Cygwin's console or IDLE included with Python.

AttributeError: 'module' object has no attribute 'setdefaultencoding'

Python 3 has no sys.setdefaultencoding() function. It cannot be reinstated by reload(sys) like it can on Python 2 (which you really shouldn't do in any case).

Since the default on Python 3 is UTF-8 already, there is no point in leaving those statements in.

In Python 2, using sys.setdefaultencoding() was used to plaster over implicit encoding problems (caused by concatening byte strings and unicode values, and other such mixed type situations), rather than fixing the problems themselves. Python 3 did away with implicit encoding and decoding, so using the plaster to set a different encoding would make no difference anyway.

However, if this is a 3rd-party library, then you probably will run into other problems as it clearly has not been made compatible with Python 3.

Python Not Accepting UTF-8 Coding

You can set it to utf-8 as:

import sys
reload(sys)
sys.setdefaultencoding("utf8")

Changing default encoding of Python?

Here is a simpler method (hack) that gives you back the setdefaultencoding() function that was deleted from sys:

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')

(Note for Python 3.4+: reload() is in the importlib library.)

This is not a safe thing to do, though: this is obviously a hack, since sys.setdefaultencoding() is purposely removed from sys when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).

PS: This hack doesn't seem to work with Python 3.9 anymore.

Python3 utf-8 decode issue

The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.

Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,

  • the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
  • the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).

In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:

  • Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
  • Re-encode STDOUT, like so:

    sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')

    The encoding used has to match the one of the terminal.

  • Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
  • Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).

There might be other options, but I doubt that there are nicer ones.



Related Topics



Leave a reply



Submit