Why should we NOT use sys.setdefaultencoding(utf-8) in a py script?
As per the documentation: This allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, sitecustomize.py
, After this module has been evaluated, the setdefaultencoding()
function is removed from the sys
module.
The only way to actually use it is with a reload hack that brings the attribute back.
Also, the use of sys.setdefaultencoding()
has always been discouraged, and it has become a no-op in py3k. The encoding of py3k is hard-wired to "utf-8" and changing it raises an error.
I suggest some pointers for reading:
- http://blog.ianbicking.org/illusive-setdefaultencoding.html
- http://nedbatchelder.com/blog/200401/printing_unicode_from_python.html
- http://www.diveintopython3.net/strings.html#one-ring-to-rule-them-all
- http://boodebr.org/main/python/all-about-python-and-unicode
- http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
Persist UTF-8 as Default Encoding
Please take a look into site.py library - it is the place where sys.setdefaultencoding
happens. You could, I think, modify or substitute this module in order to make it permanent on your machine. Here is some of it's source code, comments explains something:
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
Full source https://hg.python.org/cpython/file/2.7/Lib/site.py.
This is the place where they delete the sys.setdefaultencoding
function, if you were wondering:
def main():
...
# Remove sys.setdefaultencoding() so that users cannot change the
# encoding after initialization. The test for presence is needed when
# this module is run as a script, because this code is executed twice.
if hasattr(sys, "setdefaultencoding"):
del sys.setdefaultencoding
python print doesn't work after sys.setdefaultencoding('utf-8')
The sys.setdefaultencoding
is removed for a reason by site
and you shouldn't use reload(sys)
to restore it. Instead, my solution would be to do nothing, Python automatically detects encoding basing on ENV LANG variable or Windows chcp
encoding.
$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import os
>>> sys.stdout.encoding
'UTF-8'
>>> os.environ["LANG"]
'pl_PL.UTF-8'
>>> print u"\xabtest\xbb"
«test»
>>>
But that could cause issues when encoding doesn't have characters you want. You should instead try degrading gracefully - the chance of displaying characters you want is close to 0 (so you should try using pure-ASCII version, or use Unidecode to show usable output (or simply fail)). You could try catching exception and printing basic version of string instead.
$ LANG=C python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import os
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> os.environ["LANG"]
'C'
>>> print u"\xabtest\xbb"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 0: ordinal not in range(128)
>>>
But there is problem called Windows that has problems with Unicode support. While technically chcp 65001
should work, it doesn't actually work unless you're using Python 3.3. Python uses portable stdio.h
, but cmd.exe
expects Windows specific calls, like WriteConsoleW()
. Only 8-bit encodings work reliably (such as CP437), really.
The workaround would be to use other terminal that supports Unicode properly, such as Cygwin's console or IDLE included with Python.
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Python 3 has no sys.setdefaultencoding()
function. It cannot be reinstated by reload(sys)
like it can on Python 2 (which you really shouldn't do in any case).
Since the default on Python 3 is UTF-8 already, there is no point in leaving those statements in.
In Python 2, using sys.setdefaultencoding()
was used to plaster over implicit encoding problems (caused by concatening byte strings and unicode values, and other such mixed type situations), rather than fixing the problems themselves. Python 3 did away with implicit encoding and decoding, so using the plaster to set a different encoding would make no difference anyway.
However, if this is a 3rd-party library, then you probably will run into other problems as it clearly has not been made compatible with Python 3.
Python Not Accepting UTF-8 Coding
You can set it to utf-8 as:
import sys
reload(sys)
sys.setdefaultencoding("utf8")
Changing default encoding of Python?
Here is a simpler method (hack) that gives you back the setdefaultencoding()
function that was deleted from sys
:
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
(Note for Python 3.4+: reload()
is in the importlib
library.)
This is not a safe thing to do, though: this is obviously a hack, since sys.setdefaultencoding()
is purposely removed from sys
when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).
PS: This hack doesn't seem to work with Python 3.9 anymore.
Python3 utf-8 decode issue
The problem is with the print()
expression, not with the decode()
method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print()
function, Python converts its arguments to a str
and subsequently encodes the result to bytes
, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
- the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
- the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
- Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting
LC_ALL
to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc. Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
- Encode the strings yourself and send them to the binary buffer underlying
sys.stdout
, eg.sys.stdout.buffer.write("é".encode('utf8'))
. This is of course much more boilerplate thanprint("é")
. Again, the encoding used has to match the one of the terminal. - Avoid
print()
altogether. Useopen(fn, encoding=...)
for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.
Related Topics
Let JSON Object Accept Bytes or Let Urlopen Output Strings
Could Not Find a Version That Satisfies the Requirement Tensorflow
Pandas Groupby and Select Rows with the Minimum Value in a Specific Column
How to Generate a Random Number with a Specific Amount of Digits
Virtualenv --No-Site-Packages and Pip Still Finding Global Packages
How to Plot Multiple Functions on the Same Figure, in Matplotlib
Could Pandas Use Column as Index
What Happens When a Module Is Imported Twice
Format Floats with Standard JSON Module
Django Upgrading to 1.9 Error "Appregistrynotready: Apps Aren't Loaded Yet."
Pandas: To_Numeric for Multiple Columns
Python Subprocess.Call a Bash Alias
Download File Through Google Chrome in Headless Mode
Interactive Input/Output Using Python