Changing default encoding of Python?
Here is a simpler method (hack) that gives you back the setdefaultencoding()
function that was deleted from sys
:
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
(Note for Python 3.4+: reload()
is in the importlib
library.)
This is not a safe thing to do, though: this is obviously a hack, since sys.setdefaultencoding()
is purposely removed from sys
when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).
PS: This hack doesn't seem to work with Python 3.9 anymore.
python change character encoding to utf_8
Can you check encoding by following method:
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>>
If encoding is ascii
then set to utf-8
open following file(I am using Python 2.7):
/usr/lib/python2.7/sitecustomize.py
then update following to
utf-8
sys.setdefaultencoding("utf-8")
[Edit 2]
Can you add following in tour code(at start) and then check:-
>>> try:
... import apport_python_hook
... except ImportError:
... pass
... else:
... apport_python_hook.install()
...
>>> import sys
>>>
>>> sys.setdefaultencoding("utf-8")
>>>
>>>
Persist UTF-8 as Default Encoding
Please take a look into site.py library - it is the place where sys.setdefaultencoding
happens. You could, I think, modify or substitute this module in order to make it permanent on your machine. Here is some of it's source code, comments explains something:
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
Full source https://hg.python.org/cpython/file/2.7/Lib/site.py.
This is the place where they delete the sys.setdefaultencoding
function, if you were wondering:
def main():
...
# Remove sys.setdefaultencoding() so that users cannot change the
# encoding after initialization. The test for presence is needed when
# this module is run as a script, because this code is executed twice.
if hasattr(sys, "setdefaultencoding"):
del sys.setdefaultencoding
Change python 3.7 default encoding from cp1252 to cp65001 aka UTF-8
This is resolved when putting the following at the top of your Python script. I am able to print all characters without error.
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = 'utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = 'utf-8')
How to change default character encoding for Python IDLE?
For 2.7, 3.5, the command line you show responds, for me, with cp437 - the IBM PC or DOS encoding. Output to the Windows console is limited to a subset of Basic Multilingual Plane (BMP) Unicode characters.
For 3.6, Python's handling of the Windows console was drastically improved to use utf-8 and potentially print any unicode character, depending on font availability.
For all current versions, IDLE also reports, for me, cp1252 (Latin 1). Since there is an attempt to get the system encoding, I don't know why the difference. But it hardly makes any difference as it is a dummy or fake value. To me, it is deceptive in that non-latin1 chars cannot be encoded with latin1, whereas all BMP chars can be printed in IDLE. So I have thought about a replacement.
When (unicode) strings are written to sys.stdout (usually with print), the string object is pickled to bytes in the user process, sent through a socket (implementation detail subject to change) to the IDLE process, and unpickled back to a string object. The effect is as if the string was encoded and decoded with one of the non-lossy utf codings. UTF-32 might be the closest to what pickling does.
The IDLE process calls tkinter text.insert(index, string), which asks tk to insert the string in the widget. But that only works for BMP characters. The net effect is as if the output encoding were ucs-2, though I believe tk uses a truncated utf-8 internally.
Similarly, any BMP character you enter in the shell or editor can be sent to the user process stdin after being displayed.
Anyway, changing pseudofile.encoding has no effect, which is why it was made read-only by this part of the patch for issue 9290
- self.encoding = encoding
+ self._encoding = encoding
+
+ @property
+ def encoding(self):
+ return self._encoding
The initial underscore means that _encoding is a private (not hidden) implementation detail that should be ignored by users.
Python default string encoding
There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.
Short answer:
- For the purpose of code parsing:
str
(Py2) -- not applicable, raw bytes from the file are takenunicode
(Py2)/str
(Py3) -- "source encoding", defaults areascii
(Py2) andutf-8
(Py3)bytes
(Py3) -- none, non-ASCII characters are prohibited in the literal
- For the purpose of transcoding:
- both (Py2) --
sys.getdefaultencoding()
(ascii
almost always)- there are implicit conversions which often result in a
UnicodeDecodeError
/UnicodeEncodeError
- there are implicit conversions which often result in a
- both (Py3) -- none, must specify encoding explicitly when converting
- both (Py2) --
- For the purpose of I/O:
unicode
(Py2) --<file>.encoding
if set, otherwisesys.getdefaultencoding()
str
(Py2) -- not applicable, raw bytes are writtenstr
(Py3) --<file>.encoding
, always set and defaults tolocale.getpreferredencoding()
bytes
(Py3) -- none,print
ing produces itsrepr()
instead
First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.
Now...
Reading the source and parsing string literals
At the start of a source file, you can specify the file's "source encoding" (its exact effect is described later). If not specified, the default is ascii
for Python 2 and utf-8
for Python 3. A UTF-8 BOM has the same effect as a utf-8
encoding declaration.
Python 2
Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood, but this is the net effect.)
> type t.py
# Encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'
<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'
<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'
So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".
If the decoding fails, you will get a SyntaxError
. Same if there is a non-ASCII character in the file when there's no encoding specified. Finally, if unicode_literals
future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.
Python 3
Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ASCII characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.
Transcoding
As per the clarification at the start:
str
(Py2)/bytes
(Py3) -- bytes => can only bedecode
d (directly, that is; details follow)unicode
(Py2)/str
(Py3) -- characters => can only beencode
d
Python 2
In both cases, if the encoding is not specified, sys.getdefaultencoding()
is used. It is ascii
(unless you uncomment a code chunk in site.py
, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding()
is the "string's default encoding".
Now, here's a caveat:
a
decode()
andencode()
-- with the default encoding -- is done implicitly when convertingstr<->unicode
:- in string formatting (a third of
UnicodeDecodeError
/UnicodeEncodeError
questions on Stack Overflow are about this) - when trying to
encode()
astr
ordecode()
aunicode
(the second third of the Stack Overflow questions)
- in string formatting (a third of
Python 3
There's no "default encoding" at all: implicit conversion between str
and bytes
is now prohibited.
bytes
can only bedecode
d andstr
--encode
d, and theencoding
argument is mandatory.- converting
bytes->str
(incl. implicitly) produces itsrepr()
instead (which is only useful for debug printing), evading the encoding issue entirely - converting
str->bytes
is prohibited
Printing
This matter is unrelated to a variable's value but related to what you would see on the screen when it's print
ed -- and whether you will get a UnicodeEncodeError
when print
ing.
Python 2
- A
unicode
isencode
d with<file>.encoding
if set; otherwise, it's implicitly converted tostr
as per the above. (The final third of theUnicodeEncodeError
SO questions fall into here.)- For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the
PYTHONIOENCODING
environment variable.
- For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the
str
's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).
Python 3
The changes are:
- Now
file
s opened with text vs. binarymode
natively acceptstr
orbytes
, correspondingly, and outright refuse to process the wrong type. Text-mode files always have anencoding
set,locale.getpreferredencoding(False)
being the default. print
for text streams still implicitly converts everything tostr
, which in the case ofbytes
prints itsrepr()
as per the above, evading the encoding issue altogether
Is there a way to change the default encoding for all run configurations within Pydev?
You can change the default encoding at window > preferences > general > workspace > text file encoding (set it to other > us-ascii).
Why i can't change my python default encoding?
You need to read following:
- Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
- How to print UTF-8 encoded text to the console in Python < 3?
I recommend using # -*- coding: utf-8 -*-
to top of your .py
file.
Related Topics
How to Pad a String With Zeroes
Make a Dictionary With Duplicate Keys in Python
How to Check If a List Is Empty
Can a Variable Number of Arguments Be Passed to a Function
What Does the "At" (@) Symbol Do in Python
Convert Dataframe Column Type from String to Datetime
How to Have One Colorbar For All Subplots
How to Use Subprocess.Popen to Connect Multiple Processes by Pipes
Which Python Memory Profiler Is Recommended
How to Install Python Packages [Ssl: Tlsv1_Alert_Protocol_Version]
Why Does Running the Flask Dev Server Run Itself Twice
Changing Default Encoding of Python