Why Does Python Print Unicode Characters When the Default Encoding Is Ascii

Why does Python print unicode characters when the default encoding is ASCII?

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u'\xe9', Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it's been initiated from. If it can't find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let's for a moment exit the Python shell and set bash's environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'
in position 0: ordinal not in range(128)

Lets exit Python and discard the bash shell.

We'll now observe what happens after Python outputs strings. For this we'll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we'll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn't change the actual shell environment's encoding, it only changes the way the terminal itself will decode output it's given, a bit like a web browser does. You can therefore change the terminal's encoding, independantly from the shell's environment. Let's then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment's encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8". After UTF-8 encoding, the resulting binary string is '\xc3\xa9' (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u'\xe9' (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u'\xe9' (233) encoded in latin-1 will also yields the binary string '\xe9'. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields "é" and that's what's displayed.

Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever's in sys.stdout.encoding. Still "UTF-8". The resulting binary string is '\xc3\xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".

(6) python encodes unicode string with latin-1, it yields a binary string with the same value '\xe9'. Again, for the terminal this is pretty much the same as case (4).

Conclusions:
- Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data.
- Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding.
- Python gets that setting from the shell's environment.
- the terminal displays output according to its own encoding settings.
- the terminal's encoding is independant from the shell's.


More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it's been decided that key 0xe9 (233) is the value pointing to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to 'A' in ASCII, latin-1 and Unicode, 0xc8 points to 'Ü' in latin-1 and Unicode, 0xe9 points to 'é' in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That's what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don't cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common "cheap" characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range ('B' which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)
  • the x's show the actual space reserved to "store" the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn't change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (as we said, it's the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010 <-- Unicode code point 0x42
0100 0010 <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx <-- (from 2048 to 65535)
  • the leading bits '110' indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas '1110' indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner '10' flag bits are used to signal the beginning of an inner byte.
  • again, the x's mark the space where the Unicode code point value is stored after encoding.

e.g. 'é' Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001 <-- 0xe9
1100 0011 1010 1001 <-- 'é' after UTF-8 encoding
C 3 A 9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.

Why can't Python print Unicode symbols?

Printing unicode objects requires Python to guess the output encoding and encoding the Unicode codepoints to that encoding.

On your VPS server, the output encoding appears to be ASCII, which is the default when no encoding could be detected (such as when using a pipe). If you run the same code on a terminal, the terminal encoding is usually detected and the encoding succeeds.

The solution is to encode explicitly depending on your script requirements.

Please do read the Python Unicode HOWTO to understand how Python does this detection and why it needs to encode for you.

Python default string encoding

There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.

Short answer:

  • For the purpose of code parsing:
    • str (Py2) -- not applicable, raw bytes from the file are taken
    • unicode (Py2)/str (Py3) -- "source encoding", defaults are ascii (Py2) and utf-8 (Py3)
    • bytes (Py3) -- none, non-ASCII characters are prohibited in the literal
  • For the purpose of transcoding:
    • both (Py2) -- sys.getdefaultencoding() (ascii almost always)
      • there are implicit conversions which often result in a UnicodeDecodeError/UnicodeEncodeError
    • both (Py3) -- none, must specify encoding explicitly when converting
  • For the purpose of I/O:
    • unicode (Py2) -- <file>.encoding if set, otherwise sys.getdefaultencoding()
    • str (Py2) -- not applicable, raw bytes are written
    • str (Py3) -- <file>.encoding, always set and defaults to locale.getpreferredencoding()
    • bytes (Py3) -- none, printing produces its repr() instead

First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.

Now...

Reading the source and parsing string literals

At the start of a source file, you can specify the file's "source encoding" (its exact effect is described later). If not specified, the default is ascii for Python 2 and utf-8 for Python 3. A UTF-8 BOM has the same effect as a utf-8 encoding declaration.

Python 2

Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood, but this is the net effect.)

> type t.py
# Encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'

<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'

<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'

So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".

If the decoding fails, you will get a SyntaxError. Same if there is a non-ASCII character in the file when there's no encoding specified. Finally, if unicode_literals future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.

Python 3

Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ASCII characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.

Transcoding

As per the clarification at the start:

  • str (Py2)/bytes (Py3) -- bytes => can only be decoded (directly, that is; details follow)
  • unicode (Py2)/str (Py3) -- characters => can only be encoded

Python 2

In both cases, if the encoding is not specified, sys.getdefaultencoding() is used. It is ascii (unless you uncomment a code chunk in site.py, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding() is the "string's default encoding".

Now, here's a caveat:

  • a decode() and encode() -- with the default encoding -- is done implicitly when converting str<->unicode:

    • in string formatting (a third of UnicodeDecodeError/UnicodeEncodeError questions on Stack Overflow are about this)
    • when trying to encode() a str or decode() a unicode (the second third of the Stack Overflow questions)

Python 3

There's no "default encoding" at all: implicit conversion between str and bytes is now prohibited.

  • bytes can only be decoded and str -- encoded, and the encoding argument is mandatory.
  • converting bytes->str (incl. implicitly) produces its repr() instead (which is only useful for debug printing), evading the encoding issue entirely
  • converting str->bytes is prohibited

Printing

This matter is unrelated to a variable's value but related to what you would see on the screen when it's printed -- and whether you will get a UnicodeEncodeError when printing.

Python 2

  • A unicode is encoded with <file>.encoding if set; otherwise, it's implicitly converted to str as per the above. (The final third of the UnicodeEncodeError SO questions fall into here.)
    • For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the PYTHONIOENCODING environment variable.
  • str's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).

Python 3

The changes are:

  • Now files opened with text vs. binary mode natively accept str or bytes, correspondingly, and outright refuse to process the wrong type. Text-mode files always have an encoding set, locale.getpreferredencoding(False) being the default.
  • print for text streams still implicitly converts everything to str, which in the case of bytes prints its repr() as per the above, evading the encoding issue altogether

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

Read the Python Unicode HOWTO. This error is the very first example.

Do not use str() to convert from unicode to encoded text / bytes.

Instead, use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.

Print unicode string to console OK but fails when redirect to a file. How to fix?

Set PYTHONIOENCODING environmental variable.

SET PYTHONIOENCODING=cp936
windows_prn_utf8.py > 1.txt

no UnicodeError when using print with a default encoding set to ASCII

print uses sys.stdout.encoding, not sys.getdefaultencoding():

When Python finds its output attached to a terminal, it sets the
sys.stdout.encoding attribute to the terminal's encoding. The print
statement's handler will automatically encode unicode arguments into
str output.

>>> import sys
>>> print(sys.stdout.encoding)
utf-8
>>> print(sys.getdefaultencoding())
ascii
>>> name = u'\u0935\u0948\u092D\u0935'
>>> print name
वैभव

Unicode characters not getting displayed correctly on localhost

As I know browser doesn't have to use utf-8 as default encoding but i.e. iso8859-2.

Browser doesn't know what encoding is inside file and you have to use HTTP header to inform it

self.send_header("Content-Type", "text/plain; charset=utf-8")

Minimal working example

from http.server import HTTPServer, BaseHTTPRequestHandler

class Serv(BaseHTTPRequestHandler):

def do_GET(self):
text = '‾'

self.send_response(200)

self.send_header("Content-Type", "text/plain; charset=utf-8")

self.end_headers()

#self.wfile.write(bytes(text, 'utf-8'))
self.wfile.write(text.encode('utf-8'))

print('Serving http://localhost:8080')
httpd = HTTPServer(('localhost', 8080), Serv)
httpd.serve_forever()

EDIT:

If you will send file with HTML then inside file you can use HTML tag

<meta charset="utf-8">

Minimal working example

from http.server import HTTPServer, BaseHTTPRequestHandler

class Serv(BaseHTTPRequestHandler):

def do_GET(self):
text = '''<!DOCTYPE html>
<html>

<head>
<meta charset="utf-8">
</head>

<body>

</body>

</html>
'''

self.send_response(200)

self.end_headers()

#self.wfile.write(bytes(text, 'utf-8'))
self.wfile.write(text.encode('utf-8'))

print('Serving http://localhost:8080')
httpd = HTTPServer(('localhost', 8080), Serv)
httpd.serve_forever()

Trying to print ASCII characters 128 to 160, why does it stop at 157?

Firstly, ASCII only goes up to 127 (0x7F). chr() actually returns the Unicode character.

I think the problem is that when U+9D (157) Operating System Command (OSC) is printed, your terminal starts a control string and waits for a String Terminator like U+9C String Terminator, U+1B Escape followed by U+5C backslash, or U+7 BEL. Since none of those sequences are ever printed later, the terminal stops showing the output. For more info, see ANSI escape code § Fe Escape sequences and C1 control codes on Wikipedia.

Unicode characters U+80 (128) to U+9F (159) are control characters, meaning they're not generally printable, so you were never going to get sensible output in the first place.



Related Topics



Leave a reply



Submit