Python, Unicode, and the Windows Console

Python, Unicode, and the Windows console

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!


Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails - Python Wiki

Here's a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б

There's some more information on that page, well worth a read.

Python Unicode - What Characters Can Be Printed in Windows Console?

To answer your question, we need to check several layers of Unicode.

Valid Unicode code points are from 0 to U+10FFFF. You may find with unicodedata.category(char) what category has a Unicode code point.

The values from U+D800 to U+DFFF are surrogates, they should not be used (and they cannot be encoded/decoded in UTF-16). [They are used to enhance UCS-2 (so old Unicode, which has code point until U+FFFF), to UTF-16 (to U+10FFFF). Old programs/languages (as Javascript) may use two surrogates representation, instead of one UTF-16 codepoint].

Note: Python allows them because of surrogateescape (mostly used to read sys.argv), but you should ignore them, but use them only internally, before to convert them properly.

So, do no try to use such codes.

There are also the noncharacters: U+FDD0–U+FDEF, and FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, … U+10FFFE, U+10FFFF) [from Wikipedia, Unicode], which should not be used, but ev. BOM (U+FEFF), but in this case only as first character. Reason: the first block: What's the purpose of the noncharacters U+FDD0 to U+FDEF?, the others: to autodetect encoding, so we should not have confusing code points: if you detect them, you know that you are using a wrong encoding, and you change encoding, until you get a valid first code point.

Now, with the unicodedata.category(char), you can get also the categories of the code (see Unicode character categories). Characters until U+1F, and U+7F–U+9F are control character, do no print them.

You may have formatting characters, which could modify nearby characters.

So you may want to exclude the C* (note: this will discard all the above characters) and maybe also the Z* (white spaces) character categories.

So you have the printable characters, known by unicodedata standard module. Use unicodedata.unidata_version to check up which unicode version the database is updated. You may ev. allow Cn category (unassigned): maybe now they are assigned.

But this is not enough. You need a font to display such characters. Google has the "No Tofu fonts" which is (I think) the most complete font.

But this is also not enough. You get only the standard representation of the characters (and probably not, you should add a U+200C (ZWNJ) after each character, to force fonts not to join characters (e.g. in the Indic languages). But so you miss all the characters which are represented by a combination of code points: e.g. many accented characters, characters enclosed in circles or squares, country flags (you need two country code characters in the correct order), etc.

Note: I'm curious on how to get all glyphs from a font file, but this is not your question.

ADDENDUM:

I forgot to say: Combining characters cannot be displayed alone, so you need to precede the e.g. with U+25CC, you may check them with unicodedata.combining(chr).

So you may use this code

# if your console is not UTF-8 (or any unicode encoding) and python
# do no get it, you will get garbage
import unicodedata

combining = '\u25cc'
placeholder = '\ufffd'
zwnj = '\u200c'

line = ''
for code in range(0x10FFFF+1):
c = chr(code)
cat = unicodedata.category(c)
if cat.startswith('C'): # and cat != 'Cn':
r = placeholder
elif cat.startswith('Z'):
r = ' '
elif unicodedata.combining(c) > 0:
r = combining + c + zwnj
else:
r = c + zwnj
line += r
if code % 256 == 255:
print(line)
line = ''

python: unicode in Windows terminal, encoding used?

Unicode is not an encoding. You encode into byte strings and decode into Unicode:

>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'

The windows terminal uses legacy code pages for DOS. For US Windows it is:

>>> import sys
>>> sys.stdout.encoding
'cp437'

Windows applications use windows code pages. Python's IDLE will show the windows encoding:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

Your results may vary.

UnicodeEncodeError with Windows console in Python 3

https://wiki.python.org/moin/PrintFails details this error.

"UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in position 0: character maps to undefined"

This means that the python console app can't write the given character to the console's encoding.

More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.

...

By default, the console in Microsoft Windows only displays 256 characters (cp437, of "Code page 437", the original IBM-PC 1981 extended ASCII character set.)

If you try to print an unprintable character you will get UnicodeEncodeError.

Setting the PYTHONIOENCODING environment variable as described above can be used to suppress the error messages. Setting to "utf-8" is not recommended as this produces an inaccurate, garbled representation of the output to the console. For best results, use your console's correct default codepage and a suitable error handler other than "strict".

Try ignoring some of this advice and do the following in Windows CMD:

set PYTHONIOENCODING=utf-8
chcp 65001

Also set your console font to: Lucinda Console

This should set the console to a crappy UTF-8 emulation and force Python to encode to UTF-8.

You may find it simpler to write the results to a UTF-8 encoded file instead of writing to a console.

Use https://github.com/Drekin/win-unicode-console

Python (2.7) and reading Unicode argvs from the Windows command line

The file name is being received correctly. You can verify this by encoding sys.argv[1] as UTF-8 and writing it to a file (opened in binary mode) and then opening the file in a text editor that supports UTF-8.

The Windows command prompt is unable to display the characters correctly despite the 'chcp' command changing the codepage to UTF-8 because the terminal font does not contain those characters. The command prompt is unable to substitute characters from other fonts.

Is there a way to print certain unicode characters in python terminal from Windows?

Maybe you have wrong escape sequences in your string literals:

import unicodedata   # access to the Unicode Character Database

def check_unicode(s):
print(len(s), s)
for char in s:
print( char, '{:04x}'.format( ord(char)),
unicodedata.category( char),
unicodedata.name( char, '(unknown)') )

Output:

check_unicode( u"\u2b1c\u1f7e8\u1f7e9") # original string literals
5 ⬜὾8὾9
⬜ 2b1c So WHITE LARGE SQUARE
὾ 1f7e Cn (unknown)
8 0038 Nd DIGIT EIGHT
὾ 1f7e Cn (unknown)
9 0039 Nd DIGIT NINE
check_unicode( u"\u2b1c\U0001f7e8\U0001f7e9") # adjusted string literals
3 ⬜br>⬜ 2b1c So WHITE LARGE SQUARE
1f7e8 So LARGE YELLOW SQUARE
1f7e9 So LARGE GREEN SQUARE

Edit. Run in Windows Terminal using default Cascadia Code font…

Windows console encoding

First of all, for all your non-ASCII characters, what matters here is your console encoding and Windows locale settings, you are using byte strings and Python just prints out the bytes it received. Your keyboard input is encoded to a specific byte or byte sequence by the console before those bytes are passed on to Python. To Python, this is all just opaque data (numbers in the range 0-255), and print passes those back to the console the same way Python received them.

In Powershell, what encoding is used for the bytes sent to Python via command-line switches is not determined by the chcp codepage, but by the Language for non-Unicode programs setting in your control panel (search for Region, then find the Administrative tab). It is this setting that encodes é to 0xE9 before passing it to Python as a command-line argument. There are a large number of Windows codepages that use 0xE9 for é (but there is no such thing as an ANSI encoding).

The same applies to environment variables. Python refers to the encoding Windows uses here as the MBCS codec; you can decode command-line parameters or environment variables to Unicode using the 'mbcs' codec, which uses the MultiByteToWideChar() and WideCharToMultiByte() Windows API functions, with the CP_ACP flag.

When using the interactive prompt, Python is passed bytes as encoded by the Powershell console locale codepage, set with chcp. For you that's codepage 850, and a byte with the hex value 0x82 is received when you type é. Because print sends the same 0x82 byte back to the same console, the console then translates 0x82 back to a é character on the screen.

Only when you use Unicode text (with a unicode string literal like u'é') would Python do any decoding and encoding of the data. print writes to sys.stdout, which is configured to encode Unicode data to the current locale (or PYTHONIOENCODING if set), so print u'é' would write that Unicode object to sys.stdout, which then encodes that object to bytes using the configured codec, and those bytes are then written to the console.

To produce the unicode object from the u'é' source code text (itself a sequence of bytes), Python does have to decode the source code given. For the -c command line, the bytes that are passed in are decoded as Latin-1. In the interactive console, the locale is used. So python -c "print u'é'" and print u'é' in the interactive session will result in different output.

It should be noted that Python 3 uses Unicode strings throughout, and command-line parameters and environment variables are loaded into Python with the Windows 'wide' APIs to access the data as UTF-16, then presented as Unicode string objects. You can still access console data and filesystem information as byte strings, but as of Python 3.6, accessing the filesystem and stdin/stdout/stderr streams as binary uses UTF-8 encoded data (again using the 'wide' APIs).



Related Topics



Leave a reply



Submit