What's the Deal with Python 3.4, Unicode, Different Languages and Windows

What's the deal with Python 3.4, Unicode, different languages and Windows?

The problem iswas (see Python 3.6 update below) with the Windows console, which supports an ANSI character set appropriate for the region targeted by your version of Windows. Python throws an exception by default when unsupported characters are output.

Python can read an environment variable to output in other encodings, or to change the error handling default. Below, I've read the console default and change the default error handling to print a ? instead of throwing an error for characters that are unsupported in the console's current code page.

C:\>chcp
Active code page: 437   # Note, US Windows OEM code page.

C:\>set PYTHONIOENCODING=437:replace

C:\>example.py
Leo? Janá?ek
Zdzis?aw Beksi?ski
??? ?? ??
??
?????? ??? ?????????? ????????
Minha Língua Portuguesa: çáà

Note the US OEM code page is limited to ASCII and some Western European characters.

Below I've instructed Python to use UTF8, but since the Windows console doesn't support it, I redirect the output to a file and display it in Notepad:

C:\>set PYTHONIOENCODING=utf8
C:\>example >out.txt
C:\>notepad out.txt

Sample Image

On Windows, its best to use a Python IDE that supports UTF-8 instead of the console when working with multiple languages. If only using one language, select it as the system locale in the Region and Language control panel and the console will support the characters of that language.

Update for Python 3.6

Python 3.6 now uses Windows Unicode APIs to write directly to the console, so the only limit is the console font's support of the characters. The following code works in a US Windows console. I have a Chinese language pack installed, it even displays the Chinese and Japanese if the console font is changed. Even without the correct font, replacement characters are shown in the console. Cut-n-paste to an environment such as this web page will display the characters correctly.

#!python3.6
#coding: utf8
czech = 'Leoš Janáček'
print(czech)

pl = 'Zdzisław Beksiński'
print(pl)

jp = 'リング 山村 貞子'
print(jp)

chinese = '五行'
print(chinese)

MIR = 'Машина для Инженерных Расчётов'
print(MIR)

pt = 'Minha Língua Portuguesa: çáà'
print(pt)

Output:

Leoš Janáček
Zdzisław Beksiński
リング 山村 貞子
五行
Машина для Инженерных Расчётов
Minha Língua Portuguesa: çáà

Python, Unicode, and the Windows console

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!

Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails - Python Wiki

Here's a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

There's some more information on that page, well worth a read.

python os.walk and unicode error

Here's a test case:

C:\TEST
├───dir1
│       file1™
│
└───dir2
        file2

Here's a script (Python 3.x):

import os

spath = r'c:\test'

for root,dirs,files in os.walk(spath):
    print(root)

for dirs in os.walk(spath):                             
    print(dirs)

Here's the output, on an IDE that supports UTF-8 (PythonWin, in this case):

c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
('c:\\test\\dir1', [], ['file1™'])
('c:\\test\\dir2', [], ['file2'])

Here's the output, on my Windows console, which defaults to cp437:

c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
Traceback (most recent call last):
  File "C:\test.py", line 9, in <module>
    print(dirs)
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 47: character maps to <undefined>

For Question 1, the reason print(root) works is that no directory had a character that wasn't supported by the output encoding, but print(dirs) is now printing a tuple containing (root,dirs,files) and one of the files has an unsupported character in the Windows console.

For Question 2, the first example misspelled utf-8 as utf=8, and the second example didn't declare an encoding for the file the output was written to, so it used a default that didn't support the character.

Try this:

import os

spath = r'c:\test'

with open('os_walk4_align.txt', 'w', encoding='utf8') as f:
    for path, dirs, filenames in os.walk(spath):
        print(path, dirs, filenames, file=f)

Content of os_walk4_align.txt, encoded in UTF-8:

c:\test ['dir1', 'dir2'] []
c:\test\dir1 [] ['file1™']
c:\test\dir2 [] ['file2']

Python Unicode Does not support character U+25BE

I assume you are printing to the Windows console. The Windows console does not default to (and has poor support for) UTF-8, but you can change the code page and try again:

C:\>chcp 65001
Active code page: 65001

C:\>py
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u25be')
▾

>>> import unicodedata as ud
>>> ud.name('\u25be')
'BLACK DOWN-POINTING SMALL TRIANGLE'

That displays the correct character for me on US English Windows using the Consolas console font, but not the Lucida Console or Raster Fonts fonts. Make sure the font you are using supports the character.

What's the Deal with Python 3.4, Unicode, Different Languages and Windows