What's the deal with Python 3.4, Unicode, different languages and Windows?
The problem iswas (see Python 3.6 update below) with the Windows console, which supports an ANSI character set appropriate for the region targeted by your version of Windows. Python throws an exception by default when unsupported characters are output.
Python can read an environment variable to output in other encodings, or to change the error handling default. Below, I've read the console default and change the default error handling to print a ?
instead of throwing an error for characters that are unsupported in the console's current code page.
C:\>chcp
Active code page: 437 # Note, US Windows OEM code page.
C:\>set PYTHONIOENCODING=437:replace
C:\>example.py
Leo? Janá?ek
Zdzis?aw Beksi?ski
??? ?? ??
??
?????? ??? ?????????? ????????
Minha Língua Portuguesa: çáà
Note the US OEM code page is limited to ASCII and some Western European characters.
Below I've instructed Python to use UTF8, but since the Windows console doesn't support it, I redirect the output to a file and display it in Notepad:
C:\>set PYTHONIOENCODING=utf8
C:\>example >out.txt
C:\>notepad out.txt
On Windows, its best to use a Python IDE that supports UTF-8 instead of the console when working with multiple languages. If only using one language, select it as the system locale in the Region and Language
control panel and the console will support the characters of that language.
Update for Python 3.6
Python 3.6 now uses Windows Unicode APIs to write directly to the console, so the only limit is the console font's support of the characters. The following code works in a US Windows console. I have a Chinese language pack installed, it even displays the Chinese and Japanese if the console font is changed. Even without the correct font, replacement characters are shown in the console. Cut-n-paste to an environment such as this web page will display the characters correctly.
#!python3.6
#coding: utf8
czech = 'Leoš Janáček'
print(czech)
pl = 'Zdzisław Beksiński'
print(pl)
jp = 'リング 山村 貞子'
print(jp)
chinese = '五行'
print(chinese)
MIR = 'Машина для Инженерных Расчётов'
print(MIR)
pt = 'Minha Língua Portuguesa: çáà'
print(pt)
Output:
Leoš Janáček
Zdzisław Beksiński
リング 山村 貞子
五行
Машина для Инженерных Расчётов
Minha Língua Portuguesa: çáà
Python, Unicode, and the Windows console
Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!
Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):
PrintFails - Python Wiki
Here's a code excerpt from that page:
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б
There's some more information on that page, well worth a read.
python os.walk and unicode error
Here's a test case:
C:\TEST
├───dir1
│ file1™
│
└───dir2
file2
Here's a script (Python 3.x):
import os
spath = r'c:\test'
for root,dirs,files in os.walk(spath):
print(root)
for dirs in os.walk(spath):
print(dirs)
Here's the output, on an IDE that supports UTF-8 (PythonWin, in this case):
c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
('c:\\test\\dir1', [], ['file1™'])
('c:\\test\\dir2', [], ['file2'])
Here's the output, on my Windows console, which defaults to cp437
:
c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
Traceback (most recent call last):
File "C:\test.py", line 9, in <module>
print(dirs)
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 47: character maps to <undefined>
For Question 1, the reason print(root)
works is that no directory had a character that wasn't supported by the output encoding, but print(dirs)
is now printing a tuple containing (root,dirs,files)
and one of the files has an unsupported character in the Windows console.
For Question 2, the first example misspelled utf-8
as utf=8
, and the second example didn't declare an encoding for the file the output was written to, so it used a default that didn't support the character.
Try this:
import os
spath = r'c:\test'
with open('os_walk4_align.txt', 'w', encoding='utf8') as f:
for path, dirs, filenames in os.walk(spath):
print(path, dirs, filenames, file=f)
Content of os_walk4_align.txt
, encoded in UTF-8:
c:\test ['dir1', 'dir2'] []
c:\test\dir1 [] ['file1™']
c:\test\dir2 [] ['file2']
Python Unicode Does not support character U+25BE
I assume you are printing to the Windows console. The Windows console does not default to (and has poor support for) UTF-8, but you can change the code page and try again:
C:\>chcp 65001
Active code page: 65001
C:\>py
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u25be')
▾
>>> import unicodedata as ud
>>> ud.name('\u25be')
'BLACK DOWN-POINTING SMALL TRIANGLE'
That displays the correct character for me on US English Windows using the Consolas
console font, but not the Lucida Console
or Raster Fonts
fonts. Make sure the font you are using supports the character.
Related Topics
Thread Starts Running Before Calling Thread.Start
Python Cannot Handle Numbers String Starting with 0. Why
Is Distributing Python Source Code in Docker Secure
Using Logging in Multiple Modules
How to Overload _Init_ Method Based on Argument Type
Python MySQLdb: Library Not Loaded: Libmysqlclient.18.Dylib
How to Get a List of All Classes Within Current Module in Python
Use Cases for the 'Setdefault' Dict Method
How to Filter Rows Containing a String Pattern from a Pandas Dataframe
How to Get the Different Parts of a Flask Request's Url
"Cloning" Row or Column Vectors
How to Remove Specific Elements in a Numpy Array
How to Print Unicode Character in Python