Understanding Python Unicode and Linux Terminal

Understanding Python Unicode and Linux terminal

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.

But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.

Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).

In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

python... encoding issue when using linux

When redirecting the output, sys.stdout is not connected to a terminal and Python cannot determine the output encoding. When not directing the output, Python can detect that sys.stdout is a TTY and will use the codec configured for that TTY when printing unicode.

Set the PYTHONIOENCODING environment variable to tell Python what encoding to use in such cases, or encode explicitly.

Encoding issue while printing a string in Python in Linux

To get started, read https://docs.python.org/3/howto/unicode.html

To read a text file, just open it as a text file and specify an encoding if needed:

open('test.txt','r', encoding="utf-8")

Read operations on that file will then return Unicode strings rather than byte strings. As a rule, whenever you handle text, always use Unicode objects.

Printing Unicode to the console is another can of worms, and especially on Windows poorly supported. But there are plenty of answers to that problem already on StackOverflow, eg. here: Python, Unicode, and the Windows console and Understanding Python Unicode and Linux terminal

print unicode characters in terminal with python 3

Err... print them...

3>> print('♔♕♖')
♔♕♖

Windows will probably need chcp 65001 before running the script.

Linux/Python: encoding a unicode string for print

I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.

printing and writing Unicode characters in Python

won't display on the command line terminal

What errors do you get? In any event, the following works if you remove the unnecessary str() conversion and quote 'name' on a terminal that supports UTF-8, such as Linux:

import requests
import json

url = 'https://api.discogs.com/releases/7828220'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
art = requests.get(url, headers=headers)
json_object = json.loads(art.text)
print json_object['companies'][0]['name']

Output:

ООО "Парадиз"

On Windows, the command console may not default to an encoding that supports the characters you are trying to print. One easy way is to switch to a supported encoding, in this case chcp 1251 changes the code page to one supporting Russian, and will make the above work.

to write it to a file, use io.open with an encoding:

import io
with io.open('output.txt','w',encoding='utf8') as f:
f.write(json_object['companies'][0]['name'])

Python3 and encoding: different on linux and on OSX?

Python asks the terminal what encoding is being used, and encodes unicode strings to bytes when printing. Your Ubuntu server is not configured for UTF-8 display, your Mac terminal is.

See https://askubuntu.com/questions/87227/switch-encoding-of-terminal-with-a-command for help with switching your terminal locale. Any locale that can handle the specific codepoints you are trying to print is fine, but UTF8 can handle all of Unicode.

You can see what Python detected by printing sys.stdout.encoding:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'

Help me understand why Unicode only works sometimes with Python

I/O in Python (and most other languages) is based on bytes. When you write a byte string (str in 2.x, bytes in 3.x) to a file, the bytes are simply written as-is. When you write a Unicode string (unicode in 2.x, str in 3.x) to a file, the data needs to be encoded to a byte sequence.

For a further explanation of this distinction see the Dive into Python 3 chapter on strings.

print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥')

Here, the string is a byte string. Because the encoding of your source file is UTF-8, the bytes are

'abcd k\xce\xa9 \xe2\x98\xa0 \xc2\xb0C \xe2\x88\x9aHz \xc2\xb5F \xc3\xbc \xe2\x98\x83 \xe2\x99\xa5'

The print statement writes these bytes to the console as-is. But the Windows console interprets byte strings as being encoded in the "OEM" code page, which in the US is 437. So the string you actually see on your screen is

abcd kΩ ☠ °C √Hz µF ü ☃ ♥

On your Ubuntu system, this doesn't cause a problem because there the default console encoding is UTF-8, so you don't have the discrepancy between source file encoding and console encoding.

print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥')

When printing a Unicode string, the string has to get encoded into bytes. But it only works if you have an encoding that supports those characters. And you don't.

  • The default IBM437 encoding lacks the characters ☠☃♥
  • The windows-1252 encoding used by Spyder lacks the characters Ω☠√☃♥.

So, in both cases, you get a UnicodeEncodeError trying to print the string.

What gives?

Windows and Linux took vastly different approaches to supporting Unicode.

Originally, they both worked pretty much the same way: Each locale has its own language-specific char-based encoding (the "ANSI code page" in Windows). Western languages used ISO-8859-1 or windows-1252, Russian used KOI8-R or windows-1251, etc.

When Windows NT added support for Unicode (int the early days when it was assumed that Unicode would use 16-bit characters), it did so by creating a parallel version of its API that used wchar_t instead of char. For example, the MessageBox function was split into the two functions:

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

The "W" functions are the "real" ones. The "A" functions exist for backwards compatibility with DOS-based Windows and mostly just convert their string arguments to UTF-16 and then call the corresponding "W" function.

In the Unix world (specifically, Plan 9), writing a whole new version of the POSIX API was seen as impractical, so Unicode support was approached in a different manner. The existing support for multi-byte encoding in CJK locales was used to implement a new encoding now known as UTF-8.

The preference towards UTF-8 on Unix-like systems and UTF-16 on Windows is a huge pain the the ass when writing cross-platform code that supports Unicode. Python tries to hide this from the programmer, but printing to the console is one of Joel's "leaky abstractions".



Related Topics



Leave a reply



Submit