Understanding Python Unicode and Linux terminal
The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.
But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print
syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding()
is generally 'ascii'
, and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.
Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding
and use that (it will be None when redirecting under Python 2).
In Python 3, your code should work as is, except that you need to change print mystring
to print(mystring)
.
python... encoding issue when using linux
When redirecting the output, sys.stdout
is not connected to a terminal and Python cannot determine the output encoding. When not directing the output, Python can detect that sys.stdout
is a TTY and will use the codec configured for that TTY when printing unicode.
Set the PYTHONIOENCODING
environment variable to tell Python what encoding to use in such cases, or encode explicitly.
Encoding issue while printing a string in Python in Linux
To get started, read https://docs.python.org/3/howto/unicode.html
To read a text file, just open it as a text file and specify an encoding if needed:
open('test.txt','r', encoding="utf-8")
Read operations on that file will then return Unicode strings rather than byte strings. As a rule, whenever you handle text, always use Unicode objects.
Printing Unicode to the console is another can of worms, and especially on Windows poorly supported. But there are plenty of answers to that problem already on StackOverflow, eg. here: Python, Unicode, and the Windows console and Understanding Python Unicode and Linux terminal
print unicode characters in terminal with python 3
Err... print them...
3>> print('♔♕♖')
♔♕♖
Windows will probably need chcp 65001
before running the script.
Linux/Python: encoding a unicode string for print
I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.
printing and writing Unicode characters in Python
won't display on the command line terminal
What errors do you get? In any event, the following works if you remove the unnecessary str()
conversion and quote 'name'
on a terminal that supports UTF-8, such as Linux:
import requests
import json
url = 'https://api.discogs.com/releases/7828220'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
art = requests.get(url, headers=headers)
json_object = json.loads(art.text)
print json_object['companies'][0]['name']
Output:
ООО "Парадиз"
On Windows, the command console may not default to an encoding that supports the characters you are trying to print. One easy way is to switch to a supported encoding, in this case chcp 1251
changes the code page to one supporting Russian, and will make the above work.
to write it to a file, use io.open
with an encoding:
import io
with io.open('output.txt','w',encoding='utf8') as f:
f.write(json_object['companies'][0]['name'])
Python3 and encoding: different on linux and on OSX?
Python asks the terminal what encoding is being used, and encodes unicode strings to bytes when printing. Your Ubuntu server is not configured for UTF-8 display, your Mac terminal is.
See https://askubuntu.com/questions/87227/switch-encoding-of-terminal-with-a-command for help with switching your terminal locale. Any locale that can handle the specific codepoints you are trying to print is fine, but UTF8 can handle all of Unicode.
You can see what Python detected by printing sys.stdout.encoding
:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
Help me understand why Unicode only works sometimes with Python
I/O in Python (and most other languages) is based on bytes. When you write a byte string (str
in 2.x, bytes
in 3.x) to a file, the bytes are simply written as-is. When you write a Unicode string (unicode
in 2.x, str
in 3.x) to a file, the data needs to be encoded to a byte sequence.
For a further explanation of this distinction see the Dive into Python 3 chapter on strings.
print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥')
Here, the string is a byte string. Because the encoding of your source file is UTF-8, the bytes are
'abcd k\xce\xa9 \xe2\x98\xa0 \xc2\xb0C \xe2\x88\x9aHz \xc2\xb5F \xc3\xbc \xe2\x98\x83 \xe2\x99\xa5'
The print
statement writes these bytes to the console as-is. But the Windows console interprets byte strings as being encoded in the "OEM" code page, which in the US is 437. So the string you actually see on your screen is
abcd kΩ ☠ °C √Hz µF ü ☃ ♥
On your Ubuntu system, this doesn't cause a problem because there the default console encoding is UTF-8, so you don't have the discrepancy between source file encoding and console encoding.
print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥')
When printing a Unicode string, the string has to get encoded into bytes. But it only works if you have an encoding that supports those characters. And you don't.
- The default IBM437 encoding lacks the characters
☠☃♥
- The windows-1252 encoding used by Spyder lacks the characters
Ω☠√☃♥
.
So, in both cases, you get a UnicodeEncodeError trying to print the string.
What gives?
Windows and Linux took vastly different approaches to supporting Unicode.
Originally, they both worked pretty much the same way: Each locale has its own language-specific char
-based encoding (the "ANSI code page" in Windows). Western languages used ISO-8859-1 or windows-1252, Russian used KOI8-R or windows-1251, etc.
When Windows NT added support for Unicode (int the early days when it was assumed that Unicode would use 16-bit characters), it did so by creating a parallel version of its API that used wchar_t
instead of char
. For example, the MessageBox function was split into the two functions:
int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
The "W" functions are the "real" ones. The "A" functions exist for backwards compatibility with DOS-based Windows and mostly just convert their string arguments to UTF-16 and then call the corresponding "W" function.
In the Unix world (specifically, Plan 9), writing a whole new version of the POSIX API was seen as impractical, so Unicode support was approached in a different manner. The existing support for multi-byte encoding in CJK locales was used to implement a new encoding now known as UTF-8.
The preference towards UTF-8 on Unix-like systems and UTF-16 on Windows is a huge pain the the ass when writing cross-platform code that supports Unicode. Python tries to hide this from the programmer, but printing to the console is one of Joel's "leaky abstractions".
Related Topics
How to Check If a Process Is Still Running Using Python on Linux
The Correct Cmakelists.Txt File to Call a Maxon Libarary in a Python Script Using Pybind11
Sharing a Result Queue Among Several Processes
Difference Between Variables Inside and Outside of _Init_()
How to Add Custom Methods/Attributes to Built-In Python Types
How to Get the Path of the Python Script I am Running In
Grouping Python Dictionary Keys as a List and Create a New Dictionary with This List as a Value
How to Run a Python Program in the Command Prompt in Windows 7
What Do Square Brackets, "[]", Mean in Function/Class Documentation
Pandas Dataframe Groupby Two Columns and Get Counts
What Is Different Between Makedirs and Mkdir of Os
Basic Python Hello World Program Syntax Error
How to "Test" Nonetype in Python
Python: Access Class Property from String
Python's Most Efficient Way to Choose Longest String in List