Linux/Python: Encoding a Unicode String for Print

Linux/Python: encoding a unicode string for print

I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.

Encoding issue while printing a string in Python in Linux

To get started, read https://docs.python.org/3/howto/unicode.html

To read a text file, just open it as a text file and specify an encoding if needed:

open('test.txt','r', encoding="utf-8")

Read operations on that file will then return Unicode strings rather than byte strings. As a rule, whenever you handle text, always use Unicode objects.

Printing Unicode to the console is another can of worms, and especially on Windows poorly supported. But there are plenty of answers to that problem already on StackOverflow, eg. here: Python, Unicode, and the Windows console and Understanding Python Unicode and Linux terminal

Understanding Python Unicode and Linux terminal

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.

But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set.
You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.

Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).

In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

printing and writing Unicode characters in Python

won't display on the command line terminal

What errors do you get? In any event, the following works if you remove the unnecessary str() conversion and quote 'name' on a terminal that supports UTF-8, such as Linux:

import requests
import json

url = 'https://api.discogs.com/releases/7828220'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
art = requests.get(url, headers=headers)
json_object = json.loads(art.text)
print json_object['companies'][0]['name']

Output:

ООО "Парадиз"

On Windows, the command console may not default to an encoding that supports the characters you are trying to print. One easy way is to switch to a supported encoding, in this case chcp 1251 changes the code page to one supporting Russian, and will make the above work.

to write it to a file, use io.open with an encoding:

import io
with io.open('output.txt','w',encoding='utf8') as f:
f.write(json_object['companies'][0]['name'])

How to make python 3 print() utf8

Clarification:

TestText = "Test - āĀēĒčČ..šŠūŪžŽ" # this not UTF-8...it is a Unicode string in Python 3.X.
TestText2 = TestText.encode('utf8') # this is a UTF-8-encoded byte string.

To send UTF-8 to stdout regardless of the console's encoding, use the its buffer interface, which accepts bytes:

import sys
sys.stdout.buffer.write(TestText2)

Print unicode string to console OK but fails when redirect to a file. How to fix?

Set PYTHONIOENCODING environmental variable.

SET PYTHONIOENCODING=cp936
windows_prn_utf8.py > 1.txt

Proper way to print unicode characters to the console in Python when using inline scripts

The problem isn't printing to the console, the problem is interpreting the -c argument from the command line:

$ python -c "print repr('é')"
'\xc3\xa9' # OK, expected byte string
$ python -c "print repr('é'.decode('utf-8'))"
u'\xe9' # OK, byte string decoded explicitly
$ python -c "print repr(u'é')"
u'\xc3\xa9' # bad, decoded implicitly as iso-8859-1

Seems the problem is Python doesn't know what encoding command line arguments are using, so you get the same kind of problem as if a source code file had the wrong encoding. In that case you would tell Python what encoding the source used with a coding comment, and you can do that here too:

$ python -c "# coding=utf-8
print repr(u'é')"
u'\xe9'

Generally I'd try to avoid Unicode on the command line though, especially if you might ever have to run on Windows where the story is much worse.

How to print UTF-8 encoded text to the console in Python 3?

How to print UTF-8 encoded text to the console in Python < 3?

print u"some unicode text \N{EURO SIGN}"
print b"some utf-8 encoded bytestring \xe2\x82\xac".decode('utf-8')

i.e., if you have a Unicode string then print it directly. If you have
a bytestring then convert it to Unicode first.

Your locale settings (LANG, LC_CTYPE) indicate a utf-8 locale and
therefore (in theory) you could print a utf-8 bytestring directly and it
should be displayed correctly in your terminal (if terminal settings
are consistent with the locale settings and they should be) but you
should avoid it: do not hardcode the character encoding of your
environment inside your script
; print Unicode directly instead.

There are many wrong assumptions in your question.

You do not need to set PYTHONIOENCODING with your locale settings,
to print Unicode to the terminal. utf-8 locale supports all Unicode characters i.e., it works as is.

You do not need the workaround sys.stdout =
codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
. It may
break if some code (that you do not control) does need to print bytes
and/or it may break while
printing Unicode to Windows console (wrong codepage, can't print undecodable characters). Correct locale settings and/or PYTHONIOENCODING envvar are enough. Also, if you need to replace sys.stdout then use io.TextIOWrapper() instead of codecs module like win-unicode-console package does.

sys.getdefaultencoding() is unrelated to your locale settings and to
PYTHONIOENCODING. Your assumption that setting PYTHONIOENCODING
should change sys.getdefaultencoding() is incorrect. You should
check sys.stdout.encoding instead.

sys.getdefaultencoding() is not used when you print to the
console. It may be used as a fallback on Python 2 if stdout is
redirected to a file/pipe unless PYTHOHIOENCODING is set:

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

Do not call sys.setdefaultencoding("UTF-8"); it may corrupt your
data silently and/or break 3rd-party modules that do not expect
it. Remember sys.getdefaultencoding() is used to convert bytestrings
(str) to/from unicode in Python 2 implicitly e.g., "a" + u"b". See also,
the quote in @mesilliac's answer.

Python 'ascii' encode problems in print statement

LANG is used to determine your locale; if you don't set specific LC_ variables the LANG variable is used as the default.

The filesystem encoding is determined by the LC_CTYPE variable, but if you haven't set that variable specifically, the LANG environment variable is used instead.

Printing uses sys.stdout, a textfile configured with the codec your terminal uses. Your terminal settings is also locale specific; your LANG variable should really reflect what locale your terminal is set to. If that is UTF-8, you need to make sure your LANG variable reflects that. sys.stdout uses locale.getpreferredencoding(False) (like all text streams opened without an explicit encoding set) and on POSIX systems that'll use LC_CTYPE too.



Related Topics



Leave a reply



Submit