How to Print Utf-8 Encoded Text to the Console in Python < 3

How to print UTF-8 encoded text to the console in Python 3?

How to print UTF-8 encoded text to the console in Python < 3?

print u"some unicode text \N{EURO SIGN}"
print b"some utf-8 encoded bytestring \xe2\x82\xac".decode('utf-8')

i.e., if you have a Unicode string then print it directly. If you have
a bytestring then convert it to Unicode first.

Your locale settings (LANG, LC_CTYPE) indicate a utf-8 locale and
therefore (in theory) you could print a utf-8 bytestring directly and it
should be displayed correctly in your terminal (if terminal settings
are consistent with the locale settings and they should be) but you
should avoid it: do not hardcode the character encoding of your
environment inside your script
; print Unicode directly instead.

There are many wrong assumptions in your question.

You do not need to set PYTHONIOENCODING with your locale settings,
to print Unicode to the terminal. utf-8 locale supports all Unicode characters i.e., it works as is.

You do not need the workaround sys.stdout =
codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
. It may
break if some code (that you do not control) does need to print bytes
and/or it may break while
printing Unicode to Windows console (wrong codepage, can't print undecodable characters). Correct locale settings and/or PYTHONIOENCODING envvar are enough. Also, if you need to replace sys.stdout then use io.TextIOWrapper() instead of codecs module like win-unicode-console package does.

sys.getdefaultencoding() is unrelated to your locale settings and to
PYTHONIOENCODING. Your assumption that setting PYTHONIOENCODING
should change sys.getdefaultencoding() is incorrect. You should
check sys.stdout.encoding instead.

sys.getdefaultencoding() is not used when you print to the
console. It may be used as a fallback on Python 2 if stdout is
redirected to a file/pipe unless PYTHOHIOENCODING is set:

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

Do not call sys.setdefaultencoding("UTF-8"); it may corrupt your
data silently and/or break 3rd-party modules that do not expect
it. Remember sys.getdefaultencoding() is used to convert bytestrings
(str) to/from unicode in Python 2 implicitly e.g., "a" + u"b". See also,
the quote in @mesilliac's answer.

How to make python 3 print() utf8

Clarification:

TestText = "Test - āĀēĒčČ..šŠūŪžŽ" # this not UTF-8...it is a Unicode string in Python 3.X.
TestText2 = TestText.encode('utf8') # this is a UTF-8-encoded byte string.

To send UTF-8 to stdout regardless of the console's encoding, use the its buffer interface, which accepts bytes:

import sys
sys.stdout.buffer.write(TestText2)

Printing UTF-8 characters in Python 3 to the web

Thanks to the feedback from users here, I was able to piece together a solution:

  1. The Content-Type line must include charset=utf-8.
  2. Apache's configuration file must include SetEnv LANG en_US.UTF-8.

A great debugging tool was to print the value of sys.stdout.encoding, it should return "UTF-8", not "ANSI_X3.4-1968".

printing UTF-8 in Python 3 using Sublime Text 3

The answer was actually in the question linked in your question - PYTHONIOENCODING needs to be set to "utf-8". However, since OS X is silly and doesn't pick up on environment variables set in Terminal or via .bashrc or similar files, this won't work in the way indicated in the answer to the other question. Instead, you need to pass that environment variable to Sublime.

Luckily, ST3 build systems (I don't know about ST2) have the "env" option. This is a dictionary of keys and values passed to exec.py, which is responsible for running build systems without the "target" option set. As discussed in our comments above, I indicated that your sample program worked fine on a UTF-8-encoded text file containing non-ASCII characters when run with ST3 (Build 3122) on Linux, but not with the same version run on OS X. All that was necessary to get it to run was to change the build system to enclude this line:

"env": {"PYTHONIOENCODING": "utf8"},

I saved the build system, hit B, and the program ran fine.

BTW, if you'd like to read exec.py, or Packages/Python/Python.sublime-build, or any other file packed up in a .sublime-package archive, install PackageResourceViewer via Package Control. Use the "Open Resource" option in the Command Palette to pick individual files, or "Extract Package" (both are preceded by "PackageResourceViewer:", or prv using fuzzy search) to extract an entire package to your Packages folder, which is accessed by selecting Sublime Text → Preferences → Browse Packages… (just Preferences → Browse Packages… on other operating systems). It is located on your hard drive in the following location:

  • Linux: ~/.config/sublime-text-3/Packages
  • OS X: ~/Library/Application Support/Sublime Text 3/Packages
  • Windows Regular Install: C:\Users\YourUserName\AppData\Roaming\Sublime Text 3\Packages
  • Windows Portable Install: InstallationFolder\Sublime Text 3\Data\Packages

Once files are saved to your Packages folder (if you just view them via the "Open Resource" option and close without changing or saving them, they won't be), they will override the identically-named file contained within the .sublime-package archive. So, for instance, if you want to edit the default Python.sublime-build file in the Python package, your changes will be saved as Packages/Python/Python.sublime-build, and when you choose the Python build system from the menu, it will only use your version.

Printing utf8 strings in Sublime Text's console with Windows

I have found a possible fix: add the encoding parameter in the Python.sublime-build file:

{
"cmd": ["python", "-u", "$file"],
"file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
"selector": "source.python",
"encoding": "cp1252",
...

Note: "encoding": "latin1" seems to work as well, but - I don't know why - "encoding": "utf8" does not work, even if the .py file is UTF8, even if Python 3 uses UTF8, etc. Mystery!


Edit: This works now:

{
"cmd": ["python", "-u", "$file"],
"file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
"selector": "source.python",
"encoding": "utf8",
"env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"},
}

Linked topic:

  • Setting the correct encoding when piping stdout in Python and this answer in particular

  • How to change the preferred encoding in Sublime Text 3 for MacOS for the env trick.

Python 3 - print utf-8 encoded data into console (not \x00(\x00A\x04 )

That's not UTF-8.

3>> b"\x00(\x00A\x04>\x042\x04<\x045\x04A".decode('utf-16be')
'(Aовмес'

Note that "utf-16be" was chosen based on your sample data; it is more likely to be UTF-16LE instead.

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'


Related Topics



Leave a reply



Submit