Setting the Correct Encoding When Piping Stdout in Python

Setting the correct encoding when piping stdout in Python

Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.

A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don't do it.

How to set sys.stdout encoding in Python 3?

Since Python 3.7 you can change the encoding of standard streams with reconfigure():

sys.stdout.reconfigure(encoding='utf-8')

You can also modify how encoding errors are handled by adding an errors parameter.

Setting stdout UTF8 encoding with Python3

After further research, the solution is to use SetEnv PYTHONIOENCODING utf8 in .htaccess files, as detailed here: mod_cgi + utf8 + Python3 produces no output.

For other processes it might be interesting to put PYTHONIOENCODING=utf8 in /etc/environment for persistence (not sure if it does the job for all processes that could call a Python script).

python failure on output redirect/pipe

Try adding these three lines of code right at the beginning of the procedure:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

I also had your problem: that is due to the presence of accented letters (or non-ascii characters) in the strings: it seems that your variable "item["text"]" is utf-8 encoded.
I tried to use various encoding and decoding methods proposed by some libraries.
The only solution that proved to be effective, in my case, is the one I pointed out to you.
I hope it is for you too.

Why is there a unicode error only when piping the output of a python script?

Python knows how to handle encoding inside your program because it uses whatever encoding your terminal application is using.

When you are sending (piping) your output out, it needs to be encoded. This is because using pipe actually sends streams of bytes between the applications. Every pipe is a unidirectional channel, where one side writes data and the other side reads it.

Using pipes or redirections, you are sending out data to a fd, which is read by the another application.

So you need to make sure Python correctly encode the data before it sends it out, and then the input program needs to decode it before processing.

You also might find this question useful

Update: I'll try to elaborate more about encoding. What I mean by the first line of my answer is, because your Python interpreter uses specific encoding, it knows how to transform the hexa values (actual bytes) to symbols.

My interpreter doesn't; if I try to create a string from your text - I get an error:

>>> s = 'bibliothèque'
Unsupported characters in input

This is because I use different encoding on my interpreter.

Your shell uses different encoding than the Python interpreter. When Python sends data out of your program, it uses default encoding: ASCII. It can't translate your special character (which displayed by the hexa value \xe8) using ASCII. So, you have to specify which encoding to use in order for Python to send it.

You might be able to overcome this if you change your shell encoding - check this question on SO.

PS - There's a great video by Ned Batchelder about Unicode on youtube - Maybe this will shed some more light on the subject.

UnicodeEncodeError if piping output to wc -l

First, let me point out that this is not a problem in Python 3, and fixing it is in fact one of the reasons that it was worth a compatibility-breaking change to the language in the first place. But I'll assume you have a good reason for using Python 2, and can't just upgrade.

The proximate cause here (assuming you're using Python 2.7 on a POSIX platform—things can be more complicated on older 2.x, and on Windows) is the value of sys.stdout.encoding. When you start up the interpreter, it does the equivalent of this pseudocode:

if isatty(stdoutfd):
    sys.stdout.encoding = parse_locale(os.environ('LC_CTYPE'))
else:
    sys.stdout.encoding = None

And every time you write to a file, including sys.stdout, including implicitly from a print statement, it does something like this:

if isinstance(s, unicode):
    if self.encoding:
        s = s.encode(self.encoding)
    else:
        s = s.encode(sys.getdefaultencoding())

The actual code does standard POSIX stuff looking for fallbacks like LANG, and hardcodes a fallback to UTF-8 in some cases for Mac OS X, etc., but this is close enough.

This is only sparsely documented, under file.encoding:

The encoding that this file uses. When Unicode strings are written to a file, they will be converted to byte strings using this encoding. In addition, when the file is connected to a terminal, the attribute gives the encoding that the terminal is likely to use (that information might be incorrect if the user has misconfigured the terminal). The attribute is read-only and may not be present on all file-like objects. It may also be None, in which case the file uses the system default encoding for converting Unicode strings.

To verify that this is your problem, try the following:

$ python -c 'print __import__("sys").stdout.encoding'
UTF-8
$ python -c 'print __import__("sys").stdout.encoding' | cat
None

To be extra sure this is the problem:

$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding'
Latin-1
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding' | cat
Latin-1

So, how do you fix this?

Well, the obvious way is to upgrade to Python 3.6, where you'll get UTF-8 in both cases, but I'll assume there's a reason you're using Python 2.7 and can't easily change it.

The right solution is actually pretty complicated. But if you want a quick&dirty solution that works for your system, and for most current Linux and Mac systems with standard Python 2.7 setups (even though it may be disastrously wrong for older Linux systems, older Python 2.x versions, and weird setups), you can either:

Set the environment variable PYTHONIOENCODING to override the detection and force UTF-8. Setting this in your profile or similar may be worth doing if you know that every terminal and every tool you're ever going to use from this account is UTF-8, although it's a terrible idea if that isn't true.
Check sys.stdout.encoding and wrap it with a 'UTF-8' encoding if it's None.
Explicitly .encode('UTF-8') on everything you print.

Setting the Correct Encoding When Piping Stdout in Python