Redirecting Python's Stdout to the File Fails with Unicodeencodeerror

Redirecting python's stdout to the file fails with UnicodeEncodeError

Since nobody's jumped in yet, here's my shot. Python sets stdout's encoding when writing to a console but not when writing to a file. This script reproduces the problem:

import sys

msg = {'text':u'\2026'}
sys.stderr.write('default encoding: %s\n' % sys.stdout.encoding)
print msg['text']

when running the above shows the error:

$ python bad.py>/tmp/xxx
default encoding: None
Traceback (most recent call last):
File "fix.py", line 5, in <module>
print msg['text']
UnicodeEncodeError: 'ascii' codec can't encode character u'\x82' in position 0: ordinal not in range(128)

Adding the encoding to the above script:

import sys

msg = {'text':u'\2026'}
sys.stderr.write('default encoding: %s\n' % sys.stdout.encoding)
encoding = sys.stdout.encoding or 'utf-8'
print msg['text'].encode(encoding)

and the problem is solved:

$ python good.py >/tmp/xxx
default encoding: None
$ cat /tmp/xxx
6

Redirecting python output to a file causes UnicodeEncodeError on Windows

Python needs to write binary data to stdout (not strings) hence requirement for encoding parameter.

Encoding (used to convert strings into bytes) is determined differently for each platform:

  • on Linux and macOS it comes from current locale;
  • on Windows what is used is "Current language for non-Unicode programs" (codepage set in command line window is irrelevant).

(Thanks to @Eric Leung for precise link)

The follow up question would be why Python on Windows uses current system locale for non-Unicode programs, and not what is set by chcp command, but I will leave it for someone else.

Also it needs to be mentioned there's a checkbox titled "Beta: Use Unicode UTF-8..." in Region Settings on Windows 10 (to open - Win+R, type intl.cpl). By checking the checkbox the above example works without error. But this checkbox is off by default and really deep in system settings.

Error occurs when trying to redirect Python UTF-8 stdout to a file on Windows

When redirecting I/O Python uses a default encoding for Windows (cp1252 for US Windows), but will look to an environment variable if you want to override it:

C:\> set PYTHONIOENCODING=utf8
C:\> test.py > out.txt

Recently, set PYTHONUTF8=1 will also make Python default to UTF-8 for files and I/O redirection.

UnicodeEncodeError when redirecting stdout

Pipes that don't lead to the terminal don't have an encoding, therefore you'll need to check sys.stdout.isatty() and encode if needed.

Redirect stdout to a file with unicode encoding while keeping windows eol in python 2

Option 1

Redirection is a shell operation. You don't have to change the Python code at all, but you do have to tell Python what encoding to use if redirected. That is done with an environment variable. The following code redirects both stdout and stderr to a UTF-8-encoded file:

test.bat

set PYTHONIOENCODING=utf8
python test.py >out.txt 2>&1

test.py

#coding:utf8
import sys
print u"我不喜欢你女朋友!"
print >>sys.stderr, u"你需要一个新的。"

out.txt (encoded in UTF-8)

我不喜欢你女朋友!
你需要一个新的。

Hex dump of out.txt

0000: E6 88 91 E4 B8 8D E5 96 9C E6 AC A2 E4 BD A0 E5
0010: A5 B3 E6 9C 8B E5 8F 8B EF BC 81 0D 0A E4 BD A0
0020: E9 9C 80 E8 A6 81 E4 B8 80 E4 B8 AA E6 96 B0 E7
0030: 9A 84 E3 80 82 0D 0A

Note: You do need to print Unicode strings for this to work. Print byte strings and you get the bytes you print.

Option 2

codecs.open may force binary mode, but codecs.getwriter doesn't. Give it a file opened in text mode:

#coding:utf8
import sys
import codecs
sys.stdout = sys.stderr = codecs.getwriter('utf8')(open('out.txt','w'))
print u"我不喜欢你女朋友!"
print >>sys.stderr, u"你需要一个新的。"

(same output and hexdump as above)

UnicodeEncodeError while redirecting output to file - python 2.7

Problem solved. Turns out the issue wasn't the result from the API but the row['object'] from the df. I wrote a simple function

def force_to_unicode(text):
return text if isinstance(text, unicode) else text.decode('utf8')

and then I just edited the second for loop:

for result in reverse_result:
a=force_to_unicode(row['object'])
b=result['formatted_address']
print(a,',',b, file=f) #write result to csv file

UnicodeDecodeError when redirecting to file

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and . "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).


Some more information on Unicode, characters and code points, if you are interested:

Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster".
Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like "quot;[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("quot;, the first and only character).

With these few key points, you should be able to understand most encoding related questions!


Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.

Python Script Called in Powershell Fails to Write to Stdout when Piped to File

The root cause is with the way python handles STDOUT. Python does some low level detection to get the encoding of the system and then uses a io.TextIOWrapper with the encoding set to what it detects and that's what you get in sys.stdout (stderr and stdin have the same).

Now, this detection returns UTF-8 when running in the shell because powershell works in UTF-8 and puts a layer of translation between the system and the running program but when piping to another program the communication is direct without the powershell translation, this direct communication uses the system's encoding which for windows is cp1252 (AKA Windows-1252).

system <(cp1252)> posh <(utf-8)> python # here stdout returns to the shell
system <(cp1252)> posh <(utf-8)> python <(cp1252)> pipe| or redirect> # here stdout moves directly to the next program

As for your issue, without looking at the rest of your program and the input file my best guess is some encoding mismatch, most likely in the reading of the input file, by default python 3+ will read files in utf-8, if this file is in some other encoding you get errors, best case scenario you get garbage text, worst you get an encoding exception.

To solve it you need to know which encoding your input file was created with, which may get tricky and detection is usually slow, other solution would be to work with the files in bytes but this may not be possible depending on the processing done.

Unicode error when outputting python script output to file

You can use the codecs module to write unicode data to the file

import codecs
file = codecs.open("out.txt", "w", "utf-8")
file.write(something)

'print' outputs to the standart output and if your console doesn't support utf-8 it can cause such error even if you pipe stdout to a file.



Related Topics



Leave a reply



Submit