How to Direct Output to a File When There Are Utf-8 Characters

How do I direct output to a file when there are UTF-8 characters?

Without modifying the script, you can just set the environment variable PYTHONIOENCODING=utf8 and Python will assume that encoding when redirecting to a file.

References:

https://docs.python.org/2.7/using/cmdline.html#envvar-PYTHONIOENCODING
https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHONIOENCODING

How to redirect utf-8 output into txt file

I've found the solution. Just add one line in head of the python script:

# -*- coding: UTF-8 -*-

For example, a simple script named utf8.py:

# -*- coding: UTF-8 -*-

if __name__ == '__main__':
s = u'中文'
print s.encode('utf-8')

Then redirect the print to a txt file:

[zfz@server tmp]$ python utf8.py >> utf8.txt
[zfz@server tmp]$ cat utf8.txt
中文

The Chinese characters can be output to the txt file correctly.

Issues when writing UTF-8 characters to stdout using WriteFile

It looks like your bytes are being interpreted as ASCII instead. The character ñ in UTF-16 has a hex encoding of 0x00F1. 0xF1 corresponds to ± in ASCII codepage 437. Same is true of the other characters that are printed. It looks like the bytes, as defined by your use of UTF-16 literal, are not lost, but are rather interpreted as single ASCII bytes 0xF1 0x00 etc. by the stream.

See related post here: How to Output Unicode Strings on the Windows Console

That post says that you should use WriteConsoleW instead. The arguments for that function are the same as for WriteFile, except that str is expected to be UTF-16:

    DWORD dwToWrite, dwWritten;
dwToWrite = wcslen(str);
HANDLE hParentStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
BOOL bSuccess = WriteConsoleW(hParentStdOut, str, dwToWrite, &dwWritten, NULL);

Write to UTF-8 file in Python

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.

Dotted characters in cmd file output

First of all, the characters you need are not avaliable by default, please run: chcp 65001 - this will change the encoding to UTF-8.

Then you can run your command: dir /s /b /o:n /a:d > folderlist.txt.

If you need to view the output in the cmd. Please change the default font to Lucida Console.


Before OP clarified what he was after, i managed to come up with this:

Assuming your filesystem is of reasonable size, you can use the tree command, which produces a indented tree-structure for you. You can tell it to use ASCII characters also with the /A flag. Try it out!

cmd: tree /A ["directory path"] > "DRIVE_LETTER:\PATH_TO_EXPORT\tree.txt"

/A - ASCII characters instead of extended characters.

/F - Display the names of the files in each folder.

Note that the "directory path" is optional, but it seems like you want to use it in this case (insert the path of your starting-folder).

If you cant use tree (for simple output reasons) - you can use:

DIR YOUR_DRIVE_LETTER:\ /B /S > "DRIVE_LETTER:\PATH_TO_EXPORT\tree.txt"

/B-flag = full path instead of relative path.

powershell: Get-ChildItem | tree > "DRIVE_LETTER:\PATH_TO_EXPORT\tree.txt"

How does file reading work in utf-8 encoding?

Assuming you're using io.open() (not the same as the builtin open()), then using text mode gets you an instance of a io.TextIO, so this should anwser your question:

Text I/O over a binary storage (such as a file) is significantly
slower than binary I/O over the same storage, because it implies
conversions from unicode to binary data using a character codec. This
can become noticeable if you handle huge amounts of text data (for
example very large log files). Also, TextIOWrapper.tell() and
TextIOWrapper.seek() are both quite slow due to the reconstruction
algorithm used
.

NOTE: You should also be aware, that this still doesn't guarantee that seek() will skip over characters, but rather unicode codepoints (a single character can be composed out of more then one codepoint, for example ą can be written as u'\u0105' or u'a\u0328' - both will print the same character).

Source: http://docs.python.org/library/io.html#id1

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

How can I change the Standard Out to UTF-8 in Java

The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":

PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);

Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.

UnicodeDecodeError when redirecting to file

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and . "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).


Some more information on Unicode, characters and code points, if you are interested:

Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster".
Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like "quot;[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("quot;, the first and only character).

With these few key points, you should be able to understand most encoding related questions!


Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.



Related Topics



Leave a reply



Submit