Utf-8 and Os.Listdir()

UTF-8 and os.listdir()

The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:

import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]

This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.

Also see the unicodedata.normalize() function documentation.

Python unicode os.listdir() not returning results from API

The difference is your file system is using decomposed Unicode characters. If you normalize the filenames returned to composed Unicode characters, it would work \xe9 is the Unicode character é. and e\u0301 is an ASCII e followed by a combining accent:

>>> u'Am\xe9lie' == ud.normalize('NFC',u'Ame\u0301lie')
True

So use:

import unicodedata as ud
print('\nunicode listdir')
for filename in os.listdir(u'/media/artwork/'):
nfilename = ud.normalize(filename)
s = tmdb.Search()
s.movie(query=os.path.splitext(nfilename)[0])
print('Results', len(s.results))
for r in s.results:
print(r)

UnicodeEncodeError when using os.listdir

UnicodeEncodeError indicates that you are trying to print the filenames. If it was os.lisdir() that had a problem you should see a UnicodeDecodeError (Decode, not Encode).

Because you use a Unicode pathname, os.listdir() returns readily decoded filenames; on Windows the filesystem uses UTF-16 to encode filenames and those are easily decoded in Python (sys.getfilesystemencoding() tells Python what codec to use).

However, the Windows console uses a different encoding; in your case gbk, and that codec cannot display all the different characters that UTF-16 can encode.

You are looking for a print() statement here. You perhaps could use print(filename.encode('gbk', errors='replace')) to try and print the filenames instead; unprintable characters will be replaced by a question mark.

Alternatively, you could use a b'F:\\music' as the path and work with raw bytestrings instead of Unicode.

listdir doesn't print non-english letters correctly

Solved it: # -*- coding: utf-8 -*- at the top of the document solved it.

UnicodeDecodeError when performing os.walk

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:

sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS

When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.

Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.

That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:

filename.decode('windows-1252')

If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.

One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest

where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.

The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:

def decodeName(name):
if type(name) == str: # leave unicode ones alone
try:
name = name.decode('utf8')
except:
name = name.decode('windows-1252')
return name

The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:

root, dirs, files = os.walk(path):
files = [decodeName(f) for f in files]
# do something with the unicode filenames now

I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:

http://farmdev.com/talks/unicode/

I highly recommend it for anyone struggling with unicode issues.

Python: How do I interact with unicode filenames on Windows? (Python 2.7)

You may be getting ?s while displaying that filename in the console using os.listdir() but you can access that filename without any problems as internally everything is stored in binary. If you are trying to copy the filename and paste it directly in python, it will be interpreted as mere question marks...

If you want to open that file and perform any operations, then, have a look at this...

files = os.listdir(".")

# Possible output:
# ["a.txt", "file.py", ..., "??.html"]

filename = files[-1] # The last file in this case
f = open(filename, 'r')

# Sample file operation

lines = f.readlines()
print(lines)
f.close()

EDIT:

In Python 2, you need to pass current path as Unicode which could be done using: os.listdir(u'.'), where the . means current path. This will return the list of filenames in Unicode...

How to convert filename with invalid UTF-8 characters back to bytes?

Use an error handler; in this case the surrogateescape error handler looks appropriate:

Value: 'surrogateescape'
Meaning: On decoding, replace byte with individual surrogate code ranging fromU+DC80toU+DCFF. This code will then be turned back into the same byte when the'surrogateescape'` error handler is used when encoding the data. (See PEP 383 for more.)

The os.fsencode() utility function uses the latter option; it encodes to sys.getfilesystemencoding() using the surrogate escape error handler when applicable for your OS:

Encode filename to the filesystem encoding with 'surrogateescape' error handler, or 'strict' on Windows; return bytes unchanged.

In reality it'll use 'strict' only when the filesystem encoding is mbcs, see the os module source, a codec only available on Windows.

Demo:

>>> import sys
>>> ld = ['\udc80']
>>> [fn.encode(sys.getfilesystemencoding()) for fn in ld]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [fn.encode(sys.getfilesystemencoding(), 'surrogateescape') for fn in ld]
[b'\x80']
>>> import os
>>> [os.fsencode(fn) for fn in ld]
[b'\x80']


Related Topics



Leave a reply



Submit