Reading Unicode File Data with Bom Chars in Python

Reading Unicode file data with BOM chars in Python

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

Adding BOM (unicode signature) while saving file in python

Write it directly at the beginning of the file:

file_new.write('\ufeff')

Detect Byte Order Mark (BOM) in Python

The simple answer is: read the first 4 bytes and look at them.

with open("utf32le.file", "rb") as file:
beginning = file.read(4)
# The order of these if-statements is important
# otherwise UTF32 LE may be detected as UTF16 LE as well
if beginning == b'\x00\x00\xfe\xff':
print("UTF-32 BE")
elif beginning == b'\xff\xfe\x00\x00':
print("UTF-32 LE")
elif beginning[0:3] == b'\xef\xbb\xbf':
print("UTF-8")
elif beginning[0:2] == b'\xff\xfe':
print("UTF-16 LE")
elif beginning[0:2] == b'\xfe\xff':
print("UTF-16 BE")
else:
print("Unknown or no BOM")

The not so simple answer is:

There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.

Other than that you can typically treat text files without BOM as UTF-8 as well.

Reading Unicode file data with BOM chars in Python

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

Unicode (UTF-8) reading and writing to files in Python

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

Python: How to translate UTF8 String containing unicode decoded characters (Ok\u00c9 to Oké)

Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.

I've use

import chardet
chardet.detect(var3.encode())

to detect the proper encoding, and the did a

var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')

conversion to eventually get it in the right format!



Related Topics



Leave a reply



Submit