Reading Unicode file data with BOM chars in Python
There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it
Adding BOM (unicode signature) while saving file in python
Write it directly at the beginning of the file:
file_new.write('\ufeff')
Detect Byte Order Mark (BOM) in Python
The simple answer is: read the first 4 bytes and look at them.
with open("utf32le.file", "rb") as file:
beginning = file.read(4)
# The order of these if-statements is important
# otherwise UTF32 LE may be detected as UTF16 LE as well
if beginning == b'\x00\x00\xfe\xff':
print("UTF-32 BE")
elif beginning == b'\xff\xfe\x00\x00':
print("UTF-32 LE")
elif beginning[0:3] == b'\xef\xbb\xbf':
print("UTF-8")
elif beginning[0:2] == b'\xff\xfe':
print("UTF-16 LE")
elif beginning[0:2] == b'\xfe\xff':
print("UTF-16 BE")
else:
print("Unknown or no BOM")
The not so simple answer is:
There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.
Other than that you can typically treat text files without BOM as UTF-8 as well.
Reading Unicode file data with BOM chars in Python
There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it
Unicode (UTF-8) reading and writing to files in Python
In the notation u'Capit\xe1n\n'
(should be just 'Capit\xe1n\n'
in 3.x, and must be in 3.0 and 3.1), the \xe1
represents just one character. \x
is an escape sequence, indicating that e1
is in hexadecimal.
Writing Capit\xc3\xa1n
into the file in a text editor means that it actually contains \xc3\xa1
. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á
in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape
codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str
that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1
in the original string. To get a unicode
result, decode again with UTF-8.
In 3.x, the string_escape
codec is replaced with unicode_escape
, and it is strictly enforced that we can only encode
from a str
to bytes
, and decode
from bytes
to str
. unicode_escape
needs to start with a bytes
in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3
and \xa1
as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Python: How to translate UTF8 String containing unicode decoded characters (Ok\u00c9 to Oké)
Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.
I've use
import chardet
chardet.detect(var3.encode())
to detect the proper encoding, and the did a
var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')
conversion to eventually get it in the right format!
Related Topics
Splitting a Pandas Dataframe Column by Delimiter
How to Check the Versions of Python Modules
Django Upgrading to 1.9 Error "Appregistrynotready: Apps Aren't Loaded Yet."
Increment a Python Floating Point Value by the Smallest Possible Amount
Why Are There No ++ and -- Operators in Python
When Would the -E, --Editable Option Be Useful with Pip Install
How to Find the Groups of Consecutive Elements in a Numpy Array
Python: Get the Print Output in an Exec Statement
Flask to Return Image Stored in Database
Problem with Multi Threaded Python App and Socket Connections
In Pytest, What Is the Use of Conftest.Py Files
Reading a Text File and Splitting It into Single Words in Python
How to Switch to the Active Tab in Selenium
Want to Find Contours -> Valueerror: Not Enough Values to Unpack (Expected 3, Got 2), This Appears