Convert UTF-8 with BOM to UTF-8 with no BOM in Python
Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode
string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s
. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None
instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).
Is utf-8-sig suitable for decoding both UTF-8 and UTF-8 BOM?
The utf-8-sig codec will decode both utf-8-sig-encoded text and text encoded with the standard utf-8 encoding
>>> s = 'Straße'
>>> utf8_sig = s.encode('utf-8-sig')
>>> utf8 = s.encode('utf')
>>> print(utf8_sig.decode('utf-8-sig'))
Straße
>>> print(utf8.decode('utf-8-sig'))
Straße
From the codecs docs:
Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written ... On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.
The utf-8-sig encoding in most common in Windows environments. If you're sharing files with users on mac or *nix systems, the standard utf-8 encoding is what they would expect to receive.
Convert UTF-16 to UTF-8 and remove BOM?
Just use str.decode
and str.encode
:
with open(ff_name, 'rb') as source_file:
with open(target_file_name, 'w+b') as dest_file:
contents = source_file.read()
dest_file.write(contents.decode('utf-16').encode('utf-8'))
str.decode
will get rid of the BOM for you (and deduce the endianness).
Reading Unicode file data with BOM chars in Python
There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it
How to generate XML, UTF-8 with BOM using Python Element Tree?
Peek into sources of ElementTree.write shows that prolog is hardcoded there (https://github.com/python/cpython/blob/main/Lib/xml/etree/ElementTree.py or permalink https://github.com/python/cpython/blob/ee0ac328d38a86f7907598c94cb88a97635b32f8/Lib/xml/etree/ElementTree.py). Therefore probably using internals of ET is the only option (other than monkey-pathing module), to write required preamble and keep BOM in the file:
import xml.etree.ElementTree as ET
qnames, namespaces = ET._namespaces(tree._root, None)
with open(lang_resx_fpath,'w',encoding='utf-8-sig') as f:
f.write("<?xml version='1.0' encoding='utf-8'?>\n" )
ET._serialize_xml(f.write,
tree._root, qnames, namespaces,
short_empty_elements=False)
Probably it is not more elegant than your solution (and maybe it is even less elegant). The only advantage is that it does not require writing file twice, which would be minor benefit besides some huge XML files.
How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer
to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...
Related Topics
List on Python Appending Always the Same Value
Cartesian Product of Two Lists in Python
Execute Multiple Commands in Paramiko So That Commands Are Affected by Their Predecessors
Django 1.7 Throws Django.Core.Exceptions.Appregistrynotready: Models Aren't Loaded Yet
Weird Try-Except-Else-Finally Behavior with Return Statements
Datetime to String with Series in Pandas
Applying Function with Multiple Arguments to Create a New Pandas Column
Multiple Modeladmins/Views for Same Model in Django Admin
How to Make a Scatter Plot Colored by Density in Matplotlib
How to Display Custom Values on a Bar Plot
Add a String Prefix to Each Value in a String Column Using Pandas
Python Unicodedecodeerror - am I Misunderstanding Encode
Converting Xml to JSON Using Python
Converting String with Utc Offset to a Datetime Object
Unicodedecodeerror: 'Ascii' Codec Can't Decode Byte 0Xef in Position 1