Python Load JSON File with Utf-8 Bom Header

Python load json file with UTF-8 BOM header

You can open with codecs:

import json
import codecs

json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))

or decode with utf-8-sig yourself and pass to loads:

json.loads(open('sample.json').read().decode('utf-8-sig'))

Python Load UTF-8 JSON

In Python 2, the csv module does not support writing Unicode. You need to encode it manually here, as otherwise your Unicode values are encoded for you using ASCII (which is why you got the encoding exception).

This also means you need to write the UTF-8 BOM manually, but only if you really need it. UTF-8 can only be written one way, a Byte Order Mark is not needed to read UTF-8 files. Microsoft likes to add it to files to make the task of detecting file encodings easier for their tools, but the UTF-8 BOM may actually make it harder for other tools to work correctly as they won't ignore the extra initial character.

Use:

with open('EFSDUMP.csv', 'wb') as csv_file:
csv_file.write(codecs.BOM_UTF8)
content_writer = csv.writer(csv_file)
content_writer.writerow([unicode(v).encode('utf8') for v in data.values()])

Note that this'll write your values in arbitrary (dictionary) order. The unicode() call will convert non-string types to unicode strings first before encoding.

To be explicit: you've loaded the JSON data just fine. It is the CSV writing that failed for you.

Why do I get a ValueError with Json using a windows txt file encoded with utf-8?

BOM must be the culprit. Open the file with codecs.open(path,"r","utf-8-sig") or autodetect the encoding to open with e.g. as per Reading Unicode file data with BOM chars in Python.

BOM character copied into JSON in Python 3

No, the UTF-8 standard does not define a BOM character. That's because UTF-8 has no byte order ambiguity issue like UTF-16 and UTF-32 do. The Unicode consortium doesn't recommend using U+FEFF at the start of a UTF-8 encoded file, while the IETF actively discourages it if alternatives to specify the codec exist. From the Wikipedia article on BOM usage in UTF-8:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use.

[...]

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."

The Unicode standard only 'permits' the BOM because it is a regular character, just like any other; it's a zero-width non-breaking space character. As a result, the Unicode consortium recommends it is not removed when decoding, to preserve information (in case it had a different meaning or you wanted to retain compatibility with tools that have come to rely on it).

You have two options:

  • Strip the string first, U+FEFF is considered whitespace so removed with str.strip(). Or explicitly just strip the BOM:

    text = text.lstrip('\ufeff')  # remove the BOM if present

    (technically that'll remove any number of zero-width non-breaking space characters, but that is probably what you'd want anyway).

  • Open the file with the utf-8-sig codec instead. That codec was added to handle such files, explicitly removing the UTF-8 BOM bytesequence from the start if present, before decoding. It'll work on files without those bytes.

BOM in server response screws up json parsing

You should probably yell at whoever's running this service, because a BOM on UTF-8 text makes no sense. The BOM exists to disambiguate byte order, and UTF-8 is defined as being little-endian.

That said, ideally you should decode bytes before doing anything else with them. Luckily, Python has a codec that recognizes and removes the BOM: utf-8-sig.

>>> '\xef\xbb\xbffoo'.decode('utf-8-sig')
u'foo'

So you just need:

data = json.loads(response.decode('utf-8-sig'))

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה


Related Topics



Leave a reply



Submit