Saving Utf-8 Texts With Json.Dumps as Utf8, Not as \U Escape Sequence

Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

json.dumps \u escaped unicode to utf8

You have UTF-8 JSON data:

>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
"content": "你好"
}

My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.

However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:

>>> print json.dumps(data, indent=1,  ensure_ascii=False).encode('utf8').decode('latin1')
{
"content": "你好"
}

Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.

This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.

Python encoding and json dumps

You're doing nothing wrong (and neither is Python).

Python's json module simply takes the safe route and escapes non-ascii characters. This is a valid way of representing such characters in json, and any conforming parser will resurrect the proper Unicode characters when parsing the string:

>>> import json
>>> json.dumps({'Crêpes': 5})
'{"Cr\\u00eapes": 5}'
>>> json.loads('{"Cr\\u00eapes": 5}')
{'Crêpes': 5}

Don't forget that json is just a representation of your data, and both "ê" and "\\u00ea" are valid json representations of the string ê. Conforming json parsers should handle both correctly.

It is possible to disable this behaviour though, see the json.dump documentation:

>>> json.dumps({'Crêpes': 5}, ensure_ascii=False)
'{"Crêpes": 5}'

converting json -(utf-8) to json(unicode escape) & raw strings(cant have odd number of backlash)

It looks like you just want ensure_ascii=True (the default):

C:\>type input.json
{"context" : "-\" 너"}

C:\>py
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> with open('input.json',encoding='utf8') as f:
... data = json.load(f)
...
>>> data
{'context': '-" 너'}
>>> with open('output.json','w',encoding='utf8') as f:
... json.dump(data,f)
...
>>> ^Z

C:\>type output.json
{"context": "-\" \ub108"}

Dump Chinese data into a json file

Check out the docs for json.dump.

Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.

If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Python Saving JSON Files as UTF-8

Set ensure_ascii to False:

>>> print json.dumps(x, ensure_ascii=False)
{"some_key": "Enviar invitación privada"}

json.dump() uses ASCII codec encoding (instead of requested UTF-8) when redirecting stdout to a file

Consolidating all the comments and answers into one final answer:

Note: this answer is for Python 2.7. Python 3 is likely to be different.

The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.

You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.

There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.

The encoding argument was a red herring. That option tells the json package how the input strings are encoded.

Here's what finally worked for me:

ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)

tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.



Related Topics



Leave a reply



Submit