Facebook JSON badly encoded
I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.
What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH
JSON escape notation (so a literal backslash, a literal lowercase letter u
, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH
sequences (a literal backslash, a literal lower case letter u
, two 0
zero digits and two hex digits).
E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082
.
You can repair the damage in two ways:
Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:
>>> import json
>>> data = r'"Rados\u00c5\u0082aw"'
>>> json.loads(data).encode('latin1').decode('utf8')
'Radosław'This would require a full traversal of your data structure to find all those strings, of course.
Load the whole JSON document as binary data, replace all
\u00hh
JSON sequences with the byte the last two hex digits represent, then decode as JSON:import re
from functools import partial
fix_mojibake_escapes = partial(
re.compile(rb'\\u00([\da-f]{2})').sub,
lambda m: bytes.fromhex(m[1].decode()),
)
with open(os.path.join(subdir, file), 'rb') as binary_data:
repaired = fix_mojibake_escapes(binary_data.read())
data = json.loads(repaired)(If you are using Python 3.5 or older, you'll have to decode the
repaired
bytes
object from UTF-8, so usejson.loads(repaired.decode())
).From your sample data this produces:
{'content': 'No to trzeba ostatnie treningi zrobić xD',
'sender_name': 'Radosław',
'timestamp': 1524558089,
'type': 'Generic'}The regular expression matches against all
\u00HH
sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by thejson.loads()
function when given binary data.
Fixing Facebook JSON Encoding in Node Js
Solved... in a way. If there's a better way to do it, let me know.
So, here's the amended function
readFacebookJson(filename) {
var content = fs.readFileSync(filename, "utf8");
const json = JSON.parse(converted)
return json
}
fixEncoding(string) {
return iconv.decode(iconv.encode(string, "latin1"), "utf8")
}
It wasn't the readFileSync()
screwing things up, it was the JSON.parse()
. So - we read the file as utf8 like usual, however, we then need to do the latin1 encoding/decoding on the strings that are now properties of the JSON file, not the whole JSON file before it's parsed. I did this with a map()
.
messages = readFacebookJson(filename).messages.map(message => {
const toReturn = message;
toReturn.sender_name = fixEncoding(toReturn.sender_name)
if (typeof message.content !== "undefined") {
toReturn.content = fixEncoding(message.content)
}
return toReturn;
}),
The issue here is of course that some properties might be missed. So make sure you know what properties contain what.
Encoding/decoding issue with Facebook json messages. C# parsing
Here is the answer:
private string DecodeString(string text)
{
Encoding targetEncoding = Encoding.GetEncoding("ISO-8859-1");
var unescapeText = System.Text.RegularExpressions.Regex.Unescape(text);
return Encoding.UTF8.GetString(targetEncoding.GetBytes(unescapeText));
}
I've collect all answers, mixed them and here we are. Thank you.
unknown encoding for facebook messages
I would use the package ftfy
to solve this problem https://github.com/LuminosoInsight/python-ftfy
>>> from ftfy import fix_text
>>> fix_text(u'Comment il est \u00c3\u00a9go\u00c3\u00afste :s')
'Comment il est égoïste :s'
I was having problems installing the current version but it worked like a charm with pip install 'ftfy<5'
Mojibake when reading JSON containing escaped unicode - wrongly decoded as Latin-1?
What you have there is not the correct notation for the emoji; it really means "ð" and three undefined codepoints, so the translation you get is correct! (The \u...
notation is independent of encoding.)
The proper notation for , unicode U+1F605, in JavaScript is \ud83d\ude05
. Use that in the JSON.
{
"message": "\ud83d\ude05"
}
If, on the other hand, your question is how you can get the correct results from the wrong data, then yes, as the comments say you may have to run through some hoops to do that.
Facebook/messenger archive contains emoji that I am unable to parse
.encode('latin1').decode('utf8)
is correct - it results in the codepoint U+fe33a
("). This codepoint is in a Private Use Area (PUA) (specifically Supplemental Private Use Area-A), so everyone can assign his own meaning to that codepoint (Maybe facebook wanted to use a crying face, when there wasn't yet one in Unicode, so they used PUA?).
Googling for that char (https://www.google.com/search?q=) makes google autocorrect it to U+1f62d
(") - sadly I have no idea how google maps U+fe33a
to U+1f62d
.
Googling for U+fe33a site:unicode.org gives https://unicode.org/L2/L2010/10132-emojidata.pdf, which lists U+1F62D
as proposed official codepoint.
As that document from unicode lists U+fe33a
as a codepoint used by google, I searched for android old emoji codepoints pua. Among other stuff two actually usable results:
- How to get Android emoji code point - the question links to :
- https://unicodey.com/emoji-data/table.htm - a html table, that seems to be acceptably parsable
- and even better: https://github.com/google/mozc/blob/master/src/data/emoji/emoji_data.tsv - a tab sepperated list, that maps modern codepoints to legacy PUA codepoints and other information like this:
1F62D FE33A E72D E411
[...]
- https://github.com/googlei18n/noto-emoji/issues/115 - this thread links to:
- https://github.com/Crissov/noto-emoji/blob/legacy-pua/emoji_aliases.txt - a machine readable document, that translates legacy PUA codepoints to modern codepoints like this:
FE33A;1F62D # Google
- https://github.com/Crissov/noto-emoji/blob/legacy-pua/emoji_aliases.txt - a machine readable document, that translates legacy PUA codepoints to modern codepoints like this:
I included my search queries in the answer, because non of the results I found are in any way authoritative - but it should be enough, to get your tool working :-)
How was this JSON string's unicode badly encoded, and how can I reverse it?
Looks like you interpreted a UTF-8 string as Latin-1, then encoded that as UTF-8 and interpreted it as Latin-1 again, and encoded that to JSON. Here's a fix in Python:
>>>> s
'"R\\u00c3\\u0083\\u00c2\\u00b6yksopp"'
>>>> json.loads(s)
'RÃ\x83¶yksopp'
>>>> json.loads(s).encode('latin1')
b'R\xc3\x83\xc2\xb6yksopp'
>>>> json.loads(s).encode('latin1').decode('utf-8')
'Röyksopp'
>>>> json.loads(s).encode('latin1').decode('utf-8').encode('latin1')
b'R\xc3\xb6yksopp'
>>>> json.loads(s).encode('latin1').decode('utf-8').encode('latin1').decode('utf-8')
'Röyksopp'
How does Facebook encode emoji in the json Graph API?
Answering my own question though most of the credit belongs to @bobince for showing me the way in the comments above.
The answer is that Facebook encodes emoji using the "Google" encoding as seen on this Unicode table.
I have created a ruby gem called emojivert that can convert from one encoding to another, including from "Google" to "Unified". It is based on another existing project called rails-emoji.
So the failing example above would be fixed by doing:
string = ActiveSupport::JSON.decode('"\udbba\udf59"')
> "br>fixed = Emojivert.google_to_unified(string)
> "br>
Related Topics
How to Convert List of Key-Value Tuples into Dictionary
Verifying Compatibility in Compiling Extension Types, and Using Them with Cdef
The Zip() Function in Python 3
Python-Requests Close Http Connection
Dummy Variables When Not All Categories Are Present
Implementation Hmac-Sha1 in Python
Python Script Returns Unintended "None" After Execution of a Function
Python: Download a File from an Ftp Server
Yield in List Comprehensions and Generator Expressions
Why Are 0D Arrays in Numpy Not Considered Scalar
Framerate Affect the Speed of the Game
Comparing Previous Row Values in Pandas Dataframe
Advanced Nested List Comprehension Syntax
Running Infinite Loops Using Threads in Python
In Selenium Web Driver How to Choose the Correct Iframe