Facebook JSON Badly Encoded

Facebook JSON badly encoded

I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.

What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH JSON escape notation (so a literal backslash, a literal lowercase letter u, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH sequences (a literal backslash, a literal lower case letter u, two 0 zero digits and two hex digits).

E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082.

You can repair the damage in two ways:

  1. Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:

     >>> import json
    >>> data = r'"Rados\u00c5\u0082aw"'
    >>> json.loads(data).encode('latin1').decode('utf8')
    'Radosław'

    This would require a full traversal of your data structure to find all those strings, of course.

  2. Load the whole JSON document as binary data, replace all \u00hh JSON sequences with the byte the last two hex digits represent, then decode as JSON:

     import re
    from functools import partial

    fix_mojibake_escapes = partial(
    re.compile(rb'\\u00([\da-f]{2})').sub,
    lambda m: bytes.fromhex(m[1].decode()),
    )

    with open(os.path.join(subdir, file), 'rb') as binary_data:
    repaired = fix_mojibake_escapes(binary_data.read())
    data = json.loads(repaired)

    (If you are using Python 3.5 or older, you'll have to decode the repaired bytes object from UTF-8, so use json.loads(repaired.decode())).

    From your sample data this produces:

     {'content': 'No to trzeba ostatnie treningi zrobić xD',
    'sender_name': 'Radosław',
    'timestamp': 1524558089,
    'type': 'Generic'}

    The regular expression matches against all \u00HH sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by the json.loads() function when given binary data.

Fixing Facebook JSON Encoding in Node Js

Solved... in a way. If there's a better way to do it, let me know.

So, here's the amended function

readFacebookJson(filename) {
var content = fs.readFileSync(filename, "utf8");
const json = JSON.parse(converted)
return json
}

fixEncoding(string) {
return iconv.decode(iconv.encode(string, "latin1"), "utf8")
}

It wasn't the readFileSync() screwing things up, it was the JSON.parse(). So - we read the file as utf8 like usual, however, we then need to do the latin1 encoding/decoding on the strings that are now properties of the JSON file, not the whole JSON file before it's parsed. I did this with a map().

messages = readFacebookJson(filename).messages.map(message => {
const toReturn = message;
toReturn.sender_name = fixEncoding(toReturn.sender_name)
if (typeof message.content !== "undefined") {
toReturn.content = fixEncoding(message.content)
}
return toReturn;
}),

The issue here is of course that some properties might be missed. So make sure you know what properties contain what.

Encoding/decoding issue with Facebook json messages. C# parsing

Here is the answer:

private string DecodeString(string text)
{
Encoding targetEncoding = Encoding.GetEncoding("ISO-8859-1");
var unescapeText = System.Text.RegularExpressions.Regex.Unescape(text);
return Encoding.UTF8.GetString(targetEncoding.GetBytes(unescapeText));
}

I've collect all answers, mixed them and here we are. Thank you.

unknown encoding for facebook messages

I would use the package ftfy to solve this problem https://github.com/LuminosoInsight/python-ftfy

>>> from ftfy import fix_text
>>> fix_text(u'Comment il est \u00c3\u00a9go\u00c3\u00afste :s')
'Comment il est égoïste :s'

I was having problems installing the current version but it worked like a charm with pip install 'ftfy<5'

Mojibake when reading JSON containing escaped unicode - wrongly decoded as Latin-1?

What you have there is not the correct notation for the emoji; it really means "ð" and three undefined codepoints, so the translation you get is correct! (The \u... notation is independent of encoding.)

The proper notation for , unicode U+1F605, in JavaScript is \ud83d\ude05. Use that in the JSON.

{
"message": "\ud83d\ude05"
}

If, on the other hand, your question is how you can get the correct results from the wrong data, then yes, as the comments say you may have to run through some hoops to do that.

Facebook/messenger archive contains emoji that I am unable to parse

.encode('latin1').decode('utf8) is correct - it results in the codepoint U+fe33a("). This codepoint is in a Private Use Area (PUA) (specifically Supplemental Private Use Area-A), so everyone can assign his own meaning to that codepoint (Maybe facebook wanted to use a crying face, when there wasn't yet one in Unicode, so they used PUA?).

Googling for that char (https://www.google.com/search?q=) makes google autocorrect it to U+1f62d (") - sadly I have no idea how google maps U+fe33a to U+1f62d.

Googling for U+fe33a site:unicode.org gives https://unicode.org/L2/L2010/10132-emojidata.pdf, which lists U+1F62D as proposed official codepoint.

As that document from unicode lists U+fe33a as a codepoint used by google, I searched for android old emoji codepoints pua. Among other stuff two actually usable results:

  1. How to get Android emoji code point - the question links to :

    • https://unicodey.com/emoji-data/table.htm - a html table, that seems to be acceptably parsable
    • and even better: https://github.com/google/mozc/blob/master/src/data/emoji/emoji_data.tsv - a tab sepperated list, that maps modern codepoints to legacy PUA codepoints and other information like this:

      1F62D FE33A E72D E411[...]
  2. https://github.com/googlei18n/noto-emoji/issues/115 - this thread links to:

    • https://github.com/Crissov/noto-emoji/blob/legacy-pua/emoji_aliases.txt - a machine readable document, that translates legacy PUA codepoints to modern codepoints like this:

      FE33A;1F62D # Google

I included my search queries in the answer, because non of the results I found are in any way authoritative - but it should be enough, to get your tool working :-)

How was this JSON string's unicode badly encoded, and how can I reverse it?

Looks like you interpreted a UTF-8 string as Latin-1, then encoded that as UTF-8 and interpreted it as Latin-1 again, and encoded that to JSON. Here's a fix in Python:

>>>> s
'"R\\u00c3\\u0083\\u00c2\\u00b6yksopp"'
>>>> json.loads(s)
'RÃ\x83¶yksopp'
>>>> json.loads(s).encode('latin1')
b'R\xc3\x83\xc2\xb6yksopp'
>>>> json.loads(s).encode('latin1').decode('utf-8')
'Röyksopp'
>>>> json.loads(s).encode('latin1').decode('utf-8').encode('latin1')
b'R\xc3\xb6yksopp'
>>>> json.loads(s).encode('latin1').decode('utf-8').encode('latin1').decode('utf-8')
'Röyksopp'

How does Facebook encode emoji in the json Graph API?

Answering my own question though most of the credit belongs to @bobince for showing me the way in the comments above.

The answer is that Facebook encodes emoji using the "Google" encoding as seen on this Unicode table.

I have created a ruby gem called emojivert that can convert from one encoding to another, including from "Google" to "Unified". It is based on another existing project called rails-emoji.

So the failing example above would be fixed by doing:

string = ActiveSupport::JSON.decode('"\udbba\udf59"')
> "br>fixed = Emojivert.google_to_unified(string)
> "br>


Related Topics



Leave a reply



Submit