PHP Decoding and Encoding JSON with Unicode Characters

PHP decoding and encoding json with unicode characters

Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.

Here's why I think so:

  • json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
  • You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is \x63\xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
  • Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
  • You said you had the same problem in Python, which would seem to exclude PHP from being the issue.

PHP will use the \uXXXX escaping, but as you noted, this is valid JSON.

So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).

Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).

php json request: json_decode unicode string

I made some search and get this:

$result = json_decode(preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $file), true);

Original post

PHP - Replace JSON with the correct Unicode symbol

According to the PHP documentation on stripslashes (), it

un-quotes a quoted string.

Which means, that it basically removes all backslashes, which are used for escaping characters (or Unicode sequences). When removing those, you basically have no chance to be completely sure that any sequence as "u0024" was meant to be a Unicode entity, your user could just have entered that.

Besides that, you will get some trouble when using stripslashes () on a JSON value that contains escaped quotes. Consider this example:

{
"key": "\"value\""
}

This will become invalid when using stripslashes () because it will then look like this:

{
"key": ""value""
}

Which is not parseable as it isn't a valid JSON object. When you don't use stripslashes (), all escape sequences will be converted by the JSON parser and before outputting the (decoded) JSON object to the client, PHP will automatically decode (or "convert") the Unicode sequences your data may contain.

Conclusion: I'd suggest not to use stripslashes () when dealing with JSON entities as it may break things (as seen in the previous example, but also in your problem).

Reference: Why are my special Unicode characters encoded weird using json_encode?

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

In short: Any character can be encoded as \u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

"馬"
"\u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "馬". They don't look the same, but they mean the same thing in the JSON data encoding format.

PHP's json_encode preferably encodes non-ASCII characters using \u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

php > echo json_encode(['foo' => '馬'], JSON_UNESCAPED_UNICODE);
{"foo":"馬"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.

Php json_encode converts utf8 string to characters codes

They should be converted back on the other end when it's decoded. This is the safest option as it might not be possible to guaranteed that the transmission or storage will not corrupt a multi-byte encoding.

If you're certain that everything is safe for UTF8 end-to-end you can do:

$res = json_encode($foo, \JSON_UNESCAPED_UNICODE);

http://php.net/manual/en/function.json-encode.php

Any way to return PHP `json_encode` with encode UTF-8 and not Unicode?

{"a":"\u00e1"} and {"a":"á"} are different ways to write the same JSON document; The JSON decoder will decode the unicode escape.

In php 5.4+, php's json_encode does have the JSON_UNESCAPED_UNICODE option for plain output. On older php versions, you can roll out your own JSON encoder that does not encode non-ASCII characters, or use Pear's JSON encoder and remove line 349 to 433.

JSON Encoding and decoding of UTF8 characters in PHP

The default behavior of json_encode is to escape all Unicode characters. If your PHP is version 5.4.0 or greater, you can pass JSON_UNESCAPED_UNICODE as the second parameter of json_encode to get the behavior you're expecting. There are numerous hacks to get this behavior in earlier versions, including preprocessing your object to encode Unicode characters as HTML entities, then reversing the transformation afterwards.

Convert Unicode from JSON string with PHP

It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.

What you should have is \u00a3 which is the unicode code point for £.

{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.

If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.

function fixBadUnicode($str) {
return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}

Example here: http://phpfiddle.org/main/code/6sq-rkn

Edit:

If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:

function fixBadUnicodeForJson($str) {
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
return $str;
}

Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.

Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.



Related Topics



Leave a reply



Submit