Decode Utf-8 Encoding in Json String

How to decode json string as UTF-8?

Just an aside first: UTF-8 is typically an external format, and typically represented by an array of bytes. It's what you might send over the network as part of an HTTP response. Internally, Dart stores strings as UTF-16 code points. The utf8 encoder/decoder converts between internal format strings and external format arrays of bytes.

This is why you are using utf8.decode(response.bodyBytes); taking the raw body bytes and converting them to an internal string. (response.body basically does this too, but it chooses the bytes->string decoder based on the response header charset. When this charset header is missing (as it often is) the http package picks Latin-1, which obviously doesn't work if you know that the response is in a different charset.) By using utf8.decode yourself, you are overriding the (potentially wrong) choice being made by http because you know that this particular server always sends UTF-8. (It may not, of course!)

Another aside: setting a content type header on a request is rarely useful. You typically aren't sending any content - so it doesn't have a type! And that doesn't influence the content type or content type charset that the server will send back to you. The accept header might be what you are looking for. That's a hint to the server of what type of content you'd like back - but not all servers respect it.

So why are your special characters still incorrect? Try printing utf8.decode(response.bodyBytes) before decoding it. Does it look right in the console? (It very useful to create a simple Dart command line application for this type of issue; I find it easier to set breakpoints and inspect variables in a simple ten line Dart app.) Try using something like Wireshark to capture the bytes on the wire (again, useful to have the simple Dart app for this). Or try using Postman to send the same request and inspect the response.

How are you trying to show the characters. If may simply be that the font you are using doesn't have them.

Decode UTF-8 encoding in JSON string

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.

Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().

In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:

In [169]: d
Out[169]: {'sender_name': 'Horníková'}

How to ensure that the JSON string is UTF-8 encoded in Java

You need to set the character encoding for OutputStreamWriter when you create it:

 httpConn.connect();
wr = new OutputStreamWriter(httpConn.getOutputStream(), StandardCharsets.UTF_8);
wr.write(jsonObject.toString());
wr.flush();

Otherwise it defaults to the "platform default encoding," which is some encoding that has been used historically for text files on whatever system you are running.

encode string from json utf-8

Assuming you are running this code on Windows, then the problem is two-fold:

  • you are not telling TStringList.LoadFromFile() what the encoding of the file is. So, unless the file begins with a UTF-8 BOM (which is unlikely with a JSON file), it will be decoded as ANSI, not as UTF-8, thus corrupting any NON-ASCII characters.

  • you are converting the decoded text back into bytes without specifying an encoding. The overload of ParseJSONValue() you are using expects UTF-8 encoded bytes, but BytesOf() will encode to ANSI, not to UTF-8, thus corrupting non-ASCII characters even futher.

That is why you are getting garbage text from the JSON.

There are other problems with your code, too. Namely, a memory leak and a double-free, due to you mismanaging the initlal TJSONObject.

Try this instead.

procedure TForm1.jsonTest;
var
JSONData, JSON: TJSONObject;
jArr: TJSONArray;
s: TStringList;
i, j: Integer;
jValue: TJSonValue;
data: string;
begin
s := TStringList.Create;
try
s.LoadFromFile('clientOrders.json', TEncoding.UTF8);
data := s.Text;
finally
s.Free;
end;
{ Alternatively:
data := IOUtils.TFile.ReadAllText('clientOrders.json', TEncoding.UTF8);
}
jValue := TJSONObject.ParseJSONValue(TEncoding.UTF8.GetBytes(data), 0);
if jValue = nil then
raise Exception.Create('This is not a JSON');
try
JSON := jValue as TJSONObject;
jArr := JSON.Get(0).JsonValue as TJSONArray;
for I := 0 to jArr.Size-1 do
begin
JSONData := jArr.Get(I) as TJSONObject;
for j := 0 to JSONData.Size - 1 do
begin
ShowMessage(JSONData.Get(j).JsonValue.ToString);
end;
end;
end;
finally
jValue.Free;
end;
end;

Alternatively, don't decode the file bytes into a string just to convert them back into bytes, just load them as-is into ParseJSONValue(), eg:

procedure TForm1.jsonTest;
var
...
jValue: TJSonValue;
data: TBytesStream;
begin
data := TBytesStream.Create;
try
data.LoadFromFile('clientOrders.json');
jValue := TJSONObject.ParseJSONValue(data.Bytes, 0);
...
finally
data.Free;
end;
end;

Or:

procedure TForm1.jsonTest;
var
...
jValue: TJSonValue;
data: TBytes;
begin
data := IOUtils.TFile.ReadAllBytes('clientOrders.json');
jValue := TJSONObject.ParseJSONValue(data, 0);
...
end;

PHP json_encode json_decode UTF-8

This is an encoding issue. It looks like at some point, the data gets represented as ISO-8859-1.

Every part of your process needs to be UTF-8 encoded.

  • The database connection

  • The database tables

  • Your PHP file (if you are using special characters inside that file as shown in your example above)

  • The content-type headers that you output

Not able to decode UTF-8 encoded json text with Cpanel::JSON::XS

decode_json expects UTF-8, but you are providing decoded text (a string of Unicode Code Points).

Use

use utf8;
use Encode qw( encode_utf8 );

my $json_utf8 = encode_utf8( '{ "title": "Outlining — How to outline" }' );

my $data = decode_json( $json_utf8 );

or

use utf8;

my $json_utf8 = do { no utf8; '{ "title": "Outlining — How to outline" }' };

my $data = decode_json( $json_utf8 );

or

use utf8;

my $json_ucp = '{ "title": "Outlining — How to outline" }';

my $data = Cpanel::JSON::XS->new->decode( $json_ucp ); # Implied: ->utf8(0)

(The middle one seems hackish to me. The first one might be used if you get data from multiple source, and the others provide it encoded.)



Related Topics



Leave a reply



Submit