JSON Character Encoding

What does Content-type: application/json; charset=utf-8 really mean?

The header just denotes what the content is encoded in. It is not necessarily possible to deduce the type of the content from the content itself, i.e. you can't necessarily just look at the content and know what to do with it. That's what HTTP headers are for, they tell the recipient what kind of content they're (supposedly) dealing with.

Content-type: application/json; charset=utf-8 designates the content to be in JSON format, encoded in the UTF-8 character encoding. Designating the encoding is somewhat redundant for JSON, since the default (only?) encoding for JSON is UTF-8. So in this case the receiving server apparently is happy knowing that it's dealing with JSON and assumes that the encoding is UTF-8 by default, that's why it works with or without the header.

Does this encoding limit the characters that can be in the message body?

No. You can send anything you want in the header and the body. But, if the two don't match, you may get wrong results. If you specify in the header that the content is UTF-8 encoded but you're actually sending Latin1 encoded content, the receiver may produce garbage data, trying to interpret Latin1 encoded data as UTF-8. If of course you specify that you're sending Latin1 encoded data and you're actually doing so, then yes, you're limited to the 256 characters you can encode in Latin1.

JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.

The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.

Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).

Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.

So, I guess you just could decide which to use like this:

  • Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.

  • Otherwise, use the numeric escape sequences.

JSON character encoding in javascript different from java

There are two different things happening: Unicode encoding and JSON string escaping.

Per 2.5 Strings of the JSON RFC:

.. All Unicode characters may be placed within the
quotation marks except for the characters that must be escaped ..

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence .. [and characters outside the BMP are escaped as UTF-16 encoded surrogate pairs]

That is, the JSON strings of "•é" and "\u2022é" are equivalent. It is entirely up to the serialization implementation on which (additional) characters to escape, and both forms are valid.

It is this JSON string (which is Unicode text) that can be encoded when converted to a byte-stream. In the example it's encoded via UTF-8 encoding. A JSON string may then be equivalent without being byte-equivalent at the stream level or character-equivalent at the JSON text level.


As far as the rules for JSONObject, it escapes according to

    c < ' '
|| (c >= '\u0080' && c < '\u00a0')
|| (c >= '\u2000' && c < '\u2100')

One reason these characters, in the range [\u2000, \u2100], may be escaped is to ensure the resulting JSON is also valid JavaScript. The article JSON: The JavaScript subset that isn't discusses the issue: the problem is the Unicode code-points \u2028 and \u2029 are treated as line terminators in JavaScript string literals, but not JSON. (There are other Unicode Separator characters in the range: might as well catch them in one go.)

How to ensure that the JSON string is UTF-8 encoded in Java

You need to set the character encoding for OutputStreamWriter when you create it:

 httpConn.connect();
wr = new OutputStreamWriter(httpConn.getOutputStream(), StandardCharsets.UTF_8);
wr.write(jsonObject.toString());
wr.flush();

Otherwise it defaults to the "platform default encoding," which is some encoding that has been used historically for text files on whatever system you are running.

JSON request with accents/latin characters

JSON is a binary format and has no concept of text encoding (as can be deduced by its mime type starting with application/ rather than text/. JSON is always encoded as Unicode (UTF-8, UTF-16 or UTF-32) as is very clear from the specification (section 8.1).

It may be that the server sends you invalid JSON (incorrectly coded as Latin-1 which will probably make it look like bad UTF-8 to the parser). The remedy would then be

  1. Fix the server.
  2. If failing 1., you need some kind of hack:
    1. Convert NSData to NSString using Latin1 character encoding
    2. Convert NSString to NSData using UTF-8 character encoding
    3. Parse JSON

How to Set Character Encoding to UTF-8 for JSONObject in the Android

Updated in 2021

JAVA

Finally i'm solved my problem.

Final code:

 @Override
protected Boolean doInBackground(String... urls) {

try {
HttpGet httppost = new HttpGet(urls[0]);
HttpClient httpclient = new DefaultHttpClient();
HttpResponse response = httpclient.execute(httppost);
int status = response.getStatusLine().getStatusCode();

change status to:

         if (status == HttpStatus.SC_OK) {

HttpEntity entity = response.getEntity();

edit this code to:

             String data = EntityUtils.toString(response.getEntity(), cz.msebera.android.httpclient.protocol.HTTP.UTF_8);
JSONObject jsono = new JSONObject(data);
JSONArray jarray = jsono.getJSONArray("news");
for (int i = 0; i < jarray.length(); i++) {
JSONObject object = jarray.getJSONObject(i);

News news = new News();

news.setTitle(object.getString("title"));
news.setDescription(object.getString("description"));
news.setDate(object.getString("date"));
news.setImage(object.getString("image"));

newsList.add(news);
}
return true;
}
} catch (ParseException e1) {
e1.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (JSONException e) {
e.printStackTrace();
}
return false;
}

Kotlin

 fun doInBackground(vararg urls: String?): Boolean? {
try {
val httppost = HttpGet(urls[0])
val httpclient: HttpClient = DefaultHttpClient()
val response: HttpResponse = httpclient.execute(httppost)
val status: Int = response.getStatusLine().getStatusCode()

if (status == HttpStatus.SC_OK) {

val entity: HttpEntity = response.getEntity()
}

How to escape special characters in building a JSON string?

A JSON string must be double-quoted, according to the specs, so you don't need to escape '.

If you have to use special character in your JSON string, you can escape it using \ character.

See this list of special character used in JSON :

\b  Backspace (ascii code 08)
\f Form feed (ascii code 0C)
\n New line
\r Carriage return
\t Tab
\" Double quote
\\ Backslash character


However, even if it is totally contrary to the spec, the author could use \'.

This is bad because :

  • It IS contrary to the specs
  • It is no-longer JSON valid string

But it works, as you want it or not.

For new readers, always use a double quotes for your json strings.



Related Topics



Leave a reply



Submit