How to Send Non-English Unicode String Using Http Header

How to send non-English unicode string using HTTP header?

Is it not POSSIBLE or ALLOWED to send non-English string using HTTP Header?

It's not possible as per HTTP standards to put non-ISO-8859-1 characters directly in an HTTP header. That gives you ASCII ("English"?) characters plus common Western European diacriticals.

However in practice you can't even use the extended ISO-8859-1 characters, because servers and browsers don't agree on what to do with non-ASCII characters in headers. Safari takes RFC2616 at its word and treats high bytes as ISO-8859-1 characters; Mozilla takes UTF-16 code unit low bytes, which is similar but weirder; Opera and Chrome decode from UTF-8; IE uses the local system code page.

So in reality all you can put in an HTTP header is simple ASCII with no control codes. If you want anything more, you'll have to come up with an encoding scheme (eg UTF-8+base64). The RFC2616 standard suggests RFC2047 encoded-words as a standard form of encoding, but this makes no sense given the definitions of when they are allowable in RFC2047 itself, and nothing supports it.

Sending UTF-8 values in HTTP headers results in Mojibake

HTTP headers doesn't support UTF-8. They officially support ISO-8859-1 only. See also RFC 2616 section 2:

Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

Your best bet is to URL-encode and decode them.

response.setHeader("Info", URLEncoder.encode(arabicWord, "UTF-8"));

and

String arabicWord = URLDecoder.decode(response.getHeader("Info"), "UTF-8");

URL-encoding will transform them into %nn format which is perfectly valid ISO-8859-1. Note that the data sent in the headers may have size limitations. Rather send it in the response body instead, in plain text, JSON, CSV or XML format. Using custom HTTP headers this way is namely a design smell.

Current state of non-ascii values in http headers?

All current browsers implement RFC 8187 - you probably did something wrong. It would be helpful if you posted an example field value generated by your code.

How to encode the filename parameter of Content-Disposition header in HTTP?

There is discussion of this, including links to browser testing and backwards compatibility, in the proposed RFC 5987, "Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters."

RFC 2183 indicates that such headers should be encoded according to RFC 2184, which was obsoleted by RFC 2231, covered by the draft RFC above.

Http GET of source containing non-UTF-8 characters

getResponseBodyAsString() uses the HTTP response's Content-Type header to know what the response body's charset is so the data can be converted to a String as needed. getResponseBody() simply returns the body's raw bytes as-is, which you are then converting to a String using the platform's default charset. Since you are able to get the desired String output by converting the raw bytes manually, that suggests to me that the HTTP server is not specifying a charset in the response's Content-Type header at all, or is specifying the wrong charset.

Yáñez is the UTF-8 encoded version of Yáñez, so it is odd that the String(bytes[]) constructor would be able to decode it correctly, unless the platform's default charset is actually UTF-8. It does make sense for getResponseBodyAsString() to return Yáñez if the response charset used is ISO-8859-1, which is the default charset for text/... media types sent over HTTP when no charset is explicitly specified, per RFC 2616 Section 3.7.1.

I would suggest looking for a bug in the server script that is sending the data (or reporting a bug report to the server admin), before suspecting a bug with getResponseBodyAsString(). You can use a packet sniffer like Wireshark, or a debugging proxy like Fiddler, to confirm the missing/invalid charset in the response Content-Type header.



Related Topics



Leave a reply



Submit