Linux to Windows Bad Encoding Response

LINUX to Windows bad encoding response

UTF-8 is designed to encode the Unicode character set. It can't in general be used to encode arbitrary binary data, because (depending on the implementation) it may misbehave when the binary data represents illegal Unicode characters.

You need to pass the request from PHP to your C# program in unencoded binary form, or in an encoding such as Base64 which is designed for arbitrary binary data.

Http response decoding behaves differently from Windows to Linux

The problem came from Linux having its default Charset set to UTF-8.

Adding the argument -Dfile.encoding=ISO-8859-1 to $CATALINA_OPTS in Tomcat's config solved my problem.

Linux using command file -i return wrong value charset=unknow-8bit for a windows-1252 encoded file

It's important to understand what a character encoding is and isn't.

A text file is actually just a stream of bits; or, since we've mostly agreed that there are 8 bits in a byte, a stream of bytes. A character encoding is a lookup table (and sometimes a more complicated algorithm) for deciding what characters to show to a human for that stream of bytes.

For instance, the character "€" encoded in Windows-1252 is the string of bits 10000000. That same string of bits will mean other things in other encodings - most encodings assign some meaning to all 256 possible bytes.

If a piece of software knows that the file is supposed to be read as Windows-1252, it can look up a mapping for that encoding and show you a "€". This is how browsers are displaying the right thing: you've told them in the Content-Type header to use the Windows-1252 lookup table.

Once you save the file to disk, that "Windows-1252" label form the Content-Type header isn't stored anywhere. So any program looking at that file can see that it contains the string of bits 10000000 but it doesn't know what mapping table to look that up in. Nothing you do in the HTTP headers is going to change that - none of those are going to affect how it's saved on disk.

In this particular case the "file" command could look at the "encoding" marker inside the XML document, and find the "windows-1252" there. My guess is that it simply doesn't have that functionality. So instead it uses its general logic for guessing an encoding: it's probably something ASCII-compatible, because it starts with the bytes that spell <?xml in ASCII; but it's not ASCII itself, because it has bytes outside the range 00000000 to 01111111; anything beyond that is hard to guess, so output "unknown-8bit".

mysql console (windows-linux), wrong character set?

Set PuTTY to interpret received data as UTF8 in Window -> Translation "Character set on received data".

Java String encoding - Linux different than on Windows

Both machines have the same Locale in Java (Locale.getDefault()) -> I tried that already.

It is the default charset, not the default locale that determines what character set is used when decoding / encoding a string without a specified charset.

Check what Charset.defaultCharset().name() returns on your Windows and Linux machines. I expect that they will be different, based on the symptoms that you are reporting.

Wrong text encoding when parsing json data

You're reading the data as ISO 8859-1 but the file is actually UTF-8. I think there's an argument (or setting) to the file reader that should solve that.

Also: curl isn't going to care about the encodings. It's really something in your Java code that's wrong.

Linux to Windows Bad Encoding Response