Encoding Conversion in Java

Java convert encoding

A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.

A patch would be

String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");

Now s should be a correct Unicode String.

System.out.println(s);

If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.

  • The characters in s are actually bytes of UTF-16BE I guess
  • By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
  • Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)

Encoding conversion in java

You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)

EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).

See "URL Encoding (or: 'What are those "%20" codes in URLs?')".

Converting String from One Charset to Another

The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:

  1. Your input data is bytes in one encoding
  2. Your output data needs to be bytes in another encoding

In that case, it's straight forward:

byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));

If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.

However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:

String s = new String(source.getBytes(inputEncoding), outputEncoding);

This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).

So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.

Now finally some examples of good and bad usage.

String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)

String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"

The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.

Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.

As a last issue, Charset and Encoding aren't the same thing, but they're very much related.

¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.


NOTE

It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.

// Input comes from network/file/other place and we have misconfigured the encoding 
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8

If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

Interpret a string from one encoding to another in java

import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));

String encoding (UTF-8) JAVA

According to the javadoc of String#getBytes(String charsetName):

Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.

And the documentation of String(byte[] bytes, Charset charset)

Constructs a new String by decoding the specified array of bytes using
the specified charset.

Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:

// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));

How do I convert between ISO-8859-1 and UTF-8 in Java?

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

or

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.



Related Topics



Leave a reply



Submit