Java convert encoding
A String in java should always be correct Unicode. In your case you seem to have UTF16BE interpreted as some single-byte encoding.
A patch would be
String string = new StringEscapeUtils().UnescapeHTML4(s);
byte[] b = string.getBytes(StandardCharsets.ISO_8859_1);
string = new String(b, "UTF-16BE");
Now s
should be a correct Unicode String.
System.out.println(s);
If the operating system for instance is in Cp1251 the Cyrillic text should be converted correct.
- The characters in s are actually bytes of UTF-16BE I guess
- By getting the bytes of the string in an single-byte encoding hopefully no conversion takes place
- Then make a String of the bytes as being in UTF-16BE, internally converted to Unicode (actually UTF-16BE too)
Encoding conversion in java
You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)
EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).
See "URL Encoding (or: 'What are those "%20
" codes in URLs?')".
Converting String from One Charset to Another
The code you found (transcodeField
) doesn't convert a String
from one encoding to another, because a String
doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
- Your input data is bytes in one encoding
- Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8
to ASCII
) those characters will be replaced with the ?
replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding)
method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?
). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding)
to get those bytes, they're not encoded in outputEncoding
(except if the encodings use the same values, which is common for "normal" characters like abcd
, but differs with more complex like accented characters éêäöñ
).
So what does this mean? It means that when you have a Java String
, everything is great. Strings
are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String
to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8
, UTF16
etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII
being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes()
and new String()
), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input
, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input
, not fix it afterwards. Especially if there's a chance of it becoming corrupted.
Interpret a string from one encoding to another in java
import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));
String encoding (UTF-8) JAVA
According to the javadoc of String#getBytes(String charsetName)
:
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
And the documentation of String(byte[] bytes, Charset charset)
Constructs a new String by decoding the specified array of bytes using
the specified charset.
Thus getBytes()
is opposite operation of String(byte [])
. The getBytes()
encodes the string to bytes, and String(byte [])
will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:
// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));
How do I convert between ISO-8859-1 and UTF-8 in Java?
In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.
To transcode text:
byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
or
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");
You can exercise more control by using the lower-level Charset
APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.
Related Topics
Why Does the Jvm Still Not Support Tail-Call Optimization
How to Generate Jaxb Classes from Xsd
Io Error: the Network Adapter Could Not Establish the Connection
How to Read File from End to Start (In Reverse Order) in Java
Converting Long to Date in Java Returns 1970
Why Invoke Thread.Currentthread.Interrupt() in a Catch Interruptexception Block
Lombok Annotations Do Not Compile Under Intellij Idea
Convert String Date to String Date Different Format
Regular Expression with Variable Number of Groups
How Cancel the Execution of a Swingworker
Jackson and Generic Type Reference
Using Java with Nvidia Gpus (Cuda)
How Should Equals and Hashcode Be Implemented When Using JPA and Hibernate
Eclipse Returns Error Message "Java Was Started But Returned Exit Code = 1"