Java: how to undo conversion from UTF-8 to ISO-8859-1
Suppose we have a string containing double iso-8859-1 characters, such as é
.
To convert double iso-8859-1 to UTF-8 characters, we can use this constructor of String
. Pass an array of byte
and a CharSet
object. The class java.nio.charset.StandardCharsets
provides constants for various CharSet
objects.
String accentE =
new String(
"é".getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8
)
;
which is é
See this code run live at IdeOne.com.
How do I convert between ISO-8859-1 and UTF-8 in Java?
In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.
To transcode text:
byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
or
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");
You can exercise more control by using the lower-level Charset
APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.
Converting UTF-8 to ISO-8859-1 in Java
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8
. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1
. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C
“ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
decoding and encoding strings, ISO-8859-1 to UTF-8 in Java
It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to @VGR @Eugene @skomisa for the help
Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java
Since ISO-8859-1
is a 1 byte per character encoding, it will always work. The UTF-8
bytes are converted to incorrect characters, but luckily there's no information lost.
Changing the characters back to bytes using ISO-8859-1
encoding gives you the original byte array, containing characters encoded in UTF-8
, so you can then safely reinterpret it with the correct encoding.
The opposite of this is not (always¹) true, as UTF-8
is a multibyte encoding. The encoding process may encounter invalid byte sequences and replace them with the replacement character ?
. At that point you've lost information and can't get the original bytes back anymore.
¹ If you stick to characters in the 0-127
range it will work, as they're encoded in UTF-8
using a single byte.
Convert UTF-8 to ISO-8859-1 with Numeric Character Reference
UPDATE: Removed unnecessary DOM loading.
Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.
Example
Transformer transformer = TransformerFactory.newInstance().newTransformer();
// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-utf8.xml")));
// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-8859-1.xml")));
test.xml (input, UTF-8)
<?xml version="1.0" encoding="UTF-8"?>
<test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji> lt;/emoji>
</test>
Translated by https://translate.google.com (except emoji)
test-utf8.xml (output, UTF-8)
<?xml version="1.0" encoding="UTF-8"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>
test-8859-1.xml (output, ISO-8859-1)
<?xml version="1.0" encoding="ISO-8859-1"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>
If you replace the test.xml
with the test-8859-1.xml
file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.
Related Topics
Android Httpclient Persistent Cookies
How to Instantiate an Abstract Class Directly
Portrait for Phone, Landscape for Tablet (Android-Layout)
How to Create Button Dynamically in Android
How to Launch Home Screen Programmatically in Android
What Is the Use of Basecolumns in Android
Android Call a Method from Another Class
How to Make Burst Mode Available to Camera
Android Getintent().Getextras() Returns Null
How to Get Country Phone Prefix from Iso
Android Proguard JavaScript Interface Fail
Android: Specify Two Different Images for Togglebutton Using Xml
What Does Transitive = True in Gradle Exactly Do (W.R.T. Crashlytics)
Android Studio - Failed to Complete Gradle Execution - Error in Opening Zip File