Converting Utf-8 to Iso-8859-1 in Java - How to Keep It as Single Byte

Java: how to undo conversion from UTF-8 to ISO-8859-1

Suppose we have a string containing double iso-8859-1 characters, such as é.

To convert double iso-8859-1 to UTF-8 characters, we can use this constructor of String. Pass an array of byte and a CharSet object. The class java.nio.charset.StandardCharsets provides constants for various CharSet objects.

String accentE = 
new String(
"é".getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8
)
;

which is é

See this code run live at IdeOne.com.

How do I convert between ISO-8859-1 and UTF-8 in Java?

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

or

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

Converting UTF-8 to ISO-8859-1 in Java

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
private HtmlEncoder() {}

public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}

Example usage:

String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as “. A couple of other arbitrary code points are likewise encoded.

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).

Quote from the JEP-254 :

We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.

But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.

My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.

Thank you to @VGR @Eugene @skomisa for the help

Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

Since ISO-8859-1 is a 1 byte per character encoding, it will always work. The UTF-8 bytes are converted to incorrect characters, but luckily there's no information lost.

Changing the characters back to bytes using ISO-8859-1 encoding gives you the original byte array, containing characters encoded in UTF-8, so you can then safely reinterpret it with the correct encoding.

The opposite of this is not (always¹) true, as UTF-8 is a multibyte encoding. The encoding process may encounter invalid byte sequences and replace them with the replacement character ?. At that point you've lost information and can't get the original bytes back anymore.

¹ If you stick to characters in the 0-127 range it will work, as they're encoded in UTF-8 using a single byte.

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

UPDATE: Removed unnecessary DOM loading.

Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.

Example

Transformer transformer = TransformerFactory.newInstance().newTransformer();

// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-utf8.xml")));

// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-8859-1.xml")));

test.xml (input, UTF-8)

<?xml version="1.0" encoding="UTF-8"?>
<test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji> lt;/emoji>
</test>

Translated by https://translate.google.com (except emoji)

test-utf8.xml (output, UTF-8)

<?xml version="1.0" encoding="UTF-8"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>

test-8859-1.xml (output, ISO-8859-1)

<?xml version="1.0" encoding="ISO-8859-1"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>

If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.



Related Topics



Leave a reply



Submit