Guessing the Encoding of Text Represented as Byte[] in Java

Guessing the encoding of text represented as byte[] in Java

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

Java : How to determine the correct charset encoding of a stream

I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet

How to detect charset in Java?

As deceze commented there is no reliable way automatically detect encoding of a text.

Most encodings try to use 1 byte for characters, as result same sequence of bytes mean totally different string in different encodings. Pretty much the only thing you can reliably do is to say that "it is not valid UTF8 string", other frequently used encodings do not even have strict rules what byte sequences are/are not valid for it.

You best option is to know encoding of the message. Next option would be to preserve text as byte array next to "utf8 string".

If you have very limited set of encodings to accept (utf8/utf16/cp1252) you can try to use some heuristics to detect - i.e. most English strings in UTF16 will have 0 as every other byte, and you can than try to see if the string is OK as UTF8 - if not - than it is likely the remaining encoding.

How is this file encoded?

This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:

InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");

I don't know how to delete my own question. I don't think it is useful anymore...

Is specifying String encoding when parsing byte[] really necessary?

If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?

But do you know for sure that your installation will always have a default encoding of UTF-8? (Or at least, for as long as your code is used ...)

And do you know for sure that your code is never going to be used in a different installation that has a different default encoding?

If the answer to either of those is "No" (and unless you are prescient, it probably has to be "No") then I think that you should follow best practice ... and specify the encoding if that is what your application semantics requires:

  • If the requirement is to always encode (or decode) in UTF-8, then use "UTF-8".

  • If the requirement is to always encode (or decode) in using the platform default, then do that.

  • If the requirement is to support multiple encodings (or the requirement might change) then make the encoding name a configuration (or command line) parameter, resolve to a Charset object and use that.

The point of this "best practice" recommendation is to avoid a foreseeable problem that will arise if your platform's characteristics change. You don't think that is likely, but you probably can't be completely sure about it. But at the end of the day, it is your decision.

(The fact that you are actually thinking about whether "best practice" is appropriate to your situation is a GOOD THING ... in my opinion.)

Detect (or best guess of) incoming string encoding in Java

The utf-8 encoding should be easy to verify:

UTF-8 strings can be fairly reliably recognized as such by a simple heuristic algorithm.
from wikipedia

Take a look at this site to see the algorithm

How to identify the encoding charset of a file in Java?

As already mentioned there is no certain way to detect encoding. But there is a large amount of heuristics that allow to do a smart guess about file encoding.

If there is no way for you to get to know the encoding for sure you may have a look at Apache Tika project and EncodingDetector there.

How to detect encoding mismatch

When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.

From a byte sequence, there's no way to clearly tell how it is to be interpreted.

A few ideas:

  • You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
  • You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
  • For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.

If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.



Related Topics



Leave a reply



Submit