Java:How to Determine the Correct Charset Encoding of a Stream

Java : How to determine the correct charset encoding of a stream

I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet

Is there way to check charset encoding of .txt file with Java?

You cannot know with absolute certainty which charset is used in the general case. I found this to be a good read.

http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html

Especially the section Automatic detection of encoding.

How to find out if a stream complies with the charset encoding ISO-8859-1

I have a problem whereby I need to be able to detect whether a byte array contains characters which comply with ISO-8859-1 encoding.

Well every stream of binary data can be viewed as "valid" in ISO-8859-1, as it's simply a single-byte-per-character scheme mapping bytes 0-255 to U+0000 to U+00FF in a trivial way. Compare that with UTF-8 or UTF-16, where certain byte sequences are simply invalid.

So a method to determine whether a stream contained valid ISO-8859-1 could just return true - but that doesn't mean that the original text was encoded in ISO-8859-1... it may be meaningless to a human when decoded with ISO-8859-1, but still valid.

If you know that the original plain text won't include certain characters (e.g. unprintable control characters) you could detect that quite simply just by checking whether any byte in the stream was blacklisted. More advanced detection might check for unexpected patterns - but it becomes very heuristic, and may be tightly coupled to what the original source text is expected to be like.

how to get the real character encoding of a file java

In general, it is not possible to always detect exactly what the character encoding of a text file is - there's nothing stored in a text file that tells you explicitly what the character encoding is. You can make some intelligent guesses, but don't expect that you'll always be able to find out exactly what the character encoding of a text file is.

The link that cebewee posted in the comments has more information on how to detect what the character encoding of a text file is.

How to check the charset of string in Java?

Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form.
You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

There are plenty of other charset detectors out there as well



Related Topics



Leave a reply



Submit