What Is the Most Accurate Encoding Detector

Java : How to determine the correct charset encoding of a stream

I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet

Detect (or best guess of) incoming string encoding in Java

The utf-8 encoding should be easy to verify:

UTF-8 strings can be fairly reliably recognized as such by a simple heuristic algorithm.
from wikipedia

Take a look at this site to see the algorithm

Character Encoding Detection Algorithm

Years ago we had character set detection for a mail application, and we rolled our own. The mail app was actually a WAP application, and the phone expected UTF-8. There were several steps:

Universal

We could easily detect if text was UTF-8, as there is a specific bit pattern in the top bits of bytes 2/3/etc. Once you found that pattern repeated a certain number of times you could be certain it was UTF-8.

If the file begins with a UTF-16 byte order mark, you can probably assume the rest of the text is that encoding. Otherwise, detecting UTF-16 isn't nearly as easy as UTF-8, unless you can detect the surrogate pairs pattern: but the use of surrogate pairs is rare, so that doesn't usually work. UTF-32 is similar, except there are no surrogate pairs to detect.

Regional detection

Next we would assume the reader was in a certain region. For instance, if the user was seeing the UI localized in Japanese, we could then attempt detection of the three main Japanese encodings. ISO-2022-JP is again east to detect with the escape sequences. If that fails, determining the difference between EUC-JP and Shift-JIS is not as straightforward. It's more likely that a user would receive Shift-JIS text, but there were characters in EUC-JP that didn't exist in Shift-JIS, and vice-versa, so sometimes you could get a good match.

The same procedure was used for Chinese encodings and other regions.

User's choice

If these didn't provide satisfactory results, the user must manually choose an encoding.

How to determine the encoding of text

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

Gussing the text encoding from a UFT-8 BOM file in Java 6

As of java 1.7 the Character class knows of Unicode scripts like Arabic and Hebrew.

int freqs = s.codePoints().map(cp ->
Character.UnicodeScript.of(cp) == Character.UnicodeScript.ARABIC ? 1
: Character.UnicodeScript.of(cp) == Character.UnicodeScript.HEBREW ? -1
: 0).sum();

For java 1.6 the directionality might be sufficient, as there is a RIGHT_TO_LEFT and a RIGHT_TO_LEFT_ARABIC:

    String s = "אבגדהאבגדהصضطظع"; // First Hebrew, then Arabic.
int i0 = 0;
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
boolean rtl = Character.getDirectionality(codePoint)
== Character.DIRECTIONALITY_RIGHT_TO_LEFT;
boolean rtl2 = Character.getDirectionality(codePoint)
== Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC;
System.out.printf("[%d - %d] '%s': LTR %s %s%n",
i0, i, s.substring(i0, i), rtl, rtl2);
i0 = i;
}

[0 - 1] 'א': LTR true false
[1 - 2] 'ב': LTR true false
[2 - 3] 'ג': LTR true false
[3 - 4] 'ד': LTR true false
[4 - 5] 'ה': LTR true false
[5 - 6] 'א': LTR true false
[6 - 7] 'ב': LTR true false
[7 - 8] 'ג': LTR true false
[8 - 9] 'ד': LTR true false
[9 - 10] 'ה': LTR true false
[10 - 11] 'ص': LTR false true
[11 - 12] 'ض': LTR false true
[12 - 13] 'ط': LTR false true
[13 - 14] 'ظ': LTR false true
[14 - 15] 'ع': LTR false true

So

int arabic(String s) {
int n = 0;
for (char ch : s.toCharArray()) {
if (Character.getDirectionality(codePoint)
== Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC) {
++n;
if (n > 1000) {
break;
}
}
}
return n;
}
int hebrew(String s) {
int n = 0;
for (char ch : s.toCharArray()) {
if (Character.getDirectionality(codePoint)
== Character.DIRECTIONALITY_RIGHT_TO_LEFT) {
++n;
if (n > 1000) {
break;
}
}
}
return n;
}

if (arabic(s) > 0) {
return "Windows-1256";
} else if (hebrew(s) > 0) {
return "Windows-1255";
} else {
return "Klingon-1257";
}

How to detect encoding mismatch

When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.

From a byte sequence, there's no way to clearly tell how it is to be interpreted.

A few ideas:

  • You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
  • You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
  • For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.

If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.

Detect actual charset encoding in UTF

You need to unwrap the UTF-8 encoding and then pass it to a character-encoding detection library.

If random 8-bit data is encoded into UTF-8 (assuming an identity mapping, i.e. a C4 byte is assumed to represent U+00C4, as is the case with ISO-8859-1 and its superset Windows 1252), you end up with something like

Source:  8F    0A 20 FE    65
Result: C2 8F 0A 20 C3 BE 65

(because the UTF-8 encoding of U+008F is C2 8F, and U+00FE is C3 BE). You need to revert this encoding in order to obtain the source string, so that you can then identify its character encoding.

In Python, something like

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import chardet

mystery = u'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì'
print chardet.detect(mystery.encode('cp1252'))

Result:

{'confidence': 0.99, 'encoding': 'ISO-8859-5'}

On the Unix command line,

vnix$ echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
> iconv -t cp1252 | chardet
<stdin>: ISO-8859-5 (confidence: 0.99)

or iconv -t cp1252 file | chardet to decode a file and pass it to chardet.

(For this to work successfully at the command line, you need to have your environment properly set up for transparent Unicode handling. I am assuming that your shell, your terminal, and your locale are adequately configured. Try a recent Ubuntu Live CD or something if your regular environment is stuck in the 20th century.)

In the general case, you cannot know that the incorrectly applied encoding is CP 1252 but in practice, I guess it's going to be correct (as in, yield correct results for this scenario) most of the time. In the worst case, you would have to loop over all available legacy 8-bit encodings and try them all, then look at the one(s) with the highest confidence rating from chardet. Then, the example above will be more complex, too -- the mapping from legacy 8-bit data to UTF-8 will no longer be a simple identity mapping, but rather involve a translation table as well (for example, a byte F5 might correspond arbitrarily to U+0092 or whatever).

(Incidentally, iconv -l spits out a long list of aliases, so you will get a lot of fundamentally identical results if you use that as your input. But here is a quick ad-hoc attempt at fixing your slightly weird Perl script.

#!/bin/sh
iconv -l |
grep -F -v -e UTF -e EUC -e 2022 -e ISO646 -e GB2312 -e 5601 |
while read enc; do
echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
iconv -f utf-8 -t "${enc%//}" 2>/dev/null |
chardet | sed "s%^[^:]*%${enc%//}%"
done |
grep -Fwive ascii -e utf -e euc -e 2022 -e None |
sort -k4rn

The output still contains a lot of chaff, but once you remove that, the verdict is straightforward.

It makes no sense to try any multi-byte encodings such as UTF-16, ISO-2022, GB2312, EUC_KR etc in this scenario. If you convert a string into one of these successfully, then the result will most definitely be in that encoding. This is outside the scope of the problem outlined above: a string converted from an 8-bit encoding into UTF-8 using the wrong translation table.

The ones which returned ascii definitely did something wrong; most of them will have received an empty input, because iconv failed with an error. In a Python script, error handling would be more straightforward.)

Refactoring auto-detect file's encoding

the best way to refactor this code would be to bring in a 3rd party library that does character detection for you, because they probably do it better and it would make your code smaller.
see this question for a few alternatives



Related Topics



Leave a reply



Submit