Character Encoding Detection Algorithm

Character Encoding Detection Algorithm

Years ago we had character set detection for a mail application, and we rolled our own. The mail app was actually a WAP application, and the phone expected UTF-8. There were several steps:

Universal

We could easily detect if text was UTF-8, as there is a specific bit pattern in the top bits of bytes 2/3/etc. Once you found that pattern repeated a certain number of times you could be certain it was UTF-8.

If the file begins with a UTF-16 byte order mark, you can probably assume the rest of the text is that encoding. Otherwise, detecting UTF-16 isn't nearly as easy as UTF-8, unless you can detect the surrogate pairs pattern: but the use of surrogate pairs is rare, so that doesn't usually work. UTF-32 is similar, except there are no surrogate pairs to detect.

Regional detection

Next we would assume the reader was in a certain region. For instance, if the user was seeing the UI localized in Japanese, we could then attempt detection of the three main Japanese encodings. ISO-2022-JP is again east to detect with the escape sequences. If that fails, determining the difference between EUC-JP and Shift-JIS is not as straightforward. It's more likely that a user would receive Shift-JIS text, but there were characters in EUC-JP that didn't exist in Shift-JIS, and vice-versa, so sometimes you could get a good match.

The same procedure was used for Chinese encodings and other regions.

User's choice

If these didn't provide satisfactory results, the user must manually choose an encoding.

Java : How to determine the correct charset encoding of a stream

I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet

how to get the real character encoding of a file java

In general, it is not possible to always detect exactly what the character encoding of a text file is - there's nothing stored in a text file that tells you explicitly what the character encoding is. You can make some intelligent guesses, but don't expect that you'll always be able to find out exactly what the character encoding of a text file is.

The link that cebewee posted in the comments has more information on how to detect what the character encoding of a text file is.

How to determine the encoding of text

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

Auto-Detect Character Encoding in Java

The Mozilla's universalchardet is supposed to be the efficient detector out there. juniversalchardet is the java port of it. There is one more port. Read this SO for more information Character Encoding Detection Algorithm

How do browsers determine the encoding used?

They can guess it based on heuristic

I don't know how good are browsers today at encoding detection but MS Word did a very good job at it and recognizes even charsets I've never heard before. You can just open a *.txt file with random encoding and see.

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection.

https://en.wikipedia.org/wiki/Charset_detection

Firefox uses the Mozilla Charset Detectors. The way it works is explained here and you can also change its heuristic preferences. The Mozilla Charset Detectors were even forked to uchardet which works better and detects more languages

[Update: As commented below, it moved to chardetng since Firefox 73]

Chrome previously used ICU detector but switched to CED almost 2 years ago


None of the detection algorithms are perfect, they can guess it incorrectly like this, because it's just guessing anyway!

This process is not foolproof because it depends on statistical data.

so that's how the famous Bush hid the facts bug occurred. Bad guessing also introduces a vulnerability to the system

For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS past filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.

http://htmlpurifier.org/docs/enduser-utf8.html#fixcharset-none

As a result, the encoding should always be explicitly stated.



Related Topics



Leave a reply



Submit