How to Detect the Character Encoding of a Text File

How to determine encoding table of a text file

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

How to detect the character encoding of a text file?

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

Get encoding of a file in Windows

Open up your file using regular old vanilla Notepad that comes with Windows.

It will show you the encoding of the file when you click "Save As...".

It'll look like this:
Sample Image

Whatever the default-selected encoding is, that is what your current encoding is for the file.


If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).

I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.

FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.

More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe

Detect text file encoding

Turns out that auto-detecting the encoding is impossible for the general case.

However, there is a workaround to at least fall back to the system locale if the text is not valid UTF-8/UTF-16/UTF-32 text. It uses QTextCodec::codecForUtfText(), which tries to decode a byte array using UTF-8, UTF-16 and UTF-32, and returns the supplied default codec if it fails.

Code to do it:

QTextCodec *codec = QTextCodec::codecForUtfText(byteArray, QTextCodec::codecForName("System"));
const QString &text = codec->toUnicode(byteArray);

Update

The above code will not detect UTF-8 without BOM, however, as codecForUtfText() relies on the BOM markers. To detect UTF-8 without BOM, see https://stackoverflow.com/a/18228382/492336.

How to determine the encoding of text

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

How to correctly determine character encoding of text files?

For starters, there's no such physical encoding as "Unicode". What you probably mean by this is UTF-16. Secondly, any file is valid in "ANSI", or any single-byte encoding for that matter. The only thing you can do is guess in the best order which is most likely to throw out invalid matches.

You should check, in this order:

  • Is there a UTF-16 BOM at the beginning? Then it's probably UTF-16. Use the BOM as indicator whether it's big endian or little endian, then check the rest of the file whether it conforms.
  • Is there a UTF-8 BOM at the beginning? Then it's probably UTF-8. Check the rest of the file.
  • If the above didn't result in a positive match, check if the entire file is valid UTF-8. If it is, it's probably UTF-8.
  • If the above didn't result in a positive match, it's probably ANSI.

If you expect UTF-16 files without BOM as well (it's possible for, for example, XML files which specify the encoding in the XML declaration), then you have to shove that rule in there as well. Though any of the above may produce a false positive, falsely identifying an ANSI file as UTF-* (though it's unlikely). You should always have metadata that tells you what encoding a file is in, detecting it after the fact is not possible with 100% accuracy.

How can I detect the encoding/codepage of a text file?

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

How to check if a .txt file is in ASCII or UTF-8 format in Windows environment?

Text files in Windows don't have a format. There's an unofficial convention that if the file starts with the BOM codepoint in UTF-8 format that it's UTF-8, but that convention isn't universally supported. That would be the 3 byte sequence "\xef\xbf\xbe", i.e. ￾ in the Latin-1 character set.

Is there way to check charset encoding of .txt file with Java?

You cannot know with absolute certainty which charset is used in the general case. I found this to be a good read.

http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html

Especially the section Automatic detection of encoding.



Related Topics



Leave a reply



Submit