How to Detect the Encoding/Codepage of a Text File

How can I detect the encoding/codepage of a text file

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

How to determine encoding table of a text file

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

Get encoding of a file in Windows

Open up your file using regular old vanilla Notepad that comes with Windows.

It will show you the encoding of the file when you click "Save As...".

It'll look like this:
Sample Image

Whatever the default-selected encoding is, that is what your current encoding is for the file.


If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).

I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.

FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.

More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe

How to determine codepage of a file (that had some codepage transformation applied to it)

It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.

There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.

Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.

If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.

How to determine the encoding of text?

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

How to detect the character encoding of a text file?

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

Determine TextFile Encoding?

The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:

Dim data() As Byte = File.ReadAllBytes("test.txt")

Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.

The easiest way to automatically detect the encoding from the BOM is to let the StreamReader do it for you. In the constructor of the StreamReader, you can pass True for the detectEncodingFromByteOrderMarks argument. Then you can get the encoding of the stream by accessing its CurrentEncoding property. However, the CurrentEncoding property won't work until after the StreamReader has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:

Public Function GetFileEncoding(filePath As String) As Encoding
Using sr As New StreamReader(filePath, True)
sr.Read()
Return sr.CurrentEncoding
End Using
End Function

However, the problem to this approach is that the MSDN seems to imply that the StreamReader may only detect certain kinds of encodings:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Also, if the StreamReader is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00

So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:

If (data(0) = &HFF) And (data(1) = &HFE) Then
' Data starts with UTF-16 (little endian) BOM
End If

Conveniently, the Encoding class in .NET contains a method called GetPreamble which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
If (data(0) = bom(0)) And (data(1) = bom(1) Then
Return True
Else
Return False
End If
End Function

Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function

So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encoding class also provides a shared (static) method called GetEncodings which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:

Public Function DetectEncodingFromBom(data() As Byte) As Encoding
Return Encoding.GetEncodings().
Select(Function(info) info.GetEncoding()).
FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function

Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
Dim bom() As Byte = enc.GetPreamble()
If bom.Length <> 0 Then
Return data.
Zip(bom, Function(x, y) x = y).
All(Function(x) x)
Else
Return False
End If
End Function

Once you make a function like that, then you could detect the encoding of a file like this:

Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
Console.WriteLine("Unable to detect encoding")
Else
Console.WriteLine(detectedEncoding.EncodingName)
End If

However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.

As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:

  • TextFileEncodingDetector
  • Utf8Checker
  • GetTextEncoding

Even though this question was for C#, you may also find the answers to it useful.

Effective way to find any file's Encoding

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.

*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}

// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE

// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}

Is it possible to detect text file encoding of two possible?

For example I allow user to use
Unicode UTF-8 and iso-8859-2 for their
csv files. Is it possible to detect
whether it is former or latter?

It's not possible with 100% accuracy because, for example, the bytes C3 B1 are an equally valid representation of "Ăą" in ISO-8859-2 as they are of "ñ" in UTF-8. In fact, because ISO-8859-2 assigns a character to all 256 possible bytes, every UTF-8 string is also a valid ISO-8859-2 string (representing different characters if non-ASCII).

However, the converse is not true. UTF-8 has strict rules about what sequences are valid. More than 99% of possible 8-octet sequences are not valid UTF-8. And your CSV files are probably much longer than that. Because of this, you can get good accuracy if you:

  1. Perform a UTF-8 validity check. If it passes, assume the data is UTF-8.
  2. Otherwise, assume it's ISO-8859-2.

However is it possible to detect
whether encoding is one of two
allowed?

UTF-32 (either byte order), UTF-8, and CESU-8 can be reliably detected by validation.
UTF-16 can be detected by presence of a BOM (but not by validation, since the only way for an even-length byte sequence to be invalid UTF-16 is to have unpaired surrogates).

If you have at least one "detectable" encoding, then you can check for the detectable encoding, and use the undetectable encoding as a fallback.

If both encodings are "undetectable", like ISO-8859-1 and ISO-8859-2, then it's more difficult. You could try a statistical approach like chardet uses.



Related Topics



Leave a reply



Submit