Effective Way to Find Any File's Encoding

Effective way to find any file's Encoding

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.

*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}

// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE

// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}

Get encoding of a file in Windows

Open up your file using regular old vanilla Notepad that comes with Windows.

It will show you the encoding of the file when you click "Save As...".

It'll look like this:
Sample Image

Whatever the default-selected encoding is, that is what your current encoding is for the file.


If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).

I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.

FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.

More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe

Determine TextFile Encoding?

The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:

Dim data() As Byte = File.ReadAllBytes("test.txt")

Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.

The easiest way to automatically detect the encoding from the BOM is to let the StreamReader do it for you. In the constructor of the StreamReader, you can pass True for the detectEncodingFromByteOrderMarks argument. Then you can get the encoding of the stream by accessing its CurrentEncoding property. However, the CurrentEncoding property won't work until after the StreamReader has read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:

Public Function GetFileEncoding(filePath As String) As Encoding
Using sr As New StreamReader(filePath, True)
sr.Read()
Return sr.CurrentEncoding
End Using
End Function

However, the problem to this approach is that the MSDN seems to imply that the StreamReader may only detect certain kinds of encodings:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Also, if the StreamReader is incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00

So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:

If (data(0) = &HFF) And (data(1) = &HFE) Then
' Data starts with UTF-16 (little endian) BOM
End If

Conveniently, the Encoding class in .NET contains a method called GetPreamble which returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
If (data(0) = bom(0)) And (data(1) = bom(1) Then
Return True
Else
Return False
End If
End Function

Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
Dim bom() As Byte = Encoding.Unicode.GetPreamble()
Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function

So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encoding class also provides a shared (static) method called GetEncodings which returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:

Public Function DetectEncodingFromBom(data() As Byte) As Encoding
Return Encoding.GetEncodings().
Select(Function(info) info.GetEncoding()).
FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function

Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
Dim bom() As Byte = enc.GetPreamble()
If bom.Length <> 0 Then
Return data.
Zip(bom, Function(x, y) x = y).
All(Function(x) x)
Else
Return False
End If
End Function

Once you make a function like that, then you could detect the encoding of a file like this:

Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
Console.WriteLine("Unable to detect encoding")
Else
Console.WriteLine(detectedEncoding.EncodingName)
End If

However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.

As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This page offers some interesting insights on the problems inside the Notepad auto-detection algorithm. This page shows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:

  • TextFileEncodingDetector
  • Utf8Checker
  • GetTextEncoding

Even though this question was for C#, you may also find the answers to it useful.

How to determine the encoding of text

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn't English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

How to detect the character encoding of a text file?

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

Refactoring auto-detect file's encoding

the best way to refactor this code would be to bring in a 3rd party library that does character detection for you, because they probably do it better and it would make your code smaller.
see this question for a few alternatives

SQL/SSIS encoding detect

I don't know if there is a specific tool that does exactly what you need (file encoding detection inside SQL Server).

However, you can write one that does the job inside SQL Server as per your requirements. The file encoding detection can easily be done in C# and IMHO a more elegant solution than trying to write one in SQL, if possible at all.

So you should write a C# function that does that and then host it inside SQL Server for execution.

Here is a post that explains how to detect any encoding in a file

Effective way to find any file's Encoding

And here is a blog post that explains how to register a C# method and execute it from SQL Server

http://www.codemahek.com/blog/execute-c-methods-from-sql-server

There should be enough information here for you to do what you want.

Welcome to Stackoverflow!



Related Topics



Leave a reply



Submit