Determine a String'S Encoding in C#

Is there a way to check the encoding of a C# string?

Strings in C# (well, .NET) don't have encoding, effectively... or you can view them all as UTF-16, given that they're a sequence of char values, which are UTF-16 code units.

Normally, however, you only need to care about encoding when you convert from a string to a binary form (e.g. down a socket or to a file). At that point, you should specify the encoding explicitly - the string itself has no concept of this.

The only aspect which "defaults" to UTF-8 is that there are plenty of .NET APIs which are overloaded to either accept an encoding or not, and if no encoding is specified, UTF-8 is used. File.ReadAllText is an example of this. However, after reading the file there's no distinction between "text which was read from a UTF-8 file" and "text which was read from a Big5 file" etc.

How to know string encoding in C#


I get an accent: "Acción"

The presence of the à character is a dead give-away. Accented capital A characters have character code 0xC0 and up. Which is often the first byte in a two-byte utf-8 encoded character. The ó glyph is codepoint U+00F3, the utf-8 encoding for it is 0xC3 + 0xB3. Which are the codepoints for à and ³

The strings are encoded in utf-8 but you are reading it with an 8-bit encoding like Encoding.Default

How to detect the character encoding of a text file?

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

Effective way to find any file's Encoding

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.

*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}

// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE

// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}


Related Topics



Leave a reply



Submit