How to Write Out a Text File in C# with a Code Page Other Than Utf-8

How do I write out a text file in C# with a code page other than UTF-8?

using System.IO;
using System.Text;

using (StreamWriter sw = new StreamWriter(File.Open(myfilename, FileMode.Create), Encoding.WhateverYouWant))
{
sw.WriteLine("my text...");
}

An alternate way of getting your encoding:

using System.IO;
using System.Text;

using (var sw = new StreamWriter(File.Open(@"c:\myfile.txt", FileMode.CreateNew), Encoding.GetEncoding("iso-8859-1"))) {
sw.WriteLine("my text...");
}

Check out the docs for the StreamWriter constructor.

How to read text files with ANSI encoding and non-English letters?

 var text = File.ReadAllText(file, Encoding.GetEncoding(codePage));

List of codepages : https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers?redirectedfrom=MSDN

How can I detect the encoding/codepage of a text file?

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

C# encoding when reading files

The reason is that by default the encoding used when reading text files is UTF8.

Encoding.Default is not (despite its name) the default encoding used when reading files!

A much better name for Encoding.Default would have been Encoding.UsingCurrentCodePage, in my opinion. ;)

Also note that rather than using File.ReadLines(filePath, Encoding.GetEncoding(1252)) you could use File.ReadLines(filePath, Encoding.Default).

You would do that if your code is trying to read files that have been created in a different code page than 1252, and that code page is the current code page for the system on which the code is running.

The only reason you should be using code pages is if you are reading or writing legacy files.

Effective way to find any file's Encoding

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.

*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}

// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE

// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}

File encoding doesn't work

You can use other overloads of Encoding.GetEncoding to handle all cases when an Unicode character can't be converted to your target code page. More information on this MSDN topic.
The same could be achieved if you explicitly set the Encoding.EncoderFallback property (link to MSDN).

For example you can use the following to throw an exception every time when conversion of one Unicode character fails:

Encoding enc = Encoding.GetEncoding(28605, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);

Note: The default EncoderFallback is System.Text.InternalEncoderBestFitFallback which produces question marks for unknown code points.

Set UTF8 encoding on streamwriter

System.IO.StreamWriter streamWriter = new System.IO.StreamWriter(new FileStream(dlg.FileName, FileMode.Open), Encoding.UTF8);

My C# code doesn't read special characters from file

The

č, ć, š, đ, ž

suggests here that this could be one of ANSI code pages of Eastern Europe. A recommendation is then to try

CodePagesEncodingProvider.Instance.GetEncoding(1250)

as the encoding.

Sadly, there's no easy way to guess a code page of a 8-bit file. To overcome such issues, UTF-8 (and other unicode encodings) were designed. Thus, if there's a control on how source files are created, please strongly recommend to have UTF8 (or Unicode but there's no need) files.



Related Topics



Leave a reply



Submit