Streamwriter and Utf-8 Byte Order Marks

How to output Byte Order Mark when writing to TextWriter?

Short Version

String zwnbsp = "\xfeff"; //Zero-width non-breaking space

//The Zero-width non-breaking space character ***is*** the Byte-Order-Mark (BOM).
String s = zwnbsp+"The quick brown fox jumped over the lazy dog.";
writer.Write(s);

Long Version

At some point i realized how simple the solution is.

i used to think that the Unicode Byte-Order-Mark was some special signature. i used to think i had to carefully decide which byte sequence i wanted to output, in order to output the correct BOM:

  • 0xFE 0xFF
  • 0xFF 0xFE
  • 0xEF 0xBB 0xBF

But since then i realized that byte Byte-Order-Mark is not some special byte sequence that you have to prepend to your file.

The BOM is just a Unicode character. You don't output any bytes; you only output character U+FEFF. The very act of writing that character, the serializer will convert it to whatever encoding you're using for you.

The character U+feff (ZERO WIDTH NO-BREAK SPACE) was chosen for good reason. It's a space, so it has no meaning, and it is zero width, so you shouldn't even see it.

That means that my question is fundamentally flawed. There is no such thing as "writing a byte-order-mark". You just make sure the first character you write out is U+FEFF. In my case i am writing to a TextWriter:

void WriteStuffToTextWriter(TextWriter writer)
{
String csvExport = GetExportAsCSV();

writer.Write("\xfeff"); //Output unicode charcter U+FEFF as a byte order marker
writer.Write(csvExport);
}

The TextWriter will handle converting the unicode character U+feff into whatever byte encoding it has been configured to use.

Note: Any code is released into the public domain. No attribution required.

Default UTF-8 encoder of StreamWriter doesn't return Preamble

The preamble is optional at the encoding level; new UTF8Encoding(true) and new UTF8Encoding(false) provide UTF8 encodings with/without a BOM (preamble) as the only difference. Encoding.UTF8 uses the "with" option, and clearly for some reason StreamWriter in this scenario is choosing "without", but both are valid - and neither is specifically "right" or "wrong".

If you care deeply about whether or not the BOM is present: supply the Encoding yourself explicitly, choosing the appropriate option.

Write text files without Byte Order Mark (BOM)?

In order to omit the byte order mark (BOM), your stream must use an instance of UTF8Encoding other than System.Text.Encoding.UTF8 (which is configured to generate a BOM). There are two easy ways to do this:

1. Explicitly specifying a suitable encoding:

  1. Call the UTF8Encoding constructor with False for the encoderShouldEmitUTF8Identifier parameter.

  2. Pass the UTF8Encoding instance to the stream constructor.

' VB.NET:
Dim utf8WithoutBom As New System.Text.UTF8Encoding(False)
Using sink As New StreamWriter("Foobar.txt", False, utf8WithoutBom)
sink.WriteLine("...")
End Using
// C#:
var utf8WithoutBom = new System.Text.UTF8Encoding(false);
using (var sink = new StreamWriter("Foobar.txt", false, utf8WithoutBom))
{
sink.WriteLine("...");
}

2. Using the default encoding:

If you do not supply an Encoding to StreamWriter's constructor at all, StreamWriter will by default use an UTF8 encoding without BOM, so the following should work just as well:

' VB.NET:
Using sink As New StreamWriter("Foobar.txt")
sink.WriteLine("...")
End Using
// C#:
using (var sink = new StreamWriter("Foobar.txt"))
{
sink.WriteLine("...");
}

Finally, note that omitting the BOM is only permissible for UTF-8, not for UTF-16.

How do I unmarshal a JSON string containing a UTF-8 Byte Order Mark (BOM)?

UTF-8 has a well-defined byte order. There's no such thing as big-endian UTF-8 vs little-endian UTF-8; there is only UTF-8. This means that a byte order marker or BOM in UTF-8 is pointless. Some software thinks it's pointful: that it marks a data file as being stored in UTF-8 (vs UTF-16-LE or UTF-16-BE, each of which would start with the two bytes 0xFF and 0xFE but in either order, if that UTF-16-xx file has a BOM). As long as you agree that such software is wrong, don't use it, or use it in a way that defeats this initial BOM.

As Jim B noted, systems that generate JSON text must not embed a UTF-8-ized BOM (which comes out as the three bytes 0xEF, 0xBB, 0xBF) at the front of its output. However, it may accept and ignore a BOM at the start of a stream. To do that in Go, inspect the incoming stream data and remove an initial BOM if present, passing the rest of the data on as the JSON bytes. But you're probably better off making your C# code generate allowed output, rather than fancying up your Go code to allow forbidden input.

StreamWriter is appending BOM character 65279 to end of file

As I previously implied in my comment about byte order marks, you are trying to avoid adding a byte order mark with StreamWriter. This is based on the encoder you are using.

For example, try creating your own encoder without writing a byte order mark:

static void Main(string[] args)
{
for (int i = 0; i < 3; i++)
{
using (FileStream stream = new FileStream("file.txt", FileMode.OpenOrCreate))
using (StreamReader reader = new StreamReader(stream, Encoding.UTF8, true, 0x1000, true))
using (StreamWriter writer = new StreamWriter(stream, new UTF8Encoding(false), 0x1000, true))
{
Console.WriteLine("Read \"" + reader.ReadToEnd() + "\" from the file.");
}
}
Console.ReadLine();
}

By using new UTF8Encoding(false) as your UTF8 encoder, the encoder is explicitly instructed not to use Unicode byte order marks. This is described in the MSDN entry for the UTF8Encoding constructor.

Unicode Encoding - handling Byte Order Mark

From another post (StreamWriter and UTF-8 Byte Order Marks):

"The issue is due to the fact that you are using the static UTF8 property on the Encoding class.

When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream"

So, in this case I change

XmlTextWriter xmlTextWriter = new XmlTextWriter(stream,Encoding.Unicode);

To

XmlTextWriter xmlTextWriter = new XmlTextWriter(stream,new System.Text.UnicodeEncoding  (false,false));

and it works ok.

C# StreamWriter writes extra bytes to the Stream

When I run this locally i get

10240

10240

10243

On further inspection the extra 3 bytes appear to be at the beginning of the stream 239 187 191 or EF BB BF in hex. This is the Byte Order Mark (BOM) https://en.wikipedia.org/wiki/Byte_order_mark

To remove these extra characters from the ouptut use new UTF8Encoding(false) to omit the BOM, instead of Encoding.UTF8 in the creation of the StreamWriter

using (var sw = new StreamWriter(memStream, new UTF8Encoding(false), 4194304 /* 4 MiB */, leaveOpen: true))


Related Topics



Leave a reply



Submit