How to output Byte Order Mark when writing to TextWriter?
Short Version
String zwnbsp = "\xfeff"; //Zero-width non-breaking space
//The Zero-width non-breaking space character ***is*** the Byte-Order-Mark (BOM).
String s = zwnbsp+"The quick brown fox jumped over the lazy dog.";
writer.Write(s);
Long Version
At some point i realized how simple the solution is.
i used to think that the Unicode Byte-Order-Mark was some special signature. i used to think i had to carefully decide which byte sequence i wanted to output, in order to output the correct BOM:
- 0xFE 0xFF
- 0xFF 0xFE
- 0xEF 0xBB 0xBF
But since then i realized that byte Byte-Order-Mark is not some special byte sequence that you have to prepend to your file.
The BOM is just a Unicode character. You don't output any bytes; you only output character U+FEFF
. The very act of writing that character, the serializer will convert it to whatever encoding you're using for you.
The character U+feff
(ZERO WIDTH NO-BREAK SPACE
) was chosen for good reason. It's a space, so it has no meaning, and it is zero width, so you shouldn't even see it.
That means that my question is fundamentally flawed. There is no such thing as "writing a byte-order-mark". You just make sure the first character you write out is U+FEFF
. In my case i am writing to a TextWriter
:
void WriteStuffToTextWriter(TextWriter writer)
{
String csvExport = GetExportAsCSV();
writer.Write("\xfeff"); //Output unicode charcter U+FEFF as a byte order marker
writer.Write(csvExport);
}
The TextWriter
will handle converting the unicode character U+feff
into whatever byte encoding it has been configured to use.
Note: Any code is released into the public domain. No attribution required.
Default UTF-8 encoder of StreamWriter doesn't return Preamble
The preamble is optional at the encoding level; new UTF8Encoding(true)
and new UTF8Encoding(false)
provide UTF8 encodings with/without a BOM (preamble) as the only difference. Encoding.UTF8
uses the "with" option, and clearly for some reason StreamWriter
in this scenario is choosing "without", but both are valid - and neither is specifically "right" or "wrong".
If you care deeply about whether or not the BOM is present: supply the Encoding
yourself explicitly, choosing the appropriate option.
Write text files without Byte Order Mark (BOM)?
In order to omit the byte order mark (BOM), your stream must use an instance of UTF8Encoding
other than System.Text.Encoding.UTF8
(which is configured to generate a BOM). There are two easy ways to do this:
1. Explicitly specifying a suitable encoding:
Call the
UTF8Encoding
constructor withFalse
for theencoderShouldEmitUTF8Identifier
parameter.Pass the
UTF8Encoding
instance to the stream constructor.
' VB.NET:
Dim utf8WithoutBom As New System.Text.UTF8Encoding(False)
Using sink As New StreamWriter("Foobar.txt", False, utf8WithoutBom)
sink.WriteLine("...")
End Using
// C#:
var utf8WithoutBom = new System.Text.UTF8Encoding(false);
using (var sink = new StreamWriter("Foobar.txt", false, utf8WithoutBom))
{
sink.WriteLine("...");
}
2. Using the default encoding:
If you do not supply an Encoding
to StreamWriter
's constructor at all, StreamWriter
will by default use an UTF8 encoding without BOM, so the following should work just as well:
' VB.NET:
Using sink As New StreamWriter("Foobar.txt")
sink.WriteLine("...")
End Using
// C#:
using (var sink = new StreamWriter("Foobar.txt"))
{
sink.WriteLine("...");
}
Finally, note that omitting the BOM is only permissible for UTF-8, not for UTF-16.
How do I unmarshal a JSON string containing a UTF-8 Byte Order Mark (BOM)?
UTF-8 has a well-defined byte order. There's no such thing as big-endian UTF-8 vs little-endian UTF-8; there is only UTF-8. This means that a byte order marker or BOM in UTF-8 is pointless. Some software thinks it's pointful: that it marks a data file as being stored in UTF-8 (vs UTF-16-LE or UTF-16-BE, each of which would start with the two bytes 0xFF and 0xFE but in either order, if that UTF-16-xx file has a BOM). As long as you agree that such software is wrong, don't use it, or use it in a way that defeats this initial BOM.
As Jim B noted, systems that generate JSON text must not embed a UTF-8-ized BOM (which comes out as the three bytes 0xEF, 0xBB, 0xBF) at the front of its output. However, it may accept and ignore a BOM at the start of a stream. To do that in Go, inspect the incoming stream data and remove an initial BOM if present, passing the rest of the data on as the JSON bytes. But you're probably better off making your C# code generate allowed output, rather than fancying up your Go code to allow forbidden input.
StreamWriter is appending BOM character 65279 to end of file
As I previously implied in my comment about byte order marks, you are trying to avoid adding a byte order mark with StreamWriter
. This is based on the encoder you are using.
For example, try creating your own encoder without writing a byte order mark:
static void Main(string[] args)
{
for (int i = 0; i < 3; i++)
{
using (FileStream stream = new FileStream("file.txt", FileMode.OpenOrCreate))
using (StreamReader reader = new StreamReader(stream, Encoding.UTF8, true, 0x1000, true))
using (StreamWriter writer = new StreamWriter(stream, new UTF8Encoding(false), 0x1000, true))
{
Console.WriteLine("Read \"" + reader.ReadToEnd() + "\" from the file.");
}
}
Console.ReadLine();
}
By using new UTF8Encoding(false)
as your UTF8 encoder, the encoder is explicitly instructed not to use Unicode byte order marks. This is described in the MSDN entry for the UTF8Encoding
constructor.
Unicode Encoding - handling Byte Order Mark
From another post (StreamWriter and UTF-8 Byte Order Marks):
"The issue is due to the fact that you are using the static UTF8 property on the Encoding class.
When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream"
So, in this case I change
XmlTextWriter xmlTextWriter = new XmlTextWriter(stream,Encoding.Unicode);
To
XmlTextWriter xmlTextWriter = new XmlTextWriter(stream,new System.Text.UnicodeEncoding (false,false));
and it works ok.
C# StreamWriter writes extra bytes to the Stream
When I run this locally i get
10240
10240
10243
On further inspection the extra 3 bytes appear to be at the beginning of the stream 239 187 191
or EF BB BF
in hex. This is the Byte Order Mark (BOM) https://en.wikipedia.org/wiki/Byte_order_mark
To remove these extra characters from the ouptut use new UTF8Encoding(false)
to omit the BOM, instead of Encoding.UTF8
in the creation of the StreamWriter
using (var sw = new StreamWriter(memStream, new UTF8Encoding(false), 4194304 /* 4 MiB */, leaveOpen: true))
Related Topics
Query Extremely Slow in Code But Fast in Ssms
How to Pass Parameters to the Custom Action
How to Find Fqdn of Local MAChine in C#/.Net
How Would You Make a Unique Filename by Adding a Number
Notify Binding for Static Properties in Static Classes
How to Return an Anonymous Type from a Method
Arbitrary-Precision Decimals in C#
Checking User Name or User Email Already Exists
Should I Store My Images in the Database or Folders
Create Dynamic Buttons in a Grid Layout - Create a Magic Square Ui
Best Programming Practice of Using Dropdownlist in ASP.NET MVC
How to Raise an Event via Reflection in .Net/C#
Linq Order by Null Column Where Order Is Ascending and Nulls Should Be Last