Strip the Byte Order Mark from String in C#

Strip the byte order mark from string in C#

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

Unicode conversion to String leaves leading Byte order mark

No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].

The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.

All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

How to remove BOM from byte array

All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.

Using XDocument as an example:

using (var stream = new memoryStream(bytes))
{
var document = XDocument.Load(stream);
...
}

Once you have an XDocument you can then use it to omit the bytes without the BOM:

using (var stream = new MemoryStream())
using (var writer = XmlWriter.Create(stream))
{
writer.Settings.Encoding = new UTF8Encoding(false);
document.WriteTo(writer);
var bytesWithoutBOM = stream.ToArray();
}

Removing BOM characters from AJAX-posted string

The utf-8 BOM bytes get translated to \ufeff. Unicode character "Zero width no-break space", can't see them, can't hear them. Filter them out with:

   var good = bad.Replace("\ufeff", "");

How do I ignore the UTF-8 Byte Order Marker in String comparisons?

Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.

EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:

using System;
using System.IO;
using System.Text;

class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);

string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}

XDocument how to save without Byte Order Mark AND preseve formatting/whitespace

Problem solved (Updated due to issue with creating unnecessary whitespace):

XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.Encoding = new UTF8Encoding(false);
using (var writer = XmlWriter.Create(file, settings))
{
xdoc.Save(writer);
}

UTF8 Encoding not adding byte order mark

BOM parameter in the constructor does no affect the result of GetBytes, it affects the result of GetPreamble. Users are expected to append it manually.

byte[] bom = new UTF8Encoding(true).GetPreamble(); // 3 bytes
byte[] noBom = new UTF8Encoding(false).GetPreamble(); // 0 bytes


Related Topics



Leave a reply



Submit