How do I ignore the UTF-8 Byte Order Marker in String comparisons?
Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.
EDIT: Alternatively, you could use a StreamReader
to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString
or one character via a StreamReader
:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);
string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}
Confusion about Java conversion of bytes to String for comparison of byte order marks
A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'
. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF'
it is the BOM. If it doesn't match then the file was written without a BOM.
private final static char BOM = '\uFEFF'; // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
// We have a BOM
} else {
// No BOM present.
}
Reading UTF-8 - BOM marker
In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream
to handle this situation.
Take a look at this solution: Handle UTF8 file with BOM
Comparing Strings with different byte order masks in Java
Remove the byte order mask (BOM) from the strings as you read them from a file. The character code for this is "\uFEFF"
public class Foo {
public static void main(final String[] args) {
final byte[] b1 = {-17, -69, -65, 83, 72, 69, 79, 71, 77, 73, 79, 70};
final byte[] b2 = {83, 72, 69, 79, 71, 77, 73, 79, 70};
final String s1 = new String(b1).replace("\uFEFF", "");
final String s2 = new String(b2).replace("\uFEFF", "");
System.out.println(s1);
System.out.println(s2);
System.out.println(s1.equals(s2));
}
}
prints:
SHEOGMIOF
SHEOGMIOF
true
What's the difference between UTF-8 and UTF-8 with BOM?
The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF
) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
2.6 Encoding Schemes
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.
Remove BOM from string in Java
You're replacing the BOM with U+0000, rather than with an empty string. You should replace the BOM with the empty string, e.g.
out.write(l.replace("\uFEFF", "") + "\n");
Removing BOM characters from AJAX-posted string
The utf-8 BOM bytes get translated to \ufeff
. Unicode character "Zero width no-break space", can't see them, can't hear them. Filter them out with:
var good = bad.Replace("\ufeff", "");
Related Topics
How Math.Pow (And So On) Actually Works
3D Relative Angle Sum Calculation
.Net - Convert Generic Collection to Datatable
Differencebetween a C# Reference and a Pointer
Cannot Implicitly Convert Type 'Int' to 'T'
ASP.NET Core Appsettings.JSON Update in Code
Simulate Steady CPU Load and Spikes
A .Net Disassembler/Decompiler
Gmail Smtp via C# .Net Errors on All Ports
Filter All Queries (Trying to Achieve Soft Delete)
Wrap C# Application in .Msi Installer
Routing with Multiple Get Methods in ASP.NET Web API
How to Align Text in Columns Using Console.Writeline
Comparing Timer with Dispatchertimer
Who Should Call Dispose on Idisposable Objects When Passed into Another Object