How to Ignore the Utf-8 Byte Order Marker in String Comparisons

How do I ignore the UTF-8 Byte Order Marker in String comparisons?

Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.

EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:

using System;
using System.IO;
using System.Text;

class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);

string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}

Confusion about Java conversion of bytes to String for comparison of byte order marks

A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF' it is the BOM. If it doesn't match then the file was written without a BOM.

private final static char BOM = '\uFEFF';    // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
// We have a BOM
} else {
// No BOM present.
}

Reading UTF-8 - BOM marker

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

Comparing Strings with different byte order masks in Java

Remove the byte order mask (BOM) from the strings as you read them from a file. The character code for this is "\uFEFF"

public class Foo {
public static void main(final String[] args) {
final byte[] b1 = {-17, -69, -65, 83, 72, 69, 79, 71, 77, 73, 79, 70};
final byte[] b2 = {83, 72, 69, 79, 71, 77, 73, 79, 70};

final String s1 = new String(b1).replace("\uFEFF", "");
final String s2 = new String(b2).replace("\uFEFF", "");

System.out.println(s1);
System.out.println(s2);
System.out.println(s1.equals(s2));
}
}

prints:

SHEOGMIOF
SHEOGMIOF
true

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes


... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

Remove BOM from string in Java

You're replacing the BOM with U+0000, rather than with an empty string. You should replace the BOM with the empty string, e.g.

out.write(l.replace("\uFEFF", "") + "\n");

Removing BOM characters from AJAX-posted string

The utf-8 BOM bytes get translated to \ufeff. Unicode character "Zero width no-break space", can't see them, can't hear them. Filter them out with:

   var good = bad.Replace("\ufeff", "");


Related Topics



Leave a reply



Submit