What Is Xml Bom and How to Detect It

What is XML BOM and how do I detect it?

For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.

The Byte-Order-Mark (or BOM), is a
special marker added at the very
beginning of an Unicode file encoded
in UTF-8, UTF-16 or UTF-32. It is used
to indicate whether the file uses the
big-endian or little-endian byte
order. The BOM is mandatory for UTF-16
and UTF-32, but it is optional for
UTF-8.

(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)

Regarding the question on how detect this in java.

Check the following answer to this question: Java : How to determine the correct charset encoding of a stream and if you now want to determine the BOM yourself (at your own risk) check for example this code Java Tip: How to read a file and automatically specify the correct encoding.

Basically just read in the first few bytes yourself and then determine if you may have found a BOM.

How to generate XML, UTF-8 with BOM using Python Element Tree?

Peek into sources of ElementTree.write shows that prolog is hardcoded there (https://github.com/python/cpython/blob/main/Lib/xml/etree/ElementTree.py or permalink https://github.com/python/cpython/blob/ee0ac328d38a86f7907598c94cb88a97635b32f8/Lib/xml/etree/ElementTree.py). Therefore probably using internals of ET is the only option (other than monkey-pathing module), to write required preamble and keep BOM in the file:

import xml.etree.ElementTree as ET
qnames, namespaces = ET._namespaces(tree._root, None)
with open(lang_resx_fpath,'w',encoding='utf-8-sig') as f:
f.write("<?xml version='1.0' encoding='utf-8'?>\n" )
ET._serialize_xml(f.write,
tree._root, qnames, namespaces,
short_empty_elements=False)

Probably it is not more elegant than your solution (and maybe it is even less elegant). The only advantage is that it does not require writing file twice, which would be minor benefit besides some huge XML files.

How to parse an XML file containing BOM?

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:

builder.build(new GZIPInputStream(aUrl.openStream()));

Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:

private InputStream openStream(final URL url) throws IOException
{
final URLConnection cxn = url.openConnection();
final String contentEncoding = cxn.getContentEncoding();
if(contentEncoding == null)
return cxn.getInputStream();
else if(contentEncoding.equalsIgnoreCase("gzip")
|| contentEncoding.equalsIgnoreCase("x-gzip"))
return new GZIPInputStream(cxn.getInputStream());
else
throw new IOException("Unexpected content-encoding: " + contentEncoding);
}

(warning: not tested) and then use:

builder.build(openStream(aUrl.openStream()));

. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.

See the documentation for java.net.URLConnection.

How to Remove BOM from an XML file in Java

Having a tool breaking because of a BOM in an UTF-8 file is a very common thing in my experience. I don't know why there where so many downvotes (but then it gives me the chance to try to get enough vote to win a special SO badge ; )

More seriously: an UTF-8 BOM doesn't typically make that much sense but it is fully valid (although discouraged) by the specs. Now the problem is that a lot of people aren't aware that a BOM is valid in UTF-8 and hence wrote broken tools / APIs that do not process correctly these files.

Now you may have two different issues: you may want to process the file from Java or you need to use Java to programmatically create/fix files that other (broken) tools need.

I've had the case in one consulting gig where the helpdesk would keep getting messages from users that had problems with some text editor that would mess up perfectly valid UTF-8 files produced by Java. So I had to work around that issue by making sure to remove the BOM from every single UTF-8 file we were dealing with.

I you want to delete a BOM from a file, you could create a new file and skip the first three bytes. For example:

... $  file  /tmp/src.txt 
/tmp/src.txt: UTF-8 Unicode (with BOM) English text

... $ ls -l /tmp/src.txt
-rw-rw-r-- 1 tact tact 1733 2012-03-16 14:29 /tmp/src.txt

... $ hexdump -C /tmp/src.txt | head -n 1
00000000 ef bb bf 50 6f 6b 65 ...

As you can see, the file starts with "ef bb bf", this is the (fully valid) UTF-8 BOM.

Here's a method that takes a file and makes a copy of it by skipping the first three bytes:

 public static void workAroundbrokenToolsAndAPIs(File sourceFile, File destFile) throws IOException {
if(!destFile.exists()) {
destFile.createNewFile();
}

FileChannel source = null;
FileChannel destination = null;

try {
source = new FileInputStream(sourceFile).getChannel();
source.position(3);
destination = new FileOutputStream(destFile).getChannel();
destination.transferFrom( source, 0, source.size() - 3 );
}
finally {
if(source != null) {
source.close();
}
if(destination != null) {
destination.close();
}
}
}

Note that it's "raw": you'd typically want to first make sure you have a BOM before calling this or "Bad Thinks May Happen" [TM].

You can look at your file afterwards:

... $  file  /tmp/dst.txt 
/tmp/dst.txt: UTF-8 Unicode English text

... $ ls -l /tmp/dst.txt
-rw-rw-r-- 1 tact tact 1730 2012-03-16 14:41 /tmp/dst.txt

... $ hexdump -C /tmp/dst.txt
00000000 50 6f 6b 65 ...

And the BOM is gone...

Now if you simply want to transparently remove the BOM for one your broken Java API, then you could use the pushbackInputStream described here: why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?

private static InputStream checkForUtf8BOMAndDiscardIfAny(InputStream inputStream) throws IOException {
PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
byte[] bom = new byte[3];
if (pushbackInputStream.read(bom) != -1) {
if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
pushbackInputStream.unread(bom);
}
}
return pushbackInputStream; }

Note that this works, but shall definitely NOT fix the more serious issue where you can have other tools in the work chain not working correctly with UTF-8 files having a BOM.

And here's a link to a question with a more complete answer, covering other encodings as well:

Byte order mark screws up file reading in Java

Parsing and removing BOM/Preamble from XML via filesystem

C3 AF C2 BB C2 BF looks to be a double UTF-8 encoded BOM. UTF-8 encoding of the BOM is EF BB BF. If you were to treat each of those as a separate character and UTF-8 encode, you'd end up with the sequence that you're seeing.

So the document you have is broken. Something is taking a document containing a UTF-8 BOM and treating it as extended ASCII. If you can't get the documents fixed at source, I'd be inclined to look for that specific sequence at the start of the file and strip it if present.

If the documents in question use other extended ASCII characters, there's a good chance they'll be broken too.

XmlReader breaks on UTF-8 BOM

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.

Edit:

Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
{
var output = new StringWriter();
var writer = XmlWriter.Create(output);
new XmlSerializer(typeof(Response)).Serialize(writer, response);
return output.ToString();
}


Related Topics



Leave a reply



Submit