Reading Utf-8 - Bom Marker

Reading UTF-8 - BOM marker

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

Read a UTF-8 text file with BOM

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

Reading Unicode file data with BOM chars in Python

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

Byte order mark screws up file reading in Java

EDIT: I've made a proper release on GitHub: https://github.com/gpakosz/UnicodeBOMInputStream


Here is a class I coded a while ago, I just edited the package name before pasting. Nothing special, it is quite similar to solutions posted in SUN's bug database. Incorporate it in your code and you're fine.

/* ____________________________________________________________________________
*
* File: UnicodeBOMInputStream.java
* Author: Gregory Pakosz.
* Date: 02 - November - 2005
* ____________________________________________________________________________
*/
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
* The <code>UnicodeBOMInputStream</code> class wraps any
* <code>InputStream</code> and detects the presence of any Unicode BOM
* (Byte Order Mark) at its beginning, as defined by
* <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
*
* <p>The
* <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
* defines 5 types of BOMs:<ul>
* <li><pre>00 00 FE FF = UTF-32, big-endian</pre></li>
* <li><pre>FF FE 00 00 = UTF-32, little-endian</pre></li>
* <li><pre>FE FF = UTF-16, big-endian</pre></li>
* <li><pre>FF FE = UTF-16, little-endian</pre></li>
* <li><pre>EF BB BF = UTF-8</pre></li>
* </ul></p>
*
* <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
* or not.
* </p>
* <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
* wrapped <code>InputStream</code> object.</p>
*/
public class UnicodeBOMInputStream extends InputStream
{
/**
* Type safe enumeration class that describes the different types of Unicode
* BOMs.
*/
public static final class BOM
{
/**
* NONE.
*/
public static final BOM NONE = new BOM(new byte[]{},"NONE");

/**
* UTF-8 BOM (EF BB BF).
*/
public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
(byte)0xBB,
(byte)0xBF},
"UTF-8");

/**
* UTF-16, little-endian (FF FE).
*/
public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
(byte)0xFE},
"UTF-16 little-endian");

/**
* UTF-16, big-endian (FE FF).
*/
public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
(byte)0xFF},
"UTF-16 big-endian");

/**
* UTF-32, little-endian (FF FE 00 00).
*/
public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
(byte)0xFE,
(byte)0x00,
(byte)0x00},
"UTF-32 little-endian");

/**
* UTF-32, big-endian (00 00 FE FF).
*/
public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
(byte)0x00,
(byte)0xFE,
(byte)0xFF},
"UTF-32 big-endian");

/**
* Returns a <code>String</code> representation of this <code>BOM</code>
* value.
*/
public final String toString()
{
return description;
}

/**
* Returns the bytes corresponding to this <code>BOM</code> value.
*/
public final byte[] getBytes()
{
final int length = bytes.length;
final byte[] result = new byte[length];

// Make a defensive copy
System.arraycopy(bytes,0,result,0,length);

return result;
}

private BOM(final byte bom[], final String description)
{
assert(bom != null) : "invalid BOM: null is not allowed";
assert(description != null) : "invalid description: null is not allowed";
assert(description.length() != 0) : "invalid description: empty string is not allowed";

this.bytes = bom;
this.description = description;
}

final byte bytes[];
private final String description;

} // BOM

/**
* Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
* specified <code>InputStream</code>.
*
* @param inputStream an <code>InputStream</code>.
*
* @throws NullPointerException when <code>inputStream</code> is
* <code>null</code>.
* @throws IOException on reading from the specified <code>InputStream</code>
* when trying to detect the Unicode BOM.
*/
public UnicodeBOMInputStream(final InputStream inputStream) throws NullPointerException,
IOException

{
if (inputStream == null)
throw new NullPointerException("invalid input stream: null is not allowed");

in = new PushbackInputStream(inputStream,4);

final byte bom[] = new byte[4];
final int read = in.read(bom);

switch(read)
{
case 4:
if ((bom[0] == (byte)0xFF) &&
(bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) &&
(bom[3] == (byte)0x00))
{
this.bom = BOM.UTF_32_LE;
break;
}
else
if ((bom[0] == (byte)0x00) &&
(bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) &&
(bom[3] == (byte)0xFF))
{
this.bom = BOM.UTF_32_BE;
break;
}

case 3:
if ((bom[0] == (byte)0xEF) &&
(bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF))
{
this.bom = BOM.UTF_8;
break;
}

case 2:
if ((bom[0] == (byte)0xFF) &&
(bom[1] == (byte)0xFE))
{
this.bom = BOM.UTF_16_LE;
break;
}
else
if ((bom[0] == (byte)0xFE) &&
(bom[1] == (byte)0xFF))
{
this.bom = BOM.UTF_16_BE;
break;
}

default:
this.bom = BOM.NONE;
break;
}

if (read > 0)
in.unread(bom,0,read);
}

/**
* Returns the <code>BOM</code> that was detected in the wrapped
* <code>InputStream</code> object.
*
* @return a <code>BOM</code> value.
*/
public final BOM getBOM()
{
// BOM type is immutable.
return bom;
}

/**
* Skips the <code>BOM</code> that was found in the wrapped
* <code>InputStream</code> object.
*
* @return this <code>UnicodeBOMInputStream</code>.
*
* @throws IOException when trying to skip the BOM from the wrapped
* <code>InputStream</code> object.
*/
public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
{
if (!skipped)
{
in.skip(bom.bytes.length);
skipped = true;
}
return this;
}

/**
* {@inheritDoc}
*/
public int read() throws IOException
{
return in.read();
}

/**
* {@inheritDoc}
*/
public int read(final byte b[]) throws IOException,
NullPointerException
{
return in.read(b,0,b.length);
}

/**
* {@inheritDoc}
*/
public int read(final byte b[],
final int off,
final int len) throws IOException,
NullPointerException
{
return in.read(b,off,len);
}

/**
* {@inheritDoc}
*/
public long skip(final long n) throws IOException
{
return in.skip(n);
}

/**
* {@inheritDoc}
*/
public int available() throws IOException
{
return in.available();
}

/**
* {@inheritDoc}
*/
public void close() throws IOException
{
in.close();
}

/**
* {@inheritDoc}
*/
public synchronized void mark(final int readlimit)
{
in.mark(readlimit);
}

/**
* {@inheritDoc}
*/
public synchronized void reset() throws IOException
{
in.reset();
}

/**
* {@inheritDoc}
*/
public boolean markSupported()
{
return in.markSupported();
}

private final PushbackInputStream in;
private final BOM bom;
private boolean skipped = false;

} // UnicodeBOMInputStream

And you're using it this way:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
public static void main(final String[] args) throws Exception
{
FileInputStream fis = new FileInputStream("test/offending_bom.txt");
UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

System.out.println("detected BOM: " + ubis.getBOM());

System.out.print("Reading the content of the file without skipping the BOM: ");
InputStreamReader isr = new InputStreamReader(ubis);
BufferedReader br = new BufferedReader(isr);

System.out.println(br.readLine());

br.close();
isr.close();
ubis.close();
fis.close();

fis = new FileInputStream("test/offending_bom.txt");
ubis = new UnicodeBOMInputStream(fis);
isr = new InputStreamReader(ubis);
br = new BufferedReader(isr);

ubis.skipBOM();

System.out.print("Reading the content of the file after skipping the BOM: ");
System.out.println(br.readLine());

br.close();
isr.close();
ubis.close();
fis.close();
}

} // UnicodeBOMInputStreamUsage

How to read in UTF8+BOM file using PHP and not have the BOM appear as content?

Nope. You have to do it manually.

The BOM is part of signalling byte order in the UTF-16LE and UTF-16BE encodings, so it makes some sense for UTF-16 decoders to remove it automatically (and so many do).

However UTF-8 always has the same byte order, and aims for ASCII compatibility, so including a BOM was never envisaged as part of the encoding scheme as specified, and so really it isn't supposed to receive any special treatment from UTF-8 decoders.

The UTF-8 faux-BOM is not part of the encoding, but an ad hoc (and somewhat controversial) marker some (predominantly Microsoft) applications use to signal that the file is probably UTF-8. It's not a standard in itself, so specifications that build on UTF-8, like XML and JSON, have had to make special dispensation for it.

Java: UTF-8 and BOM

Yes, it is still true that Java cannot handle the BOM in UTF8 encoded files. I came across this issue when parsing several XML files for data formatting purposes. Since you can't know when you might come across them, I would suggest stripping the BOM marker out if you find it at runtime or following the advice that tchrist gave.

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes


... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

What does rb:bom|utf-8 mean in CSV.open in Ruby?

When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.

If you're reading CSV files that are BOM encoded then you need to do it that way.

Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.



Related Topics



Leave a reply



Submit