Read a Utf-8 Text File with Bom

Read a UTF-8 text file with BOM

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

Why does BOM stick around when reading a UTF-8 file?

UTF-8 is a byte-based encoding, so endianness is irrelevant and an initial byte order mark (BOM) is unnecessary and generally discouraged in UTF-8 data. But its validity and function is dependent on the prevailing application, so Perl cannot simply strip it from the data without question

The Unicode BOM character U+FEFF shares an encoding with the ZERO WIDTH NO-BREAK SPACE character, so if layout is the only issue it should not cause a problem if is left in, even if multiple sources are concatenated so that it appears in the middle of a data stream

In most file applications UTF-8 data sources are treated transparently, so that a file containing only 7-bit ASCII data is identical to the UTF-8 encoding of the same data. Such data must not contain a BOM, because it would interfere with the transparency. For instance the shebang #! line at the start of a UTF-8-encoded shell command file must not be preceded by a byte order mark as the shell would simply fail to recognise it

You can strip the BOM character from the beginning of decoded Unicode data, whatever the source, with

s/\A\N{BOM}//

Of course, the character can be removed throughout a string by using a global substitution with the \A anchor removed, or more tidily with

tr/\N{BOM}//d




Update

Character streams are read as a sequence of bytes, and in 16-bit or 32-bit encodings you need to know whether it is the least-significant (little-endian) or most-significant (big-endian) byte that appears first so that you know how to assemble those bytes into a multi-byte character

The BOM character is always U+FEFF. Its whole point is that that is unchanging. So if I read the first two bytes from a file and they are FF and FE in that order, then I know that the whole file is UTF-16 (or UTF-32) encoded with the least-significant byte followed by the most-significant byte, or little-endian, and I can then correctly interpret the rest of the file

But byte order is meaningless in byte-based encodings. Every character is represented by a sequence of one or more bytes, and the data is identical regardless of the endianness of its originating system. The BOM character U+FEFF is encoded in UTF-8 as the three hex bytes EF, BB, BF in that order, and that is invariant

The File::BOM module

In my opinion, File::BOM makes a simple concept unnecessarily complicated

I can see it being useful if you have to handle many different Unicode files with different encodings from platforms with different endianness, but in such circumstances the variations in the character sequence for the record separator at the end of each line of text is likely to be more of a issue

As long as you know the encoding of a file before you open it, you should just open it and read it according to that standard. If the presence of a BOM character in the data is a problem then just use s/// or tr///d to remove it. But bear in mind that the BOM character should be ignored transparently on all Unicode-compliant systems

How to read in UTF8+BOM file using PHP and not have the BOM appear as content?

Nope. You have to do it manually.

The BOM is part of signalling byte order in the UTF-16LE and UTF-16BE encodings, so it makes some sense for UTF-16 decoders to remove it automatically (and so many do).

However UTF-8 always has the same byte order, and aims for ASCII compatibility, so including a BOM was never envisaged as part of the encoding scheme as specified, and so really it isn't supposed to receive any special treatment from UTF-8 decoders.

The UTF-8 faux-BOM is not part of the encoding, but an ad hoc (and somewhat controversial) marker some (predominantly Microsoft) applications use to signal that the file is probably UTF-8. It's not a standard in itself, so specifications that build on UTF-8, like XML and JSON, have had to make special dispensation for it.

Reading UTF-8 - BOM marker

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

Reading Unicode file data with BOM chars in Python

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes


... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

Write a text file encoded in UTF-8 with a BOM through java.nio

As far as I know, there's no direct way in the standard Java NIO library to write text files in UTF-8 with BOM format.

But that's not a problem, since BOM is nothing but a special character at the start of a text stream represented as \uFEFF. Just add it manually to the CSV file, f.e.:

List<String> lines = 
Arrays.asList("\uFEFF" + "Übernahme", "Außendarstellung", "€", "@", "UTF-8?");
...


Related Topics



Leave a reply



Submit