How to Remove the Bom from a Utf-8 Encoded File

How can I remove the BOM from a UTF-8 file?

A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.

With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:

sed -i $'1s/^\uFEFF//' file.txt

This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.

If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:

sed "$(printf '1s/^\357\273\277//')" file.txt

(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)

Why does BOM stick around when reading a UTF-8 file?

UTF-8 is a byte-based encoding, so endianness is irrelevant and an initial byte order mark (BOM) is unnecessary and generally discouraged in UTF-8 data. But its validity and function is dependent on the prevailing application, so Perl cannot simply strip it from the data without question

The Unicode BOM character U+FEFF shares an encoding with the ZERO WIDTH NO-BREAK SPACE character, so if layout is the only issue it should not cause a problem if is left in, even if multiple sources are concatenated so that it appears in the middle of a data stream

In most file applications UTF-8 data sources are treated transparently, so that a file containing only 7-bit ASCII data is identical to the UTF-8 encoding of the same data. Such data must not contain a BOM, because it would interfere with the transparency. For instance the shebang #! line at the start of a UTF-8-encoded shell command file must not be preceded by a byte order mark as the shell would simply fail to recognise it

You can strip the BOM character from the beginning of decoded Unicode data, whatever the source, with

s/\A\N{BOM}//

Of course, the character can be removed throughout a string by using a global substitution with the \A anchor removed, or more tidily with

tr/\N{BOM}//d




Update

Character streams are read as a sequence of bytes, and in 16-bit or 32-bit encodings you need to know whether it is the least-significant (little-endian) or most-significant (big-endian) byte that appears first so that you know how to assemble those bytes into a multi-byte character

The BOM character is always U+FEFF. Its whole point is that that is unchanging. So if I read the first two bytes from a file and they are FF and FE in that order, then I know that the whole file is UTF-16 (or UTF-32) encoded with the least-significant byte followed by the most-significant byte, or little-endian, and I can then correctly interpret the rest of the file

But byte order is meaningless in byte-based encodings. Every character is represented by a sequence of one or more bytes, and the data is identical regardless of the endianness of its originating system. The BOM character U+FEFF is encoded in UTF-8 as the three hex bytes EF, BB, BF in that order, and that is invariant

The File::BOM module

In my opinion, File::BOM makes a simple concept unnecessarily complicated

I can see it being useful if you have to handle many different Unicode files with different encodings from platforms with different endianness, but in such circumstances the variations in the character sequence for the record separator at the end of each line of text is likely to be more of a issue

As long as you know the encoding of a file before you open it, you should just open it and read it according to that standard. If the presence of a BOM character in the data is a problem then just use s/// or tr///d to remove it. But bear in mind that the BOM character should be ignored transparently on all Unicode-compliant systems

Remove a BOM character in a file

If you look in the same menu. Click "Convert to UTF-8."

If you look in the same menu. Click "Convert to UTF-8.

Read a UTF-8 text file with BOM

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

how remove the BOM() characters from a UTF 8 encoded csv?

Here is a function that does this:

    public static void SaveAsUTF8WithoutByteOrderMark(string fileName)
{
SaveAsUTF8WithoutByteOrderMark(fileName, null);
}

public static void SaveAsUTF8WithoutByteOrderMark(string fileName, Encoding encoding)
{
if (fileName == null)
throw new ArgumentNullException("fileName");

if (encoding == null)
{
encoding = Encoding.Default;
}

File.WriteAllText(fileName, File.ReadAllText(fileName, encoding), new UTF8Encoding(false));
}

Is there a way to remove the BOM from a UTF-8 encoded file?

So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")

content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

json = JSON.parse(content)

print json
end

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

How to remove BOM from an encoded base64 UTF string?

Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:

File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.

The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:

var bytes = Convert.FromBase64String(fileContent);
string finalString = null;

using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

How to remove the namespace and use UTF-8 No BOM encoding on an XML

The pipeline component is probably working fine and already doing its job of removing the BOM and encoding to UTF-8.

Your second screenshot shows Notepad++.
The "Encode in" feature of Notepad++ lets you display the content of your file in a particular encoding.

However it is not an Encoding detector.
Detecting an Encoding can be a difficult task, especially when the file has no BOM because some encoding have similarities (example: UTF-8's 128 first characters are the same as ASCII).

Does your input file actually contain any character encoded in a way unique to UTF-8? That could be a good test.



Related Topics



Leave a reply



Submit