How to Re-Add a Unicode Byte Order Marker in Linux

How can I re-add a unicode byte order marker in linux?

Something like (backup first)):

for i in $(ls *.sql)
do
cp "$i" "$i.temp"
printf '\xFF\xFE' > "$i"
cat "$i.temp" >> "$i"
rm "$i.temp"
done

How to fix Byte-Order Mark found in UTF-8 File validation warning

The location part of your question is easy: The byte-order mark (BOM) will be at the very beginning of the file.

When editing the file, in the bottom status bar toward the right VS Code shows you what encoding is being used for the current file:

Status bar showing "UTF-8 with BOM"

Click it to open the command palette with the options "Reopen with encoding" and "Save with encoding":

The command palette showing the options

Click "Save with Encoding" to get a list of encodings:

Command palette showing list of file encodings such as UTF-8, UTF-16 LE, UTF-16 BE

Choosing an encoding saves the file with that encoding.

See also this note in the Unicode site's FAQ about the BOM and UTF-8 files. It has no function other than to call out that the file is, in fact, UTF-8. In particular, it has no effect on the byte order (the main reason we have BOMs), because the byte order of UTF-8 is fixed.

Using awk to remove the Byte-order mark

Try this:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

On the first record (line), remove the BOM characters. Print every record.

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 is the shortest condition that always evaluates to true, so each record is printed.

Enjoy!

-- ADDENDUM --

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8

Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

Character conversion using iconv without the Unicode Byte Order Mark

I found that this works for me:

iconv_t conv = iconv_open("UTF16LE", "UTF32");

Though I believe it's implementation-dependent.

There is a similar question: Using iconv to convert from UTF-16BE to UTF-8 without BOM

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes


... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

How can I remove the BOM from a UTF-8 file?

A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.

With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:

sed -i $'1s/^\uFEFF//' file.txt

This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.

If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:

sed "$(printf '1s/^\357\273\277//')" file.txt

(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)

Force UTF-8 Byte Order Mark in Perl file output

Try doing this:

print $fh chr(65279);

after opening the file.



Related Topics



Leave a reply



Submit