How can I remove the BOM from a UTF-8 file?
A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.
With bash, you can create a UTF-8 BOM with the $''
special quoting form, which implements Unicode escapes: $'\uFEFF'
. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:
sed -i $'1s/^\uFEFF//' file.txt
This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.
If you are using some other shell, you might find that "$(printf '\ufeff')"
produces the BOM character (that works with zsh
as well as any shell without a printf
builtin, provided that /usr/bin/printf
is the Gnu version ), but if you want a Posix-compatible version you could use:
sed "$(printf '1s/^\357\273\277//')" file.txt
(The -i
in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)
Remove a BOM character in a file
If you look in the same menu. Click "Convert to UTF-8."
Batch script remove BOM () from file
This is because the type
command will preserve the UTF-8 BOM, so when you combine multiple files which have the BOM, the final file will contain multiple BOMs in various places in middle of the file.
If you are certain that all the SQL files that you want to combine, start with the BOM, then you can use the following script to remove the BOM from each of them before actually combining them.
This is done by piping the output of type
. The other side of pipe will consume the first 3 bytes (The BOM) with the help of 3 pause
commands. each pause
will consume one byte. The rest of stream will be send to the findstr
command to append it to final script.
Since the SQL files are encoded UTF-8 and they may contain any characters in the Unicode range, certain code pages will interfere with the operation and may cause the final SQL script to be corrupted.
So this has been taken into account and the batch file will be restarted with code page 437 which is safe for accessing any binary sequence.
@echo off
setlocal DisableDelayedExpansion
setlocal EnableDelayedExpansion
for /F "tokens=*" %%a in ('chcp') do for %%b in (%%a) do set "CP=%%~nb"
if !CP! NEQ 437 if !CP! NEQ 65001 chcp 437 >nul && (
REM for file operations, the script must restatred in a new instance.
"%COMSPEC%" /c "%~f0"
REM Restoring previous code page
chcp !CP! >nul
exit /b
)
endlocal
set "RemoveUTF8BOM=(pause & pause & pause)>nul"
set "echoNL=echo("
set "FinalScript=C:\FinalScript\AllScripts.sql"
:: If you want the final script to start with UTF-8 BOM (This is optional)
:: Create an empty file in NotePad and save it as UTF8-BOM.txt with UTF-8 encoding.
:: Or Create a file in your HexEditor with this byte sequence: EF BB BF
:: and save it as UTF8-BOM.txt
:: The file must be exactly 3 bytes with the above sequence.
(
type "UTF8-BOM.txt" 2>nul
REM This assumes that all sql files start with UTF-8 BOM
REM If not, then they will loose their first 3 otherwise legitimate characters.
REM Resulting in a final corrupted script.
for %%A in (*.sql) do (type "%%~A" & %echoNL%)|(%RemoveUTF8BOM% & findstr "^")
)>"%FinalScript%"
How can I remove any UTF-8 BOM that exists -within- some text, not at the start of some text
Assuming that your file is small enough to hold in memory, and that you have an Enumerable.Replace
extension method for replacing subsequences, then you could use:
var bytes = File.ReadAllBytes(filePath);
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
var empty = Enumerable.Empty<byte>();
bytes = bytes.Replace(bom, empty).ToArray();
File.WriteAllBytes(filePath, bytes);
Here is a simple (inefficient) implementation of the Replace
extension method:
public static IEnumerable<TSource> Replace<TSource>(
this IEnumerable<TSource> source,
IEnumerable<TSource> match,
IEnumerable<TSource> replacement)
{
return Replace(source, match, replacement, EqualityComparer<TSource>.Default);
}
public static IEnumerable<TSource> Replace<TSource>(
this IEnumerable<TSource> source,
IEnumerable<TSource> match,
IEnumerable<TSource> replacement,
IEqualityComparer<TSource> comparer)
{
int sLength = source.Count();
int mLength = match.Count();
if (sLength < mLength || mLength == 0)
return source;
int[] matchIndexes = (
from sIndex in Enumerable.Range(0, sLength - mLength + 1)
where source.Skip(sIndex).Take(mLength).SequenceEqual(match, comparer)
select sIndex
).ToArray();
var result = new List<TSource>();
int sPosition = 0;
foreach (int mPosition in matchIndexes)
{
var sPart = source.Skip(sPosition).Take(mPosition - sPosition);
result.AddRange(sPart);
result.AddRange(replacement);
sPosition = mPosition + mLength;
}
var sLastPart = source.Skip(sPosition).Take(sLength - sPosition);
result.AddRange(sLastPart);
return result;
}
UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)
Since you state:
All of my (text) files are currently
stored in UTF-8 with the BOM
then use the 'utf-8-sig' codec to decode them:
>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'
It automatically removes the expected BOM, and works correctly if the BOM is not present as well.
Is there a way to remove the BOM from a UTF-8 encoded file?
So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.
I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string
def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")
content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
json = JSON.parse(content)
print json
end
Remove BOM from string in Java
You're replacing the BOM with U+0000, rather than with an empty string. You should replace the BOM with the empty string, e.g.
out.write(l.replace("\uFEFF", "") + "\n");
Related Topics
How to Find Directory of Some Command
How to Find the Maximum Stack Size
Optimize PDF Files (With Ghostscript or Other)
How to Get the Name of the Current Git Branch into a Variable in a Shell Script
How to Compile and Link a 32-Bit Windows Executable Using Mingw-W64
Syntax Error Near Unexpected Token 'Then'
How to Set Cron to Display Gui Application
Getting Pids from Ps -Ef |Grep Keyword
Unix Standard Directory to Put Custom Executables or Scripts
The Concept of 'Hold Space' and 'Pattern Space' in Sed
Get Ceiling Integer from Number in Linux (Bash)
How Clear and Invalidate Arm V7 Processor Cache from User Mode on Linux 2.6.35
How to Run Nginx Within a Docker Container Without Halting
Node.Js: Cannot Find Module 'Request'
Linux Shell Sort File According to the Second Column
Write a Bash Shell Script That Consumes a Constant Amount of Ram for a User Defined Time