How to Remove Multiple Utf-8 Bom Sequences

How to remove multiple UTF-8 BOM sequences

you would use the following code to remove utf8 bom

//Remove UTF8 Bom

function remove_utf8_bom($text)
{
    $bom = pack('H*','EFBBBF');
    $text = preg_replace("/^$bom/", '', $text);
    return $text;
}

Batch script remove BOM (ï»¿) from file

This is because the type command will preserve the UTF-8 BOM, so when you combine multiple files which have the BOM, the final file will contain multiple BOMs in various places in middle of the file.

If you are certain that all the SQL files that you want to combine, start with the BOM, then you can use the following script to remove the BOM from each of them before actually combining them.

This is done by piping the output of type. The other side of pipe will consume the first 3 bytes (The BOM) with the help of 3 pause commands. each pause will consume one byte. The rest of stream will be send to the findstr command to append it to final script.

Since the SQL files are encoded UTF-8 and they may contain any characters in the Unicode range, certain code pages will interfere with the operation and may cause the final SQL script to be corrupted.

So this has been taken into account and the batch file will be restarted with code page 437 which is safe for accessing any binary sequence.

@echo off
setlocal DisableDelayedExpansion


setlocal EnableDelayedExpansion
for /F "tokens=*" %%a in ('chcp') do for %%b in (%%a) do set "CP=%%~nb"
if  !CP! NEQ 437 if !CP! NEQ 65001 chcp 437 >nul && (

    REM for file operations, the script must restatred in a new instance.
    "%COMSPEC%" /c "%~f0"

    REM Restoring previous code page
    chcp !CP! >nul
    exit /b
)
endlocal


set "RemoveUTF8BOM=(pause & pause & pause)>nul"
set "echoNL=echo("
set "FinalScript=C:\FinalScript\AllScripts.sql"

:: If you want the final script to start with UTF-8 BOM (This is optional)
:: Create an empty file in NotePad and save it as UTF8-BOM.txt with UTF-8 encoding.
:: Or Create a file in your HexEditor with this byte sequence: EF BB BF
:: and save it as UTF8-BOM.txt
:: The file must be exactly 3 bytes with the above sequence.
(
    type "UTF8-BOM.txt" 2>nul

    REM This assumes that all sql files start with UTF-8 BOM
    REM If not, then they will loose their first 3 otherwise legitimate characters.
    REM Resulting in a final corrupted script.
    for %%A in (*.sql) do (type "%%~A" & %echoNL%)|(%RemoveUTF8BOM% & findstr "^")

)>"%FinalScript%"

Is there a way to remove the BOM from a UTF-8 encoded file?

So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

def read_json_file(file_name, index)
  content = ''
  file = File.open("#{file_name}\\game.json", "r") 
  content = file.read.force_encoding("UTF-8")

  content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

  json = JSON.parse(content)

  print json
end

PHP, remove BOM when requiring a PHP file

Simply do not add the BOM when creating the file. It serves no purpose as such.

The most likely explanation for your "the only way it worked for me" is simply a bad testing method, no more, no less. Meaning, your file was being created with UTF-8 perfectly fine, just whatever method you used to confirm that was flawed. I'll guess here that you opened the resulting file in some text editor, and that editor told you the file encoding was "ANSI" or "ASCII" or such.

Well, a plain text file does not declare its encoding anywhere. Your text editor was just taking a guess as to its encoding. If the file contents are just plain English/ASCII, there's no difference whatsoever between ASCII, ANSI and UTF-8. Your text editor simply told you one of the possible answers, where any answer is equally valid. Adding a BOM explicitly places a hint at the beginning of the file that the encoding is UTF-8, which the editor picked up on.

This, or something similar, is likely your entire non-issue.

Remove a BOM character in a file

If you look in the same menu. Click "Convert to UTF-8."

If you look in the same menu. Click "Convert to UTF-8.

How do I remove ï»¿ from the beginning of a file?

Three words for you:

Byte Order Mark (BOM)

That's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.

To automatize the BOM's removal you can use awk as shown in this question.

As another answer says, the best would be for PHP to actually interpret the BOM correctly, for that you can use mb_internal_encoding(), like this:

 <?php
   //Storing the previous encoding in case you have some other piece 
   //of code sensitive to encoding and counting on the default value.      
   $previous_encoding = mb_internal_encoding();

   //Set the encoding to UTF-8, so when reading files it ignores the BOM       
   mb_internal_encoding('UTF-8');

   //Process the CSS files...

   //Finally, return to the previous encoding
   mb_internal_encoding($previous_encoding);

   //Rest of the code...
  ?>

How to Remove Multiple Utf-8 Bom Sequences