Elegant Way to Search For Utf-8 Files With Bom

Elegant way to search for UTF-8 files with BOM?

What about this one simple command which not just finds but clears the nasty BOM? :)

find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \;

I love "find" :)

Warning The above will modify binary files which contain those three characters.

If you want just to show BOM files, use this one:

grep -rl $'\xEF\xBB\xBF' .

How do I remove  from the beginning of a file?

Three words for you:

Byte Order Mark (BOM)

That's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.

To automatize the BOM's removal you can use awk as shown in this question.

As another answer says, the best would be for PHP to actually interpret the BOM correctly, for that you can use mb_internal_encoding(), like this:

 <?php
//Storing the previous encoding in case you have some other piece
//of code sensitive to encoding and counting on the default value.
$previous_encoding = mb_internal_encoding();

//Set the encoding to UTF-8, so when reading files it ignores the BOM
mb_internal_encoding('UTF-8');

//Process the CSS files...

//Finally, return to the previous encoding
mb_internal_encoding($previous_encoding);

//Rest of the code...
?>

How to avoid tripping over UTF-8 BOM when reading files

With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
text_without_bom = file.read
}

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.


You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

Or with CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
csv.each{ |row| p row }
}

What makes a file UTF-8?

Text is UTF-8 because it's valid as UTF-8 and the author decides it is.

How that decision by the author is communicated to the consumer is a different question, which involves convention, guessing, and various schemes for in-band- or out-of-band-signalling, like HTTP or HTML charset, BOM (which enhances guessing), some envelope / embedding Format, additional data-streams, file-naming, and many more.

UTF-8 BOM added to downloaded file

You should check for a BOM in the script file too. Usually if your IDE saves UTF-8 files with BOM, it is before the opening <?php tag, so php treats as output.

The files format which CodeSmith generated are UTF-8 with BOM, how to change it to UTF-8 without BOM?

There are two properties you can set to control this property on the Code Template directive Encoding and ResponseEncoding attributes will control how the template is rendered and saved.

https://codesmith.atlassian.net/wiki/display/Generator/The+CodeTemplate+Directive



Related Topics



Leave a reply



Submit