How to Avoid Tripping Over Utf-8 Bom When Reading Files

How to avoid tripping over UTF-8 BOM when reading files

With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
text_without_bom = file.read
}

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.


You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

Or with CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
csv.each{ |row| p row }
}

Is there a way to remove the BOM from a UTF-8 encoded file?

So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")

content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

json = JSON.parse(content)

print json
end

Read a UTF-8 text file with BOM

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

Fix BOM issues when reading UTF-8 encoded CSVs with VBA ()

EDIT: I found that loading the CSV with a querytable object (see this good example) or through a WorkbookQuery object (introduced in Excel 2016) are the easiest and probably most reliable ways to proceed (see an example from the documentation here).

OLD ANSWER:

Talking with @Profex encouraged me to further investigate the issue. Turns out there are 2 problems: the BOM and the delimiter used for the CSV. The ADO connection string I need to use is :

strCon = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\test\;Extended Properties='text;HDR=YES;CharacterSet=65001;FMT=Delimited(;)'"

But FMT does not work with a semicolon (FMT=Delimited(;)), at least with Microsoft.ACE.OLEDB.12.0 on a x64 system (Excel x64). Thus, @Profex was quite right to state:

even though the first field name has a ? in front of it, it doesn't
look like it actually matters

given that he was using FMT=Delimited on a CSV delimited by a simple comma (",").

Some people suggest to edit the registry so that the semicolon delimiter is accepted. I'd like to avoid that. Also, I'd rather not create a schema.ini file (even if that may be the best solution for complex CSVs). Thus, the only solutions remaining require to edit the CSV before creating the ADODB.Connection.

I know my CSV will always have the problematical BOM as well as the same basic structure (something like "date";"count"). Thus I decided to go with this code:

Dim arrByte() As Byte
Dim strFilename As String
Dim iFile As Integer
Dim strBuffer As String
strFilename = "C:\Users\test\t1.csv"
If Dir(strFilename) <> "" Then 'check if the file exists, because if not, it would be created when it is opened for Binary mode.
iFile = FreeFile
Open strFilename For Binary Access Read Write As #iFile
strBuffer = String(3, " ") 'We know the BOM has a length of 3
Get #iFile, , strBuffer
If strBuffer = "" 'Check if the BOM is there
strBuffer = String(LOF(iFile) - 3, " ")
Get #iFile, , strBuffer 'the current read position is ok because we already used a Get. We store the whole content of the file without the BOM in strBuffer
arrByte = Replace(strBuffer, ";", ",") 'We replace every semicolon by a colon
Put #iFile, 1, arrByte
End If
Close #iFile
End If

(note: one might use arrByte = StrConv(Replace(strBuffer, ";", ","), vbFromUnicode) because the bytes array is in ANSI format).

Strip the byte order mark from string in C#

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

Preserve UTF-8 BOM in Browser Downloads

Workaround (from comments): Since only the first three bytes are read, you can prepend two BOMs to the source, which will result in the downloaded file being valid UTF-8 with a BOM.

As far as Excel specifically: Per the answer at https://stackoverflow.com/a/16766198/1143392, newer versions of Excel (from Office 365) do now support UTF-8.

As far as the cause of the behavior described in the question: The cause is, the relevant specs require the BOM to be stripped out, and that’s what browsers do. That is, browsers conform to the requirements of the UTF-8 decode algorithm in the Encoding spec, which is this:

To UTF-8 decode a byte stream stream, run these steps:

  1. Let buffer be an empty byte sequence.

  2. Read three bytes from stream into buffer.

  3. If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.

  4. Let output be a code point stream.

  5. Run UTF-8’s decoder with stream and output.

  6. Return output.

Step 3 is what causes the BOM to be stripped.

Given the Encoding spec requires that, I think there’s no way to tell browsers to keep the BOM.

Using PowerShell to write a file in UTF-8 without the BOM

Using .NET's UTF8Encoding class and passing $False to the constructor seems to work:

$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)

why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?

This has come up before, and I found the answer on Stack Overflow when it happened to me. The linked answer uses a PushbackInputStream to test for the BOM.

Send CSV file encoded in UTF-8 with BOM in Java

I didn't do much to fix my issue, and I'm still not sure what was wrong. I only had to change the PrintWriter to a Writer, and add the charset in my javascript code.

Backend service

public void exportStreetsToCsv(Set<Street> streets, Writer writer) throws IOException {
writer.write('\uFEFF'); // Write BOM
// ...

Frontend download

const blobFile = new Blob([response.data], { type: 'text/csv;charset=utf-8' });
this.FileSaver.saveAs(blobFile, 'test.csv');


Related Topics



Leave a reply



Submit