How to Avoid Tripping Over Utf-8 Bom When Reading Files

How to avoid tripping over UTF-8 BOM when reading files

With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  text_without_bom = file.read
}

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.

You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

Or with CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
  csv.each{ |row| p row }
}

Is there a way to remove the BOM from a UTF-8 encoded file?

So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

def read_json_file(file_name, index)
  content = ''
  file = File.open("#{file_name}\\game.json", "r") 
  content = file.read.force_encoding("UTF-8")

  content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

  json = JSON.parse(content)

  print json
end

Read a UTF-8 text file with BOM

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).

Fix BOM issues when reading UTF-8 encoded CSVs with VBA (ï»¿)

EDIT: I found that loading the CSV with a querytable object (see this good example) or through a WorkbookQuery object (introduced in Excel 2016) are the easiest and probably most reliable ways to proceed (see an example from the documentation here).

OLD ANSWER:

Talking with @Profex encouraged me to further investigate the issue. Turns out there are 2 problems: the BOM and the delimiter used for the CSV. The ADO connection string I need to use is :

strCon = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\test\;Extended Properties='text;HDR=YES;CharacterSet=65001;FMT=Delimited(;)'"

But FMT does not work with a semicolon (FMT=Delimited(;)), at least with Microsoft.ACE.OLEDB.12.0 on a x64 system (Excel x64). Thus, @Profex was quite right to state:

even though the first field name has a ? in front of it, it doesn't
look like it actually matters

given that he was using FMT=Delimited on a CSV delimited by a simple comma (",").

Some people suggest to edit the registry so that the semicolon delimiter is accepted. I'd like to avoid that. Also, I'd rather not create a schema.ini file (even if that may be the best solution for complex CSVs). Thus, the only solutions remaining require to edit the CSV before creating the ADODB.Connection.

I know my CSV will always have the problematical BOM as well as the same basic structure (something like "date";"count"). Thus I decided to go with this code:

Dim arrByte() As Byte
Dim strFilename As String
Dim iFile As Integer
Dim strBuffer As String
strFilename = "C:\Users\test\t1.csv"
If Dir(strFilename) <> "" Then 'check if the file exists, because if not, it would be created when it is opened for Binary mode.
    iFile = FreeFile
    Open strFilename For Binary Access Read Write As #iFile
    strBuffer = String(3, " ") 'We know the BOM has a length of 3
    Get #iFile, , strBuffer
    If strBuffer = "ï»¿" 'Check if the BOM is there
        strBuffer = String(LOF(iFile) - 3, " ")
        Get #iFile, , strBuffer 'the current read position is ok because we already used a Get. We store the whole content of the file without the BOM in strBuffer
        arrByte = Replace(strBuffer, ";", ",") 'We replace every semicolon by a colon
        Put #iFile, 1, arrByte
    End If
    Close #iFile
End If

(note: one might use arrByte = StrConv(Replace(strBuffer, ";", ","), vbFromUnicode) because the bytes array is in ANSI format).

Strip the byte order mark from string in C#

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

Preserve UTF-8 BOM in Browser Downloads

Workaround (from comments): Since only the first three bytes are read, you can prepend two BOMs to the source, which will result in the downloaded file being valid UTF-8 with a BOM.

As far as Excel specifically: Per the answer at https://stackoverflow.com/a/16766198/1143392, newer versions of Excel (from Office 365) do now support UTF-8.

As far as the cause of the behavior described in the question: The cause is, the relevant specs require the BOM to be stripped out, and that’s what browsers do. That is, browsers conform to the requirements of the UTF-8 decode algorithm in the Encoding spec, which is this:

To UTF-8 decode a byte stream stream, run these steps:
Let buffer be an empty byte sequence.
Read three bytes from stream into buffer.
If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.
Let output be a code point stream.
Run UTF-8’s decoder with stream and output.
Return output.

Step 3 is what causes the BOM to be stripped.

Given the Encoding spec requires that, I think there’s no way to tell browsers to keep the BOM.

Using PowerShell to write a file in UTF-8 without the BOM

Using .NET's UTF8Encoding class and passing $False to the constructor seems to work:

$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)

why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?

This has come up before, and I found the answer on Stack Overflow when it happened to me. The linked answer uses a PushbackInputStream to test for the BOM.

Send CSV file encoded in UTF-8 with BOM in Java

I didn't do much to fix my issue, and I'm still not sure what was wrong. I only had to change the PrintWriter to a Writer, and add the charset in my javascript code.

Backend service

public void exportStreetsToCsv(Set<Street> streets, Writer writer) throws IOException {
    writer.write('\uFEFF'); // Write BOM
    // ...

Frontend download

const blobFile = new Blob([response.data], { type: 'text/csv;charset=utf-8' });
this.FileSaver.saveAs(blobFile, 'test.csv');

How to Avoid Tripping Over Utf-8 Bom When Reading Files