How to avoid tripping over UTF-8 BOM when reading files
With ruby 1.9.2 you can use the mode r:bom|utf-8
text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
text_without_bom = file.read
}
or
text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')
or
text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')
It doesn't matter, if the BOM is available in the file or not.
You may also use the encoding option with other commands:
text_without_bom = File.readlines(@filename, "r:utf-8")
(You get an array with all lines).
Or with CSV:
require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
csv.each{ |row| p row }
}
Is there a way to remove the BOM from a UTF-8 encoded file?
So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.
I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string
def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")
content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
json = JSON.parse(content)
print json
end
Read a UTF-8 text file with BOM
Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")
?. ?file
says:
As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).
Fix BOM issues when reading UTF-8 encoded CSVs with VBA ()
EDIT: I found that loading the CSV with a querytable object (see this good example) or through a WorkbookQuery object (introduced in Excel 2016) are the easiest and probably most reliable ways to proceed (see an example from the documentation here).
OLD ANSWER:
Talking with @Profex encouraged me to further investigate the issue. Turns out there are 2 problems: the BOM and the delimiter used for the CSV. The ADO connection string I need to use is :
strCon = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\test\;Extended Properties='text;HDR=YES;CharacterSet=65001;FMT=Delimited(;)'"
But FMT does not work with a semicolon (FMT=Delimited(;)
), at least with Microsoft.ACE.OLEDB.12.0 on a x64 system (Excel x64). Thus, @Profex was quite right to state:
even though the first field name has a ? in front of it, it doesn't
look like it actually matters
given that he was using FMT=Delimited
on a CSV delimited by a simple comma (",").
Some people suggest to edit the registry so that the semicolon delimiter is accepted. I'd like to avoid that. Also, I'd rather not create a schema.ini file (even if that may be the best solution for complex CSVs). Thus, the only solutions remaining require to edit the CSV before creating the ADODB.Connection.
I know my CSV will always have the problematical BOM as well as the same basic structure (something like "date";"count"). Thus I decided to go with this code:
Dim arrByte() As Byte
Dim strFilename As String
Dim iFile As Integer
Dim strBuffer As String
strFilename = "C:\Users\test\t1.csv"
If Dir(strFilename) <> "" Then 'check if the file exists, because if not, it would be created when it is opened for Binary mode.
iFile = FreeFile
Open strFilename For Binary Access Read Write As #iFile
strBuffer = String(3, " ") 'We know the BOM has a length of 3
Get #iFile, , strBuffer
If strBuffer = "" 'Check if the BOM is there
strBuffer = String(LOF(iFile) - 3, " ")
Get #iFile, , strBuffer 'the current read position is ok because we already used a Get. We store the whole content of the file without the BOM in strBuffer
arrByte = Replace(strBuffer, ";", ",") 'We replace every semicolon by a colon
Put #iFile, 1, arrByte
End If
Close #iFile
End If
(note: one might use arrByte = StrConv(Replace(strBuffer, ";", ","), vbFromUnicode) because the bytes array is in ANSI format).
Strip the byte order mark from string in C#
If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).
Preserve UTF-8 BOM in Browser Downloads
Workaround (from comments): Since only the first three bytes are read, you can prepend two BOMs to the source, which will result in the downloaded file being valid UTF-8 with a BOM.
As far as Excel specifically: Per the answer at https://stackoverflow.com/a/16766198/1143392, newer versions of Excel (from Office 365) do now support UTF-8.
As far as the cause of the behavior described in the question: The cause is, the relevant specs require the BOM to be stripped out, and that’s what browsers do. That is, browsers conform to the requirements of the UTF-8 decode algorithm in the Encoding spec, which is this:
To UTF-8 decode a byte stream stream, run these steps:
Let buffer be an empty byte sequence.
Read three bytes from stream into buffer.
If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.
Let output be a code point stream.
Run UTF-8’s decoder with stream and output.
Return output.
Step 3 is what causes the BOM to be stripped.
Given the Encoding spec requires that, I think there’s no way to tell browsers to keep the BOM.
Using PowerShell to write a file in UTF-8 without the BOM
Using .NET's UTF8Encoding
class and passing $False
to the constructor seems to work:
$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)
why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?
This has come up before, and I found the answer on Stack Overflow when it happened to me. The linked answer uses a PushbackInputStream
to test for the BOM.
Send CSV file encoded in UTF-8 with BOM in Java
I didn't do much to fix my issue, and I'm still not sure what was wrong. I only had to change the PrintWriter
to a Writer
, and add the charset in my javascript code.
Backend service
public void exportStreetsToCsv(Set<Street> streets, Writer writer) throws IOException {
writer.write('\uFEFF'); // Write BOM
// ...
Frontend download
const blobFile = new Blob([response.data], { type: 'text/csv;charset=utf-8' });
this.FileSaver.saveAs(blobFile, 'test.csv');
Related Topics
No Such File to Load - Rubygems (Loaderror)
How to Get Argument Names Using Reflection
Test If String Is a Number in Ruby on Rails
How to Validate a Date in Rails
Pressing Ctrl + a in Selenium Webdriver
Aws S3: the Bucket You Are Attempting to Access Must Be Addressed Using the Specified Endpoint
Sqlite3-Ruby Install Error on Ubuntu
In Ruby How to Overload the Initialize Constructor
Ruby - Parameters by Reference or by Value
Optional Argument After Splat Argument
How to Break Outer Cycle in Ruby
Group Hashes by Keys and Sum the Values
Rails 3: Alias_Method_Chain Still Used
Does Ruby Perform Tail Call Optimization
How to Search Within an Array of Hashes by Hash Values in Ruby