How to Write Bom Marker to a File in Ruby

How to write BOM marker to a file in Ruby

Alas I think your manual approach is the way to go, at least I don't know a better way:

http://blog.grayproductions.net/articles/miscellaneous_m17n_details

To quote from JEG2's article:

Ruby 1.9 won't automatically add a BOM to your data, so you're going
to need to take care of that if you want one. Luckily, it's not too
tough. The basic idea is just to print the bytes needed at the
beginning of a file.

Ruby: Check for Byte Order Marker

string.start_with?("\u00ef\u00bb\u00bf")

From Ruby official documentation:

\xnn      hexadecimal bit pattern, where nn is 1-2 hexadecimal digits ([0-9a-fA-F])

\unnnn  Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])

That said, to interpolate a unicode character, one should use \uXXXX notation. It is safe and we can reliable use this version.

What does rb:bom|utf-8 mean in CSV.open in Ruby?

When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.

If you're reading CSV files that are BOM encoded then you need to do it that way.

Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.

Is there a way to remove the BOM from a UTF-8 encoded file?

So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.

I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string

def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")

content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

json = JSON.parse(content)

print json
end

How to use internal/external encoding when importing a YAML file?

I believe officially YAML only supports UTF-8 (and maybe UTF-16). There have historically been all sorts of encoding confusions in YAML libraries. I think you are going to run into trouble trying to have YAML in something other than a Unicode encoding.

  1. What happens actually when I don't set the default_internal to utf-8?

Encoding.default_internal controls the encoding your input will be converted to when it is read in, at least by some operations that respect Encoding.default_internal, not everything does. Rails seems to set it to UTF-8. So if you don't set the Encoding.default_internal to UTF-8, it might be UTF-8 already anyway.

If Encoding.default_internal is nil, then those operations that respect it, and try to convert any input to Encoding.default_internal upon reading it in won't do that, they'll leave any input in the encoding it was believed to originate in, not try to convert it.

If you set it to something else, like say "WINDOWS-1252" Ruby would automatically convert your stuff to WINDOWS-1252 when it read it in with File.open, which would possibly confuse YAML::load when you pass the string that's now encoded and tagged as WINDOWS-1252 to it. Generally there's no good reason to do this, so leave Encoding.default_internal alone.

Note: The Ruby docs say:

"You should not set ::default_internal in Ruby code as strings created before changing the value may have a different encoding from strings created after the change. Instead you should use ruby -E to invoke Ruby with the correct default_internal."

See also: http://ruby-doc.org/core-1.9.3/Encoding.html#method-c-default_internal


  1. Which encodings do the strings in both examples have?

I don't really know. One would have to have to look at the bytes and try to figure out if they are legal bytes for various plausible encodings, and beyond being legal, if they mean something likely to be intended.

For example take: "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". That's a perfectly legal UTF-8 string, but as humans we know it's probably not intended, and is probably garbage, quite likely from the result of an encoding misinterpretation. But a computer has no way to know that, it's perfectly legal UTF-8, and, hey, maybe someone actually did mean to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ", after all, I just did, when writing this post!

So you can try to interpret the bytes according to various encodings and see if any of them make sense.

You're really just guessing at this point. Which means...


  1. How can I load a file even if I don't know it's encoding?

Generally, you can not. You need to know and keep track of encodings. There's no real way to know what the bytes mean without knowing their encoding.

If you have some legacy data for which you've lost this, you've got to try to figure it out. Manually, or with some code that tries to guess likely encodings based on heuristics. Here's one Ruby gem Charlock Holmes that tries to guess, using the ICU library heuristics (this particular gem only works on MRI).

What Ruby says in response to string.encoding is just the encoding the string is tagged with. The string can be tagged with the wrong encoding, the bytes in the string don't actually mean what is intended in the encoding it's tagged with... in which case you'll get garbage.

Ruby will do the right things with your string instead of creating garbage only if the string's encoding tag is correct. The string's encoding tag is determined by Encoding.default_external for most input operations by default (Encoding.default_external usually starts out as UTF-8, or ASCII-8BIT which really means the null encoding, binary data, not tagged with an encoding), or by passing an argument to File.open: File.open("something", "r:UTF-8" or, means the same thing, File.open("something", "r", :encoding => "UTF-8"). The actual bytes are determined by whatever is in the file. It's up to you to tell Ruby the correct encoding to interpret those bytes as text meaning what they were intended to mean.

There were a couple posts recently to reddit /r/ruby that try to explain how to troubleshoot and workaround encoding issues that you may find helpful:

  • http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/
  • http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/

Also, this is my favorite article on understanding encoding generally: http://kunststube.net/encoding/

For YAML files in particular, if I were you, I'd just make sure they are all in UTF-8. Life will be much easier and you won't have to worry about it. If you have some legacy ones that have become corrupted, it's going to be a pain to fix them, but that's what you've got to do, unless you can just rewrite them from scratch. Try to fix them to be in valid and correct UTF-8, and from here on out keep all your YAML in UTF-8.

What's the difference between UTF-8 and UTF-8 with BOM?

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes


... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.



Related Topics



Leave a reply



Submit