How to Set the Default String Encoding on Ruby 1.9

Set UTF-8 as default for Ruby 1.9.3

To change the source encoding (i.e. the encoding your actual written source code is in), you have to use the magic comment currently:

# encoding: utf-8

It is not enough to either set the internal encoding (the encoding of the internal string representation after conversion) or the external encoding (the assumed encoding of read files). You actually have to set the magic encoding comment on top of files to set the source encoding.

In ChiliProject we have a rake task which sets the correct encoding header in all files automatically before a release.

As for encoding defaults:

  • Ruby 1.8 and below didn't knew the concept of string encodings at all. Strings were more or less byte arrays.
  • Ruby 1.9: default string encoding is US_ASCII everywhere.
  • Ruby 2.0 and above: default string encoding is UTF-8.

Thus, if you use Ruby 2.0, you could skip the encoding comment and correctly assume UTF-8 encoding everywhere by default.

Can I set the default string encoding on Ruby 1.9?

Don't confuse file encoding with string encoding

The purpose of the #encoding statement at the top of files is to let Ruby know during reading / interpreting your code, and your editor know how to handle any non-ASCII characters while editing / reading the file -- it is only necessary if you have at least one non-ASCII character in the file. e.g. it's necessary in your config/locale files.

To define the encoding in all your files at once, you can use the
magic_encoding gem
, it can insert uft-8 magic comment to all ruby files in your app.

The error you're getting at runtime Encoding::CompatibilityError is an error which happens when you try to concatenate two Strings with different encoding during program execution, and their encodings are incompatible.

This most likely happens when:

  • you are using L10N strings (e.g. UTF-8), and concatenate them to e.g. ASCII string (in your view)

  • the user types in a string in a foreign language (e.g. UTF-8), and your view tries to print it out in some view, along with some fixed string which you pre-defined (ASCII). force_encoding will help there. There's also Encoding::primary_encoding in Rails 1.9 to set the default encoding for new Strings.
    And there is config.encoding in Rails in the config/application.rb file.

  • String which come from your database, and then are combined with other Strings in your view.
    (their encodings could be either way around, and incompatible).

Side-Note: Make sure to specify a default encoding when you create your database!

    create database yourproject  DEFAULT CHARACTER SET utf8;

If you want to use EMOJIs in your strings:

    create database yourproject DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_bin;

and all indexes on string columns which may contain EMOJI need to be 191 characters in length. CHARACTER SET utf8mb4 COLLATE utf8mb4_bin

The reason for this is that normal UTF8 uses up to 3 bytes, whereas EMOJI use 4 bytes storage.

Please check this Yehuda Katz article, which covers this in-depth, and explains it very well:
(there is specifically a section 'Incompatible Encodings')

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

http://yehudakatz.com/2010/05/17/encodings-unabridged/

and:

http://zargony.com/2009/07/24/ruby-1-9-and-file-encodings

http://graysoftinc.com/character-encodings

Set global default encoding for ruby 1.9

You can either:

  1. set your RUBYOPT environment variable to "-E utf-8"
  2. or use https://github.com/m-ryan/magic_encoding

Why is the default encoding in Rails not UTF-8?

Ruby 2.0 is UTF8 by default. Otherwise you must signify that in 1.9. According to naruse:

The default script encoding change.

Default script encoding (when magic comment is not specified) is
changed into UTF8[#6679] In Ruby 1.9, the default script encoding is
US-ASCII. We changed it to be UTF-8 after considering the following
pros and cons. UTF-8 default is convenient because the majority of
modern application uses UTF-8 This change doe not impact any 1.9 codes
if Magic Comments are in place. The default script encoding in 1.9
without Magic Comment is either US-ASCII or ASCII-8BIT. In UTF-8, then
some string manipulation could become slower.

Source: Rubyist Magazine

Handling string encoding with the same code in Ruby 1.8 and 1.9

I'm with Mike Lewis in using respond_to, but don't do it on the variable res everywhere throughout your code.

I took a look at your code in gateway.rb and it looks like everywhere you are using res, it gets set by a call to make_api_request so you could add this before your return statement in that method:

doc = doc.force_encoding("UTF-8") if doc.respond_to?(:force_encoding) 

Even if it's other places but it's not literally with every string you encounter, I'm sure you can find a way to refactor the code that makes sense and solves the problems in one place instead of everywhere you encounter it.

Are you having a problem with other places?

ruby 1.9, force_encoding, but check

(update: see https://github.com/jrochkind/scrub_rb)

So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb

But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":

a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"

Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

Is UTF-8 the default encoding in Ruby v.2?

Yes. The fact that UTF-8 is the default encoding is only since Ruby 2.

If you are aware that his examples were from Ruby 1.9, then check the newly added features to the newer versions of Ruby. It is not that much.

Determine character encoding in Ruby 1.9.3

The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.

In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.

See here for other ways to encode ç.

As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.

If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:

unless query.valid_encoding?
query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end

Here's the process in IRB (plus an escaping at the end for fun)

a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1") # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"

Ruby 1.9.x and string encoding

Well, updating to Rails 3.1.3 and mysql2 0.3.10 seems to have solved my issue (was running Rails 3.0.3 and mysql2 0.2.6). Seems weird to me though, as Ruby 1.9 is over 3y old and Rails 3.0.3 was released way after that, so I don't see why Rails 3.0.x wouldn't play nice with Ruby's 1.9 new string encodings. If anyone can add up on this I would be grateful.

how to convert character encoding with ruby 1.9

As the exception points, your string is ASCII-8BIT encoded. You should change the encoding. There is a long story about that, but if you are interested in quick solution, just force_encoding on the string before you do any processing:

s = "Learn Objective\xE2\x80\x93C on the Mac"
# => "Learn Objective\xE2\x80\x93C on the Mac"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.force_encoding 'utf-8'
# => "Learn Objective–C on the Mac"


Related Topics



Leave a reply



Submit