Set Global Default Encoding For Ruby 1.9

Set global default encoding for ruby 1.9

You can either:

  1. set your RUBYOPT environment variable to "-E utf-8"
  2. or use https://github.com/m-ryan/magic_encoding

Set UTF-8 as default for Ruby 1.9.3

To change the source encoding (i.e. the encoding your actual written source code is in), you have to use the magic comment currently:

# encoding: utf-8

It is not enough to either set the internal encoding (the encoding of the internal string representation after conversion) or the external encoding (the assumed encoding of read files). You actually have to set the magic encoding comment on top of files to set the source encoding.

In ChiliProject we have a rake task which sets the correct encoding header in all files automatically before a release.

As for encoding defaults:

  • Ruby 1.8 and below didn't knew the concept of string encodings at all. Strings were more or less byte arrays.
  • Ruby 1.9: default string encoding is US_ASCII everywhere.
  • Ruby 2.0 and above: default string encoding is UTF-8.

Thus, if you use Ruby 2.0, you could skip the encoding comment and correctly assume UTF-8 encoding everywhere by default.

JSON encoding issue with Ruby 1.9 and HTTParty

In Ruby 1.9, encoding is explicit now. However, Rails may or may not be configured to send the responses in the encoding you expect. You'll have to set the global configuration setting:

Encoding.default_external = "utf-8".

I believe the encoding that Ruby specifies by default for serialization is the platform default. In America on Windows that would be CodePage-1251. Other countries would have an alternate encoding.

Edit: Also see this url if the json is executed against MySQL: https://rails.lighthouseapp.com/projects/8994/tickets/5210-encoding-problem-in-json-format-response

Edit 2: Rails core and its suite of libraries (ActiveRecord, et. al.) will respect the Encoding.default_external configuration setting which encodes all the values it sends. Unfortunately, because encoding is a relatively new concept to Ruby not every 3rd party library has been adjusted for proper encoding. The ones that have may require additional configuration settings for those libraries. This includes MySQL, and the RSolr library you were using.

In all versions of Ruby before the 1.9 series, a string was just an array of bytes. When you've been thinking like that for so long, it's hard to wrap your head around the concept of multiple string encodings. The thing that is even more confusing now is that unlike Java, C#, and other languages that use some form of UTF as the native string format, Ruby allows each string to be encoded differently. In retrospect, that might be a mistake, but at least now they are respecting encoding.

The Encoding.force_encoding method is designed to treat the byte sequence with that new encoding, but does not change any of the underlying data. So it is possible to have invalid byte sequences. There is another method called .encode() that will transform the bytes from one encoding to another and guarantees valid byte sequences. For more information read this:

http://blog.grayproductions.net/articles/ruby_19s_string

How to set #encoding: utf-8 in Rails globally

This is a duplicate of Set global default encoding for ruby 1.9, but in your case I suggest using the I18n:

redirect_to root_url, :notice => I18n.t 'sessions.destroy.success'

# config/locales/ru.yml
ru:
sessions:
destroy:
success: Вышли успешно

As for locales key naming, AFAIK there's no convention, here I use "controller_name.action_name.result" scheme.

I receive incompatible character encodings: CP850 and UTF-8 when displaying the £ symbol on my ramaze app

Try to force the encoding to see if that makes the problem go away:

your_string.force_encoding(::Encoding::UTF_8)

If it does, dive into your app and spot what is setting the wrong encoding, where, and why.

It's possibly server-/webpage-related, as in the page you're serving is rendered as US-ASCII owing to a header. Or the server is started with encoding other than UTF-8. Or something other to that effect. Your script ends up with a piece of external data that isn't UTF-8.

How can I avoid putting the magic encoding comment on top of every UTF-8 file in Ruby 1.9?

Explicit is better than implicit. Writing out the name of the encoding is good for your text editor, your interpreter, and anyone else who wants to look at the file. Different platforms have different defaults -- UTF-8, Windows-1252, Windows-1251, etc. -- and you will either hamper portability or platform integration if you automatically pick one over the other. Requiring more explicit encodings is a Good Thing.

It might be a good idea to integrate your Rails app with GetText. Then all of your UTF-8 strings will be isolated to a small number of translation files, and your Ruby modules will be clean ASCII.

How can I globally ignore invalid byte sequences in UTF-8 strings?

I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY')  # => "Men\xFC"

Then we can convert them to UTF-8 using String#encode and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace)  # => "Men\uFFFD"
s.valid_encoding? # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

Ruby 1.9 iso-8859-8-i encoding

When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.

However, if you know from some other source what the correct encoding is, you can use force_encoding method to tell Ruby how to interpret the bytes it has loaded into a String. Note this alters the object in place.

E.g.

contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents

At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.

I could not find 'ISO-8859-8-I' on my version of Ruby. I am not sure yet how close 'ISO-8859-8' is to what you need (some Googling suggests that it may be OK for you, if the ...-I encoding is not available).



Related Topics



Leave a reply



Submit