Does Ruby Support Unicode and How Does It Work

Does Ruby support unicode and how does it work?

What you heard is outdated and applies (only partially) to Ruby 1.8 or before. The latest stable version of Ruby (1.9), supports no less than 95 different character encodings (counted on my system just now). This includes pretty much all known Unicode Transformation Formats, including UTF-8.

The previous stable version of Ruby (1.8) has partial support for UTF-8.

If you use Rails, it takes care of default UTF-8 encoding for you. If all you need is UTF-8 encoding awareness, Rails will work for you no matter if you run Ruby 1.9 or Ruby 1.8. If you have very specific character encoding requirements, you should aim for Ruby 1.9.

If you're really interested, here is a series of articles describing the encoding issues in Ruby 1.8 and how they were worked around, and eventually solved in Ruby 1.9. Rails still includes workarounds for many common flaws in Ruby 1.8.

Unicode characters in a Ruby script?

You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.

What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.

Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.

And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):

puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"

However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.

So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:

puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"

But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).

(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

Ruby Output Unicode Character

In Ruby 1.9.x+

Use String#encode:

checkmark = "\u2713"
puts checkmark.encode('utf-8')

prints


In Ruby 1.8.7

puts '\u2713'.gsub(/\\u[\da-f]{4}/i) { |m| [m[-4..-1].to_i(16)].pack('U') }

Ruby 1.9 doesn't support Unicode normalization yet

If you are aware of the consequences, i.e. accented characters will not be transliterated in Ruby 1.9.1 + Rails 2.3.x, place this in config/initializers to silence the warning:

# http://stackoverflow.com/questions/2135247/ruby-1-9-doesnt-support-unicode-normalization-yet
module ActiveSupport
module Inflector
# Calling String#parameterize prints a warning under Ruby 1.9,
# even if the data in the string doesn't need transliterating.
if Rails.version =~ /^2\.3/
undef_method :transliterate
def transliterate(string)
string.dup
end
end
end
end

Rails 3 does indeed solve this issue, so a more future-proof solution would be to migrate towards that.

Ruby string escape for supplementary plane Unicode characters

You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:

s = "\u{1F609}" # => "br>

The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:

s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "Привет, мир!"

You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:

# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "br>s.length # => 1

# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4

Unicode character sent to server is returned as garbage

We were able to solve this problem by migrating to a different JSON encoding engine:

get "/foo" do
resp = "br>
puts MultiJson.adapter()
puts MultiJson.dump(resp) # Fails

MultiJson.engine = :jrjackson
puts MultiJson.adapter()
puts MultiJson.dump(resp) # Succeeds
end

How one should add UTF-8 support to sorting in Ruby (including ł character, without affecting portability)?

good solution is using gem https://github.com/twitter/twitter-cldr-rb

require 'twitter_cldr'
collator = TwitterCldr::Collation::Collator.new
collator.sort(['m', 'ł', 'l'])
=> ["l", "ł", "m"]


Related Topics



Leave a reply



Submit