Unicode Characters in a Ruby Script

Unicode characters in a Ruby script?

You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.

What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.

Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.

And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):

puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"

However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.

So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:

puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"

But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).

(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

Ruby Output Unicode Character

In Ruby 1.9.x+

Use String#encode:

checkmark = "\u2713"
puts checkmark.encode('utf-8')

prints


In Ruby 1.8.7

puts '\u2713'.gsub(/\\u[\da-f]{4}/i) { |m| [m[-4..-1].to_i(16)].pack('U') }

How to print unicode charaters in Command Prompt with Ruby

You need to enclose the unicode character in { and } if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:

heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred

Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:

macOS Emoji Menu

heart = "♥"
package = "br>fire_and_one_hundred = "br>puts heart
puts package
puts fire_and_one_hundred

Output:


br>br>

How it looks in the macOS terminal:

How it looks in the macOS terminal

How can I detect certain Unicode characters in a string in Ruby?

(ruby 1.9.2)

#encoding: UTF-8
class String
def contains_cjk?
!!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} matches a character’s Unicode script.

The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

How can I use Unicode/Korean characters in a Ruby program?

You can set the encoding when you read the file using the encoding option to File.read:

correctly_encoded_text = File.read("my_korean_text.txt", encoding: "UTF-8")

ruby: unicode character decimal value to \uXXXX conversion? .ord method not working

mu is too short's answer is cool.

But, the simplest answer is:

'好'.ord.to_s(16)     # => '597d'

Need a range with all unicode characters

[*32..65535].map do |e|
e.chr(Encoding::UTF_8).tap do |char|
char =~ /\p{Alnum}|\p{Punct}/ || raise
end rescue nil # rescuing both conversion and self-raised
end.compact

The above iterates through all the codepoints, selecting alphanumerics and punctuation.

NB The approach above, while is more or less robust, failes to match diacritics, that is a part of combined characters like ç or ö.

Unicode characters in Ruby 1.9.3 IRB with RVM

RVM has issues with readline installed via homebrew. This gist worked perfectly for me:

$ rvm get latest
$ rvm pkg install readline
$ rvm install 1.9.3 --with-readline-dir=$rvm_path/usr

Instead of install you can use reinstall.

How do I escape a Unicode string with Ruby?

In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.

>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"

>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""

>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil

In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:

>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""

In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):

>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"


Related Topics



Leave a reply



Submit