Unicode characters in a Ruby script?
You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0
. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.
What you are seeing with \377
and \376
(octal for \xFF
and \xFE
) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.
Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.
And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):
puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"
However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.
So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:
puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"
But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).
(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)
Ruby Output Unicode Character
In Ruby 1.9.x+
Use String#encode
:
checkmark = "\u2713"
puts checkmark.encode('utf-8')
prints
✓
In Ruby 1.8.7
puts '\u2713'.gsub(/\\u[\da-f]{4}/i) { |m| [m[-4..-1].to_i(16)].pack('U') }
✓
How to print unicode charaters in Command Prompt with Ruby
You need to enclose the unicode character in {
and }
if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:
heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred
Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:
heart = "♥"
package = "br>fire_and_one_hundred = "br>puts heart
puts package
puts fire_and_one_hundred
Output:
♥
br>br>
How it looks in the macOS terminal:
How can I detect certain Unicode characters in a string in Ruby?
(ruby 1.9.2)
#encoding: UTF-8
class String
def contains_cjk?
!!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
end
end
strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}
#true
#true
#true
#false
\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.
Wow. Ruby Regexp source .
How can I use Unicode/Korean characters in a Ruby program?
You can set the encoding when you read the file using the encoding
option to File.read
:
correctly_encoded_text = File.read("my_korean_text.txt", encoding: "UTF-8")
ruby: unicode character decimal value to \uXXXX conversion? .ord method not working
mu is too short's answer is cool.
But, the simplest answer is:
'好'.ord.to_s(16) # => '597d'
Need a range with all unicode characters
[*32..65535].map do |e|
e.chr(Encoding::UTF_8).tap do |char|
char =~ /\p{Alnum}|\p{Punct}/ || raise
end rescue nil # rescuing both conversion and self-raised
end.compact
The above iterates through all the codepoints, selecting alphanumerics and punctuation.
NB The approach above, while is more or less robust, failes to match diacritics, that is a part of combined characters like ç or ö.
Unicode characters in Ruby 1.9.3 IRB with RVM
RVM has issues with readline installed via homebrew. This gist worked perfectly for me:
$ rvm get latest
$ rvm pkg install readline
$ rvm install 1.9.3 --with-readline-dir=$rvm_path/usr
Instead of install
you can use reinstall
.
How do I escape a Unicode string with Ruby?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
Related Topics
How to Generate a Wsdl Using Ruby
Rails Render of Partial and Layout in Controller
Ubuntu 12.10 - Ruby Gem Rmagick Missing Dependency Issue
Uploading a File to a Website with Ruby/Rails
How Can One Set Property Values When Initializing an Object in Ruby
Write Array of Radix-2 Numeric Strings to Binary File in Ruby
Rails:Runtimeerror - Can't Modify Frozen Array When Running Rspec in Rails
Is There a Bug in Ruby Lookbehind Assertions (1.9/2.0)
Ruby Mailer Is Coming Up with an Eoferror
Devise Raises Error with Rails 4.2 Upgrade
How to Use Regex for Utf8 in Ruby
Ssl_Connect Syscall Returned=5 Errno=0 State=Sslv2/V3 Read Server Hello A
How to 'Unload' ('Un-Require') a Ruby Library