How can I detect certain Unicode characters in a string in Ruby?
(ruby 1.9.2)
#encoding: UTF-8
class String
def contains_cjk?
!!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
end
end
strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}
#true
#true
#true
#false
\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.
Wow. Ruby Regexp source .
Check if a string contains a character in a unicode range (using Ruby)
The easiest thing would probably be a regex using String#index
, String#match
, or even String#[]
:
string.index(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string.match(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string[/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/]
All three will give you nil
(which is falsey) if they don't find the pattern and non-nil
(which will be truthy) if they do.
Ruby Output Unicode Character
In Ruby 1.9.x+
Use String#encode
:
checkmark = "\u2713"
puts checkmark.encode('utf-8')
prints
✓
In Ruby 1.8.7
puts '\u2713'.gsub(/\\u[\da-f]{4}/i) { |m| [m[-4..-1].to_i(16)].pack('U') }
✓
Detecting non-ASCII characters in Rails
All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)
You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.
class String
def multibyte?
chars.count < bytes.count
end
end
"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false
How to check string contains special character in ruby
special = "?<>',?[]}{=-)(*&^%$#`~{}"
regex = /[#{special.gsub(/./){|char| "\\#{char}"}}]/
You can then use the regex to test if a string contains the special character:
if some_string =~ regex
This looks a bit complicated: what's going on in this bit
special.gsub(/./){|char| "\\#{char}"}
is to turn this
"?<>',?[]}{=-)(*&^%$#`~{}"
into this:
"\\?\\<\\>\\'\\,\\?\\[\\]\\}\\{\\=\\-\\)\\(\\*\\&\\^\\%\\$\\#\\`\\~\\{\\}"
Which is every character in special, escaped with a \
(which itself is escaped in the string, ie \\
not \
). This is then used to build a regex like this:
/[<every character in special, escaped>]/
Ruby string escape for supplementary plane Unicode characters
You can use the escape sequence \u{XXXXXX}
, where XXXXXX
is between 1 and 6 hex digits:
s = "\u{1F609}" # => "br>
The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:
s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "Привет, мир!"
You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:
# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "br>s.length # => 1
# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4
Related Topics
Do Ruby 'Require' Statements Go Inside or Outside the Class Definition
How to Dynamically Create a Local Variable in Ruby
Variable Scope and Order of Parsing VS. Operations: Assignment in an "If"
Rvm Is Not a Function, Selecting Rubies with 'Rvm Use ...' Will Not Work
Calling Another Method in Super Class in Ruby
Create_Or_Update Method in Rails
Forming Sanitary Shell Commands or System Calls in Ruby
Simple Way of Turning Off Observers During Rake Task
Ruby: File Encryption/Decryption with Private/Public Keys
Is It Possible for Rspec to Expect Change in Two Tables
Changing Table Name at Query Run Time in a Rails Application
Ruby/Rails: Converting a Date to a Unix Timestamp
Devise Logged in Root Route Rails 3
Difference Between Truncation, Transaction and Deletion Database Strategies