How to Detect Certain Unicode Characters in a String in Ruby

How can I detect certain Unicode characters in a string in Ruby?

(ruby 1.9.2)

#encoding: UTF-8
class String
def contains_cjk?
!!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} matches a character’s Unicode script.

The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

Check if a string contains a character in a unicode range (using Ruby)

The easiest thing would probably be a regex using String#index, String#match, or even String#[]:

string.index(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string.match(/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/)
string[/[\u007B-\u00BF\u02B0-\u037F\u2000-\u2BFF]/]

All three will give you nil (which is falsey) if they don't find the pattern and non-nil (which will be truthy) if they do.

Ruby Output Unicode Character

In Ruby 1.9.x+

Use String#encode:

checkmark = "\u2713"
puts checkmark.encode('utf-8')

prints


In Ruby 1.8.7

puts '\u2713'.gsub(/\\u[\da-f]{4}/i) { |m| [m[-4..-1].to_i(16)].pack('U') }

Detecting non-ASCII characters in Rails

All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)

You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.

class String
def multibyte?
chars.count < bytes.count
end
end

"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false

How to check string contains special character in ruby

special = "?<>',?[]}{=-)(*&^%$#`~{}"
regex = /[#{special.gsub(/./){|char| "\\#{char}"}}]/

You can then use the regex to test if a string contains the special character:

if some_string =~ regex

This looks a bit complicated: what's going on in this bit

special.gsub(/./){|char| "\\#{char}"}

is to turn this

"?<>',?[]}{=-)(*&^%$#`~{}"

into this:

"\\?\\<\\>\\'\\,\\?\\[\\]\\}\\{\\=\\-\\)\\(\\*\\&\\^\\%\\$\\#\\`\\~\\{\\}"

Which is every character in special, escaped with a \ (which itself is escaped in the string, ie \\ not \). This is then used to build a regex like this:

/[<every character in special, escaped>]/

Ruby string escape for supplementary plane Unicode characters

You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:

s = "\u{1F609}" # => "br>

The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:

s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "Привет, мир!"

You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:

# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "br>s.length # => 1

# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4


Related Topics



Leave a reply



Submit