Specify Unicode Character in Regular Expression
You can write /\x02/
:
"\u0002" =~ /\x02/
#=> 0
If you're not sure, you can just start from a string :
Regexp.new("\u0002")
#=> /\x02/
Here's another example :
"☀☁☂" =~ /\u2602/
#=> 2
As mentionned by @TomLord in the comments, you can also specify a range. To check if a string includes a UTF-8 arrow :
"↹" =~ /[\u2190-\u21FF]/
#=> 0
Regexp does not match utf8 characters in words (\w+)
The metacharacters \w
is equivalent to the character class [a-zA-Z0-9_]
; matches only alphabets, digits, and _
.
Instead use the character property \p{Word}
:
'The name of the city is: Ørbæk'.match(/:\s\p{Word}+/)
# => #<MatchData ": Ørbæk">
According to Character Properties from Ruby Regexp documentation:
/\p{Word}/
- A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation
How to match non-Unicode string with regexp in Ruby?
Not every byte sequence is a valid Unicode string. (or more specifically UTF-8)
Your single-byte string for example is not:
str = "\xa0"
str.encoding #=> #<Encoding:UTF-8>
str.valid_encoding? #=> false
str.codepoints # ArgumentError (invalid byte sequence in UTF-8)
To work with an arbitrary string, you have set its encoding to binary / ASCII:
str = "\xa0".b # <-- note the .b
str.encoding #=> #<Encoding:ASCII-8BIT>
str.valid_encoding? #=> true
str.codepoints #=> [160]
and also set the regexp encoding to ASCII: (via the n
modifier)
str =~ /\xa0/n
#=> 0
Ruby 1.9.3 Regex utf8 \w accented characters
Try
'ein grüner Hund'.scan(/[[:word:]]+/u)
Documentation
How do you use unicode characters within a regular expression in Ruby?
Try this:
text.encode('utf-8', 'utf-8').gsub(/《.*?》/u, '')
How to match unicode words with ruby 1.9?
# encoding=utf-8
p "föö".match(/\p{Word}+/)[0] == "föö"
Regex \w doesn't process utf-8 characters in Ruby 1.9.2
Define "doesn't match utf-8 characters"? If you expect \w
to match anything other than exactly the uppercase and lowercase ASCII letters, the ASCII digits, and underscore, it won't -- Ruby has defined \w
to be equivalent to [A-Za-z0-9_]
regardless of Unicode. Maybe you want \p{Word}
or something similar instead.
Ref: Ruby 1.9 Regexp documentation (see section "Character Classes").
Related Topics
Error Running Heckle? 'Current_Code': Undefined Method 'Translate' for Ruby2Ruby
Rails:Runtimeerror - Can't Modify Frozen Array When Running Rspec in Rails
How to Set Ca-Bundle Path for Openssl in Ruby
Ruby Ssl with Twitter Failed on Cert Openssl Issue on Windows 7
Ruby on Rails Add a Column After a Specific Column Name
Error Installing Rubymine, No Sdk Specified, But It Is Listed
How to Use Variables in a Yaml File
In Rails, How to Access Response.Body in a Action Before It Returns
How to Add New View to Ruby on Rails Spree Commerce App
Rspec 'Eq' VS 'Eql' in 'Expect' Tests
How to Convert Utf8 Combined Characters into Single Utf8 Characters in Ruby
Rails: Detecting User Agent Works in Development But Not Production
How to Run All Ruby Scripts with Warnings
Ruby on Rails Activerecord Scopes VS Class Methods