How to Use Regex for Utf8 in Ruby

Specify Unicode Character in Regular Expression

You can write /\x02/ :

"\u0002" =~ /\x02/
#=> 0

If you're not sure, you can just start from a string :

Regexp.new("\u0002")
#=> /\x02/

Here's another example :

"☀☁☂" =~ /\u2602/
#=> 2

As mentionned by @TomLord in the comments, you can also specify a range. To check if a string includes a UTF-8 arrow :

"↹" =~ /[\u2190-\u21FF]/
#=> 0

Regexp does not match utf8 characters in words (\w+)

The metacharacters \w is equivalent to the character class [a-zA-Z0-9_]; matches only alphabets, digits, and _.

Instead use the character property \p{Word}:

'The name of the city is: Ørbæk'.match(/:\s\p{Word}+/)
# => #<MatchData ": Ørbæk">

According to Character Properties from Ruby Regexp documentation:

/\p{Word}/ - A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation

How to match non-Unicode string with regexp in Ruby?

Not every byte sequence is a valid Unicode string. (or more specifically UTF-8)

Your single-byte string for example is not:

str = "\xa0"

str.encoding #=> #<Encoding:UTF-8>
str.valid_encoding? #=> false
str.codepoints # ArgumentError (invalid byte sequence in UTF-8)

To work with an arbitrary string, you have set its encoding to binary / ASCII:

str = "\xa0".b      # <-- note the .b

str.encoding #=> #<Encoding:ASCII-8BIT>
str.valid_encoding? #=> true
str.codepoints #=> [160]

and also set the regexp encoding to ASCII: (via the n modifier)

str =~ /\xa0/n
#=> 0

Ruby 1.9.3 Regex utf8 \w accented characters

Try

'ein grüner Hund'.scan(/[[:word:]]+/u)

Documentation

How do you use unicode characters within a regular expression in Ruby?

Try this:

text.encode('utf-8', 'utf-8').gsub(/《.*?》/u, '')

How to match unicode words with ruby 1.9?

# encoding=utf-8 
p "föö".match(/\p{Word}+/)[0] == "föö"

Regex \w doesn't process utf-8 characters in Ruby 1.9.2

Define "doesn't match utf-8 characters"? If you expect \w to match anything other than exactly the uppercase and lowercase ASCII letters, the ASCII digits, and underscore, it won't -- Ruby has defined \w to be equivalent to [A-Za-z0-9_] regardless of Unicode. Maybe you want \p{Word} or something similar instead.

Ref: Ruby 1.9 Regexp documentation (see section "Character Classes").



Related Topics



Leave a reply



Submit