How to Match Unicode Words with Ruby 1.9

How to match unicode words with ruby 1.9?

# encoding=utf-8 
p "föö".match(/\p{Word}+/)[0] == "föö"

Ruby 1.9.3 Regex utf8 \w accented characters

Try

'ein grüner Hund'.scan(/[[:word:]]+/u)

Documentation

Regex \w doesn't process utf-8 characters in Ruby 1.9.2

Define "doesn't match utf-8 characters"? If you expect \w to match anything other than exactly the uppercase and lowercase ASCII letters, the ASCII digits, and underscore, it won't -- Ruby has defined \w to be equivalent to [A-Za-z0-9_] regardless of Unicode. Maybe you want \p{Word} or something similar instead.

Ref: Ruby 1.9 Regexp documentation (see section "Character Classes").

How to specify Regexp for unicode cyrillic characters in Ruby 1.9

This is as specified in the Ruby documentation: \w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character.

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]] and [[:alpha:]].

Regexp does not match utf8 characters in words (\w+)

The metacharacters \w is equivalent to the character class [a-zA-Z0-9_]; matches only alphabets, digits, and _.

Instead use the character property \p{Word}:

'The name of the city is: Ørbæk'.match(/:\s\p{Word}+/)
# => #<MatchData ": Ørbæk">

According to Character Properties from Ruby Regexp documentation:

/\p{Word}/ - A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation

Match unicode text with Ruby 1.8.7

Unicode properties were added in Ruby with version 1.9, so in older versions you have to use Posix classes like [:space:] or [:alpha:]

See POSIX Bracket Expressions for more details.

How do you specify a regex character range that will work in European languages other than English?

WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u

should work in Ruby 1.9. \p{Lu} and \p{Ll} are shorthands for uppercase and lowercase Unicode letters. (\w already includes the underscore)

See also this answer - you might need to run Ruby in UTF-8 mode for this to work, and possibly your script must be encoded in UTF-8, too.



Related Topics



Leave a reply



Submit