How to match unicode words with ruby 1.9?
# encoding=utf-8
p "föö".match(/\p{Word}+/)[0] == "föö"
Ruby 1.9.3 Regex utf8 \w accented characters
Try
'ein grüner Hund'.scan(/[[:word:]]+/u)
Documentation
Regex \w doesn't process utf-8 characters in Ruby 1.9.2
Define "doesn't match utf-8 characters"? If you expect \w
to match anything other than exactly the uppercase and lowercase ASCII letters, the ASCII digits, and underscore, it won't -- Ruby has defined \w
to be equivalent to [A-Za-z0-9_]
regardless of Unicode. Maybe you want \p{Word}
or something similar instead.
Ref: Ruby 1.9 Regexp documentation (see section "Character Classes").
How to specify Regexp for unicode cyrillic characters in Ruby 1.9
This is as specified in the Ruby documentation: \w
is equivalent to [a-zA-Z0-9_]
and thus doesn't target any unicode character.
You probably want to use [[:alnum:]]
instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]]
and [[:alpha:]]
.
Regexp does not match utf8 characters in words (\w+)
The metacharacters \w
is equivalent to the character class [a-zA-Z0-9_]
; matches only alphabets, digits, and _
.
Instead use the character property \p{Word}
:
'The name of the city is: Ørbæk'.match(/:\s\p{Word}+/)
# => #<MatchData ": Ørbæk">
According to Character Properties from Ruby Regexp documentation:
/\p{Word}/
- A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation
Match unicode text with Ruby 1.8.7
Unicode properties were added in Ruby with version 1.9, so in older versions you have to use Posix classes like [:space:]
or [:alpha:]
See POSIX Bracket Expressions for more details.
How do you specify a regex character range that will work in European languages other than English?
WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u
should work in Ruby 1.9. \p{Lu}
and \p{Ll}
are shorthands for uppercase and lowercase Unicode letters. (\w
already includes the underscore)
See also this answer - you might need to run Ruby in UTF-8 mode for this to work, and possibly your script must be encoded in UTF-8, too.
Related Topics
Can't Use Compass After Installing It
Instance and Class Variables in Rails Controller
Best Way to Generate Order Numbers for an Online Store
Suitability of Rails, Padrino, and Sinatra for Building a Prepaid Mobile Service
Best Practice About Empty Belongs_To Association
How to Generate a Random Date in Ruby
Error When Starting Rails Server: Warning: Insecure World Writable Dir /Usr in Path, Mode 040777
Why Should I Use Rspec or Shoulda with Rails
Which Ruby Gems Support the Facebook API
Generating Unique Token on the Fly with Rails
How to Use Define_Method Inside Initialize()
Switching Between Web and Touch Interfaces on Facebook Login Using Omniauth and Rails 3
Rails 3 Initializes Extremely Slow on Ruby 1.9.2