What's the difference between /\p{Alpha}/i and /\p{L}/i in ruby?
They seem to be equivalent. (Edit: sometimes, see the end of this answer)
It seems like Ruby supports \p{Alpha}
since version 1.9. In POSIX \p{Alpha}
is equal to \p{L&}
(for regular expressions with Unicode support; see here). This matches all characters that have an upper and lower case variant (see here). Unicase letters would not be matched (while they would be match by \p{L}
.
This does not seem to be true for Ruby (I picked a random Arabic character, since Arabic has a unicase alphabet):
\p{L}
(any letter) matches.- Case-sensitive classes
\p{Lu}
,\p{Ll}
,\p{Lt}
don't match. As expected. p{L&}
doesn't match. As expected.\p{Alpha}
matches.
Which seems to be a very good indication that \p{Alpha}
is just an alias for \p{L}
in Ruby. On Rubular you can also see that \p{Alpha}
was not available in Ruby 1.8.7.
Note that the i
modifier is irrelevant in any case, because both \p{Alpha}
and \p{L}
match both upper- and lower-case characters anyway.
EDIT:
A ha, there is a difference! I just found this PDF about Ruby's new regex engine (in use as of Ruby 1.9 as stated above). \p{Alpha}
is available regardless of encoding (and will probably just match [A-Za-z]
if there is no Unicode support), while \p{L}
is specifically a Unicode property. That means, \p{Alpha}
behaves exactly as in POSIX regexes, with the difference that here is corresponds to \p{L}
, but in POSIX it corresponds to \p{L&}
.
Matching strings that contain a letter with the first character not being a number
The [a-z]*
and [[:space:]]*
patterns can match an empty string, so they do not really make any difference when validating is necessary. Also, =
is not a digit, it is matched with [^\d]
negated character class that is a consuming type of pattern. It means it requires a character other than a digit in the string.
You may rely on a lookahead that will restrict the start of string position:
/\A(?!\d).*[a-z]/im
Or even a bit faster and Unicode-friendly version:
/\A(?!\d)\P{L}*\p{L}/
See the regex demo
Details:
\A
- start of a string(?!\d)
- the first char cannot be a digit\P{L}*
- 0 or more (*
) chars other than letters
or.*
- any 0+ chars, including line breaks if/m
modifier is used)\p{L}
- a letter
The m
modifier enables the .
to match line break chars in a Ruby regex.
Use [a-z]
when you need to restrict the letters to those in ASCII table only. Also, \p{L}
may be replaced with [[:alpha:]]
and \P{L}
with [^[:alpha:]]
.
Regular expression \p{L} and \p{N}
\p{L}
matches a single code point in the category "letter".
\p{N}
matches any kind of numeric character in any script.
Source: regular-expressions.info
If you're going to work with regular expressions a lot, I'd suggest bookmarking that site, it's very useful.
Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby?
The two characters in question are (I have marked some interesting things in bold):
- U+0BC0 Tamil Vowel Sign II, with the following (relevant) properties:
- General Category: Nonspacing Mark
- Alphabetic: Yes
- U+0BCD Tamil Sign Virama, with the following (relevant) properties:
- General Category: Nonspacing Mark
- Alphabetic: No
The Ruby documentation for the Regexp
class does not explicitly spell out what [[:alpha:]]
matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]]
as an example, saying it matches anything with the Unicode property Nd (Decimal Number).
While not explicitly documented, it makes sense to equate the Regexp
POSIX bracket expression [[:alpha:]]
with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.
On the other hand, the documentation for Onigmo (the Regexp
engine used in YARV, and mirrored in all other implementations) does explicitly specify the workings of [[:alpha:]]
. In fact, it specifies it in two different places, and they contradict each other:
- In
doc/RE
, it says that[[:alpha:]]
matches Letter | Mark. - In
doc/UnicodeProps.txt
, it seems to imply that[[:alpha:]]
matches Alphabetic.
So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]
. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.
How to use regex to swap number position in a string in ruby?
The working regex is:
(\d)(\p{L}+)(\d)
str = str.gsub(/(\d)(\p{L}+)(\d)/, '\3\2\1')
\p{L} ... Matches a character from the Unicode category “letter” (any letter character of any language)
Capitalize first letter of each numbered line
You can use gsub
to update several lines:
str = <<TEXT
1. i was just doing this problem.
2. also eating so much food.
3. it was nice listening to the Mahler.
TEXT
puts str.gsub(/^(\d+\.[ \t]+)(\w)/) { "#{$1}#{$2.upcase}" }
Output:
1. I was just doing this problem.
2. Also eating so much food.
3. It was nice listening to the Mahler.
^
matches the beginning of a line(\d+\.[ \t]+)
captures the beginning of a numbered line, i.e. one or more digits, followed by a literal dot and spaces / tabs(\w)
captures a single word character
The first capture group is returned unmodified: #{$1}
, whereas the second capture group is upcased: #{$2.upcase}
Since upcase
only affects letters, you could also upcase
everything up to and including the first letter:
puts str.gsub(/^\d+\.[ \t]+\w/, &:upcase)
How to count instances of any Unicode letter in my string
You could try using String#scan
, passing your \p{L}
regex, and then chain the count
method:
string = "aá"
p string.scan(/\p{L}/).count
# 2
Regex that will match a combination of letters and numbers NOT ending with km
From the examples you provided, it appears your serial numbers will always start with a digit and ends with a letter -- if this isn't true, then refer to my comment and read up what it's going to take to assist you better.
This pattern should work:
/(\d+[a-z0-9]+[a-z](?<!km\b))(?:,|$)/i
This requires the following conditions:
\d+
start with a minimum of one or more+
digits\d
[a-z0-9]+
followed by any alphanumerical character[a-z0-9]
, one or more times+
(?<!km\b))(?:,|$)
negative lookahead that asserts the string ends with either a comma,
or end of string$
, but not if it comes after the letters km(?<!km\b)
This uses a single capturing group (...)
so you don't include the comma ,
that comes with the entire match
See it on regex101
Scanning for Unicode Numbers in a string with \d
Noted by Brian Candler on ruby-talk:
\w
only matches ASCII letters and digits, while[[:alpha:]]
matches the full set of Unicode letters.\d
only matches ASCII digits, while[[:digit:]]
matches the full set of Unicode numbers.
The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w
in the same Oniguruma doc we see the text:
\w word character
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why \d
does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.
p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]
It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u
flag (e.g. /\w/u
) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)
Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."
Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc
Related Topics
Better Way to Write "Matching Balanced Parenthesis" Program in Ruby
Bitwise Operations on Strings with Ruby
How to Modify a Text File in Ruby
Parsing Date from Text Using Ruby
Best Way to Handle Category/Subcategory Relationship Ruby on Rails
How to Deal with App_Key and App_Secret (Dropbox API)
Listing Directories at a Given Level in Amazon S3
Programmatically Derive a Regular Expression from a String
How to Update to Ruby 2.1.2 Using Rails 3.2.3
Ruby on Rails Add a Column After a Specific Column Name
Get, or Calculate the Entropy of an Image with Ruby and Imagemagick
Instance Variable, Class Variable and the Difference Between Them in Ruby
How to Know the Current Rake Task
Ruby Can Not Access Variable Outside the Method
Rails 5.0.0 When Installing "Nio4R":Failed to Build Gem Native Extension