What's the Difference Between /\P{Alpha}/I and /\P{L}/I in Ruby

What's the difference between /\p{Alpha}/i and /\p{L}/i in ruby?

They seem to be equivalent. (Edit: sometimes, see the end of this answer)

It seems like Ruby supports \p{Alpha} since version 1.9. In POSIX \p{Alpha} is equal to \p{L&} (for regular expressions with Unicode support; see here). This matches all characters that have an upper and lower case variant (see here). Unicase letters would not be matched (while they would be match by \p{L}.

This does not seem to be true for Ruby (I picked a random Arabic character, since Arabic has a unicase alphabet):

  • \p{L} (any letter) matches.
  • Case-sensitive classes \p{Lu}, \p{Ll}, \p{Lt} don't match. As expected.
  • p{L&} doesn't match. As expected.
  • \p{Alpha} matches.

Which seems to be a very good indication that \p{Alpha} is just an alias for \p{L} in Ruby. On Rubular you can also see that \p{Alpha} was not available in Ruby 1.8.7.

Note that the i modifier is irrelevant in any case, because both \p{Alpha} and \p{L} match both upper- and lower-case characters anyway.

EDIT:

A ha, there is a difference! I just found this PDF about Ruby's new regex engine (in use as of Ruby 1.9 as stated above). \p{Alpha} is available regardless of encoding (and will probably just match [A-Za-z] if there is no Unicode support), while \p{L} is specifically a Unicode property. That means, \p{Alpha} behaves exactly as in POSIX regexes, with the difference that here is corresponds to \p{L}, but in POSIX it corresponds to \p{L&}.

Matching strings that contain a letter with the first character not being a number

The [a-z]* and [[:space:]]* patterns can match an empty string, so they do not really make any difference when validating is necessary. Also, = is not a digit, it is matched with [^\d] negated character class that is a consuming type of pattern. It means it requires a character other than a digit in the string.

You may rely on a lookahead that will restrict the start of string position:

/\A(?!\d).*[a-z]/im

Or even a bit faster and Unicode-friendly version:

/\A(?!\d)\P{L}*\p{L}/

See the regex demo

Details:

  • \A - start of a string
  • (?!\d) - the first char cannot be a digit
  • \P{L}* - 0 or more (*) chars other than letters

    or
  • .* - any 0+ chars, including line breaks if /m modifier is used)
  • \p{L} - a letter

The m modifier enables the . to match line break chars in a Ruby regex.

Use [a-z] when you need to restrict the letters to those in ASCII table only. Also, \p{L} may be replaced with [[:alpha:]] and \P{L} with [^[:alpha:]].

Regular expression \p{L} and \p{N}

\p{L} matches a single code point in the category "letter".

\p{N} matches any kind of numeric character in any script.

Source: regular-expressions.info

If you're going to work with regular expressions a lot, I'd suggest bookmarking that site, it's very useful.

Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby?

The two characters in question are (I have marked some interesting things in bold):

  • U+0BC0 Tamil Vowel Sign II, with the following (relevant) properties:

    • General Category: Nonspacing Mark
    • Alphabetic: Yes
  • U+0BCD Tamil Sign Virama, with the following (relevant) properties:

    • General Category: Nonspacing Mark
    • Alphabetic: No

The Ruby documentation for the Regexp class does not explicitly spell out what [[:alpha:]] matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]] as an example, saying it matches anything with the Unicode property Nd (Decimal Number).

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

On the other hand, the documentation for Onigmo (the Regexp engine used in YARV, and mirrored in all other implementations) does explicitly specify the workings of [[:alpha:]]. In fact, it specifies it in two different places, and they contradict each other:

  • In doc/RE, it says that [[:alpha:]] matches Letter | Mark.
  • In doc/UnicodeProps.txt, it seems to imply that [[:alpha:]] matches Alphabetic.

So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.

How to use regex to swap number position in a string in ruby?

The working regex is:

(\d)(\p{L}+)(\d)

str = str.gsub(/(\d)(\p{L}+)(\d)/, '\3\2\1')

\p{L} ... Matches a character from the Unicode category “letter” (any letter character of any language)

Capitalize first letter of each numbered line

You can use gsub to update several lines:

str = <<TEXT
1. i was just doing this problem.
2. also eating so much food.
3. it was nice listening to the Mahler.
TEXT

puts str.gsub(/^(\d+\.[ \t]+)(\w)/) { "#{$1}#{$2.upcase}" }

Output:

1.     I was just doing this problem.
2. Also eating so much food.
3. It was nice listening to the Mahler.
  • ^ matches the beginning of a line
  • (\d+\.[ \t]+) captures the beginning of a numbered line, i.e. one or more digits, followed by a literal dot and spaces / tabs
  • (\w) captures a single word character

The first capture group is returned unmodified: #{$1}, whereas the second capture group is upcased: #{$2.upcase}

Since upcase only affects letters, you could also upcase everything up to and including the first letter:

puts str.gsub(/^\d+\.[ \t]+\w/, &:upcase)

How to count instances of any Unicode letter in my string

You could try using String#scan, passing your \p{L} regex, and then chain the count method:

string = "aá"
p string.scan(/\p{L}/).count
# 2

Regex that will match a combination of letters and numbers NOT ending with km

From the examples you provided, it appears your serial numbers will always start with a digit and ends with a letter -- if this isn't true, then refer to my comment and read up what it's going to take to assist you better.

This pattern should work:

/(\d+[a-z0-9]+[a-z](?<!km\b))(?:,|$)/i

This requires the following conditions:

  • \d+ start with a minimum of one or more + digits \d
  • [a-z0-9]+ followed by any alphanumerical character [a-z0-9], one or more times +
  • (?<!km\b))(?:,|$) negative lookahead that asserts the string ends with either a comma , or end of string $, but not if it comes after the letters km (?<!km\b)

This uses a single capturing group (...) so you don't include the comma , that comes with the entire match

See it on regex101

Scanning for Unicode Numbers in a string with \d

Noted by Brian Candler on ruby-talk:

  • \w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
  • \d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:

\w  word character  
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."

[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:

https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc



Related Topics



Leave a reply



Submit