Scanning for Unicode Numbers in a String with \D

Scanning for Unicode Numbers in a string with \d

Noted by Brian Candler on ruby-talk:

  • \w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
  • \d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:

\w  word character  
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."

[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:

https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

Character class for Unicode digits

Quoting the Java docs about isDigit:

A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.

So, I believe the pattern to match digits should be \p{Nd}.

Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.

How do you check in python whether a string contains only numbers?

You'll want to use the isdigit method on your str object:

if len(isbn) == 10 and isbn.isdigit():

From the isdigit documentation:

str.isdigit()

Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.

How do I verify that a string only contains letters, numbers, underscores and dashes?

A regular expression will do the trick with very little code:

import re

...

if re.match("^[A-Za-z0-9_-]*$", my_little_string):
# do something here

Iterate Through Unicode Ranges

VB.NET unfortunately does not treat chars the same way C# do. A char is actually just a number (known as a character code) that represents letter, so to a computer it actually makes sense that you would be able to use them in a loop.

However, for it to work in VB.NET you would have to convert the chars into integers first to be able to use them in a loop, then convert the integer in each iteration back into a Char:

For i As Integer = AscW("A"c) To AscW("Z"c)
Dim c As Char = ChrW(i)
Yield c
Next

As for your second example, Unicode code points are represented in the form U+####. The #### part is a hexadecimal number, which can be written in VB.NET in the form &H####. To the compiler a hexadecimal number is just a normal number, so all you need to do is to change your loop to:

For i As Integer = &H0000 To &H007F
Dim c As Char = ChrW(i)
Yield c
Next
  • AscW() function

  • ChrW() function

Created unicode & unicode without whitespace generators in ScalaCheck

You could do this by putting together a sequence of your non-whitespace characters, another of whitespace, and then picking from either only the non-whitespace, or from both together:

import org.scalacheck.Gen

val myChars = ('A' to 'Z') ++ ('a' to 'z')
val ws = Seq(' ', '\t')

val myCharsGenNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars))
}

val myCharsGen: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars ++ ws))
}

I would suggest considering what you're really testing for, though—the more you restrict the test cases, the less you're checking about how your program will behave on unexpected inputs.



Related Topics



Leave a reply



Submit