\D Less Efficient Than [0-9]

Why is \d slower than [0-9]?

\d checks all Unicode digits, while [0-9] is limited to these 10 characters. For example, Persian digits, ۱۲۳۴۵۶۷۸۹, are an example of Unicode digits which are matched with \d, but not [0-9].

You can generate a list of all such characters using the following code:

var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
string str = Convert.ToChar(i).ToString();
if (Regex.IsMatch(str, @"\d"))
sb.Append(str);
}
Console.WriteLine(sb.ToString());

Which generates:

0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789

Why does Apache Commons consider '१२३' numeric?

Because that "CharSequence contains only Unicode digits" (quoting your linked documentation).

All of the characters return true for Character.isDigit:

Some Unicode character ranges that contain digits:

  • '\u0030' through '\u0039', ISO-LATIN-1 digits ('0' through '9')
  • '\u0660' through '\u0669', Arabic-Indic digits
  • '\u06F0' through '\u06F9', Extended Arabic-Indic digits
  • '\u0966' through '\u096F', Devanagari digits
  • '\uFF10' through '\uFF19', Fullwidth digits

Many other character ranges contain digits as well.

१२३ are Devanagari digits:

  • is DEVANAGARI DIGIT ONE, \u0967
  • is DEVANAGARI DIGIT TWO, \u0968
  • is DEVANAGARI DIGIT THREE, \u0969

Looking for regex solution

Will be [0-9]{2}\.[0-9]{2}\.[0-9]{4}

[0-9]stands for any integer in range 0 to 9, the number in curly brackets ({2} in this case) indicates how many times should the pattern be repeated.
You need to escape the dots with a backslash because otherwise they will be interpreted as any character.

RegEx syntax and efficiency

In terms of performance {1,} and + are equivalent, but the first has more characters to be read... And {1} is not necessary. That won't make much difference though.

More generally, it is not a matter of preference. If you have to match a numeric ID made of numbers from 1 to a big number, without + (or {1,}, or * using \d twice), that will be difficult

\d+

or

[0-9]+

or

[0-9][0-9]*

if you prefer.

Besides, [aA-zZ] matches a, Z (twice actually) and anything between A and z, including [, ], _ ... (see an ascii table)

Is (*i).member less efficient than i-member

When you return a reference, that's exactly the same as passing back a pointer, pointer semantics excluded.

You pass back a sizeof(void*) element, not a sizeof(yourClass).

So when you do that:

Person& Person::someFunction(){
...
return *this;
}

You return a reference, and that reference has the same intrinsic size than a pointer, so there's no runtime difference.

Same goes for your use of (*i).name, but in that case you create an l-value, which has then the same semantics as a reference (see also here)

Python regex that will require at least 6 characters to return true

Try this:

regex = (r"^[a-zA-Z0-9_\s-]{6,}$")

If you use re module, it will find you any string with 6 or more chars if you test it.
Code here for ex:

import re

txt = "string"
print(re.search(r"^[a-zA-Z0-9_\s-]{6,}$", txt))

If the strings has less than 6 chars, it will not be found.

How to write the regex pattern to get the matched string?

You can use

^[A-Z]+-[0-9]+\s+-\s+(?:[0-9]+[.)]\s*)?[A-Za-z]+

See the regex demo

Explanation:

  • ^ - start of string
  • [A-Z]+ - 1 or more uppercase ASCII letters
  • - - a hyphen
  • [0-9]+ - 1 or more digits
  • \s+ - 1+ whitespaces
  • - - a hyphen
  • \s+ - see above
  • (?:[0-9]+[.)]\s*)? - an optional sequence of:

    • [0-9]+ - 1+ digits
    • [.)] - a literal . or )
    • \s* - 0+ whitespaces
  • [A-Za-z]+ - 1 or more ASCII letters

strip all all of numerics with length less or greater than 6

As Eily mentioned in other comment the first issue is \b. This is an anchor for word boundary so it will not match the numbers that are in words like you suggested.

My solution is to remove \b and to make sure you don't get any weirdness add negative lookahead and negative lookbehind and the end and start of your search.

(?<!\d)(\d{1,5}|\d{7,})(?!\d)

edit: accidently typed {1,6} instead of {1,5}



Related Topics



Leave a reply



Submit