Function to Create Regex Matching a Number Range

Function To Create Regex Matching a Number Range

Here's a quick hack:

<?php

function regex_range($from, $to) {

if($from < 0 || $to < 0) {
throw new Exception("Negative values not supported");
}

if($from > $to) {
throw new Exception("Invalid range $from..$to, from > to");
}

$ranges = array($from);
$increment = 1;
$next = $from;
$higher = true;

while(true) {

$next += $increment;

if($next + $increment > $to) {
if($next <= $to) {
$ranges[] = $next;
}
$increment /= 10;
$higher = false;
}
else if($next % ($increment*10) === 0) {
$ranges[] = $next;
$increment = $higher ? $increment*10 : $increment/10;
}

if(!$higher && $increment < 10) {
break;
}
}

$ranges[] = $to + 1;

$regex = '/^(?:';

for($i = 0; $i < sizeof($ranges) - 1; $i++) {
$str_from = (string)($ranges[$i]);
$str_to = (string)($ranges[$i + 1] - 1);

for($j = 0; $j < strlen($str_from); $j++) {
if($str_from[$j] == $str_to[$j]) {
$regex .= $str_from[$j];
}
else {
$regex .= "[" . $str_from[$j] . "-" . $str_to[$j] . "]";
}
}
$regex .= "|";
}

return substr($regex, 0, strlen($regex)-1) . ')$/';
}

function test($from, $to) {
try {
printf("%-10s %s\n", $from . '-' . $to, regex_range($from, $to));
} catch (Exception $e) {
echo $e->getMessage() . "\n";
}
}

test(2, 8);
test(5, 35);
test(5, 100);
test(12, 1234);
test(123, 123);
test(256, 321);
test(256, 257);
test(180, 195);
test(2,1);
test(-2,4);

?>

which produces:

2-8        /^(?:[2-7]|8)$/
5-35 /^(?:[5-9]|[1-2][0-9]|3[0-5])$/
5-100 /^(?:[5-9]|[1-9][0-9]|100)$/
12-1234 /^(?:1[2-9]|[2-9][0-9]|[1-9][0-9][0-9]|1[0-2][0-3][0-4])$/
123-123 /^(?:123)$/
256-321 /^(?:25[6-9]|2[6-9][0-9]|3[0-2][0-1])$/
256-257 /^(?:256|257)$/
180-195 /^(?:18[0-9]|19[0-5])$/
Invalid range 2..1, from > to
Negative values not supported

Not properly tested, use at your own risk!

And yes, the generated regex could be written more compact in many cases, but I leave that as an exercise for the reader :)

How to generate a regex for number range between 0.01 - 8.50

Assuming there should be 2 digits after the dot, you might use alternations:

^0\.0[1-9]|0\.[1-9][0-9]|[1-7]\.[0-9]{2}|8\.(?:[0-4][0-9]|50)$
  • ^ Start of string
  • 0\.0[1-9] Match 0.01 till 0.09
  • | Or
  • 0\.[1-9][0-9] Match 0.10 - 0.99
  • | Or
  • [1-7]\.[0-9]{2} Match 1.00 till 7.99
  • | Or
  • 8\.(?:[0-4][0-9]|50) Match 8.00 till 8.49 or 8.50
  • $ End of string

Regex demo

Why doesn't [01-12] range work as expected?

You seem to have misunderstood how character classes definition works in regex.

To match any of the strings 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, or 12, something like this works:

0[1-9]|1[0-2]

References

  • regular-expressions.info/Character Classes

    • Numeric Ranges (have many examples on matching strings interpreted as numeric ranges)

Explanation

A character class, by itself, attempts to match one and exactly one character from the input string. [01-12] actually defines [012], a character class that matches one character from the input against any of the 3 characters 0, 1, or 2.

The - range definition goes from 1 to 1, which includes just 1. On the other hand, something like [1-9] includes 1, 2, 3, 4, 5, 6, 7, 8, 9.

Beginners often make the mistakes of defining things like [this|that]. This doesn't "work". This character definition defines [this|a], i.e. it matches one character from the input against any of 6 characters in t, h, i, s, | or a. More than likely (this|that) is what is intended.

References

  • regular-expressions.info/Brackets for Grouping and Alternation with the vertical bar

How ranges are defined

So it's obvious now that a pattern like between [24-48] hours doesn't "work". The character class in this case is equivalent to [248].

That is, - in a character class definition doesn't define numeric range in the pattern. Regex engines doesn't really "understand" numbers in the pattern, with the exception of finite repetition syntax (e.g. a{3,5} matches between 3 and 5 a).

Range definition instead uses ASCII/Unicode encoding of the characters to define ranges. The character 0 is encoded in ASCII as decimal 48; 9 is 57. Thus, the character definition [0-9] includes all character whose values are between decimal 48 and 57 in the encoding. Rather sensibly, by design these are the characters 0, 1, ..., 9.

See also

  • Wikipedia/ASCII

Another example: A to Z

Let's take a look at another common character class definition [a-zA-Z]

In ASCII:

  • A = 65, Z = 90
  • a = 97, z = 122

This means that:

  • [a-zA-Z] and [A-Za-z] are equivalent
  • In most flavors, [a-Z] is likely to be an illegal character range

    • because a (97) is "greater than" than Z (90)
  • [A-z] is legal, but also includes these six characters:

    • [ (91), \ (92), ] (93), ^ (94), _ (95), ` (96)

Related questions

  • is the regex [a-Z] valid and if yes then is it the same as [a-zA-Z]

What is the regex for positive number range 18-65? It must accept only two digits

You are quite close, the issue with your expression is that you are missing anchor tags, which say that the regex engine should match only the given string (and make it fail if anything else comes before or after).

Changing your expressions slightly, to this : ^(1[89]|[2-5][0-9]|6[0-5])$ (example here). The ^ in front ensures that the matching starts at the very beginning of the string, while the $ at the end ensures that matching stops at the end of the string. This ensures you that the string provided matches the pattern you want, as a whole, thus, 19 would be matched, but 119 would not.

Alternatively (as it will most likely be proposed), it will be easier to split by - and then use the usual mathematical operators.

RegEx: number range excluding specific number

You may use a negative lookahead assertion in your regex:

~\bAM19/0(?!803)[678]\d{2}\b~

RegEx Demo

Here we have a negative lookahead (?!803) after matching 19/0 which will fail the match if 803 appears right after 19/0 in input.

Also note that by using an alternate regex delimiter ~ you can avoid escaping / in your regex.

a regular expression generator for number ranges

Here's my solution and an algorithm with complexity O(log n) (n is the end of the range). I believe it is the simplest one here:

Basically, split your task into these steps:

  1. Gradually "weaken" the start of the range.
  2. Gradually "weaken" the end of the range.
  3. Merge those two.

By "weaken", I mean finding the end of range that can be represented by simple regex for this specific number, for example:

145 -> 149,150 -> 199,200 -> 999,1000 -> etc.

Here's a backward one, for the end of the range:

387 -> 380,379 -> 300,299 -> 0

Merging would be the process of noticing the overlap of 299->0 and 200->999 and combining those into 200->299.

In result, you would get this set of numbers (first list intact, second one inverted:

145, 149, 150, 199, 200, 299, 300, 379, 380, 387

Now, here is the funny part. Take the numbers in pairs, and convert them to ranges:

145-149, 150-199, 200-299, 300-379, 380-387

Or in regex:

14[5-9], 1[5-9][0-9], 2[0-9][0-9], 3[0-7][0-9], 38[0-7]

Here's how the code for the weakening would look like:

public static int next(int num) {
//Convert to String for easier operations
final char[] chars = String.valueOf(num).toCharArray();
//Go through all digits backwards
for (int i=chars.length-1; i>=0;i--) {
//Skip the 0 changing it to 9. For example, for 190->199
if (chars[i]=='0') {
chars[i] = '9';
} else { //If any other digit is encountered, change that to 9, for example, 195->199, or with both rules: 150->199
chars[i] = '9';
break;
}
}

return Integer.parseInt(String.valueOf(chars));
}

//Same thing, but reversed. 387 -> 380, 379 -> 300, etc
public static int prev(int num) {
final char[] chars = String.valueOf(num).toCharArray();
for (int i=chars.length-1; i>=0;i--) {
if (chars[i] == '9') {
chars[i] = '0';
} else {
chars[i] = '0';
break;
}
}

return Integer.parseInt(String.valueOf(chars));
}

The rest is technical details and is easy to implement. Here's an implementation of this O(log n) algorithm: https://ideone.com/3SCvZf

Oh, and by the way, it works with other ranges too, for example for range 1-321654 the result is:

[1-9]
[1-9][0-9]
[1-9][0-9][0-9]
[1-9][0-9][0-9][0-9]
[1-9][0-9][0-9][0-9][0-9]
[1-2][0-9][0-9][0-9][0-9][0-9]
3[0-1][0-9][0-9][0-9][0-9]
320[0-9][0-9][0-9]
321[0-5][0-9][0-9]
3216[0-4][0-9]
32165[0-4]

And for 129-131 it's:

129
13[0-1]

Regex for matching number ranges with specific units

You could make the pattern a bit more specific and optionally match whitespace chars instead of hard coding all the possible spaces variations

\b\d+(?:[.,]\d+)?(?:\s*-\s*\d+(?:[.,]\d+)?)?\s*km\s*/\s*(?:h|10min)\b

Explanation

  • \b A word boundary
  • \d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
  • (?: Non capture group
    • \s*-\s* Match - between optional whitespace chars
    • \d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
  • )? Close the non capture group and make it optional
  • \s*km\s*/\s* Match km/ surrounded with optional whitespace chars to match different variations
  • (?:h|10min) Match either h or 10min (Or use \d+min to match 1+ digits)
  • \b A word boundary

See a regex demo.



Related Topics



Leave a reply



Submit