How to Determine If a Number Is a Prime with Regex

How does this regex find primes?

I think the article explains it rather well, but I'll try my hand at it as well.

Input is in unary form. 1 is 1, 2 is 11, 3 is 111, etc. Zero is an empty string.

The first part of the regex matches 0 and 1 as non-prime. The second is where the magic kicks in.

(11+?) starts by finding divisors. It starts by being defined as 11, or 2. \1 is a variable referring to that previously captured match, so \1+ determines if the number is divisible by that divisor. (111111 starts by assigning the variable to 11, and then determines that the remaining 1111 is 11 repeated, so 6 is divisible by 2.)

If the number is not divisible by two, the regex engine increments the divisor. (11+?) becomes 111, and we try again. If at any point the regex matches, the number has a divisor that yields no remainder, and so the number cannot be prime.

How to determine if a number is a prime with regex?

You said you understand this part, but just to emphasize, the String generated has a length equal to the number supplied. So the string has three characters if and only if n == 3.

.?

The first part of the regex says, "any character, zero or one times". So basically, is there zero or one character-- or, per what I mentioned above, n == 0 || n == 1. If we have the match, then return the negation of that. This corresponds with the fact that zero and one are NOT prime.

(..+?)\\1+

The second part of the regex is a little trickier, relying on groups and backreferences. A group is anything in parentheses, which will then be captured and stored by the regex engine for later use. A backreference is a matched group that is used later on in the same regex.

The group captures 1 character, then 1 or more of any character. (The + character means one or more, but ONLY of the previous character or group. So this is not "two or four or six etc. characters", but rather "two or three etc." The +? is like +, but it tries to match as few characters as possible. + normally tries to gobble the whole string if it can, which is bad in this case because it prevents the backreference part from working.)

The next part is the backreference: That same set of characters (two or more), appearing again. Said backreference appears one or more times.

So. The captured group corresponds to a natural number of characters (from 2 onward) captured. Said group then appears some natural number of times (also from 2 onward). If there IS a match, this implies that it's possible to find a product of two numbers greater than or equal to 2 that match the n-length string... meaning you have a composite n. So again, return the negation of the successful match: n is NOT prime.

If no match can be found, then you can't come up with a your product of two natural numbers greater than or equal to 2... and you have both a non-match and a prime, hence again the returning of the negation of the match result.

Do you see it now? It's unbelievably tricky (and computationally expensive!) but it's also kind of simple at the same time, once you get it. :-)

I can elaborate if you have further questions, like on how regex parsing actually works. But I'm trying to keep this answer simple for now (or as simple as it can ever be).

Regex to match string with prime length?

^(?!(xx+)\1+$)

works by performing a negative lookahead at the start of the line. It will reject any line that consists of

  • a number of 2 or more xes
  • followed by the same number of xes, possibly multiple times

In other words, the regex works by matching any number of xes that can not be divided in smaller, equally sized groups of size >= 2.

To exclude the case of just a single x, you could use ^(?!(xx+)\1+$|x$).

How does this regular expression work?

The first ? is for matching the empty string (i.e. 0) as non-prime. If you don't care whether the regexp matches 0, then it's not necessary.

The second ? is only for efficiency. + is normally "greedy", which means it matches as many characters as are available, and then backtracks if the rest of the regexp fails to match. The +? makes it non-greedy, so it matches only 1 character, and then tries matching more if the rest of the regexp fails to match. (See the Quantifiers section of perlre for more about greedy vs. non-greedy matching.)

In this particular regexp, the (11+?) means it tests divisibility by 2 ('11'), then 3 ('111'), then 4, etc. If you used (11+), it would test divisibility by N (the number itself), then N-1, then N-2, etc. Since a divisor must be no greater than N/2, without the ? it would waste time testing a lot of "potential" divisors that can't possibly work. It would still match non-prime numbers, just more slowly. (Also, $1 would be the largest divisor instead of the smallest one.)

regex - What is the complexity of this regular expression for primes detect?

From empirical data it appears to be O(n2).

I ran the Ruby code on every 100th of the first 10000 primes. Here are the results:

Graph showing time taken to check if a number is prime.

The blue dots are the recorded times and the orange line is y = 2.9e-9 * x^2. The line fits the data perfectly, indicating that the complexity is O(n2).

This is to be expected since the regular expression checks all possible divisors to see if any of them occurs a whole number of times in the string.

How does the regex expression that checks for prime numbers work?

The basic premise is that this regular expression examines a ones representation of the number (e.g. 5 = 11111). By checking for the presence of ones (1) in certain positions or groupings it can identify the number as prime.

Additional References:

  • Credit where credit is due -
    http://montreal.pm.org/tech/neil_kandalgaonkar.shtml
  • Great explanation - http://www.noulakaz.net/weblog/2007/03/18/a-regular-expression-to-check-for-prime-numbers/

Can a Perl/Java/etc regular expression be written to match decimal (non-)prime numbers?

NOTE: These Regexes where written in for PHP and use possessive quantifiers which are used in many but not all languages, for example java-script does not support them. Also this is very inefficient and will quickly become infeasible.

EDIT: here it is for base 10 \b(((\d)(?=[\d\s]*(\4{0,10}(n(?=.*n\3)|nn(?=.*1\3)|n{3}(?=.*2\3)|n{4}(?=.*3\3)|n{5}(?=.*4\3)|n{6}(?=.*5\3)|n{7}(?=.*6\3)|n{8}(?=.*7\3)|n{9}(?=.*8\3))?)))+)(?![\d\s]*(n(?=\4))++(..?1|(...*)\8+1)) I have used base 2 after this to make things easier.

EDIT: this one will allow you to pass in a string containing several binary numbers and matches those that are prime \b(((\d)(?=[\d\s]*(\4{0,2}n(?=.*\3)|\4{0,2})))+)(?![\d\s]*(n(?=\4))++(..?1|(...*)\7+1))
It basically does this by using boundary \b instead of start of string ^, it allows any number of decimals and spaces when moving forward to the ns and wraps the whole of the portion that tests the base 1 representations in a negative look-ahead. Apart from that it work in the same way as the one below.
As an example 1111 1011 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn1
will match 1011.

I have managed to get something I think works (checked to 25) and it matches non-primes. Here it is for base 2 (easier to explain) ^((\d)(?= \d*\s(\3{0,2}n(?=.*\2)|\3{0,2})))+\s(n(?=\3))*+\K(..?1|(..+?)\6+1)this can be expanded up to base n, but this expands the regex very quickly. To get this regex to work I need a couple of perquisites (A bit hacky), the input string must be the number followed by a space followed by at least as many n characters as the value you of your number (if you had the number 10 you need at leat 10 ns after it) followed by the digits of your base in order excluding your 0 digit (e.g for base 10 123456789), not including your 0. For example 11 nnnnnnnnnnnnnn1. This is due to the fact that the regexes have no accessible storage so I need to use capturing groups to do this. Finally this regex uses /x to ignore whitespaces in the expression, strip out all the space if you don't want to use this.

I will now explain how this works in 3 steps.
This regex works in 3 parts:

Part 1:this part changes a base n > 1 to base 1 as a capturing group of ns

This is the part ^((\d)(?= \d*\s(\3{0,2}n(?=.*\2)|\3{0,2})))+ it works very similarly to the a^nb^n example in the question. The ^ at the front means that the full match has to start at the beginning this is important for later.
The main structure of this code is ^((\d)(?= \d*\s (suff)))+ This takes each decimal between the start and the first space and performs a positive look-ahead using (\d)(?=) where \d is a decimal and (?=) is a look-ahead the \d is in a capturing group () for later on. It is the digit we are currently looking at.

The inside of the look-ahead is not actually to check a condition ahead but instead to build up a capturing group representing our number in base 1. The inside of the capturing group looks like this

\d*\s(\3{0,2}n(?=.*\2)|\3{0,2})) 

The part \d*\s basically moves the characters we are looking at past the rest of the remaining digits \d* (\d is digit and * is 0 to n as many times as possible) this now leaves us looking at the start of the ns.

(\3{0,2}n(?=.*\2)|\3{0,2}))

is a self referencing capturing group here is where the need for the digits you have put at the end comes in. This group matches itself 0 to 2 times but as many times as possible (using \3{0,2} with \3 meaning caturing group 3 and {0,2} meaning match from 0 to 2 times) this means that if there is a number before the current digit its base 1 representation is multiplied by 2. This would be 10 for base 10 or 16 for base 16. If this is the first digit the group will be undefined so it will match 0 times. It then either adds a single n or no n based on matching the digit we are currently working on (using its capturing group). It does this by using a positive look ahead to look to the end of the input where we put the digits, n(?=.*\2) this matches n if it can find anything followed by the digit we are working on. This allows it to identify what digit it is we are working on at this point. (I would have used a look behind but they are fixed length)
If you had base 3 and wanted to check if the digit you are currently working on is 2 you would use nn(?=.*1\2) this would match nn only if the digit was two. We have used an or operator '|' for all of these and if no digit is found we assume it is 0 and add no ns. As this is in the capturing group this matching is then saved in the group.

In summary of this part what we do is take each digit look ahead take the base 1 representation of the previous digits (saved in the capturing group )and multiply it by the base then match it, then add to it the base one representation of the digit and save it in the group. If you do this for each digit in turn you will get a base one representation of the number. Lets look at and example. 101 nnnnnnnnnnnnnnnnn1

First it goes to the sat because of ^. so :101 nnnnnnnnnnnnnnnnn1

Then it goes to the first digit and saves it in a capturing grope 101 nnnnnnnnnnnnnnnnn1

Group2 : 1

It uses a look-ahead using \d*\s to go past all the digits and the first space. 101 nnnnnnnnnnnnnnnnn1

It is now inside capturing group 3

It takes this caputing group's previous value and matches it 0 to 2 times

As it is undefined it matches 0 time.

It now looks ahead again to try to find a digit matching the digit in capturing group 2 101 nnnnnnnnnnnnnnnnn1

as it has been found it matches 1 n in capturing group 3 2 101 nnnnnnnnnnnnnnnnn1

It now leaves group 3, updating its value and leaves the look ahead
Group 3 = n

It now looks at the next digit and saves that in a capturing group 101 nnnnnnnnnnnnnnnnn1

group 2 = 0

group 3 = n

It then also uses a look-ahead and goes to the first n 101 nnnnnnnnnnnnnnnnn1

It then matches group 3 0 to 2 time but as many as possible so n 101 nnnnnnnnnnnnnnnnn1

It then uses a look-ahead to try to match the digit in group 2 which it can do so it adds no ns, befor returning out of group 3 and the look-ahead

group3 = nn

It now looks at the next digit and saves it in group 2 101 nnnnnnnnnnnnnnnnn1
Using a look-ahead it goes to the start of the ns and matches 2 times group 3 101 nnnnnnnnnnnnnnnnn1
It then uses a look-ahead to try to match the digit in group 2 it finds it so matches a n and returns out of group 3 and the look-ahead.
group3 = nnnnn
Group 3 now contains the base 1 representation of our number.

Part 2 Reduces the ns to the size of the base 1 representation of your number

\s(n(?=\3))*+\K

This matches the space and then matches ns for as long as you can match group 3 (the base one representation of your number) in front. It does this by matching n as many times as possible using a *+ which is possessive (it never lets go of a matching this is to stop the matching from being shrunk later to make a match work) n has a posive look-ahead n(?=\3) which means n will be matched as long as there is a group 3 ahead of it (\3 gives capturing group 3). This leaves us with our base 1 representation and digits being the only thing left unmatched. We then us \K to say start the matching again from here.

Part3
We now use the same algorithm mentioned in the question to get primes apart from we force it not to match between the start of the base on representation and the start of the digits. You can read how that works Here How to determine if a number is a prime with regex?

Finally to make this into a base n regex you have to do a few things

 ^((\d)(?= \d*\s(\3{0,2}n(?=.*\2)|\3{0,2})))+\s(n(?=\3))*+\K(..?1|(..+?)\6+1)

fist add some more digits at the end of your input string then change the n

?=.*\2 to  n?=.*n\2 |  n?=.*1\2   n?=.*3\2 ..,  n?=.***n**\2

Finally change the \3{0,2} to \3{0,n}. where n is the base. Also remember that this will not work without the correct input string.

Arcane isPrime method in Java

First, note that this regex applies to numbers represented in a unary counting system, i.e.

1       is 1
11 is 2
111 is 3
1111 is 4
11111 is 5
111111 is 6
1111111 is 7

and so on. Really, any character can be used (hence the .s in the expression), but I'll use "1".

Second, note that this regex matches composite (non-prime) numbers; thus negation detects primality.


Explanation:

The first half of the expression,

.?

says that the strings "" (0) and "1" (1) are matches, i.e. not prime (by definition, though arguable.)

The second half, in simple English, says:

Match the shortest string whose length is at least 2, for example, "11" (2). Now, see if we can match the entire string by repeating it. Does "1111" (4) match? Does "111111" (6) match? Does "11111111" (8) match? And so on. If not, then try it again for the next shortest string, "111" (3). Etc.

You can now see how, if the original string can't be matched as a multiple of its substrings, then by definition, it's prime!

BTW, the non-greedy operator ? is what makes the "algorithm" start from the shortest and count up.


Efficiency:

It's interesting, but certainly not efficient, by various arguments, some of which I'll consolidate below:

  • As @TeddHopp notes, the well-known sieve-of-Eratosthenes approach would not bother to check multiples of integers such as 4, 6, and 9, having been "visited" already while checking multiples of 2 and 3. Alas, this regex approach exhaustively checks every smaller integer.

  • As @PetarMinchev notes, we can "short-circuit" the multiples-checking scheme once we reach the square root of the number. We should be able to because a factor greater than the square root must partner with a factor lesser than the square root (since otherwise two factors greater than the square root would produce a product greater than the number), and if this greater factor exists, then we should have already encountered (and thus, matched) the lesser factor.

  • As @Jesper and @Brian note with concision, from a non-algorithmic perspective, consider how a regular expression would begin by allocating memory to store the string, e.g. char[9000] for 9000. Well, that was easy, wasn't it? ;)

  • As @Foon notes, there exist probabilistic methods which may be more efficient for larger numbers, though they may not always be correct (turning up pseudoprimes instead). But also there are deterministic tests that are 100% accurate and far more efficient than sieve-based methods. Wolfram's has a nice summary.



Related Topics



Leave a reply



Submit