How Does This Pcre Pattern Detect Palindromes

How does this PCRE pattern detect palindromes?

Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:

^(.)(.)(.) ... \3\2\1$

we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:

^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...

but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:

^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...

which could be converted into

^(?: 
(.)(?=.*(\1\2)$)
)*

Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.

(?(2)\2|)   # matches \2 if it exist, empty otherwise.

so our expression becomes

^(?: 
(.)(?=.*(\1(?(2)\2|))$)
)*

Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.

^(?: 
(.)(?=.*(\1(?(2)\2|))$)
)*\2$

We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.

^(?: 
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$

This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So

^(?: 
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.

Why will this recursive regex only match when a character repeats 2^n - 1 times?

The phenomenon you're observing is due to the fact that PCRE subpattern recursion is atomic, unlike Perl. The man page actually covers this problem in great detail:

In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure
.

This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):

    ^(.|(.)(?1)\2)$

The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works;
in PCRE it does not if the pattern is longer than three characters
.

Consider the subject string "abcba":

At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alternative
is taken and the recursion kicks in. The recursive call to subpattern 1
successfully matches the next character ("b"). (Note that the beginning
and end of line tests are not part of the recursion).

Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion
is treated as an atomic group, there are now no backtracking points,
and so the entire match fails. (Perl is able, at this point, to re-
enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:

    ^((.)(?1)\2|.)$

This time, the recursing alternative is tried first, and continues to
recurse until it runs out of characters, at which point the recursion
fails. But this time we do have another alternative to try at the
higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot use.

To change the pattern so that matches all palindromic strings, not just
those with an odd number of characters, it is tempting to change the
pattern to this:

    ^((.)(?1)\2|.?)$

Again, this works in Perl, but not in PCRE, and for the same reason.
When a deeper recursion has matched a single character, it cannot be
entered again in order to match an empty string. The solution is to
separate the two cases, and write out the odd and even cases as alternatives
at the higher level:

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$



WARNING!!!


The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the
entire string. For example, although "abcba" is correctly matched, if
the subject is "ababa", PCRE finds the palindrome "aba" at the start,
then fails at top level because the end of the string does not follow.
Once again, it cannot jump back into the recursion to try other alternatives,
so the entire match fails.

Additional references

  • regular-expressions.info/Atomic grouping

    • (?>…) in some flavor is atomic grouping syntax
    • Lookarounds (?=…), (?!…), (?<=…), (?<!…), are all atomic
    • Possessive quantifier (e.g. a*+) is also atomic
    • PCRE recursive subpattern and subroutine calls are also atomic

A closer look at the pattern

The atomicity argument is correct, but perhaps it's not obvious how it explains why the pattern behaves as observed. Let's take a closer look and see how this all fits:

We will use the first pattern:

^(([a-z])(?1)\2|[a-z]?)$

I will use the following notation to denote the recursion:

  • 1 means the character was captured into group 2 in the first alternate
  • 2 means the character was matched by the second alternate

    • Or if the 2 is not above a character, the zero repetition option of ? is exercised
  • \ means the character was matched by the backreference to group 2 in first alternate
  • _ denotes the bottom of a recursive branch

    • This branch will NOT be reentered even if there are other alternatives!

Now let's consider "aaa" as input:

      _
1 1 1 2
a a a # This is the first bottom of the recursion,
# now we go back to the third 1 and try to match \.
# This fails, so the third 1 becomes 2.
_
1 1 2
a a a # Now we go back to the second 1 and try to match \.
# This fails, so the second 1 becomes 2.
_
1 2
a a a # The second level matched! now we go back to the first level...

_____
1 2 \
a a a # Now the first 1 can match \, and entire pattern matches!!

Now consider "aaaaa":

          _
1 1 1 1 1 2
a a a a a # Fifth 1 can't match \, so it becomes 2.
_
1 1 1 1 2
a a a a a # Fourth 1 can't match \, so it becomes 2.
_____
1 1 1 2 /
a a a a a # Here's a crucial point. The third 1 successfully matched.
# Now we're back to the second 1 and try to match \, but this fails.
# However, since PCRE recursion is atomic, the third 1 will NOT be
# reentered to try 2. Instead, we try 2 on the second 1.
_____
1 2 \
a a a a a # Anchors don't match, so the first 1 becomes 2, and then also the
# anchors don't match, so the pattern fails to match.

Note that once a recursion level matches on the first alternative, the second alternative will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.


Now consider "aa":

    _
1 1 2
a a
_
1 2
a a # The second level matched by taking the one repetition option on ?.
# We now go back to the first level, and we can't match \.
# Since PCRE recursion is atomic, we can't go back to the second level
# to try the zero repetition option on ?.
_
2
a a # Anchors don't match, trying zero option on ? also doesn't help,
# so the pattern fails to match!

Note that once a recursion level matches on the one repetition of the ? on the second alternative, the zero repetition option will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.


Now let's consider aaaaaaa

              _
1 1 1 1 1 1 1 2
a a a a a a a
_
1 1 1 1 1 1 2
a a a a a a a
_____
1 1 1 1 1 2 \
a a a a a a a # A crucial point: the fifth level matched and now the fourth
# level can't match \, but it does NOT reenter the fifth level to
# try 2. Instead, the fourth level tries 2.
_____
1 1 1 2 \
a a a a a a a
_________
1 1 1 2 \ \
a a a a a a a
_____________
1 1 1 2 \ \ \
a a a a a a a # Entire pattern is a match!

Note that even though PCRE subpattern recursion is atomic, it can still successfully match a palindrome consisting of a character repeating 2n-1 times.


Now, just for fun, let's try "abcba":

          _
1 1 1 1 1 2
a b c b a
_
1 1 1 1 2
a b c b a

1 1 1 2
a b c b a # Third level attempts \, but c does not match a!
# So we go back to third 1 and try 2.
_____
1 1 2 \
a b c b a
_________
1 1 2 \ \
a b c b a # Entire pattern is a match!

That is, the pattern doesn't just match "only when a character repeats 2n-1 times". It can indeed match "abcba" (as seen on ideone.com). It can NOT, however, match "ababa", nor can it match "aaaaa" (see the WARNING on the man page!), because subpattern recursion in PCRE is atomic.

You can apply this same tracing process to explain the behavior of the pattern on any input.

How to check that a string is a palindrome using regular expressions?

The answer to this question is that "it is impossible". More specifically, the interviewer is wondering if you paid attention in your computational theory class.

In your computational theory class you learned about finite state machines. A finite state machine is composed of nodes and edges. Each edge is annotated with a letter from a finite alphabet. One or more nodes are special "accepting" nodes and one node is the "start" node. As each letter is read from a given word we traverse the given edge in the machine. If we end up in an accepting state then we say that the machine "accepts" that word.

A regular expression can always be translated into an equivalent finite state machine. That is, one that accepts and rejects the same words as the regular expression (in the real world, some regexp languages allow for arbitrary functions, these don't count).

It is impossible to build a finite state machine that accepts all palindromes. The proof relies on the facts that we can easily build a string that requires an arbitrarily large number of nodes, namely the string

a^x b a^x (eg., aba, aabaa, aaabaaa, aaaabaaaa, ....)

where a^x is a repeated x times. This requires at least x nodes because, after seeing the 'b' we have to count back x times to make sure it is a palindrome.

Finally, getting back to the original question, you could tell the interviewer that you can write a regular expression that accepts all palindromes that are smaller than some finite fixed length. If there is ever a real-world application that requires identifying palindromes then it will almost certainly not include arbitrarily long ones, thus this answer would show that you can differentiate theoretical impossibilities from real-world applications. Still, the actual regexp would be quite long, much longer than equivalent 4-line program (easy exercise for the reader: write a program that identifies palindromes).

Regular Expression for finding palindromes behaving strange

Your regular expression cannot match overlapping regions (you'd need to play with look-arounds with capturing groups to do that).

The expression matches the first three x characters first; it matches:

  • one character (group 1), zero spaces (group 2), an optional character (the ? is greedy), the zero spaces from group 2, the one character from group 1.

The second match then has to start after that; the two xx characters match because the [a-z]? pattern is optional.

You cannot create a regular expression to match palindromes in general (at least not with the Python re engine), as there is no facility to match an arbitrary-width previous group in reverse.

BASH Palindrome Checker

Like this:

for word in `grep -E '^[a-z]{3,45}$' /usr/share/dict/words`;
do [ $word == `echo $word | rev` ] && echo $word;
done;

Output using my dictionary:

aha
bib
bob
boob
...
wow

Update

As pointed out in the comments, reading in most of the dictionary into a variable in the for loop might not be the most efficient, and risks triggering errors in some shells. Here's an updated version:

grep -E '^[a-z]{3,45}$' /usr/share/dict/words | while read -r word;
do [ $word == `echo $word | rev` ] && echo $word;
done;

How does this pattern match hyphen without escape?

Let's break it down:

[]-a-z]
^^ ^
|| +---- 3
|+------ 2
+------- 1

1 is a literal ] since it appears at the start of the pattern, and [] is an invalid character class in PCRE.

The 2 hyphen is therefore the second character in the class, and introduces a range, between ] and a.

The next hyphen, 3, is treated literally, because the previous token, a is the end of the previous range. Another range cannot be introduced at this point. In PCRE, a - is treated literally if it's in a place where a range cannot be introduced or if it's escaped. We usually place literal hyphens at the start or the end of the range to make it obvious, but this is not required.

Then, z is a simple literal.

PCRE follows the Perl syntax. This is documented like so:

About ]:

A ] is normally either the end of a POSIX character class (see POSIX Character Classes below), or it signals the end of the bracketed character class. If you want to include a ] in the set of characters, you must generally escape it.

However, if the ] is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.

About hyphens:

If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range, you must escape that hyphen with a backslash.

Note that this refers to Perl syntax. Other flavors may have different behavior. For instance, [] is a valid (empty) character class in JavaScript that cannot match anything.

The catch is that, depending on the options, PCRE could also interpret this in the JS way (there's a couple of JS compatibility flags). From the PCRE2 docs:

An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special by default. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash. This means that, by default, an empty class cannot be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at the start does end the (empty) class.

The documented PCRE behavior about the hyphen is, unsurprisingly, matching the Perl behavior:

The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class, or immediately after a range. For example, [b-d-z] matches letters in the range b to d, a hyphen character, or z.



Related Topics



Leave a reply



Submit