Problem with Quantifiers and Look-Behind

Regular Expression Lookbehind doesn't work with quantifiers ('+' or '*')

Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:

  • only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
  • only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
  • only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)

The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.

Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).

See also section about limitations of look-behind assertions on Regular-Expressions.info.

Problem with quantifiers and look-behind

The issue is that Ruby doesn't support variable-length lookbehinds. Quantifiers aren't out per se, but they can't cause the length of the lookbehind to be nondeterministic.

Perl has the same restriction, as does just about every major language featuring regexes.

Try using the straightforward match (\w*)\W*?o instead of the lookbehind.

Java regex lookbehind issue with quantifiers

You may leverage a feature in a Java regex engine that is called "constrained width lookbehind":

Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.

That means, you may replace the {2} limiting quantifier with a limiting quantifier with both minimum and maximum values, e.g. {0,100} to allow zero to a hundred whitespace symbols. Adjust them as you see fit.

Besides, you needn't use a tempered greedy token (?:(?![\n]).)* as the dot in Java regex does not match a newline. Just replace it with .* to match any zero or more chars other than newline. So, your pattern might look as simple as (?i)(?<=survey\s{0,100}:).*.

Can we use quantifiers inside a lookbehind expression?

After looking at the Pattern code and trying to trace it, I'm convinced that this is just a bug. Both examples should result in an exception. But the logic that tests for this is incorrect.

This code appears in a couple places:

 temp = info.maxLength * cmax + maxL;
info.maxLength = temp;
if (temp < maxL) {
info.maxValid = false;
}

Note that as long as maxLength and cmax are nonnegative, temp should never be less than maxL unless overflow has occurred. maxValid is eventually used by the lookbehind code; if maxValid is false, "look-behind group does not have obvious maximum length error" is thrown.

From what I can tell, in a regex like <prefix> <expression>{m,n}, in the above code, info.maxLength is the maximum length of "expression", cmax is the upper bound of the quantifier, and maxL is the maximum length of "prefix". When a quantifier is * or +, the upper bound is set to Integer.MAX_VALUE. (All the variables here are int.) This means that there will be overflow unless info.maxLength is 1 and maxL is 0. That's exactly the case with

(?<=a*)bc

since the pattern with the quantifier has length 1, and there is nothing before the a* which means maxL will be 0. That's why this case falls through the cracks.

For any other values, the computation will overflow, but that doesn't necessarily mean temp < maxL will be true. If info.maxLength is even, then an exception will be thrown; but if info.maxLength is odd, the pattern will compile if maxL is small enough. This is because of the way wraparound works, mathematically; the code that attempts to check for overflow is quite faulty. This means that

(?<=a*)bc         // succeeds
(?<=(ab)*)bc // throws exception
(?<=(abc)*)bc // succeeds
(?<=(abcd)*)bc // throws exception
(?<=(abcde)*)bc // succeeds

Also:

(?<=xa*)bc        // throws exception
(?<=x(abc)*)bc // succeeds

Note: It should be noted that in your example regex, lookbehind is useless:

(?<=a*)bc

The lookbehind says to test whether the current location is preceded by zero or more occurrences of the letter a. This is always true, trivially. So the lookbehind serves no purpose here. Similarly,

(?<=a+)bc

is equivalent to

(?<=a)bc

since as long as there's one a preceding the current location, it doesn't matter how many more there might be. The same is true of all the examples that don't throw an exception, except for this one:

(?<=x(abc)*)bc    // succeeds

since here the matcher does have to go backwards looking for abc's in the string and making sure the last one is preceded by x. It seems that Pattern should throw an exception in this case, but because of the faulty overflow-check logic, it isn't. Therefore, I'm uncertain whether it would actually return the correct result, or whether it might crash or go into an infinite loop. However, I haven't tried it.

Really, the code should just check directly whether cmax is equal to Integer.MAX_VALUE, instead of using it in computation and hoping the code can tell later that the result is bogus.

Why can't you use repetition quantifiers in zero-width look behind assertions?

The ultimate answer to such a question is in the engine's code, and at the bottom of the answer you'll be able to dive into the section of the PCRE engine's code responsible for ensuring fixed-length in lookbehinds—if you're interested in knowing the finest details. In the meantime, let's gradually zoom into the question from higher levels.

Variable-Width Lookbehind vs. Infinite-Width Lookbehind

First off, a quick clarification on terms. A growing number of engines (including PCRE) support some form of variable-width lookbehind, where the variation falls within a determined range, for instance:

  • the engine knows that the width of what precedes must be within 5 to ten characters (not supported in PCRE)
  • the engine knows that the width of what precedes must be either 5 or ten character (supported in PCRE)

In contrast, in infinite-width lookbehind, you can use quantified tokens such as a+

Engines that Support Infinite-Width Lookbehind

For the record, these engines support infinite lookbehind:

  • .NET (C#, VB.NET etc.)
  • Matthew Barnett's regex module for Python
  • JGSoft (EditPad etc.; not available in a programming language).

As far as I know, they are the only ones.

Variable Lookbehind in PCRE

In PCRE, the most relevant section in the documentation is this:

The contents of a lookbehind assertion are restricted such that all
the strings it matches must have a fixed length. However, if there are
several top-level alternatives, they do not all have to have the same
fixed length.

Therefore, the following lookbehind is valid:

(?<=a |big )cat

However, none of these are:

  • (?<=a\s?|big )cat (the sides of the alternation do not have a fixed width)
  • (?<=@{1,10})cat (variable width)
  • (?<=\R)cat (\R does not have a fixed-width as it can match \n, \r\n, etc.)
  • (?<=\X)cat (\X does not have a fixed-width as a Unicode grapheme cluster can contain a variable number of bytes.)
  • (?<=a+)cat (clearly not fixed)

Lookbehind with Zero-Width Match but Infinite Repetition

Now consider this:

(?<=(?=@+))(cat#+)

On the face of it, this is a fixed-width lookbehind, because it can only ever find a zero-width match (defined by the lookahead (?=@++)). Is that a trick to get around the infinite lookbehind limitation?

No. PCRE will choke on this. Even though the content of the lookbehind is zero-width, PCRE will not allow infinite repetition in the lookbehind. Anywhere. When the documentation says all the strings it matches must have a fixed length, it should really be:

All the strings that any of its components matches must have a fixed
length.

Workarounds: Life without Infinite Lookbehind

In PCRE, the two main solutions to problems where infinite lookbehinds would help are \K and capture Groups.

Workaround #1: \K

The \K assertion tells the engine to drop what was matched so far from the final match it returns.

Suppose you want (?<=@+)cat#+, which is not legal in PCRE. Instead, you can use:

@+\Kcat#+

Workaround #2: Capture Groups

Another way to proceed is to match whatever you would have placed in a lookbehind, and to capture the content of interest in a capture group. You then retrieve the match from the capture group.

For instance, instead of the illegal (?<=@+)cat#+, you would use:

@+(cat#+)

In R, this could look like this:

matches <- regexpr("@+(cat#+)", subject, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(subject, result)

In languages that don't support \K, this is often the only solution.

Engine Internals: What Does the PCRE Code Say?

The ultimate answer is to be found in pcre_compile.c. If you examine the code block that starts with this comment:

If lookbehind, check that this branch matches a fixed-length string

You find that the grunt work is done by the find_fixedlength() function.

I reproduce it here for anyone who would like to dive into further details.

static int
find_fixedlength(pcre_uchar *code, BOOL utf, BOOL atend, compile_data *cd)
{
int length = -1;

register int branchlength = 0;
register pcre_uchar *cc = code + 1 + LINK_SIZE;

/* Scan along the opcodes for this branch. If we get to the end of the
branch, check the length against that of the other branches. */

for (;;)
{
int d;
pcre_uchar *ce, *cs;
register pcre_uchar op = *cc;

switch (op)
{
/* We only need to continue for OP_CBRA (normal capturing bracket) and
OP_BRA (normal non-capturing bracket) because the other variants of these
opcodes are all concerned with unlimited repeated groups, which of course
are not of fixed length. */

case OP_CBRA:
case OP_BRA:
case OP_ONCE:
case OP_ONCE_NC:
case OP_COND:
d = find_fixedlength(cc + ((op == OP_CBRA)? IMM2_SIZE : 0), utf, atend, cd);
if (d < 0) return d;
branchlength += d;
do cc += GET(cc, 1); while (*cc == OP_ALT);
cc += 1 + LINK_SIZE;
break;

/* Reached end of a branch; if it's a ket it is the end of a nested call.
If it's ALT it is an alternation in a nested call. An ACCEPT is effectively
an ALT. If it is END it's the end of the outer call. All can be handled by
the same code. Note that we must not include the OP_KETRxxx opcodes here,
because they all imply an unlimited repeat. */

case OP_ALT:
case OP_KET:
case OP_END:
case OP_ACCEPT:
case OP_ASSERT_ACCEPT:
if (length < 0) length = branchlength;
else if (length != branchlength) return -1;
if (*cc != OP_ALT) return length;
cc += 1 + LINK_SIZE;
branchlength = 0;
break;

/* A true recursion implies not fixed length, but a subroutine call may
be OK. If the subroutine is a forward reference, we can't deal with
it until the end of the pattern, so return -3. */

case OP_RECURSE:
if (!atend) return -3;
cs = ce = (pcre_uchar *)cd->start_code + GET(cc, 1); /* Start subpattern */
do ce += GET(ce, 1); while (*ce == OP_ALT); /* End subpattern */
if (cc > cs && cc < ce) return -1; /* Recursion */
d = find_fixedlength(cs + IMM2_SIZE, utf, atend, cd);
if (d < 0) return d;
branchlength += d;
cc += 1 + LINK_SIZE;
break;

/* Skip over assertive subpatterns */

case OP_ASSERT:
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
do cc += GET(cc, 1); while (*cc == OP_ALT);
cc += PRIV(OP_lengths)[*cc];
break;

/* Skip over things that don't match chars */

case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
cc += cc[1] + PRIV(OP_lengths)[*cc];
break;

case OP_CALLOUT:
case OP_CIRC:
case OP_CIRCM:
case OP_CLOSE:
case OP_COMMIT:
case OP_CREF:
case OP_DEF:
case OP_DNCREF:
case OP_DNRREF:
case OP_DOLL:
case OP_DOLLM:
case OP_EOD:
case OP_EODN:
case OP_FAIL:
case OP_NOT_WORD_BOUNDARY:
case OP_PRUNE:
case OP_REVERSE:
case OP_RREF:
case OP_SET_SOM:
case OP_SKIP:
case OP_SOD:
case OP_SOM:
case OP_THEN:
case OP_WORD_BOUNDARY:
cc += PRIV(OP_lengths)[*cc];
break;

/* Handle literal characters */

case OP_CHAR:
case OP_CHARI:
case OP_NOT:
case OP_NOTI:
branchlength++;
cc += 2;
#ifdef SUPPORT_UTF
if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
break;

/* Handle exact repetitions. The count is already in characters, but we
need to skip over a multibyte character in UTF8 mode. */

case OP_EXACT:
case OP_EXACTI:
case OP_NOTEXACT:
case OP_NOTEXACTI:
branchlength += (int)GET2(cc,1);
cc += 2 + IMM2_SIZE;
#ifdef SUPPORT_UTF
if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
break;

case OP_TYPEEXACT:
branchlength += GET2(cc,1);
if (cc[1 + IMM2_SIZE] == OP_PROP || cc[1 + IMM2_SIZE] == OP_NOTPROP)
cc += 2;
cc += 1 + IMM2_SIZE + 1;
break;

/* Handle single-char matchers */

case OP_PROP:
case OP_NOTPROP:
cc += 2;
/* Fall through */

case OP_HSPACE:
case OP_VSPACE:
case OP_NOT_HSPACE:
case OP_NOT_VSPACE:
case OP_NOT_DIGIT:
case OP_DIGIT:
case OP_NOT_WHITESPACE:
case OP_WHITESPACE:
case OP_NOT_WORDCHAR:
case OP_WORDCHAR:
case OP_ANY:
case OP_ALLANY:
branchlength++;
cc++;
break;

/* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
otherwise \C is coded as OP_ALLANY. */

case OP_ANYBYTE:
return -2;

/* Check a class for variable quantification */

case OP_CLASS:
case OP_NCLASS:
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
case OP_XCLASS:
/* The original code caused an unsigned overflow in 64 bit systems,
so now we use a conditional statement. */
if (op == OP_XCLASS)
cc += GET(cc, 1);
else
cc += PRIV(OP_lengths)[OP_CLASS];
#else
cc += PRIV(OP_lengths)[OP_CLASS];
#endif

switch (*cc)
{
case OP_CRSTAR:
case OP_CRMINSTAR:
case OP_CRPLUS:
case OP_CRMINPLUS:
case OP_CRQUERY:
case OP_CRMINQUERY:
case OP_CRPOSSTAR:
case OP_CRPOSPLUS:
case OP_CRPOSQUERY:
return -1;

case OP_CRRANGE:
case OP_CRMINRANGE:
case OP_CRPOSRANGE:
if (GET2(cc,1) != GET2(cc,1+IMM2_SIZE)) return -1;
branchlength += (int)GET2(cc,1);
cc += 1 + 2 * IMM2_SIZE;
break;

default:
branchlength++;
}
break;

/* Anything else is variable length */

case OP_ANYNL:
case OP_BRAMINZERO:
case OP_BRAPOS:
case OP_BRAPOSZERO:
case OP_BRAZERO:
case OP_CBRAPOS:
case OP_EXTUNI:
case OP_KETRMAX:
case OP_KETRMIN:
case OP_KETRPOS:
case OP_MINPLUS:
case OP_MINPLUSI:
case OP_MINQUERY:
case OP_MINQUERYI:
case OP_MINSTAR:
case OP_MINSTARI:
case OP_MINUPTO:
case OP_MINUPTOI:
case OP_NOTMINPLUS:
case OP_NOTMINPLUSI:
case OP_NOTMINQUERY:
case OP_NOTMINQUERYI:
case OP_NOTMINSTAR:
case OP_NOTMINSTARI:
case OP_NOTMINUPTO:
case OP_NOTMINUPTOI:
case OP_NOTPLUS:
case OP_NOTPLUSI:
case OP_NOTPOSPLUS:
case OP_NOTPOSPLUSI:
case OP_NOTPOSQUERY:
case OP_NOTPOSQUERYI:
case OP_NOTPOSSTAR:
case OP_NOTPOSSTARI:
case OP_NOTPOSUPTO:
case OP_NOTPOSUPTOI:
case OP_NOTQUERY:
case OP_NOTQUERYI:
case OP_NOTSTAR:
case OP_NOTSTARI:
case OP_NOTUPTO:
case OP_NOTUPTOI:
case OP_PLUS:
case OP_PLUSI:
case OP_POSPLUS:
case OP_POSPLUSI:
case OP_POSQUERY:
case OP_POSQUERYI:
case OP_POSSTAR:
case OP_POSSTARI:
case OP_POSUPTO:
case OP_POSUPTOI:
case OP_QUERY:
case OP_QUERYI:
case OP_REF:
case OP_REFI:
case OP_DNREF:
case OP_DNREFI:
case OP_SBRA:
case OP_SBRAPOS:
case OP_SCBRA:
case OP_SCBRAPOS:
case OP_SCOND:
case OP_SKIPZERO:
case OP_STAR:
case OP_STARI:
case OP_TYPEMINPLUS:
case OP_TYPEMINQUERY:
case OP_TYPEMINSTAR:
case OP_TYPEMINUPTO:
case OP_TYPEPLUS:
case OP_TYPEPOSPLUS:
case OP_TYPEPOSQUERY:
case OP_TYPEPOSSTAR:
case OP_TYPEPOSUPTO:
case OP_TYPEQUERY:
case OP_TYPESTAR:
case OP_TYPEUPTO:
case OP_UPTO:
case OP_UPTOI:
return -1;

/* Catch unrecognized opcodes so that when new ones are added they
are not forgotten, as has happened in the past. */

default:
return -4;
}
}
/* Control never gets here */
}

Unlimited quantifiers in a complex lookbehind

Keep in mind that the lookbehind will only look as far behind as it must. For example, (?<=\s+) will be satisfied if the previous character is a space; it doesn't need to look any farther back.

The same is true of your lookbehind. If it's not the beginning of the string and the previous character is not whitespace, an open-parenthesis or a period, there's no point looking any farther back. It's equivalent to this:

(?<=^|[\s(.])

Your lookahead can be condensed in the same way. If it's not the end of the string, and the next character is not whitespace, a close-parenthesis or a period, there's no point looking any further:

(?=[\s).]|$)

So the final regex is:

(?<=^|[\s(.])(?:item|item1|item2)(?=[\s).]|$)

Quantifiers inside a positive look-behind using Python

You don't need a lookbehind assertion. You could match what comes before the number, and capture the value in the named capturing group cat_num

Make the s optional using [cC]ats?

\b[cC]ats? (?P<cat_num>5A)\b

Regex demo

Or a bit broader match:

\b[cC]ats? (?P<cat_num>\d+[A-Z]+)\b

Regex demo

regex lookahead problems with greedy quantifier

Not totally sure of the requirements, this works on your samples.

 # ^(?i)\d{3}(?:[ ](?:([ACERV])[ ]?(?![ACERV ]*\1)){1,3}(?<![ ]))?$

^ # BOL
(?i) # Case insensitive modifier
\d{3} # 3 digits
(?: # Cluster grp, character block (optional)
[ ] # Space, required
(?: # Cluster grp
( [ACERV] ) # (1), Capture single character [ACERV]
[ ]? # [ ], optional
(?! # Negative lookahead
[ACERV ]* # As many [ACERV] or [ ] needed
\1 # to find what is captured in group 1
# Found it, the assertion fails
) # End Negative lookahead
){1,3} # End Cluster grp, gets 1-3 [ACERV] characters
(?<! [ ] ) # No dangling [ ] at end
)? # End Cluster grp, character block (optional)
$ # EOL

update - Adjusted to replace lookbehind.

 # ^(?i)\d{3}(?!.*[ ]$)(?:[ ](?:([ACERV])[ ]?(?![ACERV ]*\1)){1,3})?$

^ # BOL
(?i) # Case insensitive modifier
\d{3} # 3 digits
(?! .* [ ] $ ) # No dangling [ ] at end
(?: # Cluster grp, character block (optional)
[ ] # Space, required
(?: # Cluster grp
( [ACERV] ) # (1), Capture single character [ACERV]
[ ]? # [ ], optional
(?! # Negative lookahead
[ACERV ]* # As many [ACERV] or [ ] needed
\1 # to find what is captured in group 1
# Found it, the assertion fails
) # End Negative lookahead
){1,3} # End Cluster grp, gets 1-3 [ACERV] characters
)? # End Cluster grp, character block (optional)
$ # EOL


Related Topics



Leave a reply



Submit