Variable-Length Lookbehind-Assertion Alternatives For Regular Expressions

Variable-length lookbehind-assertion alternatives for regular expressions

Most of the time, you can avoid variable length lookbehinds by using \K.

s/(?<=foo.*)bar/moo/s;

would be

s/foo.*\Kbar/moo/s;

Anything up to the last \K encountered is not considered part of the match (e.g. for the purposes of replacement, $&, etc)

Negative lookbehinds are a little trickier.

s/(?<!foo.*)bar/moo/s;

would be

s/^(?:(?!foo).)*\Kbar/moo/s;

because (?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR.


If you're just matching, you might not even need the \K.

/foo.*bar/s

/^(?:(?!foo).)*bar/s

Variable length look-behind

Use \K as a special case.

It's a variable length positive lookbehind assertion:

/eat_(?:apple|pear|orange)_\Ktoday|yesterday/g

Alternatively, you can list out your lookbehind assertions separately:

/(?:(?<=eat_apple_)|(?<=eat_pear_)|(?<=eat_orange_))today|yesterday/g

However, I would propose that it's going to be a rare problem that could potentially use that feature, but couldn't be rethought to use a combination of other more common regex features.

In other words, if you get stuck on a specific problem, feel free to share it here, and I'm sure someone can come up with a different (perhaps better) approach.

Regex Error: A lookbehind assertion has to be fixed width

Use a full string match restart (\K) instead of the invalid variable-length lookbehind.

Regex 101 Demo

/^(?:username|Email|Url):? *\K\V+/mi

Make the colon and space optional by trailing them with ? or *.

Use \V+ to match the remaining non-vertical (such as \r and \n) characters excluding in the line.

See the broader canonical: Variable-length lookbehind-assertion alternatives for regular expressions


To protect your script from falsely matching values instead of matching labels, notice the use ^ with the m modifier. This will ensure that you are matching labels that occur at the start of a line.

Without a start of line anchor, Somethingelse: url whoops will match whoops.

To make multiple matches in PHP, the g pattern modifier is not used. Instead, apply the pattern in preg_match_all()

Alternatives to variable-width lookbehind in Python regex

You need to use capture groups in this case you described:

"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"

will become

r"ORIG\s?:\s?/\s?([A-Z0-9]+)"

The value will be in .group(1). Note that raw strings are preferred.

Here is a sample code:

import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)

IDEONE demo

Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.

R: workaround for variable-width lookbehind

You can use the lookbehind alternative \K instead. This escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.

Quotedrexegg

The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before \K.

Using it in context:

sub('a[ab]b{1,2}\\Ka', 'i', ba, perl=T)
# [1] "baa" "aba" "abbi" "abbbi" "aabi" "aabbi"

Avoiding lookarounds:

sub('(a[ab]b{1,2})a', '\\1i', ba)
# [1] "baa" "aba" "abbi" "abbbi" "aabi" "aabbi"

PHP Regex negative lookbehind variable length alternatives issue

The correct answer would be to use a DOM parser instead. For a quick and dirty (and sometimes faster) way though, you could use the (*SKIP)(*FAIL) mechanism which PCRE implements:

<[^<>&]+>(*SKIP)(*FAIL)|[<>&]+

See a demo on regex101.com.


A complete PHP walk-through would be:

<?php
$string = <<<DATA
<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<
DATA;

$regex = '~<[^<>&]+>(*SKIP)(*FAIL)|[<>&]+~';
$string = preg_replace_callback($regex,
function($match) {
return htmlentities($match[0]);
},
$string);

echo $string;
?>

Which yields:

<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<

However, as stated many times on StackOverflow before, consider using a parser instead, after all that's what they are made for.


A parser way could be:

$dom = new DOMDocument();
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR);

echo $dom->saveHTML();

However, your presented snippet is corrupt so regular expressions might be the only way to handle it.

How to do a perl variable length positive lookbehind or something comparable

Use \K token (match resetter) for variable-length look-behinds in Perl:

foo.*\Kbat

RegEx live demo

Perl:

perl -0777 -pe 's/foo.*\Kbat/cap/g' file

Fixed-length regex lookbehind complains of variable-length lookbehind

The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:

A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as (?<=[^f]), leading to an
incorrect compile error "lookbehind assertion is not fixed length"

PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:

$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';

However, I find the whole approach a bit weird: what if , symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.

On the second thought, one can use preg_match_all instead, with a common alternation trick to cover the escaped symbols:

$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);

Demo.

I really think I covered all the issues here, those trailing slashes were a killer )

What's the technical reason for lookbehind assertion MUST be fixed length in regex?

Lookahead and lookbehind aren't nearly as similar as their names imply. The lookahead expression works exactly the same as it would if it were a standalone regex, except it's anchored at the current match position and it doesn't consume what it matches.

Lookbehind is a whole different story. Starting at the current match position, it steps backward through the text one character at a time, attempting to match its expression at each position. In cases where no match is possible, the lookbehind has to go all the way to the beginning of the text (one character at a time, remember) before it gives up. Compare that to the lookahead expression, which gets applied exactly once.

This is a gross oversimplification, of course, and not all flavors work that way, but you get the idea. The way lookbehinds are applied is fundamentally different from (and much, much less efficient than) the way lookaheads are applied. It only makes sense to put a limit on how far back the lookbehind has to look.



Related Topics



Leave a reply



Submit