Variable-length lookbehind-assertion alternatives for regular expressions
Most of the time, you can avoid variable length lookbehinds by using \K
.
s/(?<=foo.*)bar/moo/s;
would be
s/foo.*\Kbar/moo/s;
Anything up to the last \K
encountered is not considered part of the match (e.g. for the purposes of replacement, $&
, etc)
Negative lookbehinds are a little trickier.
s/(?<!foo.*)bar/moo/s;
would be
s/^(?:(?!foo).)*\Kbar/moo/s;
because (?:(?!STRING).)*
is to STRING
as [^CHAR]*
is to CHAR
.
If you're just matching, you might not even need the \K
.
/foo.*bar/s
/^(?:(?!foo).)*bar/s
Variable length look-behind
Use \K
as a special case.
It's a variable length positive lookbehind assertion:
/eat_(?:apple|pear|orange)_\Ktoday|yesterday/g
Alternatively, you can list out your lookbehind assertions separately:
/(?:(?<=eat_apple_)|(?<=eat_pear_)|(?<=eat_orange_))today|yesterday/g
However, I would propose that it's going to be a rare problem that could potentially use that feature, but couldn't be rethought to use a combination of other more common regex features.
In other words, if you get stuck on a specific problem, feel free to share it here, and I'm sure someone can come up with a different (perhaps better) approach.
Regex Error: A lookbehind assertion has to be fixed width
Use a full string match restart (\K
) instead of the invalid variable-length lookbehind.
Regex 101 Demo
/^(?:username|Email|Url):? *\K\V+/mi
Make the colon and space optional by trailing them with ?
or *
.
Use \V+
to match the remaining non-vertical (such as \r
and \n
) characters excluding in the line.
See the broader canonical: Variable-length lookbehind-assertion alternatives for regular expressions
To protect your script from falsely matching values instead of matching labels, notice the use ^
with the m
modifier. This will ensure that you are matching labels that occur at the start of a line.
Without a start of line anchor, Somethingelse: url whoops
will match whoops
.
To make multiple matches in PHP, the g
pattern modifier is not used. Instead, apply the pattern in preg_match_all()
Alternatives to variable-width lookbehind in Python regex
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1)
. Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
R: workaround for variable-width lookbehind
You can use the lookbehind alternative \K
instead. This escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.
Quoted — rexegg
The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before \K.
Using it in context:
sub('a[ab]b{1,2}\\Ka', 'i', ba, perl=T)
# [1] "baa" "aba" "abbi" "abbbi" "aabi" "aabbi"
Avoiding lookarounds:
sub('(a[ab]b{1,2})a', '\\1i', ba)
# [1] "baa" "aba" "abbi" "abbbi" "aabi" "aabbi"
PHP Regex negative lookbehind variable length alternatives issue
The correct answer would be to use a DOM parser instead. For a quick and dirty (and sometimes faster) way though, you could use the (*SKIP)(*FAIL)
mechanism which PCRE
implements:
<[^<>&]+>(*SKIP)(*FAIL)|[<>&]+
See a demo on regex101.com.
A complete
PHP
walk-through would be:<?php
$string = <<<DATA
<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<
DATA;
$regex = '~<[^<>&]+>(*SKIP)(*FAIL)|[<>&]+~';
$string = preg_replace_callback($regex,
function($match) {
return htmlentities($match[0]);
},
$string);
echo $string;
?>
Which yields:
<b>bold</b>, <strong>bold</strong>
<i>italic</i>, <em>italic</em>
<a href="http://www.example.com/" >inline URL</a>
<code>inline fixed-width code</code>
<pre>pre-formatted fixed-width code block</pre>
yes<b bad<>b> <bad& hi>;<strong >b<a<
However, as stated many times on StackOverflow before, consider using a parser instead, after all that's what they are made for.
A parser way could be:
$dom = new DOMDocument();
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR);
echo $dom->saveHTML();
However, your presented snippet is corrupt so regular expressions might be the only way to handle it.
How to do a perl variable length positive lookbehind or something comparable
Use \K
token (match resetter) for variable-length look-behinds in Perl:
foo.*\Kbat
RegEx live demo
Perl:
perl -0777 -pe 's/foo.*\Kbat/cap/g' file
Fixed-length regex lookbehind complains of variable-length lookbehind
The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:
A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as(?<=[^f])
, leading to an
incorrect compile error"lookbehind assertion is not fixed length"
PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:
$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';
However, I find the whole approach a bit weird: what if ,
symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.
On the second thought, one can use preg_match_all
instead, with a common alternation trick to cover the escaped symbols:
$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);
Demo.
I really think I covered all the issues here, those trailing slashes were a killer )
What's the technical reason for lookbehind assertion MUST be fixed length in regex?
Lookahead and lookbehind aren't nearly as similar as their names imply. The lookahead expression works exactly the same as it would if it were a standalone regex, except it's anchored at the current match position and it doesn't consume what it matches.
Lookbehind is a whole different story. Starting at the current match position, it steps backward through the text one character at a time, attempting to match its expression at each position. In cases where no match is possible, the lookbehind has to go all the way to the beginning of the text (one character at a time, remember) before it gives up. Compare that to the lookahead expression, which gets applied exactly once.
This is a gross oversimplification, of course, and not all flavors work that way, but you get the idea. The way lookbehinds are applied is fundamentally different from (and much, much less efficient than) the way lookaheads are applied. It only makes sense to put a limit on how far back the lookbehind has to look.
Related Topics
How to Get Url of Current Page in PHP
Can File Uploads Time Out in PHP
Resetting Array Pointer in Pdo Results
How to Store File Name in Database, With Other Info While Uploading Image to Server Using PHP
How to Use Arrays in Curl Post Requests
Restructure Multidimensional Array of Column Data into Multidimensional Array of Row Data
Eloquent Orm Code Hinting in PHPstorm
Anonymous Recursive PHP Functions
Download Files in Laravel Using Response::Download
Append Data to a .Json File With PHP
How to Replace Text Urls and Exclude Urls in HTML Tags
Get the First N Elements of an Array
With "Magic Quotes" Disabled, Why Does PHP/Wordpress Continue to Auto-Escape My Post Data
Best Practice Multi Language Website
Safely Catch a 'Allowed Memory Size Exhausted' Error in PHP
PHP X86 How to Get Filesize of ≫ 2 Gb File Without External Program