What Are the Differences Between Lazy, Greedy and Possessive Quantifiers

What are the differences between lazy, greedy and possessive quantifiers?

Take the string

aaaab

and see how the following regexes match it:

Regex          Submatches
group 1 group 2 group3
(a?)(a*)(ab) a aa ab
(a??)(a*)(ab) aaa ab
(a?+)(a*)(ab) a aa ab
(a*)(a?)(ab) aaa ab
(a*?)(a?)(ab) aa a ab
(a*+)(a?)(ab) aaaa <Match fails!>
(a+)(a*)(ab) aaa ab
(a+?)(a*)(ab) a aa ab
(a++)(a*)(ab) aaaa <Match fails!>

Explanation:

  • a? tries to match one a, but it's prepared to match nothing if that's necessary for the whole match to succeed.
  • a?? tries to match nothing, but it's prepared to match one a if that's necessary for the whole match to succeed.
  • a?+ tries to match one a. If it can do that, it will not back down to match nothing if that were necessary for the overall match to succeed. If it can't match an a, then it will gladly match nothing, though.
  • a* tries to match as many as as it can, but it's prepared to match fewer as, even nothing if that's necessary for the whole match to succeed.
  • a*? tries to match nothing, but it's prepared to match just as many as as is absolutely necessary in order for the whole match to succeed, but not more.
  • a*+ tries to match as many as as it can. If it can do that, it will not back down to match fewer as if that were necessary for the overall match to succeed. If it can't match even a single a, then it will gladly match nothing, though.
  • a+ tries to match as many as as it can, but it's prepared to match fewer as (but at least one) if that's necessary for the whole match to succeed.
  • a+? tries to match only one a, but it's prepared to match just as many as as is absolutely necessary in order for the whole match to succeed, but not more.
  • a++ tries to match as many as as it can. If it can do that, it will not back down to match fewer as if that were necessary for the overall match to succeed. If it can't match even a single a, then the regex fails.

Greedy vs. Reluctant vs. Possessive Qualifiers

I'll give it a shot.

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!

A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!

In your example, it then starts the process over with the remaining unmatched portion of the string, "xxxxxxfoo", following the same process.

A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.

How do greedy / lazy (non-greedy) / possessive quantifiers work internally?

For your input string fooaaafoooobbbfoo.

Case 1: When you're using this regex:

foo.*

First remember this fact that engine traverses from left to right.

With that in mind above regex will match first foo which is at the start of input and then .* will greedily match longest possible match which is rest of the text after foo till end. At this point matching stops as there is nothing to match after .* in your pattern.

Case 2: When you're using this regex:

.*foo

Here again .* will greedily match longest possible match before matching last foo which is right the end of input.

Case 3: When you're using this regex:

foo.*foo

Which will match first foo found in input i.e. foo at the start then .* will greedily match longest possible match before matching last foo which is right the end of input.

Case 4: When you're using this regex with lazy quantifier:

foo.*?foo

Which will match first foo found in input i.e. foo at the start then .*? will lazily match shortest possible match before matching next foo which is second instance of foo starting at position 6 in input.

Case 5: When you're using this regex with possessive quantifier:

foo.*+foo

Which will match first foo found in input i.e. foo at the start then .*+ is using possessive quantifier which means match as many times as possible, without giving back. This will match greedily longest possible match till end and since possessive quantifier doesn't allow engine to backtrack hence presence of foo at the end of part will cause failure as engine will fail to match last foo.

regex possessive quantifier against lazy or greedy

From the Java Pattern documentation:

Possessive quantifiers, which greedily match as much as they can and do not back off, even when doing so would allow the overall match to succeed.

In your example, the < in your regex matches < in the string, then .++ matches the entire rest of the string, em>. You still have a > in your regex, but there are no characters left in the string for it to match (because .++ consumed them all). So the match fails.

If the quantifier were greedy, i.e. if it were .+ instead of .++, at this point the regular expression engine would try reducing the portion matched by .+ by one character, to just em, and try again. This time the match would succeed, because there would be a > left in the string for the > in the regex to match.

EDIT: A lazy quantifier would work like a greedy quantifier in reverse. Instead of starting by trying to match the whole rest of the string and backing off character by character, the lazy quantifier would start by trying to match a single character, in this case just e. If that doesn't allow the full regex to match (which it wouldn't here, because you'd have > in the regex trying to match m in the string), the lazy quantifier would move up to matching two characters, em. Then the > in the regex would line up with > in the string and the match would succeed. If it didn't work out, though, the lazy quantifier would move up to three characters, and so on.

Can someone explain Possessive Quantifiers to me? (Regular Expressions)

Perhaps the best place to start is Regex Tutorial - Possessive Quantifiers:

When discussing the repetition
operators or quantifiers, I explained
the difference between greedy and lazy
repetition. Greediness and laziness
determine the order in which the regex
engine tries the possible permutations
of the regex pattern. A greedy
quantifier will first try to repeat
the token as many times as possible,
and gradually give up matches as the
engine backtracks to find an overall
match. A lazy quantifier will first
repeat the token as few times as
required, and gradually expand the
match as the engine backtracks through
the regex to find an overall match.


Possessive quantifiers are a way to prevent the regex engine from
trying all permutations. This is primarily useful for performance
reasons. You can also use possessive quantifiers to eliminate certain
matches.

Why are greedy quantifiers less expensive than lazy quantifiers

Greedy and lazy quantifiers are equally (in)expensive when applied correctly. However, lazy quantifiers got a bad reputation of being slow because they can be, and often are, applied to compensate for imprecision in a pattern.

Consider a simple example: <.*?> vs. <.*>.

When both expressions are applied to the same input

<abcdefghijklmnopqrstuvwxyz0123456789>

they match exactly the same text. The difference is only in how they arrive at the match: the "lazy" expression tries increasingly longer strings, arriving at the match after 40 steps (demo 1). The greedy expression, on the other hand, goes all the way to the end, and backtracks once to arrive at the match after only 5 steps (demo 2).

Note that the situation can be reversed if you add more characters after the >:

<abcdefghijklmnopqrstuvwxyz0123456789>abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789

Now greedy expression becomes the "slow one", taking 149 steps (demo 3) while the lazy expression continues to take the same 40 steps as it did before (demo 4).

PCRE: Lazy and Greedy at the same time (Possessive Quantifiers)

Better use token_get_all to get the tokens of a PHP code and iterate them. PHPDoc style comments tokens can be identified with T_DOC_COMMENT.

How exactly does the possessive quantifier work?

The + after another quantifier means "don't allow the regex engine to backtrack into whatever the previous token has matched". (See a tutorial on possessive quantifiers here).

So when you apply .*foo to "xfooxxxxxxfoo", the .* first matches the entire string. Then, since foo can't be matched, the regex engine backtracks until that's possible, achieving a match when .* has matched "xfooxxxxxx" and foo has matched "foo".

Now the additional + prevents that backtracking from happening, so the match fails.

When you write (.*)+foo. the + takes on an entirely different meaning; now it means "one or more of the preceding token". You've created nested quantifiers, which is not a good idea, by the way. If you apply that regex to a string like "xfoxxxxxxxxxfox", you'll run into catastrophic backtracking.

Regex lazy vs greedy confusion

You are missing the fact that a regex engine works from left to right, position by position, and succeeds as soon as it finds a match at the current position.

In your example, the first position where the pattern succeeds is at the second "a".

The laziness works only on the right side.

If you want to obtain "xxx", a better way is to use a negated character class [^ab]* instead of .*?

Note: not exactly related to the subject, but good to know: a DFA regex engine will try to get the largest result in case of alternation, a NFA gives you the first that succeeds.



Related Topics



Leave a reply



Submit