Why Does a Simple .*? Non-Greedy Regex Greedily Include Additional Characters Before a Match

Why does a simple .*? non-greedy regex greedily include additional characters before a match?

I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".

You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

HOHO - Does this match the left side?
(?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
_HO_ - If so, grab everything until the right-hand match
? modifier on * or + - If there are multiple right-hand matches, choose the nearest one

What do 'lazy' and 'greedy' mean in the context of regular expressions?

Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:

<em>Hello World</em>

You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.

Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.

I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.

How can I write a regex which matches non greedy?

The non-greedy ? works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .. You need to tell them explicitly that you want to match line-breaks too with .

For example,

<img\s.*?>

works fine!

Check the results here.

Also, read about how dot behaves in various regex flavours.

Non greedy (reluctant) regex matching in sed?

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:

perl -pe 's|(http://.*?/).*|\1|'

Python non-greedy regexes

You seek the all-powerful *?

From the docs, Greedy versus Non-Greedy

the non-greedy qualifiers *?, +?, ??, or {m,n}? [...] match as little
text as possible.

How to do a non-greedy match in grep?

You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.

By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.

Regular expression to stop at first match

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".

Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:

/location="(.*?)"/

Adding a ? on a quantifier (?, * or +) makes it non-greedy.

Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

How to avoid (linear) back tracking in a non-greedy regular expression?

Instead of writing something like .*?\s*$ where .*? must check if there's a white-space or not before taking each character, you can use character classes and a group (atomic if possible) to limit the impact of the non-greedy quantifier.

In short, you can change .*?\s*$ to something like (?>\s*\S+)*?\s*$ (obviously (?>\s*\S+)*\s*$ or (?:\s*\S+)*+\s*$ is faster and produces the same result).
When you write it this way, \s*$ is only tested after the last non-whitespace position (at the next white-space or at the end of the string).

If the atomic group feature isn't available, you can emulate it like this:

(?>expression)   =>    (?=(expression))\1

Note: for your particular case, you can also change .*?\s*$ to (?:.*\S)?\s*$

How can I make my match non greedy in vim?

Instead of .* use .\{-}.

%s/style=".\{-}"//g

Also, see :help non-greedy

Javascript regex lazy quantifier not working as expected

You could start the match by matching one or more digits \d+ followed by ) and use a negated character class [^;] matching any char except a ;

The word boundaries \b prevent the word characters being part of a larger word.

\b\d+\)[^;]*- ex\b

Regex demo

Why Does a Simple .*? Non-Greedy Regex Greedily Include Additional Characters Before a Match