Why does a simple .*? non-greedy regex greedily include additional characters before a match?
I figured out a solution with some help from Regex lazy vs greedy confusion.
In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.
Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).
Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".
You can approximate a regex that is non-greedy in both directions by replacing the
.
with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).)
. For example, (?:(?!HOHO).)
For example, the equivalent of HOHO.*?_HO_
which is non-greedy on the left and right would be:
HOHO(?:(?!HOHO).)*?_HO_
So the regex engine is essentially going through each character like this:
HOHO
- Does this match the left side?(?:(?!HOHO).)*
- If so, can I reach the right-hand side without any repeats of the left side?_HO_
- If so, grab everything until the right-hand match?
modifier on*
or+
- If there are multiple right-hand matches, choose the nearest one
What do 'lazy' and 'greedy' mean in the context of regular expressions?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>
. Suppose you have the following:
<em>Hello World</em>
You may think that <.+>
(.
means any non newline character and +
means one or more) would only match the <em>
and the </em>
, when in reality it will be very greedy, and go from the first <
to the last >
. This means it will match <em>Hello World</em>
instead of what you wanted.Making it lazy (<.+?>
) will prevent this. By adding the ?
after the +
, we tell it to repeat as few times as possible, so the first >
it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
How can I write a regex which matches non greedy?
The non-greedy ?
works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .
. You need to tell them explicitly that you want to match line-breaks too with .
For example,
<img\s.*?>
works fine!Check the results here.
Also, read about how dot behaves in various regex flavours.
Non greedy (reluctant) regex matching in sed?
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'
Python non-greedy regexes
You seek the all-powerful *?
From the docs, Greedy versus Non-Greedy
the non-greedy qualifiers
*?
,+?
,??
, or{m,n}?
[...] match as little
text as possible.
How to do a non-greedy match in grep?
You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ?
after the quantifier. For example you can change .*
to .*?
.
By default grep
doesn't support non-greedy modifiers, but you can use grep -P
to use the Perl syntax.
Regular expression to stop at first match
You need to make your regular expression lazy/non-greedy, because by default, "(.*)"
will match all of "file path/level1/level2" xxx some="xxx"
.
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ?
on a quantifier (?
, *
or +
) makes it non-greedy.Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed
, grep
without -P
, etc.).
How to avoid (linear) back tracking in a non-greedy regular expression?
Instead of writing something like .*?\s*$
where .*?
must check if there's a white-space or not before taking each character, you can use character classes and a group (atomic if possible) to limit the impact of the non-greedy quantifier.
In short, you can change .*?\s*$
to something like (?>\s*\S+)*?\s*$
(obviously (?>\s*\S+)*\s*$
or (?:\s*\S+)*+\s*$
is faster and produces the same result).
When you write it this way, \s*$
is only tested after the last non-whitespace position (at the next white-space or at the end of the string).
If the atomic group feature isn't available, you can emulate it like this:
(?>expression) => (?=(expression))\1
Note: for your particular case, you can also change .*?\s*$
to (?:.*\S)?\s*$
How can I make my match non greedy in vim?
Instead of .*
use .\{-}
.
%s/style=".\{-}"//g
Also, see :help non-greedy
Javascript regex lazy quantifier not working as expected
You could start the match by matching one or more digits \d+
followed by )
and use a negated character class [^;]
matching any char except a ;
The word boundaries \b
prevent the word characters being part of a larger word.
\b\d+\)[^;]*- ex\b
Regex demo
Related Topics
Angular 2 Karma Test 'Component-Name' Is Not a Known Element
Create a <Ul> and Fill It Based on a Passed Array
Pass Variables to JavaScript in Expressjs
Understanding JavaScript Immutable Variable
JavaScript "Variable Variables": How to Assign Variable Based on Another Variable
Why Is Usestate Not Triggering Re-Render
Cross-Browser Bookmark/Add to Favorites JavaScript
Angularjs "Controller As" or "$Scope"
How to Reload/Refresh Jquery Datatable
Es6 Classes:What About Instrospection
Get Selected HTML in Browser via JavaScript
How to Parseint a String with Leading 0
Do Websockets Allow for P2P (Browser to Browser) Communication
Efficient Way to Insert a Number into a Sorted Array of Numbers