How to Make Regular Expression into Non-Greedy

How can I write a regex which matches non greedy?

The non-greedy ? works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .. You need to tell them explicitly that you want to match line-breaks too with .

For example,

<img\s.*?>

works fine!

Check the results here.

Also, read about how dot behaves in various regex flavours.

How to make Regular expression into non-greedy?

The non-greedy regex modifiers are like their greedy counter-parts but with a ? immediately following them:

*  - zero or more
*? - zero or more (non-greedy)
+ - one or more
+? - one or more (non-greedy)
? - zero or one
?? - zero or one (non-greedy)

How can I use regular expression with non greedy in python from right to left?

You could use re.findall with the following regex pattern:

\bstep into(?:(?!step into).)*?\bstep out\b

Python script:

inp = """step into
1
2
step into
3
4
step out"""
matches = re.findall(r'\bstep into(?:(?!step into).)*?\bstep out\b', inp, flags=re.DOTALL)
print(matches)

This prints:

['step into\n3\n4\nstep out']

Here is an explanation of the regex pattern:

\bstep into           match "step into"
(?:(?!step into).)*? match any content, across newlines, so long as "step into"
is NOT encountered before seeing "step out"
\bstep out\b match the first "step out" after "step into"

How to make regex match non-greedy?

I know there are two answers already, but sometimes it helps to have another way to look at it and handle it.

The Problem

When the engine is positioned before the first h, it makes its best effort to match the regex http.*?500.jpg. Can the regex match at that point? Yes, it can. After matching http, the engine keeps lazily matching until it meets 500.jpg. There is nothing to stop it. You have told it to match as only as many chars as necessary, and that is what it is doing.

In contrast, suppose you have this string with two 500.jpg

http://google.com<img src="http://google.com/500.jpg 1500.jpg 
^ lazy .*? stops here
^ greedy .* stops here

The greedy one will match the whole string. But the lazy one will stop as soon as it can: in the same place as before. This is where you can see the difference between greedy and lazy.

Workaround: Don't Use Dot-Star—Use The Right Token

Suppose you knew that each http string has a space or newline after it. You could use a lazy match with http\S*?\.jpg The point is that the \S*, which matches any character that is not a "whitespace character" (newlines, tabs etc) is not able to jump over the space, unlike the dot-star.

Reference

In addition, I highly recommend you read the article below, as it should help with any remaining confusion.

The Many Degrees of Regex Greed

Non-greedy string regular expression matching

Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.

Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab, aaab, aab, and ab are matches to your pattern, but aaaab being the one that starts the most to the left is the one that is returned.

So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:

str_match('xxx aaaab yyy', "a.*?y") 
# [,1]
# [1,] "aaaab y"

Here all of the strings aaaab y, aaaab yy, aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.


So what can you do to catch that last ab? Use this:

str_match('xxx aaaab yyy', ".*(a.*b)")
# [,1] [,2]
# [1,] "xxx aaaab" "ab"

How does it work? By adding a greedy pattern .* in the front, you are now forcing the process to put the last possible a into the captured group.

How can I do a non greedy regex query in notepad++?

Use a reluctant (aka non-greedy) expression:

\\cite\[(.*?)] 

See a live demo.

The addition of the question mark changes the .* from greedy (the default) to reluctant so it will consume as little as possible to find a match, ie it won't skip over multiple search terms matching start of one term all the way to the end of another term.

ie using .* the match would be

foo \cite[aaa]\cite[bbb] something here \cite[ccc] bar
^----------------------1---------------------^

but with .*? the matches would be:

foo \cite[aaa]\cite[bbb] something here \cite[ccc] bar
^---1----^^----------------2-----------------^

Minor note: ] does not need escaping.

Python non-greedy regexes

You seek the all-powerful *?

From the docs, Greedy versus Non-Greedy

the non-greedy qualifiers *?, +?, ??, or {m,n}? [...] match as little
text as possible.

How do greedy / lazy (non-greedy) / possessive quantifiers work internally?

For your input string fooaaafoooobbbfoo.

Case 1: When you're using this regex:

foo.*

First remember this fact that engine traverses from left to right.

With that in mind above regex will match first foo which is at the start of input and then .* will greedily match longest possible match which is rest of the text after foo till end. At this point matching stops as there is nothing to match after .* in your pattern.

Case 2: When you're using this regex:

.*foo

Here again .* will greedily match longest possible match before matching last foo which is right the end of input.

Case 3: When you're using this regex:

foo.*foo

Which will match first foo found in input i.e. foo at the start then .* will greedily match longest possible match before matching last foo which is right the end of input.

Case 4: When you're using this regex with lazy quantifier:

foo.*?foo

Which will match first foo found in input i.e. foo at the start then .*? will lazily match shortest possible match before matching next foo which is second instance of foo starting at position 6 in input.

Case 5: When you're using this regex with possessive quantifier:

foo.*+foo

Which will match first foo found in input i.e. foo at the start then .*+ is using possessive quantifier which means match as many times as possible, without giving back. This will match greedily longest possible match till end and since possessive quantifier doesn't allow engine to backtrack hence presence of foo at the end of part will cause failure as engine will fail to match last foo.



Related Topics



Leave a reply



Submit