Python Non-Greedy Regexes

Python non-greedy regexes

You seek the all-powerful *?

From the docs, Greedy versus Non-Greedy

the non-greedy qualifiers *?, +?, ??, or {m,n}? [...] match as little
text as possible.

How can I use regular expression with non greedy in python from right to left?

You could use re.findall with the following regex pattern:

\bstep into(?:(?!step into).)*?\bstep out\b

Python script:

inp = """step into
1
2
step into
3
4
step out"""
matches = re.findall(r'\bstep into(?:(?!step into).)*?\bstep out\b', inp, flags=re.DOTALL)
print(matches)

This prints:

['step into\n3\n4\nstep out']

Here is an explanation of the regex pattern:

\bstep into           match "step into"
(?:(?!step into).)*? match any content, across newlines, so long as "step into"
is NOT encountered before seeing "step out"
\bstep out\b match the first "step out" after "step into"

Non greedy regex pattern for email Python

I guess what you are looking for is:

pattern = re.compile(r"[a-zA-Z]+[\w.-]*@[a-z]+[.][a-z]+")

[a-zA-Z] -> the email starts with a letter

\w -> any letter, digit or an underscores

[\w.-]* -> 0 or more letters, digits, underscores, ., or -

[a-z]+ -> a lower case domain 1 or more characters long

[.] -> put the . in square brackets, cuz . has a special meaning in regex

pattern = re.compile(r"[a-zA-Z]+[\w.-]*@[a-z]+[.][a-z]+")
matches = pattern.findall("aar@g.com ooo11..com hellow_world@hello.world 1234hey@hey.com")
print(matches)

Result:

['aar@g.com', 'hellow_world@hello.world', 'hey@hey.com']

Is there any point in using non-greedy in the middle of pattern in Regex Python

They make sense when something comes after them in the pattern. For example, compare

p1 = re.compile(r'a.*?b')
p2 = re.compile(r'a.*b')

x = 'abb'
p1.match(x).group() # = 'ab'
p2.match(x).group() # = 'abb'

More concretely, they’re useful if you want to exclude a delimiter. For example, to match text between quotes you could write

pattern = r'"[^"]*"'

Or you could write

pattern = r'".*?"'

Python regex non greedy

In your expression "^this .* that (?P<occurrence>.*?) " the first .* is greedy, so that it will match all the way to the last that.

Change your example to:

import re

content = "this is how we want that first_occurrence over there but that second_occurrence it is always wrong when "
match = re.search(r"^this .*? that (?P<occurrence>.*?) ", content)
print(match.groupdict())

This prints:

{'occurrence': 'first_occurrence'}

non greedy Python regex from end of string

Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.

To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:

[^_]+_[^_]+\.xml$

Regex demo | Python demo

That will match

  • [^_]+ Match 1+ times not _
  • _ Match literally
  • [^_]+ Match 1+ times not _
  • \.xml$ Match .xml at the end of the string

For example:

import re

test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
print(result.group())

Non greedy regex within parenthesis and contains text

Instead of allowing all characters, you can allow all characters except the closed parenthesis by using [^\)] where the . is now.

re.findall(r'\(([^\)]*meh[^\)]*?)\)', test)

Python Regex non greedy match

The issue you're experiencing is due to the nature of backtracking in regex. The regex engine is parsing the string at each given position therein, and as such, will attempt every option of the pattern until it either matches or fails at that position. If it matches, it will consume those characters and if it fails it will continue to the next position until the end of the string is met.

The keyword here is backtracks. I think Microsoft's documentation does a great job of defining this term (I've bolded the important section):

Backtracking occurs when a regular expression pattern contains
optional quantifiers or alternation constructs, and the regular
expression engine returns to a previous saved state to continue its
search for a match
. Backtracking is central to the power of regular
expressions; it makes it possible for expressions to be powerful and
flexible, and to match very complex patterns. At the same time, this
power comes at a cost. Backtracking is often the single most important
factor that affects the performance of the regular expression engine.
Fortunately, the developer has control over the behavior of the
regular expression engine and how it uses backtracking. This topic
explains how backtracking works and how it can be controlled.

The regex engine backtracks to a previous saved state. It cannot forward track to a future saved state, although that would be pretty neat! Since you've specified that your match should end with at (the lazy quantifier precedes it), it will exhaust every regex option until \w{1,2} ending in at proves true.

How can you get around this? Well, the easiest way is probably to use a capture group:

See regex in use here

\w*(\w{1,2}?at)
\w*(\w{1,2}at) # yields same results as above (but in more steps)
\w*(\wat) # yields same results as above (faster method)
\wat # yields same results as above (fastest method)
\b\w{1,2}at\b # perhaps this is what OP is after?
  • \w* Matches any word character any number of times. This is a fix to allow us to simulate forward tracking (this is not a proper term, just used in the context of the rest of my answer above). It will match as many characters as possible and work its way backwards until a match occurs.
  • The rest of the pattern the OP already had. In fact, \w{2} will never be met since \w will always only be met once (since the \w* token is greedy), therefore \wat can be used instead \w*(\wat). Perhaps the OP intended to use anchors such as \b in the regex: \b\w{1,2}at\b? This doesn't differ from the original nature of the OP's regex either since making the quantifier lazy would have theoretically yielded the same results in the context of forward tracking (one match of \w would have satisfied \w{1,2}?, thus \w{2} would never be reached).


Related Topics



Leave a reply



Submit