Python Regex Find All Overlapping Matches

Python regex find all overlapping matches?

Use a capturing group inside a lookahead. The lookahead captures the text you're interested in, but the actual match is technically the zero-width substring before the lookahead, so the matches are technically non-overlapping:

import re 
s = "123456789123456789"
matches = re.finditer(r'(?=(\d{10}))',s)
results = [int(match.group(1)) for match in matches]
# results:
# [1234567891,
# 2345678912,
# 3456789123,
# 4567891234,
# 5678912345,
# 6789123456,
# 7891234567,
# 8912345678,
# 9123456789]

python 3 regex - find all overlapping matches' start and end index in a string

The problem you get is related to the fact that a lookahead is a zero-width assertion that consumes (i.e. adds to the match result) no text. It is a mere position in the string. Thus, all your matches start and end at the same location in the string.

You need to enclose the lookahead pattern with a capturing group (i.e. (?=(11111))) and access start and end of group 1 (with i.start(1) and i.end(1)):

import re
s = '1'*15
result = re.finditer(r'(?=(11111))', s)

for i in result:
print(i.start(1), i.end(1))

See the Python demo, its output is

(0, 5)
(1, 6)
(2, 7)
(3, 8)
(4, 9)
(5, 10)
(6, 11)
(7, 12)
(8, 13)
(9, 14)
(10, 15)

How do you get overlapping matches with re.findall()?

Note that 'tex1 ', ' text2 ' and ' taxw ' are overlapping, the spaces before text2 and taxw are matched and consumed by the pattern during the preceding iteration.

What you may do is place the final \s* into a capturing group and just concatenate it with the whole match:

import re
x=" tex1 text2 taxw ello how are 123 "
y=x.split()
sear=re.compile(r'\s*\b\w*x\w*\b(?=(\s*))')
a=["{}{}".format(x.group(),x.group(1)) for x in sear.finditer(x)]
print(a) # => [' tex1 ', ' text2 ', ' taxw ']

See the Python demo

The (?=(\s*)) is a non-consuming positive lookahead that does not move the regex index, and thus, it can match the preceding spaces before the following matches.

How to find overlapping matches with a regexp?

findall doesn't yield overlapping matches by default. This expression does however:

>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']

Here (?=...) is a lookahead assertion:

(?=...) matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example,
Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

Python Regex : I would like to find all overlapping and non overlapping pattern matches

Different order, but same result and significantly faster likely:

import re

substrings=['hello','world','hello world']
joined='|'.join(substrings)
reg=re.compile(rf"\b(?={joined}\b)")

for m in reg.finditer(sentence):
for e in substrings:
offset=m.span()[0]
if sentence[offset:offset+len(e)]==e:
print((offset,offset+len(e)), e)

If you want to make sure that you do not match hello worldLY (ie, just the prefix of the substring) you can do:

substrings=['hello','world','hello world']
joined='|'.join(substrings)
reg=re.compile(rf"\b(?={joined}\b)")
indvidual=list(map(re.compile, [rf'\b({e})\b' for e in substrings]))
for m in reg.finditer(sentence):
for i,e in enumerate(indvidual):
offset=m.span()[0]
if m2:=e.match(sentence[offset:]):
print((offset,offset+m2.span()[1]), substrings[i])

Either prints:

(5, 10) hello
(5, 16) hello world
(11, 16) world
(21, 26) hello

Python regular expressions with overlap

Here are two ways that can be done in one pass.

Match what you don't want, capture what you do want

With this approach by matching the following regular expression we obtain the words of interest in capture group 1, paying no attention to the matches that do not result in a capture.

(?:^|\").*?(?:$|\")|(\b(?:jake|donald|rita|katie)\b)

This expression must of course be constructed programmatically from the array keywords, which is not difficult.

Demo

This makes use of the fact that well-formed strings contain an even number of double-quotes.

The string can be broken down as follows.

(?:^|\")  # match the beginning of the string or a double-quote
.*? # match zero or more characters lazily
(?:$|\") # match the end of the string or a double-quote
| # or
( # begin capture group 1
\b # match word boundary
(?:jake|donald|rita|katie) # match one of the four words
\b # match word boundary
) # end capture group 1

The word boundaries are to avoid matching words such as 'heritage' (which contains the name of a lovely meter maid).


Count trailing double-quotes to determine if a word is within a quoted string

Observe that a word is within a doubly-quoted string if and only if the part of the string that follows the word contains an odd number of double quotes. Suppose, for example, the string were:

'jack betty "donald jake rita" lorie katie danny "katie"'

^^^^^^ ^^^^ ^^^^ ^^^^^

"donald", "jake" and "rita" are each followed by three double-quotes, so we infer they are within a doubly-quoted string. Same for "katie" which followed by one double-quote. All the remaining words are followed by a an even number of double-quotes so we infer they are not within a double-quoted string. Naturally, this is assuming that the entire string is well-formed in the sense that it contains an even number of double-quotes.

You can use re.findall with the following regular expression to identify the words of interest.

\b(?:jake|donald|rita|katie)\b(?=[^\"]*(?:(?:\"[^\"]*){2})*\"[^\"]*$)

Demo.

The regex can be broken down as follows.

\b             # match a word boundary
(?:jake|donald|rita|katie) # match one of four words in a non-capture group
\b # match a word boundary
(?= # begin a positive lookhead
[^\"]* # match zero or more chars other than dq's (double-quotes)
(?: # begin a non-capture group
(?: # begin a non-capture group
\"[^\"]* # match a dq followed by zero or more chars other than a dq
){2} # end non-capture group and execute twice
)* # end non-capture group and execute zero or more times
\"[^\"]* # match a dq followed by zero or more chars other than a dq
$ # match end of string
) # end positive lookahead

Use of this expression may require considerable backtracking by the regex engine. As shown at the link, the regex engine required 261 steps to parse the string given above.

Regex including overlapping matches with same start

As I've said above, regex is a primarily linear and single-rule-only kind of engine - you can choose between greedy capture or not, but you cannot select both. Also, most regex engines do not support overlapping matches (and even those who support it kind of fake it with substrings / forced head move) because it also doesn't fit regex philosophy.

If you're looking only for simple overlapping matches between two substrings, you can implement it yourself:

def find_substrings(data, start, end):
result = []
s_len = len(start) # a shortcut for `start` length
e_len = len(end) # a shortcut for `end` length
current_pos = data.find(start) # find the first occurrence of `start`
while current_pos != -1: # loop while we can find `start` in our data
# find the first occurrence of `end` after the current occurrence of `start`
end_pos = data.find(end, current_pos + s_len)
while end_pos != -1: # loop while we can find `end` after the current `start`
end_pos += e_len # just so we include the selected substring
result.append(data[current_pos:end_pos]) # add the current substring
end_pos = data.find(end, end_pos) # find the next `end` after the curr. `start`
current_pos = data.find(start, current_pos + s_len) # find the next `start`
return result

Which will yield:

substrings = find_substrings("BADACBA", "B", "A")
# ['BA', 'BADA', 'BADACBA', 'BA']

But you'll have to modify it for more complex matches.



Related Topics



Leave a reply



Submit