How to Find Overlapping Matches With a Regexp

How to find overlapping matches with a regexp?

findall doesn't yield overlapping matches by default. This expression does however:

>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']

Here (?=...) is a lookahead assertion:

(?=...) matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example,
Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

How to match all overlapping pattern

There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index, and a for loop:

input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)

Playground link

How can I match overlapping strings with regex?

You can't do this with a regex alone, but you can get pretty close:

var pat = /(?=(\d{3}))\d/g;var results = [];var match;
while ( (match = pat.exec( '1234567' ) ) != null ) { results.push( match[1] );}
console.log(results);

How to use regex to find all overlapping matches

Use a capturing group inside a lookahead. The lookahead captures the text you're interested in, but the actual match is technically the zero-width substring before the lookahead, so the matches are technically non-overlapping:

import re 
s = "123456789123456789"
matches = re.finditer(r'(?=(\d{10}))',s)
results = [int(match.group(1)) for match in matches]
# results:
# [1234567891,
# 2345678912,
# 3456789123,
# 4567891234,
# 5678912345,
# 6789123456,
# 7891234567,
# 8912345678,
# 9123456789]

Overlapping matches in Regex

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn
     
  2. nnnn
     
  3. nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).



Regex including overlapping matches with same start

As I've said above, regex is a primarily linear and single-rule-only kind of engine - you can choose between greedy capture or not, but you cannot select both. Also, most regex engines do not support overlapping matches (and even those who support it kind of fake it with substrings / forced head move) because it also doesn't fit regex philosophy.

If you're looking only for simple overlapping matches between two substrings, you can implement it yourself:

def find_substrings(data, start, end):
result = []
s_len = len(start) # a shortcut for `start` length
e_len = len(end) # a shortcut for `end` length
current_pos = data.find(start) # find the first occurrence of `start`
while current_pos != -1: # loop while we can find `start` in our data
# find the first occurrence of `end` after the current occurrence of `start`
end_pos = data.find(end, current_pos + s_len)
while end_pos != -1: # loop while we can find `end` after the current `start`
end_pos += e_len # just so we include the selected substring
result.append(data[current_pos:end_pos]) # add the current substring
end_pos = data.find(end, end_pos) # find the next `end` after the curr. `start`
current_pos = data.find(start, current_pos + s_len) # find the next `start`
return result

Which will yield:

substrings = find_substrings("BADACBA", "B", "A")
# ['BA', 'BADA', 'BADACBA', 'BA']

But you'll have to modify it for more complex matches.

python 3 regex - find all overlapping matches' start and end index in a string

The problem you get is related to the fact that a lookahead is a zero-width assertion that consumes (i.e. adds to the match result) no text. It is a mere position in the string. Thus, all your matches start and end at the same location in the string.

You need to enclose the lookahead pattern with a capturing group (i.e. (?=(11111))) and access start and end of group 1 (with i.start(1) and i.end(1)):

import re
s = '1'*15
result = re.finditer(r'(?=(11111))', s)

for i in result:
print(i.start(1), i.end(1))

See the Python demo, its output is

(0, 5)
(1, 6)
(2, 7)
(3, 8)
(4, 9)
(5, 10)
(6, 11)
(7, 12)
(8, 13)
(9, 14)
(10, 15)

Regexp to capture overlapping matches

You can use a positive lookahead assertion to get all 3 matches:

(?=(A.A))

RegEx Demo

For your input it finds 3 matches in captured group #1:

  1. ABA
  2. ACA
  3. ADA

PHP Code:

if (preg_match_all('/(?=(A.A))/', "ABACADA", $m))
print_r($m[1]); // printing index 1

Output:

Array
(
[0] => ABA
[1] => ACA
[2] => ADA
)

Find overlapping matches and submatches using regular expressions in Python

I think you can do it this way.

The subsets will consist of ever decreasing size of the previous matches.

That's about all you have to do.

So, it's fairly straight forward to design the regex.

The regex will only match multiples of 3 chars.

The beginning and middle are captured in group 1.

This is used for the new text value which is just the last match

minus the last 3 chars.

Regex explained:

 (                             # (1 start), Whole match minus last 3 chars
(?: ATA | ATT ) # Starts with one of these 3 char sequence
(?: # Cluster group
[ATCGN]{3} # Any 3 char sequence consisting of these chars
)+ # End cluster, do 1 to many times
) # (1 end)
(?: AGA | AGG | TAA | TAG ) # Last 3 char sequence, one of these

Python code sample:

Demo

import re

r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"

m = r.search(text)
while m:
print("Found: " + m.group(0))
text = m.group(1)
m = r.search(text)

Output:

Found:  ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found: ATAAAGCCATTTACCGTACATAGCACATTATAA
Found: ATTTACCGTACATAG

Using this method, the subsets being tested are these:

ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG

ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC

ATAAAGCCATTTACCGTACATAGCACATTA

ATTTACCGTACA

We can benchmark the time the regex takes to match these.

Regex1:   ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 1.63 s, 1627.59 ms, 1627594 µs
Matches per sec: 92,160


Related Topics



Leave a reply



Submit