How to find overlapping matches with a regexp?
findall
doesn't yield overlapping matches by default. This expression does however:
>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']
Here (?=...)
is a lookahead assertion:
(?=...)
matches if...
matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example,
Isaac (?=Asimov)
will match'Isaac '
only if it’s followed by'Asimov'
.
How to match all overlapping pattern
There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index
, and a for loop:
input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)
Playground link
How can I match overlapping strings with regex?
You can't do this with a regex alone, but you can get pretty close:
var pat = /(?=(\d{3}))\d/g;var results = [];var match;
while ( (match = pat.exec( '1234567' ) ) != null ) { results.push( match[1] );}
console.log(results);
How to use regex to find all overlapping matches
Use a capturing group inside a lookahead. The lookahead captures the text you're interested in, but the actual match is technically the zero-width substring before the lookahead, so the matches are technically non-overlapping:
import re
s = "123456789123456789"
matches = re.finditer(r'(?=(\d{10}))',s)
results = [int(match.group(1)) for match in matches]
# results:
# [1234567891,
# 2345678912,
# 3456789123,
# 4567891234,
# 5678912345,
# 6789123456,
# 7891234567,
# 8912345678,
# 9123456789]
Overlapping matches in Regex
Update 2016:
To get nn
, nn
, nn
, SDJMcHattie proposes in the comments (?=(nn))
(see regex101).
(?=(nn))
Original answer (2008)
A possible solution could be to use a positive look behind:
(?<=n)n
It would give you the end position of:
- nnnn
- nnnn
- nnnn
As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)
I would prefer to his proposition (?=nn)n
the simpler form:
(n)(?=(n))
That would reference the first position of the strings you want and would capture the second n in group(2).
That is so because:
- Any valid regular expression can be used inside the lookahead.
- If it contains capturing parentheses, the backreferences will be saved.
So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).
Regex including overlapping matches with same start
As I've said above, regex is a primarily linear and single-rule-only kind of engine - you can choose between greedy capture or not, but you cannot select both. Also, most regex engines do not support overlapping matches (and even those who support it kind of fake it with substrings / forced head move) because it also doesn't fit regex philosophy.
If you're looking only for simple overlapping matches between two substrings, you can implement it yourself:
def find_substrings(data, start, end):
result = []
s_len = len(start) # a shortcut for `start` length
e_len = len(end) # a shortcut for `end` length
current_pos = data.find(start) # find the first occurrence of `start`
while current_pos != -1: # loop while we can find `start` in our data
# find the first occurrence of `end` after the current occurrence of `start`
end_pos = data.find(end, current_pos + s_len)
while end_pos != -1: # loop while we can find `end` after the current `start`
end_pos += e_len # just so we include the selected substring
result.append(data[current_pos:end_pos]) # add the current substring
end_pos = data.find(end, end_pos) # find the next `end` after the curr. `start`
current_pos = data.find(start, current_pos + s_len) # find the next `start`
return result
Which will yield:
substrings = find_substrings("BADACBA", "B", "A")
# ['BA', 'BADA', 'BADACBA', 'BA']
But you'll have to modify it for more complex matches.
python 3 regex - find all overlapping matches' start and end index in a string
The problem you get is related to the fact that a lookahead is a zero-width assertion that consumes (i.e. adds to the match result) no text. It is a mere position in the string. Thus, all your matches start and end at the same location in the string.
You need to enclose the lookahead pattern with a capturing group (i.e. (?=(11111))
) and access start and end of group 1 (with i.start(1)
and i.end(1)
):
import re
s = '1'*15
result = re.finditer(r'(?=(11111))', s)
for i in result:
print(i.start(1), i.end(1))
See the Python demo, its output is
(0, 5)
(1, 6)
(2, 7)
(3, 8)
(4, 9)
(5, 10)
(6, 11)
(7, 12)
(8, 13)
(9, 14)
(10, 15)
Regexp to capture overlapping matches
You can use a positive lookahead assertion to get all 3 matches:
(?=(A.A))
RegEx Demo
For your input it finds 3 matches in captured group #1:
ABA
ACA
ADA
PHP Code:
if (preg_match_all('/(?=(A.A))/', "ABACADA", $m))
print_r($m[1]); // printing index 1
Output:
Array
(
[0] => ABA
[1] => ACA
[2] => ADA
)
Find overlapping matches and submatches using regular expressions in Python
I think you can do it this way.
The subsets will consist of ever decreasing size of the previous matches.
That's about all you have to do.
So, it's fairly straight forward to design the regex.
The regex will only match multiples of 3 chars.
The beginning and middle are captured in group 1.
This is used for the new text value which is just the last match
minus the last 3 chars.
Regex explained:
( # (1 start), Whole match minus last 3 chars
(?: ATA | ATT ) # Starts with one of these 3 char sequence
(?: # Cluster group
[ATCGN]{3} # Any 3 char sequence consisting of these chars
)+ # End cluster, do 1 to many times
) # (1 end)
(?: AGA | AGG | TAA | TAG ) # Last 3 char sequence, one of these
Python code sample:
Demo
import re
r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"
m = r.search(text)
while m:
print("Found: " + m.group(0))
text = m.group(1)
m = r.search(text)
Output:
Found: ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found: ATAAAGCCATTTACCGTACATAGCACATTATAA
Found: ATTTACCGTACATAG
Using this method, the subsets being tested are these:
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC
ATAAAGCCATTTACCGTACATAGCACATTA
ATTTACCGTACA
We can benchmark the time the regex takes to match these.
Regex1: ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 1.63 s, 1627.59 ms, 1627594 µs
Matches per sec: 92,160
Related Topics
How to Create a Text Input Box With Pygame
In Python, How to Determine If an Object Is Iterable
Bare Asterisk in Function Arguments
How to Find the Cumulative Sum of Numbers in a List
What Is the Python Equivalent of Static Variables Inside a Function
How to Pass a String into Subprocess.Popen (Using the Stdin Argument)
Manually Raising (Throwing) an Exception in Python
How to Check If a String Is a Substring of Items in a List of Strings
How to Count the Frequency of the Elements in an Unordered List
Identify Groups of Continuous Numbers in a List
What Exactly Is Current Working Directory
How Can the Euclidean Distance Be Calculated With Numpy
Unicodeencodeerror: 'Charmap' Codec Can't Encode Characters