How to Match Overlapping Strings With Regex

How can I match overlapping strings with regex?

You can't do this with a regex alone, but you can get pretty close:

var pat = /(?=(\d{3}))\d/g;

var results = [];

var match;

while ( (match = pat.exec( '1234567' ) ) != null ) {

results.push( match[1] );

}

console.log(results);

How to match all overlapping pattern

There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index, and a for loop:

input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)

Playground link

Replacing overlapping matches in a string (regex or string operations)

I think I would forgo regex and write a simple loop as below (there is room for improvement), because I think it would be quicker and more understandable.

        public IEnumerable<int> FindStartingOccurrences(string input, string pattern)
{
var occurrences = new List<int>();

for (int i=0; i<input.Length; i++)
{
if (input.Length+1 > i+pattern.Length)
{
if (input.Substring(i, pattern.Length) == pattern)
{
occurrences.Add(i);
}
}
}

return occurrences;
}

and then call like:

var occurrences = FindStartingOccurrences("aaabbaaaaaccaadaaa", "aa");

Overlapping matches in Regex

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn
     
  2. nnnn
     
  3. nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).



Regexp to capture overlapping matches

You can use a positive lookahead assertion to get all 3 matches:

(?=(A.A))

RegEx Demo

For your input it finds 3 matches in captured group #1:

  1. ABA
  2. ACA
  3. ADA

PHP Code:

if (preg_match_all('/(?=(A.A))/', "ABACADA", $m))
print_r($m[1]); // printing index 1

Output:

Array
(
[0] => ABA
[1] => ACA
[2] => ADA
)

Regex including overlapping matches with same start

As I've said above, regex is a primarily linear and single-rule-only kind of engine - you can choose between greedy capture or not, but you cannot select both. Also, most regex engines do not support overlapping matches (and even those who support it kind of fake it with substrings / forced head move) because it also doesn't fit regex philosophy.

If you're looking only for simple overlapping matches between two substrings, you can implement it yourself:

def find_substrings(data, start, end):
result = []
s_len = len(start) # a shortcut for `start` length
e_len = len(end) # a shortcut for `end` length
current_pos = data.find(start) # find the first occurrence of `start`
while current_pos != -1: # loop while we can find `start` in our data
# find the first occurrence of `end` after the current occurrence of `start`
end_pos = data.find(end, current_pos + s_len)
while end_pos != -1: # loop while we can find `end` after the current `start`
end_pos += e_len # just so we include the selected substring
result.append(data[current_pos:end_pos]) # add the current substring
end_pos = data.find(end, end_pos) # find the next `end` after the curr. `start`
current_pos = data.find(start, current_pos + s_len) # find the next `start`
return result

Which will yield:

substrings = find_substrings("BADACBA", "B", "A")
# ['BA', 'BADA', 'BADACBA', 'BA']

But you'll have to modify it for more complex matches.

How to find overlapping matches with a regexp?

findall doesn't yield overlapping matches by default. This expression does however:

>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']

Here (?=...) is a lookahead assertion:

(?=...) matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example,
Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.



Related Topics



Leave a reply



Submit