Returning Overlapping Regular Expressions

How to find overlapping matches with a regexp?

findall doesn't yield overlapping matches by default. This expression does however:

>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']

Here (?=...) is a lookahead assertion:

(?=...) matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example,
Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

Overlapping matches in Regex

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn
     
  2. nnnn
     
  3. nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).



Python regular expressions with overlap

Here are two ways that can be done in one pass.

Match what you don't want, capture what you do want

With this approach by matching the following regular expression we obtain the words of interest in capture group 1, paying no attention to the matches that do not result in a capture.

(?:^|\").*?(?:$|\")|(\b(?:jake|donald|rita|katie)\b)

This expression must of course be constructed programmatically from the array keywords, which is not difficult.

Demo

This makes use of the fact that well-formed strings contain an even number of double-quotes.

The string can be broken down as follows.

(?:^|\")  # match the beginning of the string or a double-quote
.*? # match zero or more characters lazily
(?:$|\") # match the end of the string or a double-quote
| # or
( # begin capture group 1
\b # match word boundary
(?:jake|donald|rita|katie) # match one of the four words
\b # match word boundary
) # end capture group 1

The word boundaries are to avoid matching words such as 'heritage' (which contains the name of a lovely meter maid).


Count trailing double-quotes to determine if a word is within a quoted string

Observe that a word is within a doubly-quoted string if and only if the part of the string that follows the word contains an odd number of double quotes. Suppose, for example, the string were:

'jack betty "donald jake rita" lorie katie danny "katie"'

^^^^^^ ^^^^ ^^^^ ^^^^^

"donald", "jake" and "rita" are each followed by three double-quotes, so we infer they are within a doubly-quoted string. Same for "katie" which followed by one double-quote. All the remaining words are followed by a an even number of double-quotes so we infer they are not within a double-quoted string. Naturally, this is assuming that the entire string is well-formed in the sense that it contains an even number of double-quotes.

You can use re.findall with the following regular expression to identify the words of interest.

\b(?:jake|donald|rita|katie)\b(?=[^\"]*(?:(?:\"[^\"]*){2})*\"[^\"]*$)

Demo.

The regex can be broken down as follows.

\b             # match a word boundary
(?:jake|donald|rita|katie) # match one of four words in a non-capture group
\b # match a word boundary
(?= # begin a positive lookhead
[^\"]* # match zero or more chars other than dq's (double-quotes)
(?: # begin a non-capture group
(?: # begin a non-capture group
\"[^\"]* # match a dq followed by zero or more chars other than a dq
){2} # end non-capture group and execute twice
)* # end non-capture group and execute zero or more times
\"[^\"]* # match a dq followed by zero or more chars other than a dq
$ # match end of string
) # end positive lookahead

Use of this expression may require considerable backtracking by the regex engine. As shown at the link, the regex engine required 261 steps to parse the string given above.

Returning overlapping regular expressions

Sure, match an empty string and place a look-ahead after it that captures /.* in a capturing group:

Matcher m = Pattern.compile("(?=(/.*))").matcher("/abc/def/ghi");
while(m.find()) {
System.out.println(m.group(1));
}

would print:

/abc/def/ghi
/def/ghi
/ghi

How to match all overlapping pattern

There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index, and a for loop:

input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)

Playground link

How can I match overlapping strings with regex?

You can't do this with a regex alone, but you can get pretty close:

var pat = /(?=(\d{3}))\d/g;var results = [];var match;
while ( (match = pat.exec( '1234567' ) ) != null ) { results.push( match[1] );}
console.log(results);

How to use regex to find all overlapping matches

Use a capturing group inside a lookahead. The lookahead captures the text you're interested in, but the actual match is technically the zero-width substring before the lookahead, so the matches are technically non-overlapping:

import re 
s = "123456789123456789"
matches = re.finditer(r'(?=(\d{10}))',s)
results = [int(match.group(1)) for match in matches]
# results:
# [1234567891,
# 2345678912,
# 3456789123,
# 4567891234,
# 5678912345,
# 6789123456,
# 7891234567,
# 8912345678,
# 9123456789]

PCRE regular expression overlapping matches

A common trick is to use capturing technique inside an unanchored positive lookahead. Use this regex with preg_match_all:

(?=(1....1))

See regex demo

The values are in $matches[1]:

$re = "/(?=(1....1))/"; 
$str = "001110000100001100001";
preg_match_all($re, $str, $matches);
print_r($matches[1]);

See lookahead reference:

Lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.

If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)).



Related Topics



Leave a reply



Submit