Pcre Regular Expression Overlapping Matches

PCRE regular expression overlapping matches

A common trick is to use capturing technique inside an unanchored positive lookahead. Use this regex with preg_match_all:

(?=(1....1))

See regex demo

The values are in $matches[1]:

$re = "/(?=(1....1))/"; 
$str = "001110000100001100001"; 
preg_match_all($re, $str, $matches);
print_r($matches[1]);

See lookahead reference:

Lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)).

Overlapping regex matches

Doing this with one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches. You could, however, do this with some simple iteration:

regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
tmp = RNA
while (match = regex.search(tmp)):
    matches.append(match)
    tmp = tmp[match.start()-2:]  #Back up two to get the UG portion.  Shouldn't matter, but safer.

for m in matches:
    print m.group(0)

Though, this has some problems. What would you expect the return to be in the case of AUGUAGUGAUAA? Are there two strings to be returned? Or just one? Right now, your regular expression isn't even capable of capturing UAG, as it continues on through to match UAGUGA and get cut off at UAA. To combat this, then, you might wish to use the ? operator to make your operator lazy - an approach that then will be unable to capture the longer substring.

Maybe iteration over the string twice is the answer, but then what if your RNA Sequence contains AUGAUGUAGUGAUAA? What's the correct behaviour there?

I might favor a regular expression free approach, by iterating over the string and its substrings:

RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0

while (RNA.find('AUG', start) > -1):
    start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
    candidates.append(RNA[start+3:])
    start += 1

matches = []

for candidate in candidates:
    for terminator in ['UAG', 'UGA', 'UAA']:
        end = 1;
        while(candidate.find(terminator, end) > -1):
            end = candidate.find(terminator, end)
            matches.append(candidate[:end])
            end += 1

for match in matches:
    print match

This way, you're sure to get all matches, no matter what.

If you need to keep track of the position of each match, you can modify your candidates data structure to use tuples which maintain the starting position:

RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0

while (RNA.find('AUG', start) > -1):
    start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
    candidates.append((RNA[start+3:], start+3))
    start += 1

matches = []

for candidate in candidates:
    for terminator in ['UAG', 'UGA', 'UAA']:
        end = 1;
        while(candidate[0].find(terminator, end) > -1):
            end = candidate[0].find(terminator, end)
            matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
            end += 1

for match in matches:
    print "%d - %d: %s" % match

which prints:

7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
7 - 31: UAGCUAACUCAGGUUACAUGGGGA
7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
7 - 11: UAGC
7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
27 - 49: GGGAUGACCCCGCGACUUGGAU
27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
27 - 31: GGGA
27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
33 - 49: ACCCCGCGACUUGGAU
33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
78 - 85: AUCCGAG

Hell, with literally three more lines, you can even sort the matches according to where they fall in the RNA sequence:

from operator import itemgetter
matches.sort(key=itemgetter(1))
matches.sort(key=itemgetter(0))

That placed before the final print nets you:

007 - 011: UAGC
007 - 031: UAGCUAACUCAGGUUACAUGGGGA
007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
027 - 031: GGGA
027 - 049: GGGAUGACCCCGCGACUUGGAU
027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
033 - 049: ACCCCGCGACUUGGAU
033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
078 - 085: AUCCGAG

Overlapping matches in Regex

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

nnnn
nnnn
nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

Any valid regular expression can be used inside the lookahead.
If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).

Lookahead regex failing to find the same overlapping matches

We can generalize the solution to any regex.

Let's say we have a valid regex pattern which you want to search for overlapping matches.

In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>), and we can nest a capturing group to capture the match (?=(<pattern>)).

_{This technique works for Python re engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.}

Sample code:

import re

inp = '10.5.20.52.48.10'
matches = [m[0] if type(m) is tuple else m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]

Output:

['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']

If the original pattern doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.

However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.

Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+ in aaa will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.

Java regex - overlapping matches

Make the matcher attempt to start its next scan from the latter \d+.

Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
    do {
        allMatches.add(m.group());
    } while (m.find(m.start(1)));
}

Overlapping matches in R

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

Get overlapping positions of pattern in string in r

You can use gregexpr and a PCRE regex with a capturing group enclosing the entire positive lookahead pattern:

pattern <- "(?=([1-]{2}))"
s <- "-1-"
res <- gregexpr(pattern, s, perl=TRUE)
starts <- attr(res[[1]],'capture.start') 
lengths <- attr(res[[1]],'capture.length')
ends <- starts + lengths - 1
df_positions <- do.call(rbind, Map(data.frame, start=starts, end=ends, length=lengths))
df_positions

Output:

  start end length
1     1   2      2
2     2   3      2

See an R demo

Pcre Regular Expression Overlapping Matches