PCRE regular expression overlapping matches
A common trick is to use capturing technique inside an unanchored positive lookahead. Use this regex with preg_match_all
:
(?=(1....1))
See regex demo
The values are in $matches[1]
:
$re = "/(?=(1....1))/";
$str = "001110000100001100001";
preg_match_all($re, $str, $matches);
print_r($matches[1]);
See lookahead reference:
Lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
If you want to store the match of the regex inside a lookahead, you have to put capturing parentheses around the regex inside the lookahead, like this:
(?=(regex))
.
Overlapping regex matches
Doing this with one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches. You could, however, do this with some simple iteration:
regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
tmp = RNA
while (match = regex.search(tmp)):
matches.append(match)
tmp = tmp[match.start()-2:] #Back up two to get the UG portion. Shouldn't matter, but safer.
for m in matches:
print m.group(0)
Though, this has some problems. What would you expect the return to be in the case of AUGUAGUGAUAA
? Are there two strings to be returned? Or just one? Right now, your regular expression isn't even capable of capturing UAG
, as it continues on through to match UAGUGA
and get cut off at UAA
. To combat this, then, you might wish to use the ?
operator to make your operator lazy - an approach that then will be unable to capture the longer substring.
Maybe iteration over the string twice is the answer, but then what if your RNA Sequence contains AUGAUGUAGUGAUAA
? What's the correct behaviour there?
I might favor a regular expression free approach, by iterating over the string and its substrings:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0
while (RNA.find('AUG', start) > -1):
start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
candidates.append(RNA[start+3:])
start += 1
matches = []
for candidate in candidates:
for terminator in ['UAG', 'UGA', 'UAA']:
end = 1;
while(candidate.find(terminator, end) > -1):
end = candidate.find(terminator, end)
matches.append(candidate[:end])
end += 1
for match in matches:
print match
This way, you're sure to get all matches, no matter what.
If you need to keep track of the position of each match, you can modify your candidates data structure to use tuples which maintain the starting position:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0
while (RNA.find('AUG', start) > -1):
start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
candidates.append((RNA[start+3:], start+3))
start += 1
matches = []
for candidate in candidates:
for terminator in ['UAG', 'UGA', 'UAA']:
end = 1;
while(candidate[0].find(terminator, end) > -1):
end = candidate[0].find(terminator, end)
matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
end += 1
for match in matches:
print "%d - %d: %s" % match
which prints:
7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
7 - 31: UAGCUAACUCAGGUUACAUGGGGA
7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
7 - 11: UAGC
7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
27 - 49: GGGAUGACCCCGCGACUUGGAU
27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
27 - 31: GGGA
27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
33 - 49: ACCCCGCGACUUGGAU
33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
78 - 85: AUCCGAG
Hell, with literally three more lines, you can even sort the matches according to where they fall in the RNA sequence:
from operator import itemgetter
matches.sort(key=itemgetter(1))
matches.sort(key=itemgetter(0))
That placed before the final print nets you:
007 - 011: UAGC
007 - 031: UAGCUAACUCAGGUUACAUGGGGA
007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
027 - 031: GGGA
027 - 049: GGGAUGACCCCGCGACUUGGAU
027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
033 - 049: ACCCCGCGACUUGGAU
033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
078 - 085: AUCCGAG
Overlapping matches in Regex
Update 2016:
To get nn
, nn
, nn
, SDJMcHattie proposes in the comments (?=(nn))
(see regex101).
(?=(nn))
Original answer (2008)
A possible solution could be to use a positive look behind:
(?<=n)n
It would give you the end position of:
- nnnn
- nnnn
- nnnn
As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)
I would prefer to his proposition (?=nn)n
the simpler form:
(n)(?=(n))
That would reference the first position of the strings you want and would capture the second n in group(2).
That is so because:
- Any valid regular expression can be used inside the lookahead.
- If it contains capturing parentheses, the backreferences will be saved.
So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).
Lookahead regex failing to find the same overlapping matches
We can generalize the solution to any regex.
Let's say we have a valid regex pattern
which you want to search for overlapping matches.
In order to get overlapping matches, we need to avoid consuming characters in each match, relying on the bump-along mechanism to evaluate the regex on every position of the string. This can be achieved by surrounding the whole regex in a look-ahead (?=<pattern>)
, and we can nest a capturing group to capture the match (?=(<pattern>))
.
This technique works for Python re
engine since after it found an empty match, it will simply bump-along and will not re-evaluate the regex at the same position but looking for non-empty match on the second try like PCRE engine.
Sample code:
import re
inp = '10.5.20.52.48.10'
matches = [m[0] if type(m) is tuple else m for m in re.findall(r'(?=(\d+(\.\d+){2}))', inp)]
Output:
['10.5.20', '0.5.20', '5.20.52', '20.52.48', '0.52.48', '52.48.10', '2.48.10']
If the original pattern
doesn't have numbered backreferences then we can build the overlapping version of the regex with string concatenation.
However, if it does, the regex will need to be modified manually to correct the backreferences which have been shifted by the additional capturing group.
Do note that this method doesn't give you overlapping matches starting at the same index (e.g. looking for a+
in aaa
will give you 3 matches instead of 6 matches). It's not possible to implement overlapping match starting at the same index in most regex flavors/library, except for Perl.
Java regex - overlapping matches
Make the matcher attempt to start its next scan from the latter \d+
.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}
Overlapping matches in R
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
Get overlapping positions of pattern in string in r
You can use gregexpr
and a PCRE regex with a capturing group enclosing the entire positive lookahead pattern:
pattern <- "(?=([1-]{2}))"
s <- "-1-"
res <- gregexpr(pattern, s, perl=TRUE)
starts <- attr(res[[1]],'capture.start')
lengths <- attr(res[[1]],'capture.length')
ends <- starts + lengths - 1
df_positions <- do.call(rbind, Map(data.frame, start=starts, end=ends, length=lengths))
df_positions
Output:
start end length
1 1 2 2
2 2 3 2
See an R demo
Related Topics
Programmatically Determine Whether to Describe an Object with "A" or "An"
Fatal Error: Call to Undefined Method MySQLi::Error()
PHP Regex for Validating a Url
Call to Undefined Function MySQLi_Result::Num_Rows()
How Does "Do Something or Die()" Work in PHP
How to Run the Bind_Param() Statement in PHP
Get Number of Rows Matched by Update Query with PHP MySQLi
Select MySQL Rows But Rows into Columns and Column into Rows
How to Http Post Special Chars in Swift
Fatal Error: Call to Undefined Function Oci_Connect()
Laravel: I Can't Send More Then 2 Variables from Controller to a View