C++ Regex for Overlapping Matches

C++ regex for overlapping matches

Your regex can be put into the capturing parentheses that can be wrapped with a positive lookahead.

To make it work on Mac, too, make sure the regex matches (and thus consumes) a single char at each match by placing a . (or - to also match line break chars - [\s\S]) after the lookahead.

Then, you will need to amend the code to get the first capturing group value like this:

#include <iostream>
#include <regex>
#include <string>
using namespace std;

int main() {
std::string input_seq = "CCCC";
std::regex re("(?=(CCC))."); // <-- PATTERN MODIFICATION
std::sregex_iterator next(input_seq.begin(), input_seq.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str(1) << "\t" << "\t" << match.position() << "\t" << "\n"; // <-- SEE HERE
next++;
}
return 0;
}

See the C++ demo

Output:

CCC     0   
CCC 1

Regex search overlapping matches c++11

Your are actually looking for overlapping matches. This can be achieved using a regex lookahead like this:

(?=((?:55|66)[0-9a-fA-F]{8,}\/r))

You will find the matches in question in group 1. The full-match, however, is empty.

Regex Demo (using /r instead of a carriage return for demonstration purposes only)

Sample Code:

#include <iostream>
#include <string>
#include <regex>
using namespace std;

int main() {
std::string subject("0055\r06550003665508090705\r0970");
try {
std::regex re("(?=((?:55|66)[0-9a-fA-F]{8,}\r))");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str(1) << "\n";
next++;
}
} catch (std::regex_error& e) {
// Syntax error in the regular expression
}
return 0;
}

See also: Regex-Info: C++ Regular Expressions with std::regex

Counting overlapping matches with Regex in C#

This will return 4 as you expect:

Regex.Matches("020202020", @"0(?=20)").Count;

The lookahead matches the 20 without consuming it, so the next match attempt starts at the position following the first 0. You can even do the whole regex as a lookahead:

Regex.Matches("020202020", @"(?=020)").Count;

The regex engine automatically bumps ahead one position each time it makes a zero-length match. So, to find all runs of three 2's or four 2's, you can use:

Regex.Matches("22222222", @"(?=222)").Count;  // 6

...and:

Regex.Matches("22222222", @"(?=2222)").Count;  // 5

EDIT: Looking over your question again, it occurs to me you might be looking for 2's interspersed with 0's

Regex.Matches("020202020", @"(?=20202)").Count;  // 2

If you don't know how many 0's there will be, you can use this:

Regex.Matches("020202020", @"(?=20*20*2)").Count;  // 2

And of course, you can use quantifiers to reduce repetition in the regex:

Regex.Matches("020202020", @"(?=2(?:0*2){2})").Count;  // 2

Replacing overlapping matches in a string (regex or string operations)

I think I would forgo regex and write a simple loop as below (there is room for improvement), because I think it would be quicker and more understandable.

        public IEnumerable<int> FindStartingOccurrences(string input, string pattern)
{
var occurrences = new List<int>();

for (int i=0; i<input.Length; i++)
{
if (input.Length+1 > i+pattern.Length)
{
if (input.Substring(i, pattern.Length) == pattern)
{
occurrences.Add(i);
}
}
}

return occurrences;
}

and then call like:

var occurrences = FindStartingOccurrences("aaabbaaaaaccaadaaa", "aa");

Getting overlapping regex matches in C#

You are already selecting the 1 in front of the second zero by the first match.

100001 0001
^^^^^^

This is the first match. The rest is just 0001 which does not match your regex.


You can circumvent this behavior if you are using lookaheads/lookbehinds:

(?<=1)(0*)(?=1)

Live example


Because you cannot use lookbehinds in JavaScript, it is enough to only use one lookahead, to prevent the overlapping:

1(0*)(?=1)

Live example


And a hint for your regex101 example: You did not add the global flag, which prevents more than one selection.

Regular expression for excluding overlapping matches

(?<![\/\-\.a-zA-Z0-9])([a-zA-Z0-9]+[\/\-\.][a-zA-Z0-9]+)(?![\/\-\.a-zA-Z0-9])

works perfectly as you asked, see Regex101 demo


Example: Foo [1234-101] bar 456B/102 baz 5/3/2016

Matches: 1234-101 and 456B/102

Example: Foo [1234-101] bar 5/22/2016

Matches: 1234-101

How to find overlapping matches with a regexp?

findall doesn't yield overlapping matches by default. This expression does however:

>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']

Here (?=...) is a lookahead assertion:

(?=...) matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example,
Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

Regex including overlapping matches with same start

As I've said above, regex is a primarily linear and single-rule-only kind of engine - you can choose between greedy capture or not, but you cannot select both. Also, most regex engines do not support overlapping matches (and even those who support it kind of fake it with substrings / forced head move) because it also doesn't fit regex philosophy.

If you're looking only for simple overlapping matches between two substrings, you can implement it yourself:

def find_substrings(data, start, end):
result = []
s_len = len(start) # a shortcut for `start` length
e_len = len(end) # a shortcut for `end` length
current_pos = data.find(start) # find the first occurrence of `start`
while current_pos != -1: # loop while we can find `start` in our data
# find the first occurrence of `end` after the current occurrence of `start`
end_pos = data.find(end, current_pos + s_len)
while end_pos != -1: # loop while we can find `end` after the current `start`
end_pos += e_len # just so we include the selected substring
result.append(data[current_pos:end_pos]) # add the current substring
end_pos = data.find(end, end_pos) # find the next `end` after the curr. `start`
current_pos = data.find(start, current_pos + s_len) # find the next `start`
return result

Which will yield:

substrings = find_substrings("BADACBA", "B", "A")
# ['BA', 'BADA', 'BADACBA', 'BA']

But you'll have to modify it for more complex matches.

How can I Prioritize Overlapping Patterns in RegEx?

How about the following (regex101.com example):

/((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)(?:\h[-–]\h|\|)?(.*?)(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)/gm

Explanation

[Bb]lank

All matches for "blank" check for a lower OR uppercase "B"

((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)

The 1st capturing group: match either the alpha numeric first value or a "blank" first value with " - " or " – " after (positive lookahead) OR a "blank" first value that won't have a 2nd matching group.

(?:\h[-–]\h|\|)?

A separator of " - " OR " – " OR "|" which will occur zero or one times.

(.*?)

Ungreedily match the 2nd matching group.

(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)

Using a positive lookahead,look for a "blank" OR "|" OR alpha numeric first value with " - " or " – " after OR the end of the line (to catch the last item on the row) to find the end of where we should capture



Related Topics



Leave a reply



Submit