Getting Overlapping Regex Matches in C#

Overlapping matches in Regex

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn
     
  2. nnnn
     
  3. nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).



Getting overlapping regex matches in C#

You are already selecting the 1 in front of the second zero by the first match.

100001 0001
^^^^^^

This is the first match. The rest is just 0001 which does not match your regex.


You can circumvent this behavior if you are using lookaheads/lookbehinds:

(?<=1)(0*)(?=1)

Live example


Because you cannot use lookbehinds in JavaScript, it is enough to only use one lookahead, to prevent the overlapping:

1(0*)(?=1)

Live example


And a hint for your regex101 example: You did not add the global flag, which prevents more than one selection.

Counting overlapping matches with Regex in C#

This will return 4 as you expect:

Regex.Matches("020202020", @"0(?=20)").Count;

The lookahead matches the 20 without consuming it, so the next match attempt starts at the position following the first 0. You can even do the whole regex as a lookahead:

Regex.Matches("020202020", @"(?=020)").Count;

The regex engine automatically bumps ahead one position each time it makes a zero-length match. So, to find all runs of three 2's or four 2's, you can use:

Regex.Matches("22222222", @"(?=222)").Count;  // 6

...and:

Regex.Matches("22222222", @"(?=2222)").Count;  // 5

EDIT: Looking over your question again, it occurs to me you might be looking for 2's interspersed with 0's

Regex.Matches("020202020", @"(?=20202)").Count;  // 2

If you don't know how many 0's there will be, you can use this:

Regex.Matches("020202020", @"(?=20*20*2)").Count;  // 2

And of course, you can use quantifiers to reduce repetition in the regex:

Regex.Matches("020202020", @"(?=2(?:0*2){2})").Count;  // 2

Replacing overlapping matches in a string (regex or string operations)

I think I would forgo regex and write a simple loop as below (there is room for improvement), because I think it would be quicker and more understandable.

        public IEnumerable<int> FindStartingOccurrences(string input, string pattern)
{
var occurrences = new List<int>();

for (int i=0; i<input.Length; i++)
{
if (input.Length+1 > i+pattern.Length)
{
if (input.Substring(i, pattern.Length) == pattern)
{
occurrences.Add(i);
}
}
}

return occurrences;
}

and then call like:

var occurrences = FindStartingOccurrences("aaabbaaaaaccaadaaa", "aa");

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

Regular expressions are designed to find one match at a time. Even a global match operation is simply repeated applications of the same regex, each starting at the end of the previous match in the target string. So no, regexes are not able to find all matches in this way.

I will stick my neck out and say that I don't believe you can even find "all strings beginning with 'a' in 'akzzaz'" with a regex. /(a.*)/g will find the entire string, while /(a.*?)/g will find just 'a' twice.

The way I would code this would be to locate all 'a's, and search each of the substrings from there to the end of the string for all 'z's. So search 'akzzaz` and 'az' for 'z', giving 'akz', 'akzz', 'akzzaz', and 'az'. That is a fairly simple thing to do, but not a job for a regex unless the actual 'a' and 'z' tokens are complex.

How is Regex.Split giving me overlapping matches?

From the documentation:

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, splitting the string " plum-pear" on a hyphen placed within capturing parentheses adds a string element that contains the hyphen to the returned array.

You have two sets of capturing parenthesis, one inclusive of the quotes and one exclusive. These return the strings you are seeing.

Note that the pattern for RegEx.Split isn't supposed to match the desired results, it's supposed to match the delimiters. A quoted string is usually not a delimiter.

Also, your results seem very odd, because you've used a greedy match. Apparently the requirement "The input string is split as many times as possible." makes matching non-greedy for the entire operation.

Overall, I'd say you're using the wrong tool. Regular expressions are, depending on implementation, incapable of dealing with nested groupings or extremely inefficient. A simple DFA should work much better and never need more than a single scan.

How can I Prioritize Overlapping Patterns in RegEx?

How about the following (regex101.com example):

/((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)(?:\h[-–]\h|\|)?(.*?)(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)/gm

Explanation

[Bb]lank

All matches for "blank" check for a lower OR uppercase "B"

((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)

The 1st capturing group: match either the alpha numeric first value or a "blank" first value with " - " or " – " after (positive lookahead) OR a "blank" first value that won't have a 2nd matching group.

(?:\h[-–]\h|\|)?

A separator of " - " OR " – " OR "|" which will occur zero or one times.

(.*?)

Ungreedily match the 2nd matching group.

(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)

Using a positive lookahead,look for a "blank" OR "|" OR alpha numeric first value with " - " or " – " after OR the end of the line (to catch the last item on the row) to find the end of where we should capture

How do I fix this regex? (How to get overlapping matches that share a word?)

With RegexOptions.Multiline | RegexOptions.IgnoreCase

 ^(?<Title>.*(?:Manager|Officer)).*\n(?<Name>.*)(?:\n(?!.*(?:Manager|Officer))(?<Detail>.*))+$

See: http://regexhero.net/tester/?id=1ac1bd9f-be0a-4bea-ac01-cc32a6605ae7

Retrieve values using

Match.Groups["Name"].Value
Match.Groups["Title"].Value
Match.Groups["Detail"].Captures[1..n].Value


Related Topics



Leave a reply



Submit