What's the Difference Between "Groups" and "Captures" in .Net Regular Expressions

What's the difference between groups and captures in .NET regular expressions?

You won't be the first who's fuzzy about it. Here's what the famous Jeffrey Friedl has to say about it (pages 437+):

Depending on your view, it either adds
an interesting new dimension to the
match results, or adds confusion and
bloat.

And further on:

The main difference between a Group
object and a Capture object is that
each Group object contains a
collection of Captures representing
all the intermediary matches by the
group during the match, as well as the
final text matched by the group.

And a few pages later, this is his conclusion:

After getting past the .NET
documentation and actually
understanding what these objects add,
I've got mixed feelings about them. On
one hand, it's an interesting
innovation [..] on the other hand, it
seems to add an efficiency burden [..]
of a functionality that won't be used
in the majority of cases

In other words: they are very similar, but occasionally and as it happens, you'll find a use for them. Before you grow another grey beard, you may even get fond of the Captures...


Since neither the above, nor what's said in the other post really seems to answer your question, consider the following. Think of Captures as a kind of history tracker. When the regex makes his match, it goes through the string from left to right (ignoring backtracking for a moment) and when it encounters a matching capturing parentheses, it will store that in $x (x being any digit), let's say $1.

Normal regex engines, when the capturing parentheses are to be repeated, will throw away the current $1 and will replace it with the new value. Not .NET, which will keep this history and places it in Captures[0].

If we change your regex to look as follows:

MatchCollection matches = Regex.Matches("{Q}{R}{S}", @"(\{[A-Z]\})+");

you will notice that the first Group will have one Captures (the first group always being the whole match, i.e., equal to $0) and the second group will hold {S}, i.e. only the last matching group. However, and here's the catch, if you want to find the other two catches, they're in Captures, which contains all intermediary captures for {Q} {R} and {S}.

If you ever wondered how you could get from the multiple-capture, which only shows last match to the individual captures that are clearly there in the string, you must use Captures.

A final word on your final question: the total match always has one total Capture, don't mix that with the individual Groups. Captures are only interesting inside groups.

Differences among .NET Capture, Group, Match

Here's a simpler example than the one in the document @Dav cited:

string s0 = @"foo%123%456%789";
Regex r0 = new Regex(@"^([a-z]+)(?:%([0-9]+))+$");
Match m0 = r0.Match(s0);
if (m0.Success)
{
Console.WriteLine(@"full match: {0}", m0.Value);
Console.WriteLine(@"group #1: {0}", m0.Groups[1].Value);
Console.WriteLine(@"group #2: {0}", m0.Groups[2].Value);
Console.WriteLine(@"group #2 captures: {0}, {1}, {2}",
m0.Groups[2].Captures[0].Value,
m0.Groups[2].Captures[1].Value,
m0.Groups[2].Captures[2].Value);
}

result:

full match: foo%123%456%789
group #1: foo
group #2: 789
group #2 captures: 123, 456, 789

The full match and group #1 results are straightforward, but the others require some explanation. Group #2, as you can see, is inside a non-capturing group that's controlled by a + quantifier. It matches three times, but if you request its Value, you only get what it matched the third time around--the final capture. Similarly, if you use the $2 placeholder in a replacement string, the final capture is what gets inserted in its place.

In most regex flavors, that's all you can get; each intermediate capture is overwritten by the next and lost; .NET is almost unique in preserving all of the captures and making them available after the match is performed. You can access them directly as I did here, or iterate through the CaptureCollection as you would a MatchCollection. There's no equivalent for the $1-style replacement-string placeholders, though.

So the reason the API design is so ugly (as you put it) is twofold: first it was adapted from Perl's integral regex support to .NET's object-oriented framework; then the CaptureCollection structure was grafted onto it. Perl 6 offers a much cleaner solution, but the authors accomplished that by rewriting Perl practically from scratch and throwing backward compatibility out the window.

What's the difference between groups and captures in .NET regular expressions?

You won't be the first who's fuzzy about it. Here's what the famous Jeffrey Friedl has to say about it (pages 437+):

Depending on your view, it either adds
an interesting new dimension to the
match results, or adds confusion and
bloat.

And further on:

The main difference between a Group
object and a Capture object is that
each Group object contains a
collection of Captures representing
all the intermediary matches by the
group during the match, as well as the
final text matched by the group.

And a few pages later, this is his conclusion:

After getting past the .NET
documentation and actually
understanding what these objects add,
I've got mixed feelings about them. On
one hand, it's an interesting
innovation [..] on the other hand, it
seems to add an efficiency burden [..]
of a functionality that won't be used
in the majority of cases

In other words: they are very similar, but occasionally and as it happens, you'll find a use for them. Before you grow another grey beard, you may even get fond of the Captures...


Since neither the above, nor what's said in the other post really seems to answer your question, consider the following. Think of Captures as a kind of history tracker. When the regex makes his match, it goes through the string from left to right (ignoring backtracking for a moment) and when it encounters a matching capturing parentheses, it will store that in $x (x being any digit), let's say $1.

Normal regex engines, when the capturing parentheses are to be repeated, will throw away the current $1 and will replace it with the new value. Not .NET, which will keep this history and places it in Captures[0].

If we change your regex to look as follows:

MatchCollection matches = Regex.Matches("{Q}{R}{S}", @"(\{[A-Z]\})+");

you will notice that the first Group will have one Captures (the first group always being the whole match, i.e., equal to $0) and the second group will hold {S}, i.e. only the last matching group. However, and here's the catch, if you want to find the other two catches, they're in Captures, which contains all intermediary captures for {Q} {R} and {S}.

If you ever wondered how you could get from the multiple-capture, which only shows last match to the individual captures that are clearly there in the string, you must use Captures.

A final word on your final question: the total match always has one total Capture, don't mix that with the individual Groups. Captures are only interesting inside groups.

Why does regex match capture the whole string as a group in C# when the whole pattern does not have an enclosing parentheses?

The capture group is overwritten each quantified pass (a)*
Change it to (abc)((?:de)*)(fgh)

The extra group you see includes group 0 which is the overall match

of the regex. So group 0,1,2,3 = 4 groups.

Regular Expression Groups in C#

The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).

What is the difference between a group and match in .NET's RegEx?

A Match is an object that indicates a particular regular expression matched (a portion of) the target text. A Group indicates a portion of a match, if the original regular expression contained group markers (basically a pattern in parentheses). For example, with the following code:

string text = "One car red car blue car";
string pat = @"(\w+)\s+(car)";
Match m = r.Match(text);

m would be match object that contains two groups - group 1, from (\w+), and that captured "One", and group 2 (from (car)) that matched, well, "car".

How to read RegEx Captures in C#

The C# regex API can be quite confusing. There are groups and captures:

  • A group represents a capturing group, it's used to extract a substring from the text
  • There can be several captures per group, if the group appears inside a quantifier.

The hierarchy is:

  • Match
    • Group
      • Capture

(a match can have several groups, and each group can have several captures)

For example:

Subject: aabcabbc
Pattern: ^(?:(a+b+)c)+$

In this example, there is only one group: (a+b+). This group is inside a quantifier, and is matched twice. It generates two captures: aab and abb:

aabcabbc
^^^ ^^^
Cap1 Cap2

When a group is not inside of a quantifier, it generates only one capture. In your case, you have 3 groups, and each group captures once. You can use match.Groups[1].Value, match.Groups[2].Value and match.Groups[3].Value to extract the 3 substrings you're interested in, without resorting to the capture notion at all.

Why does Regex.Match include noncapturing groups in the result?

Matching is not the same thing as capturing. (?:\d) simply means match a subpattern containing \d, but don't bother putting it in a capture group. Your entire pattern (?:\d)\w looks for a (?:\d) followed by a \w; it's functionally equivalent to \d\w.

If you're trying to match a \w only when it is preceded by a \d, use a lookbehind assertion instead:

System.Text.RegularExpressions.Regex.Match("b3a", @"(?<=\d)\w").Value

How do I access named capturing groups in a .NET Regex?

Use the group collection of the Match object, indexing it with the capturing group name, e.g.

foreach (Match m in mc){
MessageBox.Show(m.Groups["link"].Value);
}


Related Topics



Leave a reply



Submit