What Are Regular Expression Balancing Groups

What are regular expression Balancing Groups?

As far as I know, balancing groups are unique to .NET's regex flavor.

Aside: Repeated Groups

First, you need to know that .NET is (again, as far as I know) the only regex flavor that lets you access multiple captures of a single capturing group (not in backreferences but after the match has completed).

To illustrate this with an example, consider the pattern

(.)+

and the string "abcd".

in all other regex flavors, capturing group 1 will simply yield one result: d (note, the full match will of course be abcd as expected). This is because every new use of the capturing group overwrites the previous capture.

.NET on the other hand remembers them all. And it does so in a stack. After matching the above regex like

Match m = new Regex(@"(.)+").Match("abcd");

you will find that

m.Groups[1].Captures

Is a CaptureCollection whose elements correspond to the four captures

0: "a"
1: "b"
2: "c"
3: "d"

where the number is the index into the CaptureCollection. So basically every time the group is used again, a new capture is pushed onto the stack.

It gets more interesting if we are using named capturing groups. Because .NET allows repeated use of the same name we could write a regex like

(?<word>\w+)\W+(?<word>\w+)

to capture two words into the same group. Again, every time a group with a certain name is encountered, a capture is pushed onto its stack. So applying this regex to the input "foo bar" and inspecting

m.Groups["word"].Captures

we find two captures

0: "foo"
1: "bar"

This allows us to even push things onto a single stack from different parts of the expression. But still, this is just .NET's feature of being able to track multiple captures which are listed in this CaptureCollection. But I said, this collection is a stack. So can we pop things from it?

Enter: Balancing Groups

It turns out we can. If we use a group like (?<-word>...), then the last capture is popped from the stack word if the subexpression ... matches. So if we change our previous expression to

(?<word>\w+)\W+(?<-word>\w+)

Then the second group will pop the first group's capture, and we will receive an empty CaptureCollection in the end. Of course, this example is pretty useless.

But there's one more detail to the minus-syntax: if the stack is already empty, the group fails (regardless of its subpattern). We can leverage this behavior to count nesting levels - and this is where the name balancing group comes from (and where it gets interesting). Say we want to match strings that are correctly parenthesized. We push each opening parenthesis on the stack, and pop one capture for each closing parenthesis. If we encounter one closing parenthesis too many, it will try to pop an empty stack and cause the pattern to fail:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*$

So we have three alternatives in a repetition. The first alternative consumes everything that is not a parenthesis. The second alternative matches (s while pushing them onto the stack. The third alternative matches )s while popping elements from the stack (if possible!).

Note: Just to clarify, we're only checking that there are no unmatched parentheses! This means that string containing no parentheses at all will match, because they are still syntactically valid (in some syntax where you need your parentheses to match). If you want to ensure at least one set of parentheses, simply add a lookahead (?=.*[(]) right after the ^.

This pattern is not perfect (or entirely correct) though.

Finale: Conditional Patterns

There is one more catch: this does not ensure that the stack is empty at the end of the string (hence (foo(bar) would be valid). .NET (and many other flavors) have one more construct that helps us out here: conditional patterns. The general syntax is

(?(condition)truePattern|falsePattern)

where the falsePattern is optional - if it is omitted the false-case will always match. The condition can either be a pattern, or the name of a capturing group. I'll focus on the latter case here. If it's the name of a capturing group, then truePattern is used if and only if the capture stack for that particular group is not empty. That is, a conditional pattern like (?(name)yes|no) reads "if name has matched and captured something (that is still on the stack), use pattern yes otherwise use pattern no".

So at the end of our above pattern we could add something like (?(Open)failPattern) which causes the entire pattern to fail, if the Open-stack is not empty. The simplest thing to make the pattern unconditionally fail is (?!) (an empty negative lookahead). So we have our final pattern:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))$

Note that this conditional syntax has nothing per se to do with balancing groups but it's necessary to harness their full power.

From here, the sky is the limit. Many very sophisticated uses are possible and there are some gotchas when used in combination with other .NET-Regex features like variable-length lookbehinds (which I had to learn the hard way myself). The main question however is always: is your code still maintainable when using these features? You need to document it really well, and be sure that everyone who works on it is also aware of these features. Otherwise you might be better off, just walking the string manually character-by-character and counting nesting levels in an integer.

Addendum: What's with the (?<A-B>...) syntax?

Credits for this part go to Kobi (see his answer below for more details).

Now with all of the above, we can validate that a string is correctly parenthesized. But it would be a lot more useful, if we could actually get (nested) captures for all those parentheses' contents. Of course, we could remember opening and closing parentheses in a separate capture stack that is not emptied, and then do some substring extraction based on their positions in a separate step.

But .NET provides one more convenience feature here: if we use (?<A-B>subPattern), not only is a capture popped from stack B, but also everything between that popped capture of B and this current group is pushed onto stack A. So if we use a group like this for the closing parentheses, while popping nesting levels from our stack, we can also push the pair's content onto another stack:

^(?:[^()]|(?<Open>[(])|(?<Content-Open>[)]))*(?(Open)(?!))$

Kobi provided this Live-Demo in his answer

So taking all of these things together we can:

  • Remember arbitrarily many captures
  • Validate nested structures
  • Capture each nesting level

All in a single regular expression. If that's not exciting... ;)

Some resources that I found helpful when I first learned about them:

  • http://blog.stevenlevithan.com/archives/balancing-groups
  • MSDN on balancing groups
  • MSDN on conditional patterns
  • http://kobikobi.wordpress.com/tag/balancing-group/ (slightly academic, but has some interesting applications)

Regex and balancing groups

Assuming your input is already valid, and you want to parse it, here is a rather simple regex to achieve that:

(?:
(?<ConditionId>\d+)
|
(?<Operator>[XUO])
|
(?<Open>\()
|
(?<Group-Open>\))
)+

Working example - Regex Storm - switch to the table tab to see all captures.

The pattern captures:

  • Numbers into the $ConditionId group.
  • Operators into the $Operator group.
  • Sub expressions in parentheses into the $Group group (needs a better name?). For example, for the string 22X(34U(42O27)), it will have two captures: 42O27 and 34U(42O27).

Each capture contains the position of the matches string. The relations between $Group and its contained $Operators, $ConditionIds and sub-$Groups is expressed only using these positions.

The (?<Group-Open>) syntax is used when we reach a closing parenthesis to capture everything since the corresponding opening parenthesis. This is explained in more detailed here: What are regular expression Balancing Groups?

Regex with balancing groups

I suggest capturing those values using

\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*

See the regex demo.

Details:

  • \w+(?:\.\w+)* - match 1+ word chars followed with . + 1+ word chars 1 or more times
  • \[ - a literal [
  • (?:,?(?<res>\w+(?:\[[^][]*])?))* - 0 or more sequences of:

    • ,? - an optional comma
    • (?<res>\w+(?:\[[^][]*])?) - Group "res" capturing:

      • \w+ - one or more word chars (perhaps, you would like [\w.]+)
      • (?:\[[^][]*])? - 1 or 0 (change ? to * to match 1 or more) sequences of a [, 0+ chars other than [ and ], and a closing ].

A C# demo below:

var line = "System.Action[Int32,Dictionary[Int32,Int32],Int32]";
var pattern = @"\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*";
var result = Regex.Matches(line, pattern)
.Cast<Match>()
.SelectMany(x => x.Groups["res"].Captures.Cast<Capture>()
.Select(t => t.Value))
.ToList();
foreach (var s in result) // DEMO
Console.WriteLine(s);

UPDATE: To account for unknown depth [...] substrings, use

\w+(?:\.\w+)*\[(?:\s*,?\s*(?<res>\w+(?:\[(?>[^][]+|(?<o>\[)|(?<-o>]))*(?(o)(?!))])?))*

See the regex demo

using regular expression balancing groups to match nesting tags

Typically, to match nested tags, you'd want something similar to:

(?>
\{(?<Open>\w+)\}
|
\{/(?<-Open>\<Open>)\}
|
(?(Open)[^{}]+)
)*
(?(Open)(?!))

Working example: Regex Storm

This way you can match nested tags of different types, which looks like what you're trying to do. For example, it would match this:

{list}
nesting loop content
{world}
{list}
{hello}
bala ...
{/hello}
{/list}
{/world}
{/list}

Notes:

  • I'm using (?(Open)[^{}]+) so we only match free text if it is within tags.
  • I'm using the same group for the top level and the inner levels.

Yours was pretty close. You are basically missing one alternation between the middle groups:

(
\{(?<NAME2>.+?)\}(?<OPEN>)
[^\{\}]*?
)
| # <---- This
(
\{\/\<NAME2>\}(?<-OPEN>)
[^\{\}]*?
)

Working example

However, you are always using the last value of $NAME2. $NAME2 is a stack, but you never pop values from it, only push. This causes a bug: it would also match this string (which is probably wrong):

{list}             # Set $Name = "world"
nesting loop content
{world} # Set $Name2 = "world"
{world} # Set $Name2 = "world"
{hello} # Set $Name2 = "hello"
bala ...
{/hello} # Match $Name2 ("hello")
{/hello} # Match $Name2 ("hello")
{/hello} # Match $Name2 ("hello")
{/list} # Match $Name ("list")

See also:

  • What are regular expression Balancing Groups?
  • How to make balancing group capturing?

Regular expression that uses balancing groups

To capture a whole IF/ENDIF block with balanced IF statements, you can use this regex:

%IF\s+(?<Name>\w+)
(?<Contents>
(?> #Possessive group, so . will not match IF/ENDIF
\s|
(?<IF>%IF)| #for IF, push
(?<-IF>%ENDIF)| #for ENDIF, pop
. # or, anything else, but don't allow
)+
(?(IF)(?!)) #fail on extra open IFs
) #/Contents
%ENDIF

The point here is this: you cannot capture in a single Match more than one of every named group. You will only get one (?<Name>\w+) group, for example, of the last captured value. In my regex, I kept the Name and Contents groups of your simple regex, and limited the balancing inside the Contents group - the regex is still wrapped in IF and ENDIF.

If becomes interesting when your data is more complex. For example:

%IF MY_VAR             
some text
%IF OTHER_VAR
some other text
%ENDIF
%IF OTHER_VAR2
some other text 2
%ENDIF
%ENDIF
%IF OTHER_VAR3
some other text 3
%ENDIF

Here, you will get two matches, one for MY_VAR, and one for OTHER_VAR3. If you want to capture the two ifs on MY_VAR's content, you have to rerun the regex on its Contents group (you can get around it by using a lookahead if you must - wrap the whole regex in (?=...), but you'll need to put it into a logical structure somehow, using positions and lengths).

Now, I won't explain too much, because it seems you get the basics, but a short note about the contents group - I've uses a possessive group to avoid backtracking. Otherwise, it would be possible for the dot to eventually match whole IFs and break the balance. A lazy match on the group would behave similarly (( )+? instead of (?> )+).

Regular Expressions Balancing Group

You're pretty close.

Adapted from the second answer on this question (I use that as my canonical "balancing xxx in C#/.NET regex engine" answer, upvote it if it helped you! It has helped me in the past):

var r = new Regex(@"
[^{}]* # any non brace stuff.
\{( # First '{' + capturing bracket
(?:
[^{}] # Match all non-braces
|
(?<open> \{ ) # Match '{', and capture into 'open'
|
(?<-open> \} ) # Match '}', and delete the 'open' capture
)+ # Change to * if you want to allow {}
(?(open)(?!)) # Fails if 'open' stack isn't empty!
)\} # Last '}' + close capturing bracket
"; RegexOptions.IgnoreWhitespace);

Regex: Match multiple balancing groups

Assuming you only want to match basic code like those samples in your question, you can use

(?<method_name>\w+)\s*\((?s:.*?)\)\s*(?<method_body>\{(?>[^{}]+|\{(?<n>)|}(?<-n>))*(?(n)(?!))})

See demo

Sample Image

To access the values you need, use .Groups["method_name"].Value and .Groups["method_body"].Value.

How to make balancing group capturing?

I think that a different approach is required here. Once you match the first larger group R{R{abc}aD{mnoR{xyz}}} (see my comment about the possible typo), you won't be able to get the subgroups inside as the regex doesn't allow you to capture the individual R{ ... } groups.

So, there had to be some way to capture and not consume and the obvious way to do that was to use a positive lookahead. From there, you can put the expression you used, albeit with some changes to adapt to the new change in focus, and I came up with:

(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))

[I also renamed the 'Open' to 'O' and removed the named capture for the close brace to make it shorter and avoid noises in the matches]

On regexhero.net (the only free .NET regex tester I know so far), I got the following capture groups:

1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}

Breakdown of regex:

(?=                         # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
(?: # First non-capture group opening
(?: # Second non-capture group opening
(?'O'{) # Get the named opening brace
[^{}]* # Any non-brace
)+ # Close of second non-capture group and repeat over as many times as necessary
(?: # Third non-capture group opening
(?'-O'}) # Removal of named opening brace when encountered
[^{}]*? # Any other non-brace characters in case there are more nested braces
)+ # Close of third non-capture group and repeat over as many times as necessary
)+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
(?(O)(?!)) # Condition to prevent unbalanced braces
) # Close capture group
) # Close positive lookahead

The following will not work in C#

I actually wanted to try out how it should be working out on the PCRE engine, since there was the option to have recursive regex and I think it was easier since I'm more familiar with it and which yielded a shorter regex :)

(?=([A-Z]{(?:[^{}]|(?1))+}))

regex101 demo

(?=                    # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
{ # Opening brace
(?: # Opening non-capture group
[^{}] # Matches non braces
| # OR
(?1) # Recurse first capture group
)+ # Close non-capture group and repeat as many times as necessary
} # Closing brace
) # Close of capture group
) # Close of positive lookahead

How to match balanced delimiters in javascript regex?

I do not believe that this is possible in JavaScript, though it's hard to prove. For example, Java and PHP do not have the features that you mention (recursive interpolation, balancing groups), but this fascinating Stack Overflow answer shows how to match anbn using regexes in those languages. (Adapting that answer to the present case, the Java regex ^(?:(?:<(?=<*(\1?+>)))+\1)*$ should work. Correction: no, it's not that easily adapted.) But that answer depends on Java's support for the possessive quantifier ?+ (like ? except that you can't backtrack into it), and JavaScript doesn't have that.

That said, you can solve the referenced puzzle by writing this:

^(?:<(?:<(?:<(?:<(?:<(?:<(?:<>)*>)*>)*>)*>)*>)*>)*$

which matches up to seven levels of nesting. That's the most that any of the strings has, so it's all you need. (Several of the other puzzles at that page advise you to cheat because they're asking for something technically impossible; so while an elegant solution would obviously be more appealing, there's no reason to assume that one exists.)



Related Topics



Leave a reply



Submit