How to Match Nested Brackets with a Regex Without Using Recursion or Balancing Groups

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

Indeed! It's possible using forward references:

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:







































































































NoteComponentDescription
(?=\()Make sure '(' follows before doing any hard work.
(?:Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('(?=This lookahead deals with finding the next '('.
.*?\((?!.*?\1)Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'(?=This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation.*?(?=\1)Match up to and including the last '(' found.
[^(]*(?=\2$)Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).

How can I match nested brackets using regex?

Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.

A demo in Perl:

#!/usr/bin/perl -w

my $text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';

while($text =~ /(\(([^()]|(?R))*\))/g) {
print("----------\n$1\n");
}

which will print:

----------
(outer
(center
(inner)
(inner)
center)
ouer)
----------
(outer
(inner)
ouer)
----------
(outer
ouer)

Or, the PHP equivalent:

$text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';

preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);

print_r($matches);

which produces:

Array
(
[0] => Array
(
[0] => (outer
(center
(inner)
(inner)
center)
ouer)
[1] => (outer
(inner)
ouer)
[2] => (outer
ouer)
)

...

An explanation:


( # start group 1
\( # match a literal '('
( # group 2
[^()] # any char other than '(' and ')'
| # OR
(?R) # recursively match the entir pattern
)* # end group 2 and repeat zero or more times
\) # match a literal ')'
) # end group 1

EDIT

Note @Goozak's comment:

A better pattern might be \(((?>[^()]+)|(?R))*\) (from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.

Regular expression to match balanced parentheses

Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.

But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.

Need a regex expert to match nested brackets

You need to use preg_replace_callback, with a pattern able to extract content between _{ and } or ^{ and } with nested curly brackets, and a callback function that will replace all occurrences of \dfrac in the match. Example:

$pattern = '~[_^]({[^{}]*(?:(?1)[^{}]*)*})~';

$result = preg_replace_callback($pattern,
function ($m) { return str_replace('\dfrac', '\frac', $m[0]); },
$text);

pattern details:

~              # pattern delimiter
[_^] # _ or ^
( # open the capture group 1
{
[^{}]* # all that is not a curly bracket
(?: # open a non capturing group
(?1) # the recursion is here:
# (?1) refers to the subpattern contained in capture group 1
# (so the current capture group)
[^{}]* #
)* # repeat the non capturing group as needed
}
) # close the capture group 1
~

Note: if curly brackets are not always balanced, you can change quantifiers to possessive to prevent too much backtracking and to make the pattern fail faster:

$pattern = '~[_^]({[^{}]*+(?:(?1)[^{}]*)*+})~';

or you can use an atomic group as well (or better):

$pattern = '~[_^]({(?>[^{}]*(?:(?1)[^{}]*)*)})~';

Python: How to match nested parentheses with regex?

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.

How to write a recursive regex that matches nested parentheses?

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left = '{';/* put YOUR left delimiter here. */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left = preg_quote( $delimiter_left, $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern = $delimiter_wrap . $delimiter_left
. '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
. $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );

Matching balanced parenthesis in Ruby using recursive regular expressions like perl

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

re = %r{
(?<re>
\(
(?:
(?> [^()]+ )
|
\g<re>
)*
\)
)
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

The manual for oniguruma is here.



Related Topics



Leave a reply



Submit