How to Write a Recursive Regex That Matches Nested Parentheses

How to write a recursive regex that matches nested parentheses?

When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

Here is a script to generate the regex pattern with your own variable left and right delimiters.

$delimiter_wrap  = '~';
$delimiter_left = '{';/* put YOUR left delimiter here. */
$delimiter_right = '}';/* put YOUR right delimiter here. */

$delimiter_left = preg_quote( $delimiter_left, $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern = $delimiter_wrap . $delimiter_left
. '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
. $delimiter_right . $delimiter_wrap;

/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

Indeed! It's possible using forward references:

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:







































































































NoteComponentDescription
(?=\()Make sure '(' follows before doing any hard work.
(?:Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('(?=This lookahead deals with finding the next '('.
.*?\((?!.*?\1)Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'(?=This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation.*?(?=\1)Match up to and including the last '(' found.
[^(]*(?=\2$)Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).

Recursive regex with text before nested parenthesis

(?R) isn't a magical incantation to obtain a pattern able to handle balanced things (like parenthesis for example). (?R) is the same thing than (?0), it is an alias for "the capture group zero", in other words, the whole pattern.

In the same way you can use (?1), (?2), etc. as aliases for the sub-patterns in group 1, 2, etc.

As an aside, note that except for (?0) and (?R) that are obviously always in their sub-pattern, since it is the whole pattern, (?1), (?2) induce a recursion only if they are in their respective own groups, and can be used only to not rewrite a part of a pattern.

something\((?:[^()]|(?R))*\) doesn't work because it imposes each nested (or not) opening parenthesis to be preceded by something in your string.

Conclusion, you can't use (?R) here, and you need to create a capture group to only handle nested parenthesis:

(\((?:[^()]|(?1))*\))

that can be written in a more efficient way:

(\([^()]*(?:(?1)[^()]*)*+\))

To finish you only need to add something that is no more included in the recursion:

something(\([^()]*(?:(?1)[^()]*)*+\))

Note that if something is a sub-pattern with an undetermined number of capture groups, it is more handy to refer to the last opened capture group with a relative reference like this:

som(eth)ing(\([^()]*(?:(?-1)[^()]*)*+\))

How can I match nested brackets using regex?

Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.

A demo in Perl:

#!/usr/bin/perl -w

my $text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';

while($text =~ /(\(([^()]|(?R))*\))/g) {
print("----------\n$1\n");
}

which will print:

----------
(outer
(center
(inner)
(inner)
center)
ouer)
----------
(outer
(inner)
ouer)
----------
(outer
ouer)

Or, the PHP equivalent:

$text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';

preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);

print_r($matches);

which produces:

Array
(
[0] => Array
(
[0] => (outer
(center
(inner)
(inner)
center)
ouer)
[1] => (outer
(inner)
ouer)
[2] => (outer
ouer)
)

...

An explanation:


( # start group 1
\( # match a literal '('
( # group 2
[^()] # any char other than '(' and ')'
| # OR
(?R) # recursively match the entir pattern
)* # end group 2 and repeat zero or more times
\) # match a literal ')'
) # end group 1

EDIT

Note @Goozak's comment:

A better pattern might be \(((?>[^()]+)|(?R))*\) (from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.

Java - Regex, nested recursion match

I know this is a RegEx related question but I thought I would just throw this in here.

Me personally, I wouldn't attempt RegEx to acquire a particular nested bracketed depth even though the depth you are trying to get is only at a depth of 1. Below are a couple methods that can retrieve the required bracketed sub-strings at whatever depth you like for bracket types of: (), [], {}, and <>:

/**
* This method will parse out (retrieve) the contents of nested brackets
* groups within a bracketed string and return those group contents within a
* Single Dimensional String Array. The best way to see how this works is by
* examples. Let's say we have the following bracketed string:<pre>
*
* String a = "1(2(3)(4))(5(6)(7))";</pre><br>
*
* In the above string we can see that there are two instances of level 1
* bracketed groups which in each level 1 group nest two more level 2
* bracketed groups:
* <pre>
*
* Level 1 Level 2
* (2(3)(4)) (3) (4)
*
* ==================================
*
* Level 1 Level 2
* (5(6)(7)) (6) (7)</pre><br>
*
* Bracketed groups: <b>(2(3)(4))</b> and <b>(5(6)(7))</b> are both
* considered to be at nest level 1 (level 1 is the outer most level). They
* are both individual groups because they both have their own set of outer
* brackets. Within each level 1 group we have two sets of level 2 bracketed
* groups (4 level 2 groups altogether) which consist of: <b>(3)</b> &
* <b>(4)</b> and <b>(6)</b> & <b>
* (7)</b>.<br><br>
*
* This method also utilizes the bracketsMaxDepth() method.
*
* @param bracketedString (String) The string which contains bracketed
* content to parse.<br>
*
* @param desiredDepth (Integer) Default is 0 (full depth). The
* nested depth to retrieve bracketed content
* from.<br> If the bracket depth supplied is
* deeper than what is contained within the
* supplied input string then an <b>IllegalArgu-
* mentException</b> is thrown explaining as
* such.
*
* @param bracketType (String) You must supply the bracket type to
* process or null to use the default bracket
* type of parentheses ("()"). You can provide
* the bracket type to work against by supplying
* either a single open or close bracket or both
* open and close brackets, for example, any one
* of the following are all acceptable entries if
* parentheses are required:<pre>
*
* "(" ")" "()" ")("</pre><br>
*
* Any one of four (4) bracket types can be supplied. The allowable Bracket
* Types are:<pre>
*
* () Parentheses
* {} Curly Braces
* [] Square Brackets
* <> Chevron Brackets</pre>
*
* @param removeOuterBrackets (Optional - Boolean - Default is false) By
* default the outer brackets for each found
* group are also attached to the returned
* results. If true is supplied to this optional
* parameter then the outer brackets are removed
* from the returned group results.<br>
*
* @return (1D String Array) The determined nested groups desired.<br>
*
* @throws IllegalArgumentException if a depth is supplied greater than the
* available bracketed depth contained
* within the supplied input string. This
* exception is also thrown if it is found
* that the supplied Bracket Type is
* unbalanced (open and closed braces are
* not properly paired) within the supplied
* input string.
*/
public String[] getNestedBracketedGroups(String bracketedString, int desiredDepth,
String bracketType, boolean... removeOuterBrackets) {
boolean removeOuter = false;
if (removeOuterBrackets.length > 0) {
removeOuter = removeOuterBrackets[0];
}

int d = bracketsMaxDepth(bracketedString, bracketType);
if (desiredDepth == 0) {
//Default for this method is 0 (full depth).
desiredDepth = 1;
}
if (d == -1) {
// Throw Exception...
throw new IllegalArgumentException("\n\ngetNestedBracketedGroups() Method Error!\n"
+ "Brackets mismatch in supplied string!\n");
}
else if (d < desiredDepth) {
// Throw Another Exception...
throw new IllegalArgumentException("\n\ngetNestedBracketedGroups() Method Error!\n"
+ "Invalid Depth Supplied! Brackets within the supplied string go to a\n"
+ "maximum depth of (" + d + ") and therefore can not go to the supplied "
+ "depth\nof (" + desiredDepth + "). Change the desired depth.\n");
}


char open = '('; // Default
char close = ')'; // Default
String bType = Character.toString(open);
if (bracketType != null && !bracketType.trim().isEmpty()) {
bType = Character.toString(bracketType.trim().charAt(0));
}

switch (bType) {
case "(":
case ")":
open = '(';
close = ')';
break;
case "{":
case "}":
open = '{';
close = '}';
break;
case "[":
case "]":
open = '[';
close = ']';
break;
case "<":
case ">":
open = '<';
close = '>';
break;
default:
throw new IllegalArgumentException("\ngetNestedBracketedGroups() Method Error!\n"
+ "Unknown bracket type supplied (" + bType + ")!\n");
}

List<String> list = new ArrayList<>();
int n = bracketedString.length();
char[] c = bracketedString.toCharArray();
int depth = 0;
String strg = "";
for (int i = 0; i < n; i++) {
if (c[i] == open) {
depth++;
}
if ((depth >= desiredDepth)) {
if (c[i] == close) {
depth--;
}
strg += Character.toString(c[i]);

if (depth < desiredDepth) {
strg = strg.trim();
if (removeOuter) {
if (strg.startsWith(Character.toString(open))) {
strg = strg.substring(1);
}
if (strg.endsWith(Character.toString(close))) {
strg = strg.substring(0,
strg.lastIndexOf(Character.toString(close)));
}
}
list.add(strg);
strg = "";
}

continue;
}
if (c[i] == close) {
depth--;
}
if (!strg.isEmpty()) {
strg = strg.trim();
if (removeOuter) {
if (strg.startsWith(Character.toString(open))) {
strg = strg.substring(1);
}
if (strg.endsWith(Character.toString(close))) {
strg = strg.substring(0,
strg.lastIndexOf(Character.toString(close)));
}
}
list.add(strg);
strg = "";
}
}
if (!strg.isEmpty()) {
strg = strg.trim();
if (removeOuter) {
if (strg.startsWith(Character.toString(open))) {
strg = strg.substring(1);
}
if (strg.endsWith(Character.toString(close))) {
strg = strg.substring(0,
strg.lastIndexOf(Character.toString(close)));
}
}
list.add(strg);
}

return list.toArray(new String[list.size()]);
}

/**
* This method takes a string and returns the maximum depth of nested
* brackets. The bracket type to check the depth for is supplied within the
* bracketType parameter.<br><br>
*
* @param bracketedString (String) The string to process.<br>
*
* @param bracketType (String - Default is "()") Either a open bracket,
* or a close bracket, or both open and closed
* brackets can be supplied (no white-spaces). This
* method will process any <b>one</b> of 4 different
* bracket types and they are as follows:<pre>
*
* () Parentheses (Default)
* {} Curly Braces
* [] Square Brackets
* <> Chevron Brackets</pre>
*
* @return (Integer) The maximum depth of the supplied bracket type. 0 is
* returned if there are no brackets of the type supplied within the
* supplied string. -1 is returned if there is an unbalance within
* the supplied string of the supplied bracket type. For every open
* bracket there must be a close bracket and visa versa.
*/
public static int bracketsMaxDepth(String bracketedString, String... bracketType) {
char open = '('; // Default
char close = ')'; // Default
if (bracketType.length > 0) {
String bType = Character.toString(bracketType[0].charAt(0));
switch (bType) {
case "(":
case ")":
open = '(';
close = ')';
break;
case "{":
case "}":
open = '{';
close = '}';
break;
case "[":
case "]":
open = '[';
close = ']';
break;
case "<":
case ">":
open = '<';
close = '>';
break;
default:
throw new IllegalArgumentException("\nbracketsMaxDepth() Method Error!\n"
+ "Unknown bracket type supplied (" + bType + ")!\n");
}
}

int current_max = 0; // current count
int max = 0; // overall maximum count
int n = bracketedString.length();
char[] c = bracketedString.toCharArray();

// Traverse the input string
for (int i = 0; i < n; i++) {
if (c[i] == open) {
current_max++;
// update max if required
if (current_max > max) {

max = current_max;
}
}
else if (c[i] == close) {
if (current_max > 0) {
current_max--;
}
else {
return -1;
}
}
}

// finally check for unbalanced string
if (current_max != 0) {
return -1;
}
return max;
}

Here is an example of how you might use these methods:

String string = "('fdmg') (R:Percentual ('dmg') (20))";
System.out.println("String with parentheses: " + string);
int maxBracketDepth = bracketsMaxDepth(string, "()");
System.out.println("Depth of nested parentheses: " + maxBracketDepth);
System.out.println();

System.out.println("WITH BRACKETS:");
System.out.println("==============");
for (int i = 1; i <= maxBracketDepth; i++) {
String[] a = getNestedBracketedGroups(string, i, "()");
System.out.println("Parenthesized groups at depth: " + i);
for (String b : a) {
System.out.println(b);
}
System.out.println();
}

System.out.println("WITHOUT BRACKETS:");
System.out.println("=================");
for (int i = 1; i <= maxBracketDepth; i++) {
String[] a = getNestedBracketedGroups(string, i, "()", true);
System.out.println("Parenthesized groups at depth: " + i);
for (String b : a) {
System.out.println(b);
}
System.out.println();
}

Console Window should display:

String with parentheses: ('fdmg') (R:Percentual ('dmg') (20))
Depth of nested parentheses: 2

WITH BRACKETS:
==============
Parenthesized groups at depth: 1
('fdmg')
(R:Percentual ('dmg') (20))

Parenthesized groups at depth: 2
('dmg')
(20)

WITHOUT BRACKETS:
=================
Parenthesized groups at depth: 1
'fdmg'
R:Percentual ('dmg') (20)

Parenthesized groups at depth: 2
'dmg'
20


Related Topics



Leave a reply



Submit