Regex Nested Parentheses

Can regular expressions be used to match nested patterns?

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.

For more background: Automata Theory at Wikipedia

What a RegEx that can match text in parentheses with nested parentheses

To extract the text from your example data, I think you can use this regex:

\(pattern:?\s?(.+?\)?)\)

  • match \(pattern
  • an optional colon: :?
  • an optional whitespace \s?
  • start capturing group (
  • capture one or more characters non greedy .+?
  • an optional \)
  • close capturing group
  • match \)

    var string = "Some text (pattern: SOME TEXT THAT (I WANT TO EXTRACT)) a bit more text (another pattern: ignore that text) and may be a little more text Some text (pattern: SOME TEXT THAT I WANT TO EXTRACT) a bit more text (another pattern: ignore that text) and may be a little more text";    var myRegexp = /\(pattern:?\s?(.+?\)?)\)/g;    var matches;    while ((matches = myRegexp.exec(string)) !== null) {        console.log(matches[1]);    }

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

Indeed! It's possible using forward references:

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:







































































































NoteComponentDescription
(?=\()Make sure '(' follows before doing any hard work.
(?:Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('(?=This lookahead deals with finding the next '('.
.*?\((?!.*?\1)Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'(?=This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation.*?(?=\1)Match up to and including the last '(' found.
[^(]*(?=\2$)Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).

RegEx for capturing values in nested brackets

If everything in between the [] would be desired, then we might simplify our expression to maybe:

(?:\[+)(.+?)(?:\]+)

Here, we capture our likely desired substring in this capturing group:

(.+?)

Then, we add two boundaries on its left and right sides using two non-capturing groups:

(?:\[+)
(?:\]+)

Demo

const regex = /(?:\[+)(.+?)(?:\]+)/g;const str = `[[[hello-hello]][[hi-hi]]][[hi hi]]]`;const subst = `$1`;
// The substituted value will be contained in the result variableconst result = str.replace(regex, subst);
console.log('Substitution result: ', result);

How can I match nested brackets using regex?

Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.

A demo in Perl:

#!/usr/bin/perl -w

my $text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';

while($text =~ /(\(([^()]|(?R))*\))/g) {
print("----------\n$1\n");
}

which will print:

----------
(outer
(center
(inner)
(inner)
center)
ouer)
----------
(outer
(inner)
ouer)
----------
(outer
ouer)

Or, the PHP equivalent:

$text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';

preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);

print_r($matches);

which produces:

Array
(
[0] => Array
(
[0] => (outer
(center
(inner)
(inner)
center)
ouer)
[1] => (outer
(inner)
ouer)
[2] => (outer
ouer)
)

...

An explanation:


( # start group 1
\( # match a literal '('
( # group 2
[^()] # any char other than '(' and ')'
| # OR
(?R) # recursively match the entir pattern
)* # end group 2 and repeat zero or more times
\) # match a literal ')'
) # end group 1

EDIT

Note @Goozak's comment:

A better pattern might be \(((?>[^()]+)|(?R))*\) (from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.

Regex nested parentheses

You can use this:

(?>\w+\.)?\w+\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*\)(?(DEPTH)(?!))|\w+

With your example you obtain:

0 => username
1 => TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2))
2 => password

Explanation:

(?>\w+\.)? \w+ \(    # the opening parenthesis (with the function name)
(?> # open an atomic group
\( (?<DEPTH>) # when an opening parenthesis is encountered,
# then increment the stack named DEPTH
| # OR
\) (?<-DEPTH>) # when a closing parenthesis is encountered,
# then decrement the stack named DEPTH
| # OR
[^()]+ # content that is not parenthesis
)* # close the atomic group, repeat zero or more times
\) # the closing parenthesis
(?(DEPTH)(?!)) # conditional: if the stack named DEPTH is not empty
# then fail (ie: parenthesis are not balanced)

You can try it with this code:

string input = "username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password";
string pattern = @"(?>\w+\.)?\w+\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*\)(?(DEPTH)(?!))|\w+";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[0].Value);
}

Regular expression for nested brackets that contain a symbol

I would tokenise the input, splitting it by commas and parentheses, keeping also these delimiters as results. Then use a recursive algorithm to detect whether commas appear for a certain pair of parentheses and make the appropriate replacement.

Here is a function doing the job:

function replaceWithBrackets($s) {

function recur(&$tokens) {
$comma = false;
$replaced = "";
while (true) {
$token = current($tokens);
next($tokens);
if ($token == ")" || $token === false) break;
if ($token == "(") {
[$substr, $subcomma] = recur($tokens);
$replaced .= $subcomma ? "[$substr]" : "($substr)";
} else {
$comma = $comma || $token == ",";
$replaced .= $token;
}
}
return [$replaced, $comma];
}

$tokens = preg_split("~([(),])~", $s, 0, PREG_SPLIT_DELIM_CAPTURE);
return recur($tokens)[0];
}

Regular expression to return string split up respecting nested parentheses

Using regex only for the task might work but it wouldn't be straightforward.

Another possibility is writing a simple algorithm to track the parentheses in the string:

  1. Split the string at all parentheses, while returning the delimiter (e.g. using re.split)
  2. Keep a counters tracking the parentheses: start_parens_count for ( and end_parens_count for ).
  3. Using the counters, proceed by either splitting at white spaces or adding the current data into a temp var ( term)
  4. When the left most parenthesis has been closed, append term to the list of values & reset the counters/temp vars.

Here's an example:

import re

string = "1 2 3 (test 0, test 0) (test (0 test) 0)"

result, start_parens_count, end_parens_count, term = [], 0, 0, ""
for x in re.split(r"([()])", string):
if not x.strip():
continue
elif x == "(":
if start_parens_count > 0:
term += "("
start_parens_count += 1
elif x == ")":
end_parens_count += 1
if end_parens_count == start_parens_count:
result.append(term)
end_parens_count, start_parens_count, term = 0, 0, ""
else:
term += ")"
elif start_parens_count > end_parens_count:
term += x
else:
result.extend(x.strip(" ").split(" "))

print(result)
# ['1', '2', '3', 'test 0, test 0', 'test (0 test) 0']

Not very elegant, but works.



Related Topics



Leave a reply



Submit