How to Match String Within Parentheses (Nested) in Java

How to match string within parentheses (nested) in Java?

As I said, contrary to popular belief (don't believe everything people say) matching nested brackets is possible with regex.

The downside of using it is that you can only up to a fixed level of nesting. And for every additional level you wish to support, your regex will be bigger and bigger.

But don't take my word for it. Let me show you. The regex:

\([^()]*\)

Matches one level. For up to two levels, you'd need:

\(([^()]*|\([^()]*\))*\)

And so on. To keep adding levels, all you have to do is change the middle (second) [^()]* part to ([^()]*|\([^()]*\))* (check three levels here). As I said, it will get bigger and bigger.

Your problem:

For your case, two levels may be enough. So the Java code for it would be:

String fortranCode = "code code u(i, j, k) code code code code u(i, j, k(1)) code code code u(i, j, k(m(2))) should match this last 'u', but it doesnt.";
String regex = "(\\w+)(\\(([^()]*|\\([^()]*\\))*\\))"; // (\w+)(\(([^()]*|\([^()]*\))*\))
System.out.println(fortranCode.replaceAll(regex, "__$1%array$2"));

Input:

code code u(i, j, k) code code code code u(i, j, k(1)) code code code u(i, j, k(m(2))) should match this last 'u', but it doesnt.

Output:

code code __u%array(i, j, k) code code code code __u%array(i, j, k(1)) code code code u(i, j, __k%array(m(2))) should match this last 'u', but it doesnt.

Bottom line:

In the general case, parsers will do a better job - that's why people get so pissy about it. But for simple applications, regexes can pretty much be enough.

Note: Some flavors of regex support the nesting operator R (Java doesn't, PCRE engines like PHP and Perl do), which allows you to nest arbitrary number of levels. With them, you could do: \(([^()]|(?R))*\).

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

Indeed! It's possible using forward references:

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:







































































































NoteComponentDescription
(?=\()Make sure '(' follows before doing any hard work.
(?:Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('(?=This lookahead deals with finding the next '('.
.*?\((?!.*?\1)Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'(?=This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation.*?(?=\1)Match up to and including the last '(' found.
[^(]*(?=\2$)Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).

How to extract sub strings containing multiple parentheses in Java?

I have developed what you requested but not just with regex, but a recursive function. Please check following code:

public static void main(String[] args)
{
String str = "(A(B(C(D(x)))))";
findStuff(str);

}

public static void findStuff(String str){
String pattern = "\\((.+)\\)";

Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(str);
while (m.find())
{
String sub = m.group(1);
System.out.println(" Word: " + sub);

findStuff(sub);
}
}

Dealing with nested parenthesis in regex

If you are only looking for a maximum depth of 5 then you can use the following regexp

(\((\((\((\((\(\))*\))*\))*\))*\))*

You can preveiw the results here http://regex101.com/r/zN1sZ2/1

As a bonus here is some psuedo code you can use to generate this string

var s = "_", depth = 5;
while(depth > 0) {
s = s.replace("_", "(\\(_\\))*");
depth--;
}
s = s.replace("_", "");

Now it as simple as changing one variable (depth) if your needs change and using the string s to perform your regexp

Match contents within square brackets, including nested square brackets

More direct solution

This solution will omit empty or whitespace only substrings

public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastCloseBracket= 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1 && i != 0 && i!=lastCloseBracket &&
!s.substring(lastCloseBracket, i).trim().isEmpty()) {
subTreeList.add(s.substring(lastCloseBracket, i).trim());
}
}
} else if (c == markEnd) {
if (level > 0) {
level--;
lastCloseBracket = i+1;
}
}
}
if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
subTreeList.add(s.substring(lastCloseBracket).trim());
}
return subTreeList;
}

Then, use it as

String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced = getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]

Original answer (more complex, shows a way to extract nested parentheses)

You can also extract all substrings inside balanced parentheses and then split with them:

String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));

And this function will find all []-balanced substrings:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
Character markEnd, Boolean includeMarkers) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenBracket = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenBracket = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}

See IDEONE demo

Result of the code execution:

Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']

Credits: the getBalancedSubstrings is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.

Regular expression to match balanced parentheses

Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.

But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.

Regular Expression for separating strings enclosed in parentheses

So we can assume that the parentheses can nest at most two levels deep. So we can do it without too much magic. I would go with this code:

List<String> matches = new ArrayList<>();
Pattern p = Pattern.compile("\\([^()]*(?:\\([^()]*\\)[^()]*)*\\)");
Matcher m = p.matcher(inputStr);
while (m.find()) {
String fullMatch = m.group();
matches.add(fullMatch.substring(1, fullMatch.length() - 1));
}

Explanation:

  • First we match a parenthesis: \\(
  • Then we match some non-parenthesis characters: [^()]*
  • Then zero or more times: (?:...)* we will see some stuff within parentheses, and then some non-parentheses again:
  • \\([^()]*\\)[^()]* - it's important that we don't allow any more parentheses within the inside parentheses
  • And then the closing parenthesis comes: \\)
  • m.group(); returns the actual full match.
  • fullMatch.substring(1, fullMatch.length() - 1) removes the parentheses from the start and the end. You could do it with another group too. I just didn't want to make the regex uglier.

Regex nested brackets (ignoring inside brackets and whitespace)

Regex is not a good parser for text with nested blocks.

If you insist on using regex, you should match the outer part first:

@INPROCEEDINGS{Fogel95,
???
}

Capture the ???, so you can match on that in a nested loop.

The outer regex would be something like @(\w+)\{(\w+),([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}

The inner regex would be something like (\w+)\s*=\s*\{([^}]*)\}

Since a field value may be wrapped on multiple lines, you need to unwrap that.

Code

Pattern pTag = Pattern.compile("@(\\w+)" + // tag
"\\{" +
"(\\w+)" + // name
"," +
"([^{}]*(?:\\{[^{}]*\\}[^{}]*)*)" + // content
"\\}");
Pattern pField = Pattern.compile("(\\w+)" + // field
"\\s*=\\s*" +
"\\{" +
"([^}]*)" + // value
"\\}");
Pattern pNewline = Pattern.compile("\\s*(?:\\R\\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
String tag = mTag.group(1);
String name = mTag.group(2);
String content = mTag.group(3);
for (Matcher mField = pField.matcher(content); mField.find(); ) {
String field = mField.group(1);
String value = mField.group(2);
value = pNewline.matcher(value).replaceAll(" ");
System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
}
}

Test input

String input = "@INPROCEEDINGS{Fogel95,\n" +
" AUTHOR = {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
" TITLE = {An evolutionary programming approach to self-adaptation\n" +
" on finite state machines},\n" +
" BOOKTITLE = {Proceedings of the Fourth International Conference on\n" +
" Evolutionary Programming},\n" +
" YEAR = {1995},\n" +
" pages = {355--365}\n" +
"}\n" +
"\n" +
"@ARTICLE{Goldberg91,\n" +
" AUTHOR = {D. Goldberg},\n" +
" TITLE = {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
" JOURNAL = {Complex Systems},\n" +
" YEAR = {1991},\n" +
" pages = {139--167}\n" +
"}\n" +
"\n" +
"@INPROCEEDINGS{Yao96,\n" +
" AUTHOR = {X. Yao and Y. Liu},\n" +
" TITLE = {Fast evolutionary programming},\n" +
" BOOKTITLE = {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
" Programming},\n" +
" YEAR = {1996},\n" +
" pages = {451--460}\n" +
"}";

Output

INPROCEEDINGS   Fogel95      AUTHOR      L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS Fogel95 TITLE An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS Fogel95 BOOKTITLE Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS Fogel95 YEAR 1995
INPROCEEDINGS Fogel95 pages 355--365
ARTICLE Goldberg91 AUTHOR D. Goldberg
ARTICLE Goldberg91 TITLE Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE Goldberg91 JOURNAL Complex Systems
ARTICLE Goldberg91 YEAR 1991
ARTICLE Goldberg91 pages 139--167


Related Topics



Leave a reply



Submit