How to match string within parentheses (nested) in Java?
As I said, contrary to popular belief (don't believe everything people say) matching nested brackets is possible with regex.
The downside of using it is that you can only up to a fixed level of nesting. And for every additional level you wish to support, your regex will be bigger and bigger.
But don't take my word for it. Let me show you. The regex:
\([^()]*\)
Matches one level. For up to two levels, you'd need:
\(([^()]*|\([^()]*\))*\)
And so on. To keep adding levels, all you have to do is change the middle (second) [^()]*
part to ([^()]*|\([^()]*\))*
(check three levels here). As I said, it will get bigger and bigger.
Your problem:
For your case, two levels may be enough. So the Java code for it would be:
String fortranCode = "code code u(i, j, k) code code code code u(i, j, k(1)) code code code u(i, j, k(m(2))) should match this last 'u', but it doesnt.";
String regex = "(\\w+)(\\(([^()]*|\\([^()]*\\))*\\))"; // (\w+)(\(([^()]*|\([^()]*\))*\))
System.out.println(fortranCode.replaceAll(regex, "__$1%array$2"));
Input:
code code u(i, j, k) code code code code u(i, j, k(1)) code code code u(i, j, k(m(2))) should match this last 'u', but it doesnt.
Output:
code code __u%array(i, j, k) code code code code __u%array(i, j, k(1)) code code code u(i, j, __k%array(m(2))) should match this last 'u', but it doesnt.
Bottom line:
In the general case, parsers will do a better job - that's why people get so pissy about it. But for simple applications, regexes can pretty much be enough.
Note: Some flavors of regex support the nesting operator R
(Java doesn't, PCRE engines like PHP and Perl do), which allows you to nest arbitrary number of levels. With them, you could do: \(([^()]|(?R))*\)
.
Is it possible to match nested brackets with a regex without using recursion or balancing groups?
Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note | Component | Description |
---|---|---|
(?=\() | Make sure '(' follows before doing any hard work. | |
(?: | Start of group used to iterate through the string, so the following lookaheads match repeatedly. | |
Handle '(' | (?= | This lookahead deals with finding the next '('. |
.*?\((?!.*?\1) | Match up until the next '(' that is not followed by \1 . Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again | |
(.*\)(?!.*\2).*) | Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads. | |
) | ||
Handle ')' | (?= | This lookahead deals with finding the next ')' |
.*?\)(?!.*?\2) | Match up until the next ')' that is not followed by \2 . Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before. | |
(.*) | Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient. | |
) | ||
. | Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point. | |
)+? | Match as few times as possible until a balanced group has been found. This is validated by the following check | |
Final validation | .*?(?=\1) | Match up to and including the last '(' found. |
[^(]*(?=\2$) | Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group). |
How to extract sub strings containing multiple parentheses in Java?
I have developed what you requested but not just with regex, but a recursive function. Please check following code:
public static void main(String[] args)
{
String str = "(A(B(C(D(x)))))";
findStuff(str);
}
public static void findStuff(String str){
String pattern = "\\((.+)\\)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(str);
while (m.find())
{
String sub = m.group(1);
System.out.println(" Word: " + sub);
findStuff(sub);
}
}
Dealing with nested parenthesis in regex
If you are only looking for a maximum depth of 5 then you can use the following regexp
(\((\((\((\((\(\))*\))*\))*\))*\))*
You can preveiw the results here http://regex101.com/r/zN1sZ2/1
As a bonus here is some psuedo code you can use to generate this string
var s = "_", depth = 5;
while(depth > 0) {
s = s.replace("_", "(\\(_\\))*");
depth--;
}
s = s.replace("_", "");
Now it as simple as changing one variable (depth
) if your needs change and using the string s
to perform your regexp
Match contents within square brackets, including nested square brackets
More direct solution
This solution will omit empty or whitespace only substrings
public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastCloseBracket= 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1 && i != 0 && i!=lastCloseBracket &&
!s.substring(lastCloseBracket, i).trim().isEmpty()) {
subTreeList.add(s.substring(lastCloseBracket, i).trim());
}
}
} else if (c == markEnd) {
if (level > 0) {
level--;
lastCloseBracket = i+1;
}
}
}
if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
subTreeList.add(s.substring(lastCloseBracket).trim());
}
return subTreeList;
}
Then, use it as
String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced = getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]
Original answer (more complex, shows a way to extract nested parentheses)
You can also extract all substrings inside balanced parentheses and then split with them:
String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));
And this function will find all []
-balanced substrings:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenBracket = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenBracket = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
See IDEONE demo
Result of the code execution:
Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']
Credits: the getBalancedSubstrings
is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.
Regular expression to match balanced parentheses
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
Regular Expression for separating strings enclosed in parentheses
So we can assume that the parentheses can nest at most two levels deep. So we can do it without too much magic. I would go with this code:
List<String> matches = new ArrayList<>();
Pattern p = Pattern.compile("\\([^()]*(?:\\([^()]*\\)[^()]*)*\\)");
Matcher m = p.matcher(inputStr);
while (m.find()) {
String fullMatch = m.group();
matches.add(fullMatch.substring(1, fullMatch.length() - 1));
}
Explanation:
- First we match a parenthesis:
\\(
- Then we match some non-parenthesis characters: [^()]*
- Then zero or more times:
(?:...)*
we will see some stuff within parentheses, and then some non-parentheses again: \\([^()]*\\)[^()]*
- it's important that we don't allow any more parentheses within the inside parentheses- And then the closing parenthesis comes:
\\)
m.group();
returns the actual full match.fullMatch.substring(1, fullMatch.length() - 1)
removes the parentheses from the start and the end. You could do it with another group too. I just didn't want to make the regex uglier.
Regex nested brackets (ignoring inside brackets and whitespace)
Regex is not a good parser for text with nested blocks.
If you insist on using regex, you should match the outer part first:
@INPROCEEDINGS{Fogel95,
???
}
Capture the ???
, so you can match on that in a nested loop.
The outer regex would be something like @(\w+)\{(\w+),([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}
The inner regex would be something like (\w+)\s*=\s*\{([^}]*)\}
Since a field value may be wrapped on multiple lines, you need to unwrap that.
Code
Pattern pTag = Pattern.compile("@(\\w+)" + // tag
"\\{" +
"(\\w+)" + // name
"," +
"([^{}]*(?:\\{[^{}]*\\}[^{}]*)*)" + // content
"\\}");
Pattern pField = Pattern.compile("(\\w+)" + // field
"\\s*=\\s*" +
"\\{" +
"([^}]*)" + // value
"\\}");
Pattern pNewline = Pattern.compile("\\s*(?:\\R\\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
String tag = mTag.group(1);
String name = mTag.group(2);
String content = mTag.group(3);
for (Matcher mField = pField.matcher(content); mField.find(); ) {
String field = mField.group(1);
String value = mField.group(2);
value = pNewline.matcher(value).replaceAll(" ");
System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
}
}
Test input
String input = "@INPROCEEDINGS{Fogel95,\n" +
" AUTHOR = {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
" TITLE = {An evolutionary programming approach to self-adaptation\n" +
" on finite state machines},\n" +
" BOOKTITLE = {Proceedings of the Fourth International Conference on\n" +
" Evolutionary Programming},\n" +
" YEAR = {1995},\n" +
" pages = {355--365}\n" +
"}\n" +
"\n" +
"@ARTICLE{Goldberg91,\n" +
" AUTHOR = {D. Goldberg},\n" +
" TITLE = {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
" JOURNAL = {Complex Systems},\n" +
" YEAR = {1991},\n" +
" pages = {139--167}\n" +
"}\n" +
"\n" +
"@INPROCEEDINGS{Yao96,\n" +
" AUTHOR = {X. Yao and Y. Liu},\n" +
" TITLE = {Fast evolutionary programming},\n" +
" BOOKTITLE = {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
" Programming},\n" +
" YEAR = {1996},\n" +
" pages = {451--460}\n" +
"}";
Output
INPROCEEDINGS Fogel95 AUTHOR L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS Fogel95 TITLE An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS Fogel95 BOOKTITLE Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS Fogel95 YEAR 1995
INPROCEEDINGS Fogel95 pages 355--365
ARTICLE Goldberg91 AUTHOR D. Goldberg
ARTICLE Goldberg91 TITLE Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE Goldberg91 JOURNAL Complex Systems
ARTICLE Goldberg91 YEAR 1991
ARTICLE Goldberg91 pages 139--167
Related Topics
Add Bean Programmatically to Spring Web App Context
Reliable File.Renameto() Alternative on Windows
Jlabel - Show Longer Text as Multiple Lines
How to Sort an Array of Ints Using a Custom Comparator
How to Replace Case-Insensitive Literal Substrings in Java
Socket Programming Multiple Client to One Server
Add an Background Image to a Panel
Why Is String Immutable in Java
In Java, What Are the Advantages of Streams Over Loops
Jaxb: How to Ignore Namespace During Unmarshalling Xml Document
Should We @Override an Interface's Method Implementation
Should I Return a Collection or a Stream
Java "User.Dir" Property - What Exactly Does It Mean