Using Regex to Balance Match Parenthesis

Regular expression to match balanced parentheses

Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.

But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.

Using RegEx to balance match parenthesis

Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.

Example:

var r = new Regex(@"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name

\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!

\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);

Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) ) will match a ) and delete the previous "open" capture.

The trickiest line is (?(open)(?!)), so let me explain it. (?(open) is a conditional expression that only matches if there is an "open" capture. (?!) is a negative expression that always fails. Therefore, (?(open)(?!)) says "if there is an open capture, then fail".

Microsoft's documentation was pretty helpful too.

What is the balance matching regular expression for removing nested brackets composed of sets of ordered characters?

Your current 12...34 matching regex is not right since the tempered greedy token used is "corrupt" ((?!(12|34))* is missing the consuming part, .).

You just need to remember about the parts of the regex like that: 1) the leading delimiter pattern, 2) the trailing delimiter pattern, 3) the part in between should match what is not both 1 and 2, 4) the conditional construct that checks if the "technical" group capture stack is empty.

So, the numeric regex can be fixed as

12(?>(?!12|34).|(?<o>)12|(?<-o>)34)*(?(o)(?!))34

(regex demo) and the CDATA one will look like

<!\[CDATA\[(?>(?!<!\[CDATA\[|]]>).|(?<o>)<!\[CDATA\[|(?<-o>)]]>)*(?(o)(?!))]]>

See this regex demo

NOTE: If there can be newline symbols in the string input, use RegexOptions.Singleline option or the inline modifier version, (?s), at the pattern start.

Pattern details:

  • 12 - the leading delimiter pattern
  • (?> - start of the atomic group that will match what is neither leading nor trailing patterns, and will keep track of those delimiting substrings:

    • (?!12|34).| - match any char (if RegexOptions.Singleline option is used, even including a newline) but a char that is a starting point of the 12 or 34 sequences
    • (?<o>)12| - match12` and increment the "o" group capture stack, or
    • (?<-o>)34 - match 34 and decrement the "o" group capture stack
  • )* - and repeat that (keep matching) zero or more occurrences of the patterns inside the atomic group
  • (?(o)(?!)) - the conditional construct that will check if the "o" group capture stack is empty. If it is not empty, backtracking will trigger, and balanced number of leading/trailing delimiters will be searched for.
  • 34 - the trailing delimiter pattern.

Also, [ in <![CDATA[ must be escaped, as [ is a special char outside the character class, and ] in ]]> do not have to be escaped, since outside a character class, ] is not special for a .NET regex.

Regex to capture unpaired brackets or parentheses

Maybe,

\b\d+\)

might simply return the desired output, I guess.

Demo 1

Another way is to see what left boundary you might have, which in this case, I see digits, then what other chars we'd have prior to the closing curly bracket, and then we can design some other simple expression similar to:

\b\d[^)]*\) 

Demo 2

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class RegularExpression{

public static void main(String[] args){

final String regex = "\\b\\d[^)]*\\)";
final String string = "Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1a b) demanding and 2a a) costly.\n\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1b) demanding and 2b) costly.\n\n"
+ "It is not supposed to match ( s s 1) \n";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}


}
}

Output

Full match: 1)
Full match: 2)
Full match: 1a b)
Full match: 2a a)
Full match: 1b)
Full match: 2b)
Full match: 1)

RegEx Circuit

jex.im visualizes regular expressions:

Sample Image

Matching parenthesis in PHP

Finally figured something out.

function isParenthesis(string $expression){
$patterns = array('{','}','(',')','[',']');
$occ = array(0,0,0,0,0,0);

for($i = 0; $i < strlen($expression); $i++){
for($j = 0; $j < count($patterns); $j++){
if($expression[$i] == $patterns[$j]){
$occ[$j]++;
}
}
}
for ($i=0; $i < count($occ); $i+=2) {
if($occ[$i] != $occ[$i+1]){
return 0;
}
}
return 1;
}

$str = "{[{iHTSc}]}p(R)m(){q({})";
echo "l'expression ".$str." est correcte ? : ".isParenthesis($str);

And it works greatly ! Thanks for your contributions.

regex where parenthesis might not be balanced

As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. I think the solution requires "negative lookahead", and perl offers that. The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples:

$ cat pdf_data_line_by_line.txt
q Q /Tx BMC q 0 0 471.34 407.34 re W n BT 1 0 0 1 2 397.16 Tm /Helv 12 Tf 0 g (RE: Request for Additional Information) Tj
q Q /Tx BMC q 0 0 471.34 407.34 re W n BT 1 0 0 1 2 397.16 Tm /Helv 12 Tf 0 g (RE: Request for (Additional Information) Tj
0 g 1 0 0 1 2 383.29 Tm 0 g ( 13. Processing TT Instructions -) Audit Note 12) Tj
0 g 1 0 0 1 2 369.42 Tm 0 g () Tj
0 g 1 0 0 1 2 355.55 Tm 0 g (Dear test:) Tj
0 g 1 0 0 1 2 341.68 Tm 0 g () Tj
0 g 1 0 0 1 2 327.8 Tm 0 g (Thank you for the more random words here. )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
#!/usr/bin/perl
while (<>) {
# find some text
if ( /[^(]*\((?!\)).*\) Tj/ ) {
# strip off leading junk
s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
# output saved part of match
print $_;
print "YOUR DELIMITER HERE\n";
}
}
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE: Request for Additional Information
YOUR DELIMITER HERE
RE: Request for (Additional Information
YOUR DELIMITER HERE
13. Processing TT Instructions -) Audit Note 12
YOUR DELIMITER HERE
Dear test:
YOUR DELIMITER HERE
Thank you for the more random words here. )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

However, if I combine the examples into a single stream, it stops after the first one. I tried using "g" at the end of the 's' command, but it didn't help:

$ cat pdf_data_single_stream.txt
q Q /Tx BMC q 0 0 471.34 407.34 re W n BT 1 0 0 1 2 397.16 Tm /Helv 12 Tf 0 g (RE: Request for (Additional Information) Tj 0 g 1 0 0 1 2 383.29 Tm 0 g ( 13. Processing TT Instructions -) Audit Note 12) Tj 0 g 1 0 0 1 2 369.42 Tm 0 g () Tj 0 g 1 0 0 1 2 355.55 Tm 0 g (Dear test:) Tj 0 g 1 0 0 1 2 341.68 Tm 0 g () Tj 0 g 1 0 0 1 2 327.8 Tm 0 g (Thank you for the more random words here. )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE: Request for (Additional Information) Tj 0 g 1 0 0 1 2 383.29 Tm 0 g ( 13. Processing TT Instructions -) Audit Note 12) Tj 0 g 1 0 0 1 2 369.42 Tm 0 g () Tj 0 g 1 0 0 1 2 355.55 Tm 0 g (Dear test:) Tj 0 g 1 0 0 1 2 341.68 Tm 0 g () Tj 0 g 1 0 0 1 2 327.8 Tm 0 g (Thank you for the more random words here. )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

The replacement string ...

s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/

... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string.
If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand.

Regex to match number in a string enclosed within brackets or Parenthesis

You may use the following .NET regex:

(?:(\()|\[)(.*?)(?(1)\)|])

See the regex demo

Details

  • (?:(\()|\[) - a non-capturing group that matches and captures into Group 1 a ( char, else just matches a [ char
  • (.*?) - Group 2: any 0 or more chars other than a newline char, as few as possible (instead of .*?, you might want to use \d+ there to match 1 or more digits, or \d{4} to match just four digits exactly, or even (?:20|19)\d{2} to match a year in the 20th and 21st c.)
  • (?(1)\)|]) - a conditional construct: if Group 1 was matched, a ) is matched, else, a ] char.

Regex for getting parentheses within parentheses infinitely

Regex flavor in Java doesn't support recursion like some other flavors do. Instead you can write your own method which will build strings from characters only if they are:

  • (
  • )
  • inside parenthesis.

To know if currently handled character is inside parenthesis we can create counter which will check parenthesis balance (you can also think of it as counter for nesting level). In short: if we saw more ( than ) then we are inside unclosed (open) parenthesis section, so we should add current character to resulting string.

Using that idea our code can look like:

String str = "hello world () (foo bar) (foo bar 2 (this looks cozy)) (foo bar 3...)";
List<String> result = new ArrayList<>();

int parenthesisNestingLevel = 0;
StringBuilder sb = new StringBuilder();
for (char ch : str.toCharArray()) {
if (ch == '(') {
parenthesisNestingLevel++;
sb.append(ch);
} else if (ch == ')') {
parenthesisNestingLevel--;
sb.append(ch);
if (parenthesisNestingLevel == 0) {
result.add(sb.toString());
sb.delete(0, sb.length());//reset sb
}
} else if (parenthesisNestingLevel > 0) {//we are inside unclosed parenthesis
sb.append(ch);
}
}

result.forEach(System.out::println);

Output:

()
(foo bar)
(foo bar 2 (this looks cozy))
(foo bar 3...)


Related Topics



Leave a reply



Submit