Regular expression to match balanced parentheses
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
Using RegEx to balance match parenthesis
Regular Expressions can definitely do balanced parentheses matching. It can be tricky, and requires a couple of the more advanced Regex features, but it's not too hard.
Example:
var r = new Regex(@"
func([a-zA-Z_][a-zA-Z0-9_]*) # The func name
\( # First '('
(?:
[^()] # Match all non-braces
|
(?<open> \( ) # Match '(', and capture into 'open'
|
(?<-open> \) ) # Match ')', and delete the 'open' capture
)+
(?(open)(?!)) # Fails if 'open' stack isn't empty!
\) # Last ')'
", RegexOptions.IgnorePatternWhitespace);
Balanced matching groups have a couple of features, but for this example, we're only using the capture deleting feature. The line (?<-open> \) )
will match a )
and delete the previous "open" capture.
The trickiest line is (?(open)(?!))
, so let me explain it. (?(open)
is a conditional expression that only matches if there is an "open" capture. (?!)
is a negative expression that always fails. Therefore, (?(open)(?!))
says "if there is an open capture, then fail".
Microsoft's documentation was pretty helpful too.
What is the balance matching regular expression for removing nested brackets composed of sets of ordered characters?
Your current 12...34
matching regex is not right since the tempered greedy token used is "corrupt" ((?!(12|34))*
is missing the consuming part, .
).
You just need to remember about the parts of the regex like that: 1) the leading delimiter pattern, 2) the trailing delimiter pattern, 3) the part in between should match what is not both 1 and 2, 4) the conditional construct that checks if the "technical" group capture stack is empty.
So, the numeric regex can be fixed as
12(?>(?!12|34).|(?<o>)12|(?<-o>)34)*(?(o)(?!))34
(regex demo) and the CDATA one will look like
<!\[CDATA\[(?>(?!<!\[CDATA\[|]]>).|(?<o>)<!\[CDATA\[|(?<-o>)]]>)*(?(o)(?!))]]>
See this regex demo
NOTE: If there can be newline symbols in the string input, use RegexOptions.Singleline
option or the inline modifier version, (?s)
, at the pattern start.
Pattern details:
12
- the leading delimiter pattern(?>
- start of the atomic group that will match what is neither leading nor trailing patterns, and will keep track of those delimiting substrings:(?!12|34).|
- match any char (ifRegexOptions.Singleline
option is used, even including a newline) but a char that is a starting point of the12
or34
sequences(?<o>)12| - match
12` and increment the "o" group capture stack, or(?<-o>)34
- match34
and decrement the "o" group capture stack
)*
- and repeat that (keep matching) zero or more occurrences of the patterns inside the atomic group(?(o)(?!))
- the conditional construct that will check if the "o" group capture stack is empty. If it is not empty, backtracking will trigger, and balanced number of leading/trailing delimiters will be searched for.34
- the trailing delimiter pattern.
Also, [
in <![CDATA[
must be escaped, as [
is a special char outside the character class, and ]
in ]]>
do not have to be escaped, since outside a character class, ]
is not special for a .NET regex.
Regex to capture unpaired brackets or parentheses
Maybe,
\b\d+\)
might simply return the desired output, I guess.
Demo 1
Another way is to see what left boundary you might have, which in this case, I see digits, then what other chars we'd have prior to the closing curly bracket, and then we can design some other simple expression similar to:
\b\d[^)]*\)
Demo 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "\\b\\d[^)]*\\)";
final String string = "Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1a b) demanding and 2a a) costly.\n\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1b) demanding and 2b) costly.\n\n"
+ "It is not supposed to match ( s s 1) \n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: 1)
Full match: 2)
Full match: 1a b)
Full match: 2a a)
Full match: 1b)
Full match: 2b)
Full match: 1)
RegEx Circuit
jex.im visualizes regular expressions:
Matching parenthesis in PHP
Finally figured something out.
function isParenthesis(string $expression){
$patterns = array('{','}','(',')','[',']');
$occ = array(0,0,0,0,0,0);
for($i = 0; $i < strlen($expression); $i++){
for($j = 0; $j < count($patterns); $j++){
if($expression[$i] == $patterns[$j]){
$occ[$j]++;
}
}
}
for ($i=0; $i < count($occ); $i+=2) {
if($occ[$i] != $occ[$i+1]){
return 0;
}
}
return 1;
}
$str = "{[{iHTSc}]}p(R)m(){q({})";
echo "l'expression ".$str." est correcte ? : ".isParenthesis($str);
And it works greatly ! Thanks for your contributions.
regex where parenthesis might not be balanced
As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. I think the solution requires "negative lookahead", and perl offers that. The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples:
$ cat pdf_data_line_by_line.txt
q Q /Tx BMC q 0 0 471.34 407.34 re W n BT 1 0 0 1 2 397.16 Tm /Helv 12 Tf 0 g (RE: Request for Additional Information) Tj
q Q /Tx BMC q 0 0 471.34 407.34 re W n BT 1 0 0 1 2 397.16 Tm /Helv 12 Tf 0 g (RE: Request for (Additional Information) Tj
0 g 1 0 0 1 2 383.29 Tm 0 g ( 13. Processing TT Instructions -) Audit Note 12) Tj
0 g 1 0 0 1 2 369.42 Tm 0 g () Tj
0 g 1 0 0 1 2 355.55 Tm 0 g (Dear test:) Tj
0 g 1 0 0 1 2 341.68 Tm 0 g () Tj
0 g 1 0 0 1 2 327.8 Tm 0 g (Thank you for the more random words here. )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
#!/usr/bin/perl
while (<>) {
# find some text
if ( /[^(]*\((?!\)).*\) Tj/ ) {
# strip off leading junk
s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
# output saved part of match
print $_;
print "YOUR DELIMITER HERE\n";
}
}
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE: Request for Additional Information
YOUR DELIMITER HERE
RE: Request for (Additional Information
YOUR DELIMITER HERE
13. Processing TT Instructions -) Audit Note 12
YOUR DELIMITER HERE
Dear test:
YOUR DELIMITER HERE
Thank you for the more random words here. )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE
However, if I combine the examples into a single stream, it stops after the first one. I tried using "g" at the end of the 's' command, but it didn't help:
$ cat pdf_data_single_stream.txt
q Q /Tx BMC q 0 0 471.34 407.34 re W n BT 1 0 0 1 2 397.16 Tm /Helv 12 Tf 0 g (RE: Request for (Additional Information) Tj 0 g 1 0 0 1 2 383.29 Tm 0 g ( 13. Processing TT Instructions -) Audit Note 12) Tj 0 g 1 0 0 1 2 369.42 Tm 0 g () Tj 0 g 1 0 0 1 2 355.55 Tm 0 g (Dear test:) Tj 0 g 1 0 0 1 2 341.68 Tm 0 g () Tj 0 g 1 0 0 1 2 327.8 Tm 0 g (Thank you for the more random words here. )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE: Request for (Additional Information) Tj 0 g 1 0 0 1 2 383.29 Tm 0 g ( 13. Processing TT Instructions -) Audit Note 12) Tj 0 g 1 0 0 1 2 369.42 Tm 0 g () Tj 0 g 1 0 0 1 2 355.55 Tm 0 g (Dear test:) Tj 0 g 1 0 0 1 2 341.68 Tm 0 g () Tj 0 g 1 0 0 1 2 327.8 Tm 0 g (Thank you for the more random words here. )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE
The replacement string ...
s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/
... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string.
If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand.
Regex to match number in a string enclosed within brackets or Parenthesis
You may use the following .NET regex:
(?:(\()|\[)(.*?)(?(1)\)|])
See the regex demo
Details
(?:(\()|\[)
- a non-capturing group that matches and captures into Group 1 a(
char, else just matches a[
char(.*?)
- Group 2: any 0 or more chars other than a newline char, as few as possible (instead of.*?
, you might want to use\d+
there to match 1 or more digits, or\d{4}
to match just four digits exactly, or even(?:20|19)\d{2}
to match a year in the 20th and 21st c.)(?(1)\)|])
- a conditional construct: if Group 1 was matched, a)
is matched, else, a]
char.
Regex for getting parentheses within parentheses infinitely
Regex flavor in Java doesn't support recursion like some other flavors do. Instead you can write your own method which will build strings from characters only if they are:
(
)
- inside parenthesis.
To know if currently handled character is inside parenthesis we can create counter which will check parenthesis balance (you can also think of it as counter for nesting level). In short: if we saw more (
than )
then we are inside unclosed (open) parenthesis section, so we should add current character to resulting string.
Using that idea our code can look like:
String str = "hello world () (foo bar) (foo bar 2 (this looks cozy)) (foo bar 3...)";
List<String> result = new ArrayList<>();
int parenthesisNestingLevel = 0;
StringBuilder sb = new StringBuilder();
for (char ch : str.toCharArray()) {
if (ch == '(') {
parenthesisNestingLevel++;
sb.append(ch);
} else if (ch == ')') {
parenthesisNestingLevel--;
sb.append(ch);
if (parenthesisNestingLevel == 0) {
result.add(sb.toString());
sb.delete(0, sb.length());//reset sb
}
} else if (parenthesisNestingLevel > 0) {//we are inside unclosed parenthesis
sb.append(ch);
}
}
result.forEach(System.out::println);
Output:
()
(foo bar)
(foo bar 2 (this looks cozy))
(foo bar 3...)
Related Topics
Why Is Multiple Inheritance Not Allowed in Java or C#
Launching an Application (.Exe) from C#
How to Change the Table Names When Using ASP.NET Identity
How to Serialize a String as Cdata Using Xmlserializer
C# Linq Group by on Multiple Columns
How to Format Datetime Columns in Datagridview
Wpf Binding UI Events to Commands in Viewmodel
Recommended Servicestack API Structure
C# VS Java Enum (For Those New to C#)
Serialize Class Containing Dictionary Member
Keeping ASP.NET Session Open/Alive
Specifying a Custom Datetime Format When Serializing with JSON.Net
Automatically Create an Enum Based on Values in a Database Lookup Table
ASP.NET MVC $.Post Call Returning String...Need Help with Format for Jqgrid