Matching Balanced Parenthesis in Ruby Using Recursive Regular Expressions Like Perl

Matching balanced parenthesis in Ruby using recursive regular expressions like perl

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

re = %r{
  (?<re>
    \(
      (?:
        (?> [^()]+ )
        |
        \g<re>
      )*
    \)
  )
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

The manual for oniguruma is here.

Regular expression to match balanced parentheses

Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.

But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.

Using Regexp::Common to get matched parenthesis, without the parenthesis

You can't do this with Regexp::Common::balanced, because the regex it generates only contains one capturing group, which encloses the outermost set of parentheses:

$ perl -MRegexp::Common=balanced -E 'say $RE{balanced}{-parens=>"()"}'
(?^:((?:\((?:(?>[^\(\)]+)|(?-1))*\))))
    ^                               ^
    |                               |
    +-------------HERE--------------+

Fortunately, Regexp::Common lets you define your own regexes so you can use the handy $RE{foo} syntax:

use strict;
use warnings;
use 5.010;

use Regexp::Common qw(pattern);

pattern name   => [qw(inside_parens)],
        create => q/(?x: ( \( ( (?: (?> [^()]+ ) | (?-2) )* ) \) ) )/
        ;

say $2 if 'foo(bar(baz,bat), qux())' =~ /foo$RE{inside_parens}/;

Output:

bar(baz,bat), qux()

The entire parenthetical expression is stored in $1, while the contents of the parentheses are stored in $2.

This regex is a slightly-modified version of the one described in perldoc perlre for matching balanced parentheses.

Is there a Regex-like that is capable of parsing matching symbols?

If you only have one level of parentheses, then there are two possibilities.

Option 1: use ungreedy repetition:

/\(.*?\)/

This will stop when it encounters the first ).

Option 2: use a negative character class

/\([^)]*\)/

This can only repeat characters that are not ), so it can necessarily never go past the first closing parenthesis. This option is usually preferred due to performance reasons. In addition, this option is more easily extended to allow for escaping parenthesis (so that you could match this complete string: (some\)thing) instead of throwing away thing)). But this is probably rather rarely necessary.

However if you want nested structures, this is generally too complicated for regex (although some flavors like PCRE support recursive patterns). In this case, you should just go through the string yourself and count parentheses, to keep track of your current nesting level.

Just as a side note about those recursive patterns: In PCRE (?R) simply represents the whole pattern, so inserting this somewhere makes the whole thing recursive. But then every content of parentheses must be of the same structure as the whole match. Also, it is not really possible to do meaningful one-step replacements with this, as well as using capturing groups on multiple nested levels. All in all - you are best off, not to use regular expressions for nested structures.

Update: Since you seem eager to find a regex solution, here is how you would match your example using PCRE (example implementation in PHP):

$str = 'there are (many (things (on) the)) box (except (carrots (and apples)))';
preg_match_all('/\([^()]*(?:(?R)[^()]*)*\)/', $str, $matches);
print_r($matches);

results in

Array
(
    [0] => Array
        (
            [0] => (many (things (on) the))
            [1] => (except (carrots (and apples)))
        )   
)

What the pattern does:

\(      # opening bracket
[^()]*  # arbitrarily many non-bracket characters
(?:     # start a non-capturing group for later repetition
(?R)    # recursion! (match any nested brackets)
[^()]*  # arbitrarily many non-bracket characters
)*      # close the group and repeat it arbitrarily many times
\)      # closing bracket

This allows for infinite nested levels and also for infinite parallel levels.

Note that it is not possible to get all nested levels as separate captured groups. You will always just get the inner-most or outer-most group. Also, doing a recursive replacement is not possible like this.

Regular expression scan results don't register further regex hits?

The method string from MatchData returns a "frozen copy of the string passed in to match", not what has been matched. per http://www.ruby-doc.org/core-2.0.0/MatchData.html#method-i-string

That's why you're getting the entire string returned, because you're adding each of the initial matches to trans.

You can confirm this by putting in print statements of the value of m within the innermost block. match is correctly matching 1, then 4.

The recognizing power of modern regexes

Pattern Recursion

With recursive patterns, you have a form of recursive descent matching.

This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

Restrictions on Recursion

The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

Recursion Demos

The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

\((?:[^()]*+|(?0))*\)

I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

\( (?: [^()] *+ | (?0) )* \)

Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

‘ (?: [^‘’] *+ | (?0) )* ’

Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

^((.)(?1)\2|.?)$

which you can test on most systems using something like this:

$ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words

Note that PCRE’s implementation of recursion requires the more elaborate

^(?:((.)(?1)\2|)|((.)(?3)\4|.))

This is because of restrictions on how PCRE recursion works.

Proper Parsing

To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?"ed_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?"ed_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?"ed_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?"ed_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?"ed_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> \@ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*

    <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
    <token: quoted_pair>     \\ <.text>

    <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            \x0d \x0a
    <token: DQUOTE>          "
    <token: WSP>             [\x20\x09]
    }x;
};

while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

Matching text not enclosed by parenthesis

This is very far from "obvious"; on the contrary. There is no direct way to say "don't match" for a complex pattern (there is good support at a character level, with [^a], \S etc). Regex is firstly about matching things, not about not-matching them.

One approach is to match those (possibly nested) delimiters and get everything other than that.

A good tool for finding nested delimiters is the core module Text::Balanced. As it matches it can also give us the substring before the match and the rest of the string after the match.

use warnings;
use strict;
use feature 'say';

use Text::Balanced qw(extract_bracketed);

my $text = <<'END';
(bar foo bar)
bar foo (
bar foo 
   (bar) (foo)
   )
END

my ($match, $before);
my $remainder = $text;
while (1) {
    ($match, $remainder, $before) = extract_bracketed($remainder, '(', '[^(]*');
    print $before // $remainder;
    last if not defined $match; 
}

The extract_bracketed returns the match, the remainder substring ($remainder), and the substring before the match ($before); so we keep matching in the remainder.

Taken from this post, where there are more details and another way, using Regexp::Common.

Using Regex to Split Mathematical Formula into Array

We can define the following regular expression.

R = /
    # split after an open paren if not followed by a digit
    (?<=\()      # match is preceded by an open paren, pos lookbehind
    (?!\d)       # match is not followed by a digit, neg lookahead
    [ ]*         # match >= 0 spaces
    |            # or
    # split before an open paren if paren not followed by a digit
    (?=          # begin pos lookahead
      \(         # match a left paren...
      (?!\d)     # ...not followed by a digit, neg lookahead
    )            # end pos lookahead
    [ ]*         # match >= 0 spaces        
    |            # or
    # split before a closed paren if paren not preceded by a digit
    (?<!\d)      # do not follow a digit, neg lookbehind
    (?=\))       # match a closed paren, pos lookahead
    [ ]*         # match >= 0 spaces        
    |            # or
    # split after a closed paren
    (?<=\))      # match a preceding closed paren, pos lookbehind
    [ ]*         # match >= 0 spaces        
    |            # or
    # match spaces not preceded by *, = or + and followed by a letter 
    (?<![*=+\/-]) # match is not preceded by one of '*=+\/-', neg lookbehind
    [ ]+         # match one or more spaces
    |            # or
    # match spaces followed by a letter 
    [ ]+         # match one or more spaces
    (?=\()       # match a left paren, pos lookahead
    /x           # free-spacing regex definition mode

In the first example we have the following.

algo1 = 'ABC * DEF * GHI Round(3) = JKL * MNO * PQR Round(0) = SAVE'
expected1 = ['ABC', '* DEF', '* GHI', 'Round(3)', '= JKL', '* MNO',
             '* PQR', 'Round(0)', '= SAVE']
algo1.split(R) == expected1
  #=> true

In the second example we have the following.

algo2 = '(ABC + DEF + (GHI * JKL)) - ((MNO + PQR + (STU * VWX)) * YZ) Round(0) + SUM(AAA) = SAVE'
expected2 = ['(', 'ABC', '+ DEF', '+', '(', 'GHI', '* JKL', ')', ')', '-',
             '(', '(', 'MNO', '+ PQR', '+', '(', 'STU', '* VWX', ')', ')',
             '* YZ', ')', 'Round(0)', '+ SUM', '(', 'AAA', ')', '= SAVE']
algo2.split(R) == expected2
  #=> true

The regular expression is conventionally written as follows.

R = /(?<=\()(?!\d) *|(?=\((?!\d)) *|(?<!\d)(?=\)) *|(?<=\)) *|(?<![*=+\/-]) +| +(?=\()/

In free-spacing mode I enclosed spaces in a character class ([ ]); else they would be stripped out before the expression is evaluated. That's not necessary when the regex is written conventionally.

Can regular expressions be used to match nested patterns?

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.

For more background: Automata Theory at Wikipedia

Matching Balanced Parenthesis in Ruby Using Recursive Regular Expressions Like Perl