Are Ruby 1.9 Regular Expressions Equally Powerful to a Context Free Grammar

Are Ruby 1.9 regular expressions equally powerful to a context free grammar?

This is one of the awesome things about the Oniguruma regexp engine used in Ruby 1.9 – it has the power of a parser, and is not restricted to recognizing regular languages. It has positive and negative lookahead/lookbehind, which even can be used to recognize some languages which are not context-free! Take the following as an example:

regexp = /\A(?<AB>a\g<AB>b|){0}(?=\g<AB>c)a*(?<BC>b\g<BC>c|){1}\Z/

This regexp recognizes strings like “abc”, “aabbcc”, “aaabbbccc”, and so on – the number of “a”, “b”, and “c” must be equal, or it will not match.

(One limitation: you can’t use named groups in the lookahead and lookbehind.)

Although I haven’t peeked under the hood, Oniguruma seems to deal with named groups by simple recursive descent, backing up when something doesn’t match. I’ve observed that it can’t deal with left recursion. For example:

irb(main):013:0> regexp = /(?<A>\g<A>a|)/
SyntaxError: (irb):13: never ending recursion: /(?<A>\g<A>a|)/
from C:/Ruby192/bin/irb:12:in `<main>'

I don’t remember my parsing theory very clearly, but I think that a non-deterministic top-down parser like this should be able to parse any context-free language. (“language”, not “grammar”; if your grammar has left recursion, you will have to convert it to right recursion.) If that is incorrect, please edit this post.

Ruby regular expressions for recursive grammar definitions?

Here's a way to match a recursive pattern in Ruby 1.9, in this case an arbitrary level of nested braces:

#!/usr/bin/env ruby

text = "... { a { b { c } b } a { d } a } ...";
match = text.match(/(?<entire>\{(?:[^{}]+|\g<entire>)*\})/).captures
puts match

which will print:

{ a { b { c } b } a { d } a }

A quick break down of the pattern:

(?<entire>        # start named capture group called <entire>
\{ # match the literal '{'
(?: # start non capture group 1
[^{}]+ # match one or more chars other than '{' and '}'
| # OR
\g<entire> # recursively match named group <entire>
)* # end non capture group 1 and repeat it zero or more times
\} # match the literal '}'
) # end named capture group called <entire>

Context-free grammar describing regular expressions?

Left recursion:

Expression = Expression '|' Sequence
| Sequence
;

Sequence = Sequence Repetition
| <empty>
;

Right recursion:

Expression = Sequence '|' Expression
| Sequence
;

Sequence = Repetition Sequence
| <empty>
;

Ambiguous form:

Expression = Expression '|' Expression
| Sequence
;

Sequence = Sequence Sequence
| Repetition
| <empty>
;

Can regex be used to recognize any Context Free Language?

Basically this answer is taken from this great blog post.

So the short answer is that regular expressions with a recursive extension can recognize any context free grammar.

To show this the idea is to show a way which constructs a regex from a context free grammar.

(?<name> ...) defines a regex pattern which can later be reused with (?&name).

Any context free grammar can be written as a set of rules of the following forms:

  • A -> BC
  • A -> a

If we can write these rules as regex the regex can recognize any context free language. The only interesting rule here is the first one.

Firstly, if the rule is left-recursive we need to rewrite it to a right-recursive rule since regex only support right recursion. This rewrite is always possible. Now we can write all such rules as follows:

A -> BC
A -> DE
(?<A>(?&B)(?&C)|(?&D)(?&E))

This allows the definition of arbitrary CFG rules, so we only need to define them all and then match the initial rule.

(?(DEFINE)define rules here)^(?&initial)$

Here the (?(DEFINE)...) declares rules without matching and the initial refers to the initial rule of the grammar.

It has been some time since I heard theoretical CS courses so please correct me if there are mistakes :)

Is regex in modern programming languages really context sensitive grammar ?

In particular backreferences to capturing parentheses make regular expressions more complex than regular, context-free, or context-sensitive grammars. The name is simply historically grown (as many words). See also this section in Wikipedia and this explanation with an example from Perl.

What kind of formal languages can modern regex engines parse?

I recently wrote a rather long article on this topic: The true power of regular expressions.

To summarize:

  • Regular expressions with support for recursive subpattern references can match all context-free languages (e.g a^n b^n).
  • Regular expressions with lookaround assertions and subpattern references can match at least some context-sensitive languages (e.g. ww and a^n b^n c^n).
  • If the assertions have unlimited width (as you say), then all context-sensitive grammars can be matched. I don't know any regex flavor though that does not have fixed-width restrictions on lookbehind (and at the same time supports subpattern references).
  • Regular expressions with backreferences are NP-complete, so any other NP problem can be solved using regular expressions (after applying a polynomial-time transformation).

Some examples:

  • Matching the context-free language {a^n b^n, n>0}:

    /^(a(?1)?b)$/
    # or
    /^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x
  • Matching the context-sensitive language {a^n b^n c^n, n>0}:

    /^
    (?=(a(?-1)?b)c)
    a+(b(?-1)?c)
    $/x
    # or
    /^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x

Matching balanced parenthesis in Ruby using recursive regular expressions like perl

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

re = %r{
(?<re>
\(
(?:
(?> [^()]+ )
|
\g<re>
)*
\)
)
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

The manual for oniguruma is here.



Related Topics



Leave a reply



Submit