Recursive Pattern in Regex

Recursive pattern in regex

The pattern is:

{((?>[^{}]+|(?R))*)}

You can see this works for your example:

regex.findall("{((?>[^{}]+|(?R))*)}", "{1, {2, 3}} {4, 5}")
# ['1, {2, 3}', '4, 5']

Explanation:

The m part needs to exclude the brackets. The use of an atomic group is needed if you want at the same time to allow a quantifier for [^{}] and to repeat the group without catastropic backtracking problems. To be more clear, if the last closing curly bracket is missing this regex engine will backtrack atomic group by atomic group instead of character by character. To drive home this point, you can make the quantifier possessive like that: {((?>[^{}]+|(?R))*+)} (or {((?:[^{}]+|(?R))*+)} since the atomic group is no more useful).

The atomic group (?>....) and the possessive quantifier ?+, *+, ++ are the two sides of the same feature. This feature forbids the regex engine to backtrack inside the group of characters that becomes an "atom" (something you can't divide in smaller parts).

The basic examples are the following two patterns that always fail for the string Recursive Pattern in Regexaab:

(?>a+)ab
a++ab

that is:

regex.match("a++ab", "Recursive Pattern in Regexaab")
regex.match("(?>a+)ab", "Recursive Pattern in Regexaab")

When you use (?:a+) or a+ the regex engine (by default) records (in prevision) all backtracking positions for all characters. But when you use an atomic group or a possessive quantifier, theses backtracking positions are no more recorded (except for the begining of the group). So when the backtracking mechanism occurs the last "a" character can't be given back. Only the entire group can be given back.

[EDIT]: the pattern can be written in a more efficient way if you use an "unrolled" subpattern to describe the content between brackets:

{([^{}]*+(?:(?R)[^{}]*)*+)}

Recursive regexp in python?

Thanks Jerry and devnull!
regex module for python instead of default re module solved my issue

import regex
>>> regex.sub(r'\((((?>[^()]+)|(?R))*)\)', r'del', 'acd (e(fg)h) ij)')
'acd del ij)'

Recursive Regex with a Pattern Matching only on Start of Match before Recursion?

That is the case when you need to use subroutines. Here, you need to enclose the recursed pattern in a capturing group and then use (?1) construct to recurse it:

import regex
m = regex.search(r'Test(\((?:[^()]++|(?1))*\))', 'Test(123, Test(123, (3), 3))')
if m:
    print(m.group()) # => Test(123, Test(123, (3), 3))

See the Python demo.

Details

Test - a prefix word
(\((?:[^()]++|(?1))*\)) - Capturing group 1 (that will be recursed with (?1)):
- \( - a ( char
- (?:[^()]++|(?1))* - zero or more reptitions of
  - [^()]++ - 1+ chars other than ( and ) (possessive match for better efficiency)
  - | - or
  - (?1) - a subroutine to recurse Capturing group #1 subpattern
- \) - a ) char.

Regex match with negative lookbehind, recursive pattern and negative lookahead

Assuming you are using PyPi regex module you can use

import regex
text = """void function{ {  {  } }}   
static stTLookupTable RxTable[MAX]={
     
    {zero, one},{zero, one},{zero, one}};"""

print( [x.group(3) for x in regex.finditer(r'=\s*({(?>[^{}]+|(?1))*})(*SKIP)(*F)|({((?>[^{}]+|(?2))*)})', text)] )
# => [' {  {  } }']

See the Python demo online.

Details:

=\s*({(?>[^{}]+|(?1))*})(*SKIP)(*F):
- = - a = char
- \s* - zero or more whitespaces
- ({(?>[^{}]+|(?1))*}) - a substring between balanced {...}
- (*SKIP)(*F) - skips the match and restarts the search from the failure position
| - or
({((?>[^{}]+|(?2))*)}) - Group 2 (technical, used for recursion):
- {((?>[^{}]+|(?2))*)} - matches a {...} substring with balanced curly braces.

You need to return Group 3 from the matches.

Regex pattern recursively - in python

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.

Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.

def chars_at_level(s):
    out = ['<']
    nesting_level = 0

    for c in s:
        if c == '<':
            nesting_level += 1
        elif c == '>':
            nesting_level -= 1
        elif nesting_level == 1:
            out.append(c)

    out.append('>')
    return ''.join(out)

Example:

>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'

Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.

>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866

However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.

>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091

By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.

How to match but exclude delimiters in recursive pattern

You need to capture the part of the regex you need to extract from the match:

{((?:[^{}]++|(?R))*)}
 ^_________________^

These parentheses create Group 1 that will hold the value you can access using your programming language.

Once you get the matches, you may run a second pass to extract a:b or a:[...] substrings using

[^,[]+(?:\[[^][]+])?

See the regex demo. Details:

[^,[]+ - zero or more chars other than [ and comma
(?:\[[^][]+])? - an optional sequence of [, then one or more chars other than [ and ] and then a ].

How exactly does this recursive regex work?

You understand recursion correctly. Word boundaries baffle you here. The \b around the pattern require the regex engine to only match the string if it is not preceded and followed with word chars.

See how the recursion goes here:

( o      (?1)?         o )  => oo

(?1) is then replaced with (o(?1)?o):

( o   (?>o(?1)?o)?     o )  => oo or oooo

Then again:

(o (?>o(?>o(?1)?o)?o)?  o) => oo, oooo, oooooo

See the regex demo without word boundaries.

Why adding (?>...) in the example above? Each recursion level in PHP recursive regexes is atomic, unlike Perl, and once a preceding level fails, the engine does not go back to the following one.

When you add word boundaries, the first o and last o matched cannot have any other word chars before/after. So, ooo won't match then.

See Recursive Regular Expressions explained step by step and Word Boundary: \b at rexegg.com, too.

Why does oooooo not get matched as a whole but as oooo and oo?

Again, each recursion level is atomic. oooooo is matched like this:

(o(?1)?o) matches the first o
(?1)? gets expanded and the pattern is now (o(?>o(?1)?o)?o) and it matches the second o in the input
It goes on until (o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o) that does not match the input any longer, backtracking happens, we go to the 6th level,
The whole 6th recursion level also fails since it cannot match the necessary amount of os
This goes on until the level that can match the necessary amount of os.

See the regex debugger:

Sample Image

How can I recursively match a pattern using Regular Expressions?

Java's standard regex lib does not support recursion, so you can't match such general nested constructs with it.

But in flavors that do support recursion (Perl, PCRE, .NET, etc) you can use expressions like:

\w+(?:\((?R)(?:,(?R))*\))?

Recursive Pattern in Regex