How Can a Recursive Regexp Be Implemented in Python

Recursive pattern in regex

The pattern is:

{((?>[^{}]+|(?R))*)}

You can see this works for your example:

regex.findall("{((?>[^{}]+|(?R))*)}", "{1, {2, 3}} {4, 5}")
# ['1, {2, 3}', '4, 5']

Explanation:

The m part needs to exclude the brackets. The use of an atomic group is needed if you want at the same time to allow a quantifier for [^{}] and to repeat the group without catastropic backtracking problems. To be more clear, if the last closing curly bracket is missing this regex engine will backtrack atomic group by atomic group instead of character by character. To drive home this point, you can make the quantifier possessive like that: {((?>[^{}]+|(?R))*+)} (or {((?:[^{}]+|(?R))*+)} since the atomic group is no more useful).

The atomic group (?>....) and the possessive quantifier ?+, *+, ++ are the two sides of the same feature. This feature forbids the regex engine to backtrack inside the group of characters that becomes an "atom" (something you can't divide in smaller parts).

The basic examples are the following two patterns that always fail for the string How Can a Recursive Regexp Be Implemented in Pythonaab:

(?>a+)ab
a++ab

that is:

regex.match("a++ab", "How Can a Recursive Regexp Be Implemented in Pythonaab")
regex.match("(?>a+)ab", "How Can a Recursive Regexp Be Implemented in Pythonaab")

When you use (?:a+) or a+ the regex engine (by default) records (in prevision) all backtracking positions for all characters. But when you use an atomic group or a possessive quantifier, theses backtracking positions are no more recorded (except for the begining of the group). So when the backtracking mechanism occurs the last "a" character can't be given back. Only the entire group can be given back.

[EDIT]: the pattern can be written in a more efficient way if you use an "unrolled" subpattern to describe the content between brackets:

{([^{}]*+(?:(?R)[^{}]*)*+)}

Regex pattern recursively - in python

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.

Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.

def chars_at_level(s):
    out = ['<']
    nesting_level = 0

    for c in s:
        if c == '<':
            nesting_level += 1
        elif c == '>':
            nesting_level -= 1
        elif nesting_level == 1:
            out.append(c)

    out.append('>')
    return ''.join(out)

Example:

>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'

Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.

>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866

However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.

>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091

By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.

Recursive Regex with a Pattern Matching only on Start of Match before Recursion?

That is the case when you need to use subroutines. Here, you need to enclose the recursed pattern in a capturing group and then use (?1) construct to recurse it:

import regex
m = regex.search(r'Test(\((?:[^()]++|(?1))*\))', 'Test(123, Test(123, (3), 3))')
if m:
    print(m.group()) # => Test(123, Test(123, (3), 3))

See the Python demo.

Details

Test - a prefix word
(\((?:[^()]++|(?1))*\)) - Capturing group 1 (that will be recursed with (?1)):
- \( - a ( char
- (?:[^()]++|(?1))* - zero or more reptitions of
  - [^()]++ - 1+ chars other than ( and ) (possessive match for better efficiency)
  - | - or
  - (?1) - a subroutine to recurse Capturing group #1 subpattern
- \) - a ) char.

Recursive regexp in python?

Thanks Jerry and devnull!
regex module for python instead of default re module solved my issue

import regex
>>> regex.sub(r'\((((?>[^()]+)|(?R))*)\)', r'del', 'acd (e(fg)h) ij)')
'acd del ij)'

How can I use a recursive regex or another method to recursively validate this BBcode-like markup in Python?

Your main issue is within the ((?!\[\/\1\]).)*? tempered greedy token.

First, it is inefficient since you quantified it and then you quantify the whole group it is in, so making the regex engine look for more ways to match a string, and that makes it rather fragile.

Second, you only match up to the closing tag and you did not restrict the starting tag. The first step is to make the / before \1 optional, \/?. It won't stop before [tag] like tags with no attributes. To add attribute support, add an optional group after \1, (?:\s[^]]*)?. It matches an optional sequence of a whitespace and then any 0+ chars other than ].

A fixed regex will look like

\[([biu]|h[123]|l(?:arge|ist)|small|table|grid)](?:(?!\[/?\1(?:\s[^]]*)?]).|(?R))*\[/\1]

Do not forget to compile it with regex.DOTALL to match across multiple newlines.

Recursively capture patterns in regex - Python

Huh silly me. Somehow, I wasn't testing the whole string on my machine ^^;

Anyway, I used this regex and it works, you just get the results you were looking for in a list, which I guess is okay. I'm not too good in python, and don't know how to transform this list into array or tuple:

>>> import re
>>> intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""
>>> results = re.findall('\\n .*?([A-Z0-9\.]*) < *((?:[^>\n]|>")*) *>.*?(?:\\n|$)', intext)
>>> print results
[('ORTH.FOO', '"cali.ber,kl", \'calf\', "done"'), ('ORHT2BAR', '"what so ever>", "this that mess < up"'), ('JOKE', '\'whatthe \', "what" ')]

The parentheses indicate the first level elements and the single quotes the second level elements.