Matching Nested Structures with Regular Expressions in Python

Matching Nested Structures With Regular Expressions in Python

You can't do this generally using Python regular expressions. (.NET regular expressions have been extended with "balancing groups" which is what allows nested matches.)

However, PyParsing is a very nice package for this type of thing:

from pyparsing import nestedExpr

data = "( (a ( ( c ) b ) ) ( d ) e )"
print nestedExpr().parseString(data).asList()

The output is:

[[['a', [['c'], 'b']], ['d'], 'e']]

More on PyParsing:

  • http://pyparsing.wikispaces.com/Documentation

Python: How to match nested parentheses with regex?

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.

Regex to extract nested patterns

C# has recursive/nested RegEx, I don't believe Python does. You could re-run the RegEx search on previous results, but this is probably less efficient (the overhead of RegEx for such a simple search) than just making a custom parser. The text your searching for "[@" and "]" isn't very complex.

Here's a custom parser (in JavaScript) that would do the job.

var txt = "Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet";
function parse(s) {
var stack = [];
var result = [];
for(var x=0; x<s.length; x++) {
var c = s.charAt(x);
if(c == '[' && x+1 < s.length-1 && s.charAt(x+1) == '@') {
for(var y=0; y<stack.length; y++)
stack[y] += "[@";
stack.push("[@");
x++;
} else if(c == ']' && stack.length > 0) {
for(var y=0; y<stack.length; y++)
stack[y] += "]";
result.push(stack.pop());
} else {
for(var y=0; y<stack.length; y++)
stack[y] += c;
}
}
return result;
}
parse(txt);

It quickly loops through all the characters of the text (only once) and uses a stack and an if...if else...else condition to push, pop and modify the values in that stack respectively.

How to handle nested parentheses with regex?

Standard1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.


1 I said standard, but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:

$str = "[hello [world]] abc [123] [xyz jkl]";

my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;

foreach (@matches) {
print "$_\n";
}

[hello [world]]
abc
[123]
[xyz jkl]

EDIT: I see you're using Python; check out pyparsing.

Regex - How to work with sub groups

{{persondata(.*)}} will match greedily. I.e. it will try to return the longest match possible. You should use {{persondata(.*?)}} if you want to get the shortest possible match. (Is do not have a name for this, maybe frugal matching?)

However, in this case, you have another }} inside your string. You can do something clever like {{persondata((?:.*)}}(?:.*))}}, but in general, as soon as you reach recursive structures (structures that nest themselves) you should abandon regular expressions and turn to proper parsing solutions.

You might want to look at pyparsing.

Match the last group of (potentially) nested brackets

In Python, to use recursion or repeated subroutines, we need to use Matthew Barnett's outstanding regex module... And, as @CTZhu points out, you are already using it!

To be clear on terms, there can be several understandings of "nesting", such as:

  1. Simple nesting as in [C[D[E]F]], which is a subset of...
  2. More complex, family-style nesting as in [B[C] [D] [E[F][G]]].

You need to be able to handle the latter, and this short regex does it for us:

\[(?:[^[\]]++|(?R))*\]

This will match all the nested braces. Now all we need to do is print the last match.

Here is some tested Python code:

import regex # say "yeah!" for Matthew Barnett
pattern = r'\[(?:[^[\]]++|(?R))*\]'
myregex = regex.compile(pattern)

# this outputs [EEE]
matches = myregex.findall('AAA [BBB [CCC]] [EEE]')
print (matches[-1])

# this outputs [C[D[E]F]] (simple nesting)
matches = myregex.findall('AAA [BBB] [C[D[E]F]]')
print (matches[-1])

# this outputs [B[C] [D] [E[F][G]]] (family-style nesting)
matches = myregex.findall('AAA [AAA] [B[]B[B]] [B[C] [D] [E[F][G]]]')
print (matches[-1])

bounding strings between two characters in regex

Don't use regex. Use the traditional way to do this. Make a stack and if there's more than one '<' keep appending else break and append the whole thing.

But just make sure to handle the double back slashes that somehow crop up :-/

def find_tags(your_string)
ans = []
stack = []
tag_no = 0

for c in your_string:
if c=='<':
tag_no+=1
if tag_no>1:
stack.append(c)
elif c=='>':
if tag_no==1:
ans.append(''.join(stack))
tag_no=0
stack=[]
else:
tag_no = tag_no-1
stack.append(c)
elif tag_no>0:
stack.append(c)
return ans

Output below

find_tags(r'<abc>, <?.sdfs/>, <sdsld\>')
['abc', '?.sdfs/', 'sdsld\\']
find_tags(r'</</\/\asa></dsdsds><sdsfsa>>')
['/</\\/\\asa></dsdsds><sdsfsa>']

Note: Works in O(n) as well.



Related Topics



Leave a reply



Submit