Python: How to Match Nested Parentheses With Regex

Python: How to match nested parentheses with regex?

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.

Regex nested parenthesis in python

Regex

(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}

Regular expression image

Text used for test


Name1 Name2 Name3 (2000) {Education (#3.2)}
Name1 Name2 Name3 (2000) (ok) {edu (#1.1)}
Name1 Name2 (2002) {edu (#1.1)}
Name1 Name2 Name3 (2000) (V) {variation (#4.12)}
Othername California (2000) (T) (S) (ok) {state (#2.1)}

Test


>>> regex = re.compile("(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0x54e2105f36c16a48>
>>> regex.match(string)
<_sre.SRE_Match object at 0x54e2105f36c169e8>

# Run findall
>>> regex.findall(string)
[
(u'Name1 Name2 Name3' , u'' , u'3.2'),
(u'Name1 Name2 Name3' , u'ok', u'1.1'),
(u'Name1 Name2' , u'' , u'1.1'),
(u'Name1 Name2 Name3' , u'' , u'4.12'),
(u'Othername California', u'ok', u'2.1')
]

How to handle nested parentheses with regex?

Standard1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.


1 I said standard, but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:

$str = "[hello [world]] abc [123] [xyz jkl]";

my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;

foreach (@matches) {
print "$_\n";
}

[hello [world]]
abc
[123]
[xyz jkl]

EDIT: I see you're using Python; check out pyparsing.

Regex to find texts between nested parenthesis

The work around pattern can be the one that matches a line starting with {{info and then matches any 0+ chars as few as possible up to the line with just }} on it:

re.findall(r'(?sm)^{{[^\S\r\n]*info\s*(.*?)^}}$', s)

See the regex demo.

Details

  • (?sm) - re.DOTALL (now, . matches a newline) and re.MULTILINE (^ now matches line start and $ matches line end positions) flags
  • ^ - start of a line
  • {{ - a {{ substring
  • [^\S\r\n]* - 0+ horizontal whitespaces
  • info - a substring
  • \s* - 0+ whitespaces
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • ^}}$ - start of a line, }} and end of the line.

Regular expression to return string split up respecting nested parentheses

Using regex only for the task might work but it wouldn't be straightforward.

Another possibility is writing a simple algorithm to track the parentheses in the string:

  1. Split the string at all parentheses, while returning the delimiter (e.g. using re.split)
  2. Keep a counters tracking the parentheses: start_parens_count for ( and end_parens_count for ).
  3. Using the counters, proceed by either splitting at white spaces or adding the current data into a temp var ( term)
  4. When the left most parenthesis has been closed, append term to the list of values & reset the counters/temp vars.

Here's an example:

import re

string = "1 2 3 (test 0, test 0) (test (0 test) 0)"


result, start_parens_count, end_parens_count, term = [], 0, 0, ""
for x in re.split(r"([()])", string):
if not x.strip():
continue
elif x == "(":
if start_parens_count > 0:
term += "("
start_parens_count += 1
elif x == ")":
end_parens_count += 1
if end_parens_count == start_parens_count:
result.append(term)
end_parens_count, start_parens_count, term = 0, 0, ""
else:
term += ")"
elif start_parens_count > end_parens_count:
term += x
else:
result.extend(x.strip(" ").split(" "))


print(result)
# ['1', '2', '3', 'test 0, test 0', 'test (0 test) 0']

Not very elegant, but works.

Extract string between two brackets, including nested brackets in python

>>> import re
>>> s = """res = sqr(if((a>b)&(a<c),(a+b)*c,(a-b)*c)+if()+if()...)"""
>>> re.findall(r'if\((?:[^()]*|\([^()]*\))*\)', s)
['if((a>b)&(a<c),(a+b)*c,(a-b)*c)', 'if()', 'if()']

For such patterns, better to use VERBOSE flag:

>>> lvl2 = re.compile('''
... if\( #literal if(
... (?: #start of non-capturing group
... [^()]* #non-parentheses characters
... | #OR
... \([^()]*\) #non-nested pair of parentheses
... )* #end of non-capturing group, 0 or more times
... \) #literal )
... ''', flags=re.X)
>>> re.findall(lvl2, s)
['if((a>b)&(a<c),(a+b)*c,(a-b)*c)', 'if()', 'if()']



To match any number of nested pairs, you can use regex module, see Recursive Regular Expressions

How can I make a regular expression that only matches the middle bracket of nested brackets?

Easiest way to capture something that does not entail some other things is with

[^ ....] - the ^ disallowes anything inside the [] - as a special feature you do not need to escape brackets inside it - so by declaring your regex as

r'(\([^()]+\))'

you essentially capture a literal ( followed bei 1+ anythings but neither ( nor ) followed by a literal ).

See https://regexr.com/3nsfg

From Regex Syntax:

  • Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^, all
    the characters that are not in the set will be matched. For example,
    [^5] will match any character except '5', and [^^] will match any
    character except '^'. ^ has no special meaning if it’s not the first
    character in the set.
  • To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}]
    and []()[{}] will both match a parenthesis.

Code:

t = "x(x+3(x+3))"

import re

m = re.findall(r"(\([^()]+\))", t)

print(m[0])

Output:

(x+3)

Python Regex match parenthesis but not nested parenthesis

If (foo) in x(foo)x shall be matched, but (foo) in ((foo)) not, what you want is not possible with regular expressions, as regular expressions represent regular grammars and all regular grammars are context free. But context (or 'state', as Jonathon Reinhart called it in his comment) is necessary for the distinction between the (foo) substrings in x(foo)x and ((foo)).

If you only want to match strings that only consist of a parenthesized substring, without any parentheses (matched or unmatched) in that substring, the following regex will do:

^\([^()]*\)$
  • ^ and $ 'glue' the pattern to the beginning and end of the string, respectively, thereby excluding partial matches
  • note the arbitrary number of repetitions (…*) of the non-parenthesis character inside the parentheses.
  • note how special characters are not escaped inside a character set, but still have their literal meaning. (Putting backslashes in there would put literal backslashes in the character set. Or in this case out of the character set, due to the negation.)
  • note how the [ starting the character set isn't escaped, because we actually want its special meaning, rather than is literal meaning

The last two points might be specific to the dialect of regular expressions Python uses.

So this will match () and (foo) completely, but not (not even partially) (foo)bar), (foo(bar), x(foo), (foo)x or ()().



Related Topics



Leave a reply



Submit