Named Regular Expression Group "(P<Group_Name>Regexp)": What Does "P" Stand For

Named regular expression group (?Pgroup_nameregexp): what does P stand for?

Since we're all guessing, I might as well give mine: I've always thought it stood for Python. That may sound pretty stupid -- what, P for Python?! -- but in my defense, I vaguely remembered this thread [emphasis mine]:

Subject: Claiming (?P...) regex syntax extensions

From: Guido van Rossum (gui...@CNRI.Reston.Va.US)

Date: Dec 10, 1997 3:36:19 pm

I have an unusual request for the Perl developers (those that develop
the Perl language). I hope this (perl5-porters) is the right list. I
am cc'ing the Python string-sig because it is the origin of most of
the work I'm discussing here.

You are probably aware of Python. I am Python's creator; I am
planning to release a next "major" version, Python 1.5, by the end of
this year. I hope that Python and Perl can co-exist in years to come;
cross-pollination can be good for both languages. (I believe Larry
had a good look at Python when he added objects to Perl 5; O'Reilly
publishes books about both languages.)

As you may know, Python 1.5 adds a new regular expression module that
more closely matches Perl's syntax. We've tried to be as close to the
Perl syntax as possible within Python's syntax. However, the regex
syntax has some Python-specific extensions, which all begin with (?P .
Currently there are two of them:

(?P<foo>...) Similar to regular grouping parentheses, but the text

matched by the group is accessible after the match has been performed,
via the symbolic group name "foo".

(?P=foo) Matches the same string as that matched by the group named
"foo". Equivalent to \1, \2, etc. except that the group is referred

to by name, not number.

I hope that this Python-specific extension won't conflict with any
future Perl extensions to the Perl regex syntax. If you have plans to
use (?P, please let us know as soon as possible so we can resolve the
conflict. Otherwise, it would be nice if the (?P syntax could be
permanently reserved for Python-specific syntax extensions.
(Is
there some kind of registry of extensions?)

to which Larry Wall replied:

[...] There's no registry as of now--yours is the first request from
outside perl5-porters, so it's a pretty low-bandwidth activity.
(Sorry it was even lower last week--I was off in New York at Internet
World.)

Anyway, as far as I'm concerned, you may certainly have 'P' with my
blessing. (Obviously Perl doesn't need the 'P' at this point. :-) [...]

So I don't know what the original choice of P was motivated by -- pattern? placeholder? penguins? -- but you can understand why I've always associated it with Python. Which considering that (1) I don't like regular expressions and avoid them wherever possible, and (2) this thread happened fifteen years ago, is kind of odd.

python regex: what does the (?Pname...) mean

From the re module documentation:

(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.

So it is essentially the same as what you changed your pattern to except now you can no longer access that pattern by name as well as by its number.

To understand the difference I recommend you read up on Non-capturing And Named Groups in the Regular Expression HOWTO.

You can access named groups by passing the name to the MatchObject.group() method, or get a dictionary containing all named groups with MatchObject.groupdict(); this dictionary would not include positional groups.

What does this Django regular expression mean? `?P`

(?P<name>regex) - Round brackets group the regex between them. They capture the text matched by the regex inside them that can be referenced by the name between the sharp brackets. The name may consist of letters and digits.

Copy paste from: http://www.regular-expressions.info/refext.html

How to get group name of match regular expression in Python?

You can get this information from the compiled expression:

>>> pattern = re.compile(r'(?P<name>\w+)|(?P<number>\d+)')
>>> pattern.groupindex
{'name': 1, 'number': 2}

This uses the RegexObject.groupindex attribute:

A dictionary mapping any symbolic group names defined by (?P<id>) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.

If you only have access to the match object, you can get to the pattern with the MatchObject.re attribute:

>>> a = list(re.finditer(r'(?P<name>\w+)|(?P<number>\d+)', 'Ala ma kota'))
>>> a[0]
<_sre.SRE_Match object at 0x100264ad0>
>>> a[0].re.groupindex
{'name': 1, 'number': 2}

If all you wanted to know what group matched look at the value; None means a group never was used in a match:

>>> a[0].groupdict()
{'name': 'Ala', 'number': None}

The number group never used to match anything because its value is None.

You can then find the names used in the regular expression with:

names_used = [name for name, value in matchobj.groupdict().iteritems() if value is not None]

or if there is only ever one group that can match, you can use MatchObject.lastgroup:

name_used = matchobj.lastgroup

As a side note, your regular expression has a fatal flaw; everything that \d matches, is also matched by \w. You'll never see number used where name can match first. Reverse the pattern to avoid this:

>>> for match in re.finditer(r'(?P<name>\w+)|(?P<number>\d+)', 'word 42'):
... print match.lastgroup
...
name
name
>>> for match in re.finditer(r'(?P<number>\d+)|(?P<name>\w+)', 'word 42'):
... print match.lastgroup
...
name
number

but take into account that words starting with digits will still confuse things for your simple case:

>>> for match in re.finditer(r'(?P<number>\d+)|(?P<name>\w+)', 'word42 42word'):
... print match.lastgroup, repr(match.group(0))
...
name 'word42'
number '42'
name 'word'

Named backreference (?P=name) issue in Python re

The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:

(?P=name)

A backreference to a named group; it matches whatever text was matched by the earlier group named name.

See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.

To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:

repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern...

In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

Regular expression naming group

A named symbolic group requires a name. It takes the form (?P<name>...). In your example, you forgot to provide a name for your groups.

Unfortunately, a group name cannot be reused, thus the following is an error.

re.compile(r'(?P<last>\w+), (?P<first>\w+)|(?P<first>\w+) (?P<last>\w+)')
# sre_constants.error: redefinition of group name 'first' ...

The above error happens because re is not smart enough to know that only one of each name will be matched. Thus you will have to catch the pattern and then extract first and last.

import re

def get_name(name):
match = re.match(r'(\w+), (\w+)|(\w+) (\w+)', name)

return {'first': match[2] or match[3], 'last': match[1] or match[4]}

print(get_name('James Allen'))
print(get_name('Allen, James'))

Output

{'first': 'James', 'last': 'Allen'}
{'first': 'James', 'last': 'Allen'}

What does P mean in `/(?Ptopic_id\d+)$`

(?P<name>...) is a named group:

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.

As such, it's equivalent to (...) but instead of referring to \1, you can refer to any of the following: (?P=name), \1, m.group('name'), or \g<name>, depending on the context.

Convert capture group to named capture group

You can use the following, though it would get pretty tricky if you ever have nested parentheses:

reg = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s(\w+)\s(\w+)\s(\d+)"
groupNames = ["ip","name", "proto", "http_status_code"]

splitReg = [a for a in reg.split("(") if a] # skip empty groups
if len(groupNames) == len(splitReg):
newReg = ''.join([("(?P<" + name + ">" + val)
for name, val in zip(groupNames, splitReg)])
print(newReg)

Output:

(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s(?P<name>\w+)\s(?P<proto>\w+)\s(?P<http_status_code>\d+)

Python replace multiple strings while supporting backreferences

Update: See Angus Hollands' answer for a better alternative.


I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.

However, there are some difficulties. Let's assume a repldict like this:

repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}

If we combine these to a single regex, we get (a)|(b) - so now (b) is no longer group 1, which means its backreference won't work correctly.

Another problem is that we can't tell which replacement to use. If the regex matches the text b, how can we find out that \1b is the appropriate replacement? It's not possible; we don't have enough information.

The solution to these problems is to enclose every dict key in a named group like so:

(?P<group1>(a))|(?P<group2>(b))

Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b refers to "the first group after group2".


Here's the implementation:

def replaceAll(repldict, text):
# split the dict into two lists because we need the order to be reliable
keys, repls = zip(*repldict.items())

# generate a regex pattern from the keys, putting each key in a named group
# so that we can find out which one of them matched.
# groups are named "_<idx>" where <idx> is the index of the corresponding
# replacement text in the list above
pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k in enumerate(keys))

def repl(match):
# find out which key matched. We know that exactly one of the keys has
# matched, so it's the only named group with a value other than None.
group_name = next(name for name, value in match.groupdict().items()
if value is not None)
group_index = int(group_name[1:])

# now that we know which group matched, we can retrieve the
# corresponding replacement text
repl_text = repls[group_index]

# now we'll manually search for backreferences in the
# replacement text and substitute them
def repl_backreference(m):
reference_index = int(m.group(1))

# return the corresponding group's value from the original match
# +1 because regex starts counting at 1
return match.group(group_index + reference_index + 1)

return re.sub(r'\\(\d+)', repl_backreference, repl_text)

return re.sub(pattern, repl, text)

Tests:

repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )

repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )

# output: not newData.exists() or newData.val().length == 1
# (not this and not that) or not this and foo == bar

Caveats:

  • Only numerical backreferences are supported; no named references.
  • Silently accepts invalid backreferences like {r'(a)': r'\2'}. (These will sometimes throw an error, but not always.)


Related Topics



Leave a reply



Submit