Can't Use '\1' Backreference to Capture-Group in a Function Call in Re.Sub() Repr Expression

Can't use '\1' backreference to capture-group in a function call in re.sub() repr expression

The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.

That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.

See the re documentation:

re.sub(pattern, repl, string, count=0, flags=0)

If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.

Use

import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))

See the Python demo

Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

python : pass regex back-reference value to method

read_file(..) is called not by the re.sub. '\1' is used literally as a filename. In addition to that, \1 is interpreted \x01.

To do that you need to pass a replacement function instead:

content = re.sub(
r'import\s+(.*)\s+\n',
lambda m: '\n' + read_file(m.group(1)) + '\n', # `m` is a match object
content)

How to pass a backreference to a function

The second argument of re.sub can take a function, so you could do this:

>>> re.sub(r'\b(%s)\b' % '|'.join(ns_mapping.keys()), lambda x: ns_mapping[x.group()], sql)
'SELECT id, date, instance_id FROM __SHADOW__test.sales_1m'

Python's re.sub returns data in wrong encoding from unicode

Because '\1' is the character with codepoint 1 (and its repr form is '\x01'). re.sub never saw your backslash, per the rules on string literals. Even if you did escape it, such as in r'\1' or '\\1', reference 1 isn't the right number; you need parenthesis to define groups. r'\g<0>' would work as described in the re.sub documentation.

How to apply a function on a backreference?

In addition to having a replace string, re.sub allows you to use a function to do the replacements:

>>> import re
>>> old_string = "I love the number 3 so much"
>>> def f(match):
... return str(int(match.group(1)) + 1)
...
>>> re.sub('([0-9])+', f, old_string)
'I love the number 4 so much'
>>>

From the docs:

re.sub(pattern, repl, string, count=0, flags=0)

If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.

How can I do multiple substitutions using regex?

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.

A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )

import re 

def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

if __name__ == "__main__":

text = "Larry Wall is the creator of Perl"

dict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}

print multiple_replace(dict, text)

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.

You could use this function while reading from your file, for example:

with open("notes.txt") as text:
new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)

I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

Replace a digit in a sentence using re.sub() in python

You may use

re.sub(r'(\d/)\d(?!\d)', r'\g<1>9', s)

See the regex demo. The regex matches

  • (\d/) - Group 1 (referred to with the \g<1> unambiguous backreference from the replacement pattern; the \g<N> syntax is required since, after the backreference, there is a digit): a digit and a / char
  • \d - a digit
  • (?!\d) - not followed with any other digit.

See the Python demo:

import re
s = "Your score for quiz01 is 6/8."
print( re.sub(r"(\d/)\d(?!\d)", r"\g<1>9", s) )
# => Your score for quiz01 is 6/9.


Related Topics



Leave a reply



Submit