Python Re.Sub Back Reference Not Back Referencing

Python re.sub back reference not back referencing

You need to use a raw-string here so that the backslash isn't processed as an escape character:

>>> import re
>>> fileText = '<text top="52" left="20" width="383" height="15" font="0"><b>test</b></text>'
>>> fileText = re.sub("<b>(.*?)</b>", r"\1", fileText, flags=re.DOTALL)
>>> fileText
'<text top="52" left="20" width="383" height="15" font="0">test</text>'
>>>

Notice how "\1" was changed to r"\1". Though it is a very small change (one character), it has a big effect. See below:

>>> "\1"
'\x01'
>>> r"\1"
'\\1'
>>>

Can't use '\1' backreference to capture-group in a function call in re.sub() repr expression

The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.

That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.

See the re documentation:

re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.

Use

import re
S = '02143' 
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))

See the Python demo

Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

Why don't backreferences work in Python's re.sub when using a replacement function?

As there are simpler ways to achieve your goal, you can use them.

As you already see, your replacement function gets a match object as it argument.

This object has, among others, a method group() which can be used instead:

def dashrepl(matchobj):
    return matchobj.group(0) + ' '

which will give exactly your result.

But you are completely right - the docs are a bit confusing in that way:

they describe the repl argument:

repl can be a string or a function; if it is a string, any backslash escapes in it are processed.

and

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

You could interpret this as if "the replacement string" returned by the function would also apply to the processment of backslash escapes.

But as this processment is described only for the case that "it is a string", it becomes clearer, but not obvious at the first glance.

Handling backreferences to capturing groups in re.sub replacement pattern

You should be using raw strings for regex, try the following:

coord_re = re.sub(r"(\d), (\d)", r"\1,\2", coords)

With your current code, the backslashes in your replacement string are escaping the digits, so you are replacing all matches the equivalent of chr(1) + "," + chr(2):

>>> '\1,\2'
'\x01,\x02'
>>> print '\1,\2'
,
>>> print r'\1,\2'   # this is what you actually want
\1,\2

Any time you want to leave the backslash in the string, use the r prefix, or escape each backslash (\\1,\\2).

Python re.sub: ignore backreferences in the replacement string

The previous answer using re.escape() would escape too much, and you would get undesirable backslashes in the replacement and the replaced string.

It seems like in Python only the backslash needs escaping in the replacement string, thus something like this could be sufficient:

replacement = replacement.replace("\\", "\\\\")

Example:

import re

x = r'hai! \1 <ops> $1 \' \x \\'
print "want to see: "
print x

print "getting: "
print re.sub(".(.).", x, "###")
print "over escaped: "
print re.sub(".(.).", re.escape(x), "###")
print "could work: "
print re.sub(".(.).", x.replace("\\", "\\\\"), "###")

Output:

want to see: 
hai! \1 <ops> $1 \' \x \\
getting: 
hai! # <ops> $1 \' \x \
over escaped: 
hai\!\ \1\ \<ops\>\ \$1\ \\'\ \x\ \\
could work: 
hai! \1 <ops> $1 \' \x \\

Python's re.sub returns data in wrong encoding from unicode

Because '\1' is the character with codepoint 1 (and its repr form is '\x01'). re.sub never saw your backslash, per the rules on string literals. Even if you did escape it, such as in r'\1' or '\\1', reference 1 isn't the right number; you need parenthesis to define groups. r'\g<0>' would work as described in the re.sub documentation.

Named backreference (?P=name) issue in Python re

The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:

(?P=name)

A backreference to a named group; it matches whatever text was matched by the earlier group named name.

See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.

To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:

repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern...

In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

invalid group reference when using re.sub()

You cannot use \1 replacement backreference if there are no capturing groups in the pattern. Add the capturing group to the pattern:

pattern = r'\b(' + ' (?:\w+ )?(?:\w+ )?'.join(markers) + r')\b' # or
              ^                                            ^
pattern = r'\b({})\b'.format(r' (?:\w+ )?(?:\w+ )?'.join(markers))

Or, just use the \g<0> to insert the whole match rather than a capturing group value (then, there is no need amending your regex):

text = re.sub(pattern, r'<b>\g<0></b>', s)

See the Python demo.

regex: getting backreference to number, adding to it

The actual problem is, you are supposed to pass a function to the second parameter of re.sub, instead you are calling a function and passing the return value.

Why does it work in the first case?

Whenever a match is found, the second parameter will be looked at. If it is a string, then it will be used as the replacement, if it is a function, then the function will be called with the match object. In your case, add_pages(r"\1"), is simply returning r"\1" itself. So, the re.sub translates to this

print re.sub("(?<=Page )(\d{2})", r"\1", ...)

So, it actually replaces the original matched string with the same. That is why it works.

Why it doesn't work in the second case?

But, in the second case, when you do

add_pages(r"\1")

you are trying to convert r"\1" to an integer, which is not possible. That is why it is failing.

How to fix this?

The actual way to write this would be,

def add_pages(matchObject):
    return str(int(matchObject.group()) + 10)
print re.sub("(?<=Page )(\d{2})", add_pages, ...)

Read more about the group function, here

Python Re.Sub Back Reference Not Back Referencing