How to Replace Only Part of the Match with Python Re.Sub

How to replace only part of the match with python re.sub

 re.sub(r'(?:_a)?\.([^.]*)$', r'_suff.\1', "long.file.name.jpg")

?: starts a non matching group (SO answer), so (?:_a) is matching the _a but not enumerating it, the following question mark makes it optional.

So in English, this says, match the ending .<anything> that follows (or doesn't) the pattern _a

Another way to do this would be to use a lookbehind (see here). Mentioning this because they're super useful, but I didn't know of them for 15 years of doing REs

python re.sub, only replace part of match

You can use substitution groups:

>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>'
>>> re.sub(r'(\<[A-Za-z0-9_]+)(\s[A-Za-z0-9_="\s]+)', r"\1", my_string)
'<cross_sell> --> <cross_sell>'

Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "\1" modifier (first group) in the replacement string.

Why does re.sub replace the entire pattern, not just a capturing group within it?

Because it's supposed to replace the whole occurrence of the pattern:

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.

If it were to replace only some subgroup, then complex regexes with several groups wouldn't work. There are several possible solutions:

  1. Specify pattern in full: re.sub('ab', 'ad', 'abc') - my favorite, as it's very readable and explicit.
  2. Capture groups which you want to preserve and then refer to them in the pattern (note that it should be raw string to avoid escaping): re.sub('(a)b', r'\1d', 'abc')
  3. Similar to previous option: provide a callback function as repl argument and make it process the Match object and return required result.
  4. Use lookbehinds/lookaheds, which are not included in the match, but affect matching: re.sub('(?<=a)b', r'd', 'abxb') yields adxb. The ?<= in the beginning of the group says "it's a lookahead".

Using re.sub with capture groups to replace only portion of a match

Use a lookahead to match part of the string without replacing it.

pattern = r'\A\w+(?=[@+\-/*])'

You don't need a capture group when you're just removing the match; it's needed if you need to copy parts of the input text into the result. You also don't need [] around \w. And you should get rid of the * after [@+\-/*], since you want to require one of those characters.

You should generally use raw strings when creating regular expressions, so that the regexp escape sequences won't be confused for Python escape sequences. And you should escape - in a character set, otherwise it's used to create a range of characters.

Replacing only the captured group using re.sub and multiple replacements

You can use a lookbehind and lookahead based regex and then a lambda function to iterate through replacements words:

>>> words = ['Swimming', 'Eating', 'Jogging']
>>> pattern = re.compile(r'(?<=I love )\w+(?=\.)')
>>> print pattern.sub(lambda m: words.pop(0), string)
'I love Swimming. I love Eating. I love Jogging.'

Code Demo

Python re.sub() is replacing the full match even when using non-capturing groups

The general solution for such problems is using a lambda in the replacement:

string = 'aBCDeFGH'

print(re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', lambda match: '+%s+%s' % (match.group(2), match.group(4)), string))

However, as bro-grammer has commented, you can use backreferences in this case:

print(re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', r'+\2+\4', string))

Python replace only part of a re.sub match

Use this instead: re.sub("(?<=[^a-zA-Z])pi(?=[^a-zA-Z])", "(math.pi)", "2pi3 + supirse")

Visualization: http://regex101.com/r/fX5wX3

Python Regular Expression; replacing a portion of match

If you want to only remove zeros after letters, you may use:

([a-zA-Z])0+

Replace with \1 backreference. See the regex demo.

The ([a-zA-Z]) will capture a letter and 0+ will match 1 or more zeros.

Python demo:

import re
s = 'e004_n07'
res = re.sub(r'([a-zA-Z])0+', r'\1', s)
print(res)

Note that re.sub will find and replace all non-overlapping matches (will perform a global search and replace). If there is no match, the string will be returned as is, without modifications. So, there is no need using additional re.match/re.search.

UDPATE

To keep 1 zero if the numbers only contain zeros, you may use

import re
s = ['e004_n07','e000_n00']
res = [re.sub(r'(?<=[a-zA-Z])0+(\d*)', lambda m: m.group(1) if m.group(1) else '0', x) for x in s]
print(res)

See the Python demo

Here, r'(?<=[a-zA-Z])0+(\d*)' regex matches one or more zeros (0+) that are after an ASCII letter ((?<=[a-zA-Z])) and then any other digits (0 or more) are captured into Group 1 with (\d*). Then, in the replacement, we check if Group 1 is empty, and if it is empty, we insert 0 (there are only zeros), else, we insert Group 1 contents (the remaining digits after the first leading zeros).

python re.sub, only replace part of match

You can use substitution groups:

>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>'
>>> re.sub(r'(\<[A-Za-z0-9_]+)(\s[A-Za-z0-9_="\s]+)', r"\1", my_string)
'<cross_sell> --> <cross_sell>'

Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "\1" modifier (first group) in the replacement string.

How can I replace a string match with part of itself in Python?

Instead of directly using the re.sub() method, you can use the re.findall() method to find all substrings (in a non-greedy fashion) that begins and ends with the proper square brackets.

Then, iterate through the matches and use the str.replace() method to replace each match in the string with the second character in the match:

import re

s = "alEhos[cr@e]sjt"

for m in re.findall("\[.*?\]", s):
s = s.replace(m, m[1])

print(s)

Output:

alEhoscsjt


Related Topics



Leave a reply



Submit