Re.Sub Replace with Matched Content

re.sub replace with matched content

Simply use \1 instead of $1:

In [1]: import re

In [2]: method = 'images/:id/huge'

In [3]: re.sub(r'(:[a-z]+)', r'<span>\1</span>', method)
Out[3]: 'images/<span>:id</span>/huge'

Also note the use of raw strings (r'...') for regular expressions. It is not mandatory but removes the need to escape backslashes, arguably making the code slightly more readable.

Replace matched susbtring using re sub

According to the documentation, re.sub is defined as

re.sub(pattern, repl, string, count=0, flags=0)

If repl is a function, it is called for every non-overlapping occurrence of pattern.

This said, if you pass a lambda function, you can remain the code in one line. Furthermore, remember that the matched characters can be accessed easier to an individual group by: x[0].

I removed _ from the regex to reach the desired output.

txt = "/J&L/LK/Tac1_1/shareloc.pdf"
x = re.sub("[^0-9]", lambda x: '.' if x[0] is '_' else '', txt)
print(x)

Getting the match number when passing a function in re.sub

Based on @Barmar's answer, I tried this:

import re

def custom_replace(match, matchcount):
result = 'a' + str(matchcount.i)
matchcount.i += 1
return result

def any_request():
matchcount = lambda: None # an empty "object", see https://stackoverflow.com/questions/19476816/creating-an-empty-object-in-python/37540574#37540574
matchcount.i = 0 # benefit : it's a local variable that we pass to custom_replace "as reference
print(re.sub(r'o', lambda match: custom_replace(match, matchcount), "oh hello wow"))
# a0h hella1 wa2w

any_request()

and it seems to work.

Reason: I was a bit reluctant to use a global variable for this, because I'm using this inside a web framework, in a route function (called any_request() here).

Let's say there are many requests in parallel (in threads), I don't want a global variable to be "mixed" between different calls (since the operations are probably not atomic?)

re.sub() - Replace with text from match without using capture groups?

To answer your first question, re.sub allows you to use a function instead of a fixed replacement string. E.g.

>>> s = "omglolwtfbbq"
>>> regex = r"l[\w]"
>>> re.sub(regex, lambda x: "!%s!" % x.group(), s)
'omg!lo!!lw!tfbbq'

Note that the .group method of a match object returns the whole match (whether or not capture groups are present). If you have capture groups, then .groups returns those captured groups.

To answer your question about colouring specifically, I would recommend taking a look at colorama.

Python re.sub() is replacing the full match even when using non-capturing groups

The general solution for such problems is using a lambda in the replacement:

string = 'aBCDeFGH'

print(re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', lambda match: '+%s+%s' % (match.group(2), match.group(4)), string))

However, as bro-grammer has commented, you can use backreferences in this case:

print(re.sub('(a)?([A-Z]{3})(e)?([A-Z]{3})', r'+\2+\4', string))

python regex sub replace the whole string

Match the rest of the string with a .*

import re

s = 'abcdefg'

s = re.sub(r'^abc.*', 'replacement', s)
print(s)

output:

replacement

Why re.sub() adds not matched string by default in Python?

You seem to have a misunderstanding of what sub does. it substitutes the matching regex. this regex r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)' matches part of your string and so ONLY THE MATCHING PART will be substituted, the capture groups do not effect this.
what you can do (if you don't want to add .* in the beginning and the end is to use re.findall like this

re.findall(
r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)',
'START, size: 100Х200 x 50, END'
)

which will return [('size:', '100', '200', '50')], you can then format it as you wish.
one way to do is as one liner with no error handling is like this:

'{1}x{2}x{3}'.format(
*re.findall(
r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)',
'START, size: 100Х200 x 50, END')[0]
)

Using re.sub with capture groups to replace only portion of a match

Use a lookahead to match part of the string without replacing it.

pattern = r'\A\w+(?=[@+\-/*])'

You don't need a capture group when you're just removing the match; it's needed if you need to copy parts of the input text into the result. You also don't need [] around \w. And you should get rid of the * after [@+\-/*], since you want to require one of those characters.

You should generally use raw strings when creating regular expressions, so that the regexp escape sequences won't be confused for Python escape sequences. And you should escape - in a character set, otherwise it's used to create a range of characters.

How to replace only part of the match with python re.sub

 re.sub(r'(?:_a)?\.([^.]*)$', r'_suff.\1', "long.file.name.jpg")

?: starts a non matching group (SO answer), so (?:_a) is matching the _a but not enumerating it, the following question mark makes it optional.

So in English, this says, match the ending .<anything> that follows (or doesn't) the pattern _a

Another way to do this would be to use a lookbehind (see here). Mentioning this because they're super useful, but I didn't know of them for 15 years of doing REs

Why does re.sub replace the entire pattern, not just a capturing group within it?

Because it's supposed to replace the whole occurrence of the pattern:

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.

If it were to replace only some subgroup, then complex regexes with several groups wouldn't work. There are several possible solutions:

  1. Specify pattern in full: re.sub('ab', 'ad', 'abc') - my favorite, as it's very readable and explicit.
  2. Capture groups which you want to preserve and then refer to them in the pattern (note that it should be raw string to avoid escaping): re.sub('(a)b', r'\1d', 'abc')
  3. Similar to previous option: provide a callback function as repl argument and make it process the Match object and return required result.
  4. Use lookbehinds/lookaheds, which are not included in the match, but affect matching: re.sub('(?<=a)b', r'd', 'abxb') yields adxb. The ?<= in the beginning of the group says "it's a lookahead".


Related Topics



Leave a reply



Submit