Passing a Function to Re.Sub in Python

Passing a function to re.sub in Python

You should call group() to get the matching string:

import re

number_mapping = {'1': 'one',
'2': 'two',
'3': 'three'}
s = "1 testing 2 3"

print re.sub(r'\d', lambda x: number_mapping[x.group()], s)

prints:

one testing two three

Using a function as argument to re.sub in Python?

Notice that m.group() returns the entire string that matched, whether or not it was part of a capturing group:

In [19]: m = re.search(r"#(\w+)", s)

In [20]: m.group()
Out[20]: '#Whatthehello'

m.group(0) also returns the entire match:

In [23]: m.group(0)
Out[23]: '#Whatthehello'

In contrast, m.groups() returns all capturing groups:

In [21]: m.groups()
Out[21]: ('Whatthehello',)

and m.group(1) returns the first capturing group:

In [22]: m.group(1)
Out[22]: 'Whatthehello'

So the problem in your code originates with the use of m.group in

re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)

since

In [7]: re.search(r"#(\w+)", s).group()
Out[7]: '#Whatthehello'

whereas if you had used .group(1), you would have gotten

In [24]: re.search(r"#(\w+)", s).group(1)
Out[24]: 'Whatthehello'

and the preceding # makes all the difference:

In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'

In [26]: func_replace('Whatthehello')
Out[26]: 'What the hello'

Thus, changing m.group() to m.group(1), and substituting /usr/share/dict/words for corncob_lowercase.txt,

import re

def func_replace(each_func):
i = 0
wordsineach_func = []
while len(each_func) > 0:
i = i + 1
word_found = longest_word(each_func)
if len(word_found) > 0:
wordsineach_func.append(word_found)
each_func = each_func.replace(word_found, "")
return ' '.join(wordsineach_func)

def longest_word(phrase):
phrase_length = len(phrase)
words_found = []
index = 0
outerstring = ""
while index < phrase_length:
outerstring = outerstring + phrase[index]
index = index + 1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) == 0:
words_found.append(phrase)
return max(words_found, key=len)

words = []
# corncob_lowercase.txt contains a list of dictionary words
with open('/usr/share/dict/words', 'rb') as f:
for read_word in f:
words.append(read_word.strip())
s = "#Whatthehello #goback"
hashtags = re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
print re.sub(r"#(\w+)", lambda m: func_replace(m.group(1)), s)

prints

What the hello
What the hello gob a c k

since, alas, 'gob' is longer than 'go'.


One way you could have debugged this is to replace the lambda function with a regular function and then add print statements:

def foo(m):
result = func_replace(m.group())
print(m.group(), result)
return result

In [35]: re.sub(r"#(\w+)", foo, s)
('#Whatthehello', '#Whatthehello') <-- This shows you what `m.group()` and `func_replace(m.group())` returns
('#goback', '#goback')
Out[35]: '#Whatthehello #goback'

That would focus your attention on

In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'

which you could then compare with

In [26]: func_replace(hashtags[0])
Out[26]: 'What the hello'

In [27]: func_replace('Whatthehello')
Out[27]: 'What the hello'

That would lead you to ask the question, if m.group() returns '#Whatthehello', what method do I need to return 'Whatthehello'. A dive into the docs then solves the problem.

Getting the match number when passing a function in re.sub

Based on @Barmar's answer, I tried this:

import re

def custom_replace(match, matchcount):
result = 'a' + str(matchcount.i)
matchcount.i += 1
return result

def any_request():
matchcount = lambda: None # an empty "object", see https://stackoverflow.com/questions/19476816/creating-an-empty-object-in-python/37540574#37540574
matchcount.i = 0 # benefit : it's a local variable that we pass to custom_replace "as reference
print(re.sub(r'o', lambda match: custom_replace(match, matchcount), "oh hello wow"))
# a0h hella1 wa2w

any_request()

and it seems to work.

Reason: I was a bit reluctant to use a global variable for this, because I'm using this inside a web framework, in a route function (called any_request() here).

Let's say there are many requests in parallel (in threads), I don't want a global variable to be "mixed" between different calls (since the operations are probably not atomic?)

Call functions from re.sub

If you want to use a function with re.sub you need to pass a function, not an expression. As documented here, your function should take the match object as an argument and returns the replacement string. You can access the groups with the usual .group(n) methods and so on. An example:

re.sub("(a+)(b+)", lambda match: "{0} as and {1} bs ".format(
len(match.group(1)), len(match.group(2))
), "aaabbaabbbaaaabb")
# Output is '3 as and 2 bs 2 as and 3 bs 4 as and 2 bs '

Note that the function should return strings (since they will be put back into the original string).

Why do I have to pass a callable to re.sub to make an uppercase string?

Second argument can be string or a callable.

re.sub(my_re, r'\1'.upper(), "bruce's computer"): you're passing a \1 string to the sub function (upper or not, doesn't matter)

re.sub(my_re, lambda x: x.group(1).upper(), "bruce's computer"): you're passing a callable, so the upper() works because it applies on the result.

x.group(1).upper() isn't evaluated at once because it's contained in a lambda expression, equivalent to the non-lambda:

def func(x):
return x.group(1).upper()

that you could also pass to re.sub: re.sub(my_re, func, "bruce's computer"), note the lack of () in that case!

Calling a function on captured group in re.sub()

You pass a function to re.sub and then you pull the group from there:

def base64_encode(match):
"""
This function takes a re 'match object' and performs
The appropriate substitutions
"""

group = match.group(1)
... #Code to encode as base 64
return result

re.sub(...,base64_encode,s,flags=re.I)

Python replace string pattern with output of function

You can pass a function to re.sub. The function will receive a match object as the argument, use .group() to extract the match as a string.

>>> def my_replace(match):
... match = match.group()
... return match + str(match.index('e'))
...
>>> string = "The quick @red fox jumps over the @lame brown dog."
>>> re.sub(r'@\w+', my_replace, string)
'The quick @red2 fox jumps over the @lame4 brown dog.'

how to pass a match object from re.sub as an argument

You have to use lambda and pass the match

string = re.sub(r"{(.*?)}", lambda match :matchVar(match, another_argument), string)

Or change the matchVar to def matchVar(match):
and pass the function like re.sub(r"{(.*?)}", matchVar, string)

How can I pass a callback to re.sub, but still inserting match captures?

If you pass a function you lose the automatic escaping of backreferences. You just get the match object and have to do the work. So you could:

Pick a string in the regex rather than passing a function:

text = "abcdef"
pattern = "(b|e)cd(b|e)"

repl = [r"\1bla\2", r"\1blabla\2"]
re.sub(pattern, random.choice(repl), text)
# 'abblaef' or 'abblablaef'

Or write a function that processes the match object and allows more complex processing. You can take advantage of expand to use back references:

text = "abcdef abcdef"
pattern = "(b|e)cd(b|e)"

def repl(m):
repl = [r"\1bla\2", r"\1blabla\2"]
return m.expand(random.choice(repl))

re.sub(pattern, repl, text)

# 'abblaef abblablaef' and variations

You can, or course, put that function into a lambda:

repl = [r"\1bla\2", r"\1blabla\2"]
re.sub(pattern, lambda m: m.expand(random.choice(repl)), text)

How to pass a variable to a re.sub callback?

The easiest way I guess is to make use of functools.partial, which allows you create a "partially evaluated" function:

from functools import partial

def evaluate(match, mappings):
return str(eval(match.group(0)[2:-1], mappings))

mappings = {'A': 1, 'B': 2} # Or whatever ...

newstring = sub(r'\#\{([^#]+)\}', partial(evaluate, mappings=mappings), string)


Related Topics



Leave a reply



Submit