Passing a function to re.sub in Python
You should call group()
to get the matching string:
import re
number_mapping = {'1': 'one',
'2': 'two',
'3': 'three'}
s = "1 testing 2 3"
print re.sub(r'\d', lambda x: number_mapping[x.group()], s)
prints:
one testing two three
Using a function as argument to re.sub in Python?
Notice that m.group()
returns the entire string that matched, whether or not it was part of a capturing group:
In [19]: m = re.search(r"#(\w+)", s)
In [20]: m.group()
Out[20]: '#Whatthehello'
m.group(0)
also returns the entire match:
In [23]: m.group(0)
Out[23]: '#Whatthehello'
In contrast, m.groups()
returns all capturing groups:
In [21]: m.groups()
Out[21]: ('Whatthehello',)
and m.group(1)
returns the first capturing group:
In [22]: m.group(1)
Out[22]: 'Whatthehello'
So the problem in your code originates with the use of m.group
in
re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)
since
In [7]: re.search(r"#(\w+)", s).group()
Out[7]: '#Whatthehello'
whereas if you had used .group(1)
, you would have gotten
In [24]: re.search(r"#(\w+)", s).group(1)
Out[24]: 'Whatthehello'
and the preceding #
makes all the difference:
In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'
In [26]: func_replace('Whatthehello')
Out[26]: 'What the hello'
Thus, changing m.group()
to m.group(1)
, and substituting /usr/share/dict/words
for corncob_lowercase.txt
,
import re
def func_replace(each_func):
i = 0
wordsineach_func = []
while len(each_func) > 0:
i = i + 1
word_found = longest_word(each_func)
if len(word_found) > 0:
wordsineach_func.append(word_found)
each_func = each_func.replace(word_found, "")
return ' '.join(wordsineach_func)
def longest_word(phrase):
phrase_length = len(phrase)
words_found = []
index = 0
outerstring = ""
while index < phrase_length:
outerstring = outerstring + phrase[index]
index = index + 1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) == 0:
words_found.append(phrase)
return max(words_found, key=len)
words = []
# corncob_lowercase.txt contains a list of dictionary words
with open('/usr/share/dict/words', 'rb') as f:
for read_word in f:
words.append(read_word.strip())
s = "#Whatthehello #goback"
hashtags = re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
print re.sub(r"#(\w+)", lambda m: func_replace(m.group(1)), s)
prints
What the hello
What the hello gob a c k
since, alas, 'gob'
is longer than 'go'
.
One way you could have debugged this is to replace the lambda
function with a regular function and then add print statements:
def foo(m):
result = func_replace(m.group())
print(m.group(), result)
return result
In [35]: re.sub(r"#(\w+)", foo, s)
('#Whatthehello', '#Whatthehello') <-- This shows you what `m.group()` and `func_replace(m.group())` returns
('#goback', '#goback')
Out[35]: '#Whatthehello #goback'
That would focus your attention on
In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'
which you could then compare with
In [26]: func_replace(hashtags[0])
Out[26]: 'What the hello'
In [27]: func_replace('Whatthehello')
Out[27]: 'What the hello'
That would lead you to ask the question, if m.group()
returns '#Whatthehello'
, what method do I need to return 'Whatthehello'
. A dive into the docs then solves the problem.
Getting the match number when passing a function in re.sub
Based on @Barmar's answer, I tried this:
import re
def custom_replace(match, matchcount):
result = 'a' + str(matchcount.i)
matchcount.i += 1
return result
def any_request():
matchcount = lambda: None # an empty "object", see https://stackoverflow.com/questions/19476816/creating-an-empty-object-in-python/37540574#37540574
matchcount.i = 0 # benefit : it's a local variable that we pass to custom_replace "as reference
print(re.sub(r'o', lambda match: custom_replace(match, matchcount), "oh hello wow"))
# a0h hella1 wa2w
any_request()
and it seems to work.
Reason: I was a bit reluctant to use a global variable for this, because I'm using this inside a web framework, in a route function (called any_request()
here).
Let's say there are many requests in parallel (in threads), I don't want a global variable to be "mixed" between different calls (since the operations are probably not atomic?)
Call functions from re.sub
If you want to use a function with re.sub
you need to pass a function, not an expression. As documented here, your function should take the match object as an argument and returns the replacement string. You can access the groups with the usual .group(n)
methods and so on. An example:
re.sub("(a+)(b+)", lambda match: "{0} as and {1} bs ".format(
len(match.group(1)), len(match.group(2))
), "aaabbaabbbaaaabb")
# Output is '3 as and 2 bs 2 as and 3 bs 4 as and 2 bs '
Note that the function should return strings (since they will be put back into the original string).
Why do I have to pass a callable to re.sub to make an uppercase string?
Second argument can be string or a callable.
re.sub(my_re, r'\1'.upper(), "bruce's computer")
: you're passing a \1
string to the sub
function (upper or not, doesn't matter)
re.sub(my_re, lambda x: x.group(1).upper(), "bruce's computer")
: you're passing a callable, so the upper()
works because it applies on the result.
x.group(1).upper()
isn't evaluated at once because it's contained in a lambda expression, equivalent to the non-lambda:
def func(x):
return x.group(1).upper()
that you could also pass to re.sub
: re.sub(my_re, func, "bruce's computer")
, note the lack of ()
in that case!
Calling a function on captured group in re.sub()
You pass a function to re.sub
and then you pull the group from there:
def base64_encode(match):
"""
This function takes a re 'match object' and performs
The appropriate substitutions
"""
group = match.group(1)
... #Code to encode as base 64
return result
re.sub(...,base64_encode,s,flags=re.I)
Python replace string pattern with output of function
You can pass a function to re.sub
. The function will receive a match object as the argument, use .group()
to extract the match as a string.
>>> def my_replace(match):
... match = match.group()
... return match + str(match.index('e'))
...
>>> string = "The quick @red fox jumps over the @lame brown dog."
>>> re.sub(r'@\w+', my_replace, string)
'The quick @red2 fox jumps over the @lame4 brown dog.'
how to pass a match object from re.sub as an argument
You have to use lambda and pass the match
string = re.sub(r"{(.*?)}", lambda match :matchVar(match, another_argument), string)
Or change the matchVar to def matchVar(match):
and pass the function like re.sub(r"{(.*?)}", matchVar, string)
How can I pass a callback to re.sub, but still inserting match captures?
If you pass a function you lose the automatic escaping of backreferences. You just get the match object and have to do the work. So you could:
Pick a string in the regex rather than passing a function:
text = "abcdef"
pattern = "(b|e)cd(b|e)"
repl = [r"\1bla\2", r"\1blabla\2"]
re.sub(pattern, random.choice(repl), text)
# 'abblaef' or 'abblablaef'
Or write a function that processes the match object and allows more complex processing. You can take advantage of expand
to use back references:
text = "abcdef abcdef"
pattern = "(b|e)cd(b|e)"
def repl(m):
repl = [r"\1bla\2", r"\1blabla\2"]
return m.expand(random.choice(repl))
re.sub(pattern, repl, text)
# 'abblaef abblablaef' and variations
You can, or course, put that function into a lambda:
repl = [r"\1bla\2", r"\1blabla\2"]
re.sub(pattern, lambda m: m.expand(random.choice(repl)), text)
How to pass a variable to a re.sub callback?
The easiest way I guess is to make use of functools.partial
, which allows you create a "partially evaluated" function:
from functools import partial
def evaluate(match, mappings):
return str(eval(match.group(0)[2:-1], mappings))
mappings = {'A': 1, 'B': 2} # Or whatever ...
newstring = sub(r'\#\{([^#]+)\}', partial(evaluate, mappings=mappings), string)
Related Topics
Python Subprocess and User Interaction
When Should Iteritems() Be Used Instead of Items()
Serving Dynamically Generated Zip Archives in Django
Python Read File as Stream from Hdfs
How to Do Row-To-Column Transposition of Data in CSV Table
Tuple or List When Using 'In' in an 'If' Clause
How to Check If an Object Is a List or Tuple (But Not String)
How to Create a Numpy Array of Arbitrary Length Strings
Create Empty File Using Python
Pandas Equivalent of Oracle Lead/Lag Function
Pandas Groupby.Size VS Series.Value_Counts VS Collections.Counter with Multiple Series
What Exactly Is the Point of Memoryview in Python
Format String Unused Named Arguments
Compare Two Files Report Difference in Python
How to Read Contents of an Table in Ms-Word File Using Python
How Is Tuple Implemented in Cpython