How can I do multiple substitutions using regex?
The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.
A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )
import re
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
if __name__ == "__main__":
text = "Larry Wall is the creator of Perl"
dict = {
"Larry Wall" : "Guido van Rossum",
"creator" : "Benevolent Dictator for Life",
"Perl" : "Python",
}
print multiple_replace(dict, text)
So in your case, you could make a dict trans = {"a": "aa", "b": "bb"}
and then pass it into multiple_replace
along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub
to perform the translation dictionary lookup.
You could use this function while reading from your file, for example:
with open("notes.txt") as text:
new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
result.write(new_text)
I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.
As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.
Replacing multiple regex patterns together
Iterate your dictionary, then make a substitution using each key, value pair:
replacements = { r'\spunt(?!\s*komma)' : r".",
r'punt komma' : r",",
r'(?<!punt )komma' : r",",
"paragraaf" : "\n\n" }
text = "a punt komma is in this case not a komma and thats it punt"
print(text)
for key, value in replacements.items():
text = re.sub(key, value, text)
print(text)
This outputs:
a punt komma is in this case not a komma and thats it punt
a , is in this case not a , and thats it.
Note that you probably should be word boundaries \b
around each key regex term, to avoid matching an unintentional substring.
Python Regex sub() with multiple patterns
If you're just trying to delete specific substrings, you can combine the patterns with alternation for a single pass removal:
pat1 = r"Please check with the store to confirm holiday hours."
pat2 = r'\t'
combined_pat = r'|'.join((pat1, pat2))
stripped = re.sub(combined_pat, '', s2)
It's more complicated if the "patterns" use actual regex special characters (because then you need to worry about wrapping them to ensure the alternation breaks at the right places), but for simple fixed patterns, it's simple.
If you had real regexes, rather than fixed patterns, you might do something like:
all_pats = [...]
combined_pat = r'|'.join(map(r'(?:{})'.format, all_pats))
so any regex specials remain grouped without possibly "bleeding" across an alternation.
replace more than one pattern python
You need the regex "or" operator which is the pipe |
:
re.sub(r"http\S+|@\S+","",sent)
If you have a long list of patterns that you want to remove, a common trick is to use join
to create the regular expression:
to_match = ['http\S+',
'@\S+',
'something_else_you_might_want_to_remove']
re.sub('|'.join(to_match), '', sent)
Multiple regex substitutions using a dict with regex expressions as keys
If no expression you want to use matches an empty string (which is a valid assumption if you want to replace), you can use groups before |
ing the expressions, and then check which group found a match:
(exp1)|(exp2)|(exp3)
Or maybe named groups so you don't have to count the subgroups inside the subexpressions.
The replacement function than can look which group matched, and chose the replacement from a list.
I came up with this implementation:
import re
def dictsub(replacements, string):
"""things has the form {"regex1": "replacement", "regex2": "replacement2", ...}"""
exprall = re.compile("|".join("("+x+")" for x in replacements))
gi = 1
replacements_by_gi = {}
for (expr, replacement) in replacements.items():
replacements_by_gi[gi] = replacement
gi += re.compile(expr).groups + 1
def choose(match):
return replacements_by_gi[match.lastindex]
return re.sub(exprall, choose, string)
text = "local foals drink cola"
print(dictsub({"(?<=o)a":"w", "l(?=a)":"co"}, text))
that prints local fowls drink cocoa
How to chain multiple re.sub() commands in Python
Store the search/replace strings in a list and loop over it:
replacements = [
('__this__', 'something'),
('__This__', 'when'),
(' ', 'this'),
('.', 'is'),
('__', 'different')
]
for old, new in replacements:
stuff = re.sub(old, new, stuff)
stuff = stuff.capitalize()
Note that when you want to replace a literal .
character you have to use '\.'
instead of '.'
.
Efficiently make many multiple substitutions in a string
As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.
- Short dictionary (847 substitution pairs)
- Medium dictionary (2528 pairs)
- Long dictionary (80430 pairs)
For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.
My string is between 28k and 29k characters.
All times given in seconds.
UPDATE: Flashtext
A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.
Regular Expressions
There are many variations, but the best tend to be very similar to this:
import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)
Execution times were:
- 1.63
- 5.03
- 7.7
Replace
This method simply applies string.replace
in a loop. (Later I talk about problems with this.)
for original, replacement in self.my_dict.items():
string = string.replace(original, replacement)
This solution proposes a variation using reduce
, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression
reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
equals ((((1+2)+3)+4)+5)
import functools
new_string = functools.reduce(lambda a, k: a.replace(*k),
my_dict.items(), string)
Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace
.
[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]
Execution times were (in parenthesis results for reduce and assignment expressions variants):
- 1.37 (1.39) (1.50)
- 4.10 (4.12) (4.07)
- 1.9 (1.8) (no Python 3.8 in machine)
Recursive Lambda
This proposal involves using a recursive Lambda.
mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)
Execution times were:
- 0.07
RecursionError
RecursionError
Practical remarks
See the update above: Flashtext is much faster than the other alternatives.
You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.
Regular expressions offer more control over your substitutions. For example, you may use \b
before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.
Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace
seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.
I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:
string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}
Using replace and regular expressions I get string = ' This is : an island '
so that my replace loop
for original, replacement in self.my_dict.items():
string = string.replace(f' {original} ', f' {replacement} ')
returns ' This is not : a museum '
as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.
RegEx: replace multiple values in text
Assuming there are only two matches in the line, would you please try the following:
#!/usr/bin/python
import re
s = 'RPM- 1400, (Psig)- 57.66, Ts- 48.11, (Psig)- 299.33'
newval = ['22.77', '355.26'] # array of new values
val_iter = iter(newval) # iterator to return each new value
s = re.sub(r'(?<=\(Psig\)- )[\d.]+', lambda x: next(val_iter), s)
print(s)
Output:
RPM- 1400, (Psig)- 22.77, Ts- 48.11, (Psig)- 355.26
The regex (?<=\(Psig\)- )[\d.]+
matches a decimal value preceded by the string (Psig)-
. Each time the regex matches, the value is replaced with the output of the iterator.
Combining multiple regex substitutions
You can't do it with consecutive re.sub
calls as you have shown. You can use re.finditer
to find them all. Each match will provide you with a match object, which has .start
and .end
attributes indicating their positions. You can collect all those together, and then remove characters at the end.
Here I use a bytearray
as a mutable string, used as a mask. It's initialized to zero bytes, and I mark with an 'x' all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters:
bits = bytearray(len(text))
for pat in patterns:
for m in re.finditer(pat, text):
bits[m.start():m.end()] = 'x' * (m.end()-m.start())
new_string = ''.join(c for c,bit in zip(text, bits) if not bit)
Is there a way to do multiple substitutions using regsub?
With a regsub
, no. There's a long-standing feature request for this sort of thing (which requires substitution with the result of evaluating a command on the match information) but it's not been acted on to date.
But you can use string map
to do what you want in this case:
set a ".a/b.c..d/e/f//g"
set b [string map {".." "no" "." "yes" "//" "false" "/" "true"} $a]
puts "changed $a to $b"
# changed .a/b.c..d/e/f//g to yesatruebyescnodtrueetrueffalseg
Note that when building the map, if any from-value is a prefix of another, the longer from-value should be put first. (This is because the string map
implementation checks which change to make in the order you list them in…)
It's possible to use regsub
and subst
to do multiple-target replacements in a two-step process, but I don't advise it for anything other than very complex cases! A nice string map
is far easier to work with.
Related Topics
Is There a Python Equivalent for Rspec to Do Tdd
Simple File Server to Serve Current Directory
Python, Ruby, Haskell - Do They Provide True Multithreading
Looking for Recommendation on How to Convert PDF into Structured Format
In Python Can One Implement Mixin Behavior Without Using Inheritance
Ruby Equivalent to Python's Help()
Pyobjc VS Rubycocoa for MAC Development: Which Is More Mature
Create Static Graphics Files (Png, Gif, Jpg) Using Ruby or Python
Python's Equivalent for Ruby's Define_Method
Python Equivalent of Ruby's .Select
How to Decrypt Aws Ruby Client-Side Encryption in Python
Using Perl, Python, or Ruby, How to Write a Program to "Click" on the Screen at Scheduled Time
Xcode 3.2 Ruby and Python Templates
What Does Blazeds Livecycle Data Services Do, That Something Like Pyamf or Rubyamf Not Do
Separate a Row of Strings into Separate Rows