How to Do Multiple Substitutions Using Regex

How can I do multiple substitutions using regex?

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.

A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )

import re 

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text)

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.

You could use this function while reading from your file, for example:

with open("notes.txt") as text:
    new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
    result.write(new_text)

I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

Replacing multiple regex patterns together

Iterate your dictionary, then make a substitution using each key, value pair:

replacements = { r'\spunt(?!\s*komma)' : r".",
                 r'punt komma' : r",",
                 r'(?<!punt )komma' : r",",
                 "paragraaf" : "\n\n" }

text = "a punt komma is in this case not a komma and thats it punt"
print(text)

for key, value in replacements.items():
    text = re.sub(key, value, text)

print(text)

This outputs:

a punt komma is in this case not a komma and thats it punt
a , is in this case not a , and thats it.

Note that you probably should be word boundaries \b around each key regex term, to avoid matching an unintentional substring.

Python Regex sub() with multiple patterns

If you're just trying to delete specific substrings, you can combine the patterns with alternation for a single pass removal:

pat1 = r"Please check with the store to confirm holiday hours."
pat2 = r'\t'
combined_pat = r'|'.join((pat1, pat2))
stripped = re.sub(combined_pat, '', s2)

It's more complicated if the "patterns" use actual regex special characters (because then you need to worry about wrapping them to ensure the alternation breaks at the right places), but for simple fixed patterns, it's simple.

If you had real regexes, rather than fixed patterns, you might do something like:

all_pats = [...]
combined_pat = r'|'.join(map(r'(?:{})'.format, all_pats))

so any regex specials remain grouped without possibly "bleeding" across an alternation.

replace more than one pattern python

You need the regex "or" operator which is the pipe |:

re.sub(r"http\S+|@\S+","",sent)

If you have a long list of patterns that you want to remove, a common trick is to use join to create the regular expression:

to_match = ['http\S+',
            '@\S+',
            'something_else_you_might_want_to_remove']

re.sub('|'.join(to_match), '', sent)

Multiple regex substitutions using a dict with regex expressions as keys

If no expression you want to use matches an empty string (which is a valid assumption if you want to replace), you can use groups before |ing the expressions, and then check which group found a match:

(exp1)|(exp2)|(exp3)

Or maybe named groups so you don't have to count the subgroups inside the subexpressions.

The replacement function than can look which group matched, and chose the replacement from a list.

I came up with this implementation:


import re
def dictsub(replacements, string):
    """things has the form {"regex1": "replacement", "regex2": "replacement2", ...}"""
    exprall = re.compile("|".join("("+x+")" for x in replacements))
    gi = 1
    replacements_by_gi = {}
    for (expr, replacement) in replacements.items():
        replacements_by_gi[gi] = replacement
        gi += re.compile(expr).groups + 1

    def choose(match):
        return replacements_by_gi[match.lastindex]

    return re.sub(exprall, choose, string)

text = "local foals drink cola"
print(dictsub({"(?<=o)a":"w", "l(?=a)":"co"}, text))

that prints local fowls drink cocoa

How to chain multiple re.sub() commands in Python

Store the search/replace strings in a list and loop over it:

replacements = [
    ('__this__', 'something'),
    ('__This__', 'when'),
    (' ', 'this'),
    ('.', 'is'),
    ('__', 'different')
]

for old, new in replacements:
    stuff = re.sub(old, new, stuff)

stuff = stuff.capitalize()

Note that when you want to replace a literal . character you have to use '\.' instead of '.'.

Efficiently make many multiple substitutions in a string

As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.

Short dictionary (847 substitution pairs)
Medium dictionary (2528 pairs)
Long dictionary (80430 pairs)

For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.

My string is between 28k and 29k characters.

All times given in seconds.

UPDATE: Flashtext

A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.

Regular Expressions

There are many variations, but the best tend to be very similar to this:

import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)

Execution times were:

1.63
5.03
7.7

Replace

This method simply applies string.replace in a loop. (Later I talk about problems with this.)

for original, replacement in self.my_dict.items():
    string = string.replace(original, replacement)

This solution proposes a variation using reduce, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression

reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])

equals ((((1+2)+3)+4)+5)

import functools
new_string = functools.reduce(lambda a, k: a.replace(*k), 
                              my_dict.items(), string)

Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace.

[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]

Execution times were (in parenthesis results for reduce and assignment expressions variants):

1.37 (1.39) (1.50)
4.10 (4.12) (4.07)
1.9 (1.8) (no Python 3.8 in machine)

Recursive Lambda

This proposal involves using a recursive Lambda.

mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)

Execution times were:

0.07
RecursionError
RecursionError

Practical remarks

See the update above: Flashtext is much faster than the other alternatives.

You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.

Regular expressions offer more control over your substitutions. For example, you may use \b before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.

Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.

I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:

string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}

Using replace and regular expressions I get string = ' This is : an island ' so that my replace loop

for original, replacement in self.my_dict.items():
    string = string.replace(f' {original} ', f' {replacement} ')

returns ' This is not : a museum ' as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.

RegEx: replace multiple values in text

Assuming there are only two matches in the line, would you please try the following:

#!/usr/bin/python

import re

s = 'RPM- 1400, (Psig)- 57.66, Ts- 48.11, (Psig)- 299.33'
newval = ['22.77', '355.26']            # array of new values
val_iter = iter(newval)                 # iterator to return each new value

s = re.sub(r'(?<=\(Psig\)- )[\d.]+', lambda x: next(val_iter), s)
print(s)

Output:

RPM- 1400, (Psig)- 22.77, Ts- 48.11, (Psig)- 355.26

The regex (?<=\(Psig\)- )[\d.]+ matches a decimal value preceded by the string (Psig)- . Each time the regex matches, the value is replaced with the output of the iterator.

Combining multiple regex substitutions

You can't do it with consecutive re.sub calls as you have shown. You can use re.finditer to find them all. Each match will provide you with a match object, which has .start and .end attributes indicating their positions. You can collect all those together, and then remove characters at the end.

Here I use a bytearray as a mutable string, used as a mask. It's initialized to zero bytes, and I mark with an 'x' all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters:

bits = bytearray(len(text))
for pat in patterns:
    for m in re.finditer(pat, text):
        bits[m.start():m.end()] = 'x' * (m.end()-m.start())
new_string = ''.join(c for c,bit in zip(text, bits) if not bit)

Is there a way to do multiple substitutions using regsub?

With a regsub, no. There's a long-standing feature request for this sort of thing (which requires substitution with the result of evaluating a command on the match information) but it's not been acted on to date.

But you can use string map to do what you want in this case:

set a ".a/b.c..d/e/f//g"
set b [string map {".." "no" "." "yes" "//" "false" "/" "true"} $a]
puts "changed $a to $b"
# changed .a/b.c..d/e/f//g to yesatruebyescnodtrueetrueffalseg

Note that when building the map, if any from-value is a prefix of another, the longer from-value should be put first. (This is because the string map implementation checks which change to make in the order you list them in…)

It's possible to use regsub and subst to do multiple-target replacements in a two-step process, but I don't advise it for anything other than very complex cases! A nice string map is far easier to work with.