Speed Up Millions of Regex Replacements in Python 3

Speed up millions of regex replacements in Python 3

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric's answer is faster.

Speed up millions of regex replacements in Dataframe

Use trrex, it builds an equivalent pattern as the same found in this resource (actually it is inspired by that answer):

from random import choice
from string import ascii_lowercase, digits

import pandas as pd
import trrex as tx

# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
                  "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
                  "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")

df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)

Output

                              string_column      string_column_location_removed
0                    Burger King Oakland CA                        Burger King 
1                   Walmart Walnut Creek CA                            Walmart 
2                   Random Other Thing Here             Random Other Thing Here
3           Another random other thing here     Another random other thing here
4        Really Appreciate the help on this  Really Appreciate the help on this
...                                     ...                                 ...
1499995             Walmart Walnut Creek CA                            Walmart 
1499996             Random Other Thing Here             Random Other Thing Here
1499997     Another random other thing here     Another random other thing here
1499998  Really Appreciate the help on this  Really Appreciate the help on this
1499999                  Thank you so Much!                  Thank you so Much!

[1500000 rows x 2 columns]

Timing (of on run of str.replace)

%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The timing does not include the time needed for building the pattern.

DISCLAIMER I'm the author of trrex

Increase regex replace speed

It's taking forever because you're updating the entire dataframe 20 million times. There's no need for the loop, the assignment operates on the whole df, not one row at a time.

Also, you can do all the replacements at once by combining the regular expressions using alternatives with pipes.

df['text'] = df['text'].src.replace(r'\d{11}|[a-z]\d{3}|\d-[a-z]{5}-\d', '', regex=True)

There's no need for {1} in the regular expressions. A pattern matches exactly one time unless you quantify it otherwise.

Speed up a series of regex replacement in python

You should probably do three things:

Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
If possible (depending on file size), read the file into memory completely.
Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).

This gives you something like:

regex = re.compile(r"My big honking regex")
for datafile in files:
    content = datafile.read()
    result = regex.sub("Replacement", content)

Python Regex - Fast replace of multiple keywords with punctuation and starting with

You can tweak this solution to suit your needs:

Create another dictionary from a that will contain the same keys and the regex created from the values
If a * char is found, replace it with \w* if you mean any zero or more word chars, or use \S* if you mean any zero or more non-whitespace chars (please adjust the def quote(self, char) method), else, quote the char
Use unambiguous word boundaries, (?<!\w) and (?!\w), or remove them altogether if they interfere with matching non-word entries
The first regex here will look like (?<!\w)(?:cat|dog(?:\ and\ cat)?)(?!\w) (demo) and the second will look like (?<!\w)(?::\)|I've\ been|asp\w*)(?!\w) (demo)
Replace in a loop.

See the Python demo:

import re

# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""
    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        if char == '*':
            return r'\w*'
        else:
            return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

# Creating patterns
a2 = {}
for k,v in a.items():
    trie = Trie()
    for w in v:
        trie.add(w)
    a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)

for k,r in a2.items():
    text = r.sub(k, text)
    
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX

Combine in an efficient way regex python

Casually discovered that applying each regex individually in a for-loop is very slow using the re module, while it's surprisingly faster using the regex module.

Speed Up Millions of Regex Replacements in Python 3