Speed Up Millions of Regex Replacements in Python 3

Speed up millions of regex replacements in Python 3

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric's answer is faster.

Speed up millions of regex replacements in Dataframe

Use trrex, it builds an equivalent pattern as the same found in this resource (actually it is inspired by that answer):

from random import choice
from string import ascii_lowercase, digits

import pandas as pd
import trrex as tx

# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")

df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)

Output

                              string_column      string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!

[1500000 rows x 2 columns]

Timing (of on run of str.replace)

%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The timing does not include the time needed for building the pattern.

DISCLAIMER I'm the author of trrex

Increase regex replace speed

It's taking forever because you're updating the entire dataframe 20 million times. There's no need for the loop, the assignment operates on the whole df, not one row at a time.

Also, you can do all the replacements at once by combining the regular expressions using alternatives with pipes.

df['text'] = df['text'].src.replace(r'\d{11}|[a-z]\d{3}|\d-[a-z]{5}-\d', '', regex=True)

There's no need for {1} in the regular expressions. A pattern matches exactly one time unless you quantify it otherwise.

Speed up a series of regex replacement in python

You should probably do three things:

  1. Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
  2. If possible (depending on file size), read the file into memory completely.
  3. Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).

This gives you something like:

regex = re.compile(r"My big honking regex")
for datafile in files:
content = datafile.read()
result = regex.sub("Replacement", content)

Python Regex - Fast replace of multiple keywords with punctuation and starting with

You can tweak this solution to suit your needs:

  • Create another dictionary from a that will contain the same keys and the regex created from the values
  • If a * char is found, replace it with \w* if you mean any zero or more word chars, or use \S* if you mean any zero or more non-whitespace chars (please adjust the def quote(self, char) method), else, quote the char
  • Use unambiguous word boundaries, (?<!\w) and (?!\w), or remove them altogether if they interfere with matching non-word entries
  • The first regex here will look like (?<!\w)(?:cat|dog(?:\ and\ cat)?)(?!\w) (demo) and the second will look like (?<!\w)(?::\)|I've\ been|asp\w*)(?!\w) (demo)
  • Replace in a loop.

See the Python demo:

import re

# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}

def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1

def dump(self):
return self.data

def quote(self, char):
if char == '*':
return r'\w*'
else:
return re.escape(char)

def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None

alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0

if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')

if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"

if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result

def pattern(self):
return self._pattern(self.dump())

# Creating patterns
a2 = {}
for k,v in a.items():
trie = Trie()
for w in v:
trie.add(w)
a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)

for k,r in a2.items():
text = r.sub(k, text)

print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX

Combine in an efficient way regex python

Casually discovered that applying each regex individually in a for-loop is very slow using the re module, while it's surprisingly faster using the regex module.



Related Topics



Leave a reply



Submit