Speed up millions of regex replacements in Python 3
One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b"
.
Because re
relies on C code to do the actual matching, the savings can be dramatic.
As @pvg pointed out in the comments, it also benefits from single pass matching.
If your words are not regex, Eric's answer is faster.
Speed up millions of regex replacements in Dataframe
Use trrex, it builds an equivalent pattern as the same found in this resource (actually it is inspired by that answer):
from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)
Output
string_column string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!
[1500000 rows x 2 columns]
Timing (of on run of str.replace
)
%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The timing does not include the time needed for building the pattern.
DISCLAIMER I'm the author of trrex
Increase regex replace speed
It's taking forever because you're updating the entire dataframe 20 million times. There's no need for the loop, the assignment operates on the whole df, not one row at a time.
Also, you can do all the replacements at once by combining the regular expressions using alternatives with pipes.
df['text'] = df['text'].src.replace(r'\d{11}|[a-z]\d{3}|\d-[a-z]{5}-\d', '', regex=True)
There's no need for {1}
in the regular expressions. A pattern matches exactly one time unless you quantify it otherwise.
Speed up a series of regex replacement in python
You should probably do three things:
- Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
- If possible (depending on file size), read the file into memory completely.
- Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).
This gives you something like:
regex = re.compile(r"My big honking regex")
for datafile in files:
content = datafile.read()
result = regex.sub("Replacement", content)
Python Regex - Fast replace of multiple keywords with punctuation and starting with
You can tweak this solution to suit your needs:
- Create another dictionary from
a
that will contain the same keys and the regex created from the values - If a
*
char is found, replace it with\w*
if you mean any zero or more word chars, or use\S*
if you mean any zero or more non-whitespace chars (please adjust thedef quote(self, char)
method), else, quote the char - Use unambiguous word boundaries,
(?<!\w)
and(?!\w)
, or remove them altogether if they interfere with matching non-word entries - The first regex here will look like
(?<!\w)(?:cat|dog(?:\ and\ cat)?)(?!\w)
(demo) and the second will look like(?<!\w)(?::\)|I've\ been|asp\w*)(?!\w)
(demo) - Replace in a loop.
See the Python demo:
import re
# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
if char == '*':
return r'\w*'
else:
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
# Creating patterns
a2 = {}
for k,v in a.items():
trie = Trie()
for w in v:
trie.add(w)
a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)
for k,r in a2.items():
text = r.sub(k, text)
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX
Combine in an efficient way regex python
Casually discovered that applying each regex individually in a for-loop is very slow using the re module, while it's surprisingly faster using the regex module.
Related Topics
Django Template How to Look Up a Dictionary Value With a Variable
How to Get the Path and Name of the File That Is Currently Executing
Do Regular Expressions from the Re Module Support Word Boundaries (\B)
Is There a Simple, Elegant Way to Define Singletons
How to Copy a String to the Clipboard
Adding a New Pandas Column With Mapped Value from a Dictionary
What Does It Mean If a Python Object Is "Subscriptable" or Not
Understanding Repr( ) Function in Python
Convert String to Variable Name in Python
How to Plot Data from Multiple Two Column Text Files With Legends in Matplotlib
Extracting an Attribute Value With Beautifulsoup
Using Module 'Subprocess' With Timeout
Efficient Way to Rotate a List in Python
Why Does This Iterative List-Growing Code Give Indexerror: List Assignment Index Out of Range