What's a faster operation, re.match/search or str.find?
The question: which is faster is best answered by using timeit
.
from timeit import timeit
import re
def find(string, text):
if string.find(text) > -1:
pass
def re_find(string, text):
if re.match(text, string):
pass
def best_find(string, text):
if text in string:
pass
print timeit("find(string, text)", "from __main__ import find; string='lookforme'; text='look'")
print timeit("re_find(string, text)", "from __main__ import re_find; string='lookforme'; text='look'")
print timeit("best_find(string, text)", "from __main__ import best_find; string='lookforme'; text='look'")
The output is:
0.441393852234
2.12302494049
0.251421928406
So not only should you use the in
operator because it is easier to read, but because it is faster also.
What is the difference between re.search and re.match?
re.match
is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^
in the pattern.
As the re.match documentation says:
If zero or more characters at the
beginning of string match the regular expression pattern, return a
correspondingMatchObject
instance.
ReturnNone
if the string does not
match the pattern; note that this is
different from a zero-length match.Note: If you want to locate a match
anywhere in string, usesearch()
instead.
re.search
searches the entire string, as the documentation says:
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
correspondingMatchObject
instance.
ReturnNone
if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.
So if you need to match at the beginning of the string, or to match the entire string use match
. It is faster. Otherwise use search
.
The documentation has a specific section for match
vs. search
that also covers multiline strings:
Python offers two different primitive
operations based on regular
expressions:match
checks for a match
only at the beginning of the string,
whilesearch
checks for a match
anywhere in the string (this is what
Perl does by default).Note that
match
may differ fromsearch
even when using a regular expression
beginning with'^'
:'^'
matches only
at the start of the string, or in
MULTILINE
mode also immediately
following a newline. The “match
”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optionalpos
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing$', re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
Python: speed for in vs regular expression
Option (1) definitely is faster. For the future, do something like this to test it:
>>> import time, re
>>> if True:
... s = time.time()
... "aaaa" in "bbbaaaaaabbb"
... print time.time()-s
...
True
1.78813934326e-05
>>> if True:
... s = time.time()
... pattern = re.compile("aaaa")
... pattern.search("bbbaaaaaabbb")
... print time.time()-s
...
<_sre.SRE_Match object at 0xb74a91e0>
0.0143280029297
gnibbler's way of doing this is better, I never really played around with interpreter options so I didn't know about that one.
Differences between re.match, re.search, re.fullmatch
Giving credit for @Ruzihm's answer since parts of my answer derive from his.
Quick overview
A quick rundown of the differences:
re.match
is anchored at the start^pattern
- Ensures the string begins with the pattern
re.fullmatch
is anchored at the start and end of the pattern^pattern$
- Ensures the full string matches the pattern (can be especially useful with alternations as described here)
re.search
is not anchoredpattern
- Ensures the string contains the pattern
A more in-depth comparison of re.match
vs re.search
can be found here
With examples:
aa # string
a|aa # regex
re.match: a
re.search: a
re.fullmatch: aa
ab # string
^a # regex
re.match: a
re.search: a
re.fullmatch: # None (no match)
So what about \A
and \Z
anchors?
The documentation states the following:
Python offers two different primitive operations based on regular
expressions:re.match()
checks for a match only at the beginning of
the string, whilere.search()
checks for a match anywhere in the
string (this is what Perl does by default).
And in the Pattern.fullmatch
section it says:
If the whole string matches this regular expression, return a corresponding match object.
And, as initially found and quoted by Ruzihm in his answer:
Note however that in MULTILINE mode match() only matches at the
beginning of the string, whereas using search() with a regular
expression beginning with^
will match at the beginning of each
line.>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<re.Match object; span=(4, 5), match='X'>
\A^A
B
X$\Z
# re.match('X', s) no match
# re.search('^X', s) no match
# ------------------------------------------
# and the string above when re.MULTILINE is enabled effectively becomes
\A^A$
^B$
^C$\Z
# re.match('X', s, re.MULTILINE) no match
# re.search('^X', s, re.MULTILINE) match X
With regards to \A
and \Z
, neither performs differently for re.MULTILINE
since \A
and \Z
are effectively the only ^
and $
in the whole string.
So using \A
and \Z
with any of the three methods yields the same results.
Answer (line anchors vs string anchors)
What this tells me is that re.match
and re.fullmatch
don't match line anchors ^
and $
respectively, but that they instead match string anchors \A
and \Z
respectively.
What's faster: Regex or string operations?
It depends
Although string manipulation will usually be somewhat faster, the actual performance heavily depends on a number of factors, including:
- How many times you parse the regex
- How cleverly you write your string code
- Whether the regex is precompiled
As the regex gets more complicated, it will take much more effort and complexity to write equivlent string manipulation code that performs well.
Why is a compiled python regex slower?
Short answer
If you call compiled_pattern.search(text)
directly, it won't call _compile
at all, it will be faster than re.search(pattern, text)
and much faster than re.search(compiled_pattern, text)
.
This performance difference is due to KeyError
s in cache and slow hash calculations for compiled patterns.
re
functions and SRE_Pattern
methods
Any time a re
function with a pattern
as 1st argument (e.g. re.search(pattern, string)
or re.findall(pattern, string)
) is called, Python tries to compile the pattern
first with _compile
and then calls the corresponding method on the compiled pattern. For example:
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
Note that pattern
could be either a string or an already compiled pattern (an SRE_Pattern
instance).
_compile
Here's a compact version of _compile
. I simply removed debug and flags check:
_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512
def _compile(pattern, flags):
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if len(_cache) >= _MAXCACHE:
_cache.clear()
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
_compile
with String pattern
When _compile
is called with a string pattern, the compiled pattern is saved in _cache
dict. Next time the same function is called (e.g. during the many timeit
runs), _compile
simply checks in _cache
if this string has already been seen and returns the corresponding compiled pattern.
Using ipdb
debugger in Spyder, it's easy to dive into re.py
during execution.
import re
pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
'tempor incididunt ut labore et dolore magna aliqua.'
compiled_pattern = re.compile(pattern)
re.search(pattern, text)
re.search(pattern, text)
With a breakpoint at the second re.search(pattern, text)
, it can be seen that the pair:
{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}
is saved in _cache
. The compiled pattern is returned directly.
_compile
with compiled pattern
slow hash
What happens if _compile
is called with an already compiled pattern?
First, _compile
checks if the pattern is in _cache
. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:
In [1]: import re
In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"
In [3]: compiled_pattern = re.compile(pattern)
In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
hash(compiled_pattern)
is 60 times slower than hash(pattern)
here.
KeyError
When a pattern
is unknown, _cache[type(pattern), pattern, flags]
fails with a KeyError
.
The KeyError
gets handled and ignored. Only then does _compile
check if the pattern is already compiled. If it is, it gets returned, without being written in cache.
It means that the next time _compile
is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError
.
Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text)
is slower than re.search(pattern, text)
.
This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile
is called with an already compiled pattern.
Speed up millions of regex replacements in Python 3
One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b"
.
Because re
relies on C code to do the actual matching, the savings can be dramatic.
As @pvg pointed out in the comments, it also benefits from single pass matching.
If your words are not regex, Eric's answer is faster.
Regex match returing 'none' while findall & search work
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string
https://docs.python.org/3/library/re.html#search-vs-match
Related Topics
How to Access Function Variables in Another Function
Gradient Descent Using Python and Numpy
Call Int() Function on Every List Element
How to Group a List of Tuples/Objects by Similar Index/Attribute in Python
Python-Requests Close Http Connection
Update Row Values Where Certain Condition Is Met in Pandas
Python String.Strip Stripping Too Many Characters
The Problem with Installing Pil Using Virtualenv or Buildout
A Very Simple Multithreading Parallel Url Fetching (Without Queue)
Functions That Help to Understand JSON(Dict) Structure
How to Get JSON from Webpage into Python Script
Split List into Smaller Lists (Split in Half)
Valueerror: Numpy.Dtype Has the Wrong Size, Try Recompiling
Dummy Variables When Not All Categories Are Present