Is It Worth Using Python's Re.Compile

Is it worth using Python's re.compile?

I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.

EDIT:
After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).

From module re.py (comments are mine):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.

When to use re.compile

Let's say that word1, word2 ... are regexes:

let's rewrite those parts:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):
   if allWords[0].search(data):
      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW
      return(temp)

   elif allWords[1].search(data):
      temp = data.split("word2", 1)[1]
      return(temp)

can be rewritten simply as

def bar(data, allWords):
   return allWords.split(data,maxsplit=1)[1]

in terms of performance:

regular expression is compiled at start, so it's as fast as it can be
there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
the match & the split are done in one operation

The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)

What does python's re.compile do?

It compiles a regex into a regex object. Take a look at the docs for more info.

Should I reuse a compiled regex?

well, if I understand what you say, all you want to write is :

def match_on_list_of_strings(list_of_strings):
    regex = compile(r'(?=(%s))')
    for string in list_of_strings:
        yield regex.findall(string)

That will apply your match on the strings as many times there are strings in the list of strings, while your regex been compiled only once.

Aaaah... but you don't need a regex for that:

def match_on_list_of_strings(bigstring, list_of_strings):
    for string in list_of_strings:
        if string in bigstring:
            yield string

or if you really want to use a re:

def match_on_list_of_strings(bigstring, list_of_strings):
    for string in list_of_strings:
        if re.match('.*'+string+'.*', bigstring):
            yield string

And then to answer your question, no you can't compile the destination string into a regex, but only the contrary. When you compile a regex, what you do is transform the actual regexp into an internal representation of the automaton. You might want to read courses on NFA and regexps

Why is a compiled python regex slower?

Short answer

If you call compiled_pattern.search(text) directly, it won't call _compile at all, it will be faster than re.search(pattern, text) and much faster than re.search(compiled_pattern, text).

This performance difference is due to KeyErrors in cache and slow hash calculations for compiled patterns.

`re` functions and `SRE_Pattern` methods

Any time a re function with a pattern as 1st argument (e.g. re.search(pattern, string) or re.findall(pattern, string)) is called, Python tries to compile the pattern first with _compile and then calls the corresponding method on the compiled pattern. For example:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

Note that pattern could be either a string or an already compiled pattern (an SRE_Pattern instance).

_compile

Here's a compact version of _compile. I simply removed debug and flags check:

_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512

def _compile(pattern, flags):
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    loc = None
    _cache[type(pattern), pattern, flags] = p, loc
    return p

`_compile` with String pattern

When _compile is called with a string pattern, the compiled pattern is saved in _cache dict. Next time the same function is called (e.g. during the many timeit runs), _compile simply checks in _cache if this string has already been seen and returns the corresponding compiled pattern.

Using ipdb debugger in Spyder, it's easy to dive into re.py during execution.

import re

pattern = 'sed'
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod' \
       'tempor incididunt ut labore et dolore magna aliqua.'

compiled_pattern = re.compile(pattern)

re.search(pattern, text)
re.search(pattern, text)

With a breakpoint at the second re.search(pattern, text), it can be seen that the pair:

{(<class 'str'>, 'sed', 0): (re.compile('sed'), None)}

is saved in _cache. The compiled pattern is returned directly.

`_compile` with compiled pattern

slow hash

What happens if _compile is called with an already compiled pattern?

First, _compile checks if the pattern is in _cache. To do so, it needs to calculate its hash. This calculation is much slower for a compiled pattern than for a string:

In [1]: import re

In [2]: pattern = "(?:a(?:b(?:b\\é|sorbed)|ccessing|gar|l(?:armists|ternation)|ngels|pparelled|u(?:daciousness's|gust|t(?:horitarianism's|obiographi
   ...: es)))|b(?:aden|e(?:nevolently|velled)|lackheads|ooze(?:'s|s))|c(?:a(?:esura|sts)|entenarians|h(?:eeriness's|lorination)|laudius|o(?:n(?:form
   ...: ist|vertor)|uriers)|reeks)|d(?:aze's|er(?:elicts|matologists)|i(?:nette|s(?:ciplinary|dain's))|u(?:chess's|shanbe))|e(?:lectrifying|x(?:ampl
   ...: ing|perts))|farmhands|g(?:r(?:eased|over)|uyed)|h(?:eft|oneycomb|u(?:g's|skies))|i(?:mperturbably|nterpreting)|j(?:a(?:guars|nitors)|odhpurs
   ...: 's)|kindnesses|m(?:itterrand's|onopoly's|umbled)|n(?:aivet\\é's|udity's)|p(?:a(?:n(?:els|icky|tomimed)|tios)|erpetuating|ointer|resentation|
   ...: yrite)|r(?:agtime|e(?:gret|stless))|s(?:aturated|c(?:apulae|urvy's|ylla's)|inne(?:rs|d)|m(?:irch's|udge's)|o(?:lecism's|utheast)|p(?:inals|o
   ...: onerism's)|tevedore|ung|weetest)|t(?:ailpipe's|easpoon|h(?:ermionic|ighbone)|i(?:biae|entsin)|osca's)|u(?:n(?:accented|earned)|pstaging)|v(?
   ...: :alerie's|onda)|w(?:hirl|ildfowl's|olfram)|zimmerman's)"

In [3]: compiled_pattern = re.compile(pattern)

In [4]: % timeit hash(pattern)
126 ns ± 0.358 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: % timeit hash(compiled_pattern)
7.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

hash(compiled_pattern) is 60 times slower than hash(pattern) here.

`KeyError`

When a pattern is unknown, _cache[type(pattern), pattern, flags] fails with a KeyError.

The KeyError gets handled and ignored. Only then does _compile check if the pattern is already compiled. If it is, it gets returned, without being written in cache.

It means that the next time _compile is called with the same compiled pattern, it will calculate the useless, slow hash again, but will still fail with a KeyError.

Error handling is expensive, and I suppose that's the main reason why re.search(compiled_pattern, text) is slower than re.search(pattern, text).

This weird behaviour might be a choice to speed up calls with string patterns, but it might have been a good idea to write a warning if _compile is called with an already compiled pattern.

what is the use of re.compile in BeautifulSoup?

re.compile() compiles a regex into a regex object

For example,

line = re.compile("line")
print(line)

result : re.compile("line")

Here are a few good reference on regex:
1. python 3 re library
2. python re.compile

Meaning of re.compile(r [\w']+ ) in Python

This is a regular expression that has been compiled for faster reuse (explained in this question: Is it worth using re.compile). The command re.compile is explained in the Python docs.

Regarding the specific regex expression, this searches for groups that have alphanumerics (that's the \w part) or apostrophes (which is also in those square brackets) that are 1 or longer. Note that whitespace is not a match, so this, generally speaking, breaks a line into words.

See the query in a Python specific regex tester to try it out or on regex101 where they offer an explanation of any regex expression.

In the phrase How's it going $# this would how three matches: How's, it, going but wouldn't match the group of symbols.

There are lots of tutorials and even some games out there but you can start with regexone to understand it better by trying some out.

Python: re.compile and re.sub

I will explain what happens with your code:

import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()

match = re.compile('<.*>')
for unwanted in text:
    fixed_doc = match.sub(r' ',text)

fixed.write(fixed_doc)

The instruction text = file.read() creates an object text of type string named text.

Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.

As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.

Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).

You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.

match is also the name of a function of the re module.

This use of this name for your particular need is very confusing. You must avoid that.

There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.

Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.

Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.

Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.

As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.

To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.

There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.

By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.

So , now I can propose you a code for your first problem:

import re

def ripl(mat=None,li = []):
    if mat==None:
        li[:] = []
        return
    if mat.group(1):
        li.append(mat.span(2))
        return ''
    elif mat.span() in li:
        return ''
    else:
        return mat.group()

r = re.compile('</[^>]+>'
               '|'
               '<([^>]+)>(?=.*?(</\\1>))',
               re.DOTALL)

text = '''<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'

result

1------------------------------------1
<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>
2------------------------------------2

George <wxc>Washington 
Joe </zazaza>Taylor

3------------------------------------3

The principle is as follows:

When the regex detects a tag,

- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text

For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.

If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.

If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.

The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).

As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.

Then, one must think to write ripl() before any use of it.

If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:

import re

def ripl(mat=None,li = []):
    if mat==None:
        li[:] = []
        return
    if mat.group(1):
        return ''
    elif mat.group(2):
        li.append(mat.span(3))
        return ''
    elif mat.span() in li:
        return ''
    else:
        return mat.group()

r = re.compile('( *\n *)'
               '|'
               '</[^>]+>'
               '|'
               '<([^>]+)>(?=.*?(</\\2>)) *',
               re.DOTALL)

text = '''<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'

result

1------------------------------------1
<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3

Is It Worth Using Python's Re.Compile