Do Python Regular Expressions Have an Equivalent to Ruby'S Atomic Grouping

Do Python regular expressions have an equivalent to Ruby's atomic grouping?

Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)) inside the lookahead, and then using a named backreference ((?P=name)) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax.

For example, the link you provided gives the Ruby example of

/"(?>.*)"/.match('"Quote"') #=> nil

We can emulate that in Python as such:

re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None

We can show that I'm doing something useful and not just spewing line noise, because if we change it so that the inner group doesn't eat the final ", it still matches:

re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict()
# => {'tmp': 'Quote'}

You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise:

re.search(r'"(?=(.*))\1"', '"Quote"') # => None

(Full disclosure: I learned this trick from perl's perlre documentation, which mentions it under the documentation for (?>...).)

In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of perlre:

[nelhage@anarchique:~/tmp]$ cat re.py
import re
import timeit


re_1 = re.compile(r'''\(
(
[^()]+ # x+
|
\( [^()]* \)
)+
\)
''', re.X)
re_2 = re.compile(r'''\(
(
(?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+)
|
\( [^()]* \)
)+
\)''', re.X)

print timeit.timeit("re_1.search('((()' + 'a' * 25)",
setup = "from __main__ import re_1",
number = 10)

print timeit.timeit("re_2.search('((()' + 'a' * 25)",
setup = "from __main__ import re_2",
number = 10)

We see a dramatic improvement:

[nelhage@anarchique:~/tmp]$ python re.py
96.0800571442
7.41481781006e-05

Which only gets more dramatic as we extend the length of the search string.

Best way of using atomic groupings in Python?

Whether you are using re or regex, you will have to fix your pattern, as it is catastrophic backtracking prone. Atomic groupings are not necessary here, you need optional groupings with obligatory patterns. Also, you need to fix your alternations that may start matching at the same location inside a string.

You can use

(?:[Uu]nit|[Ss]tudio|[Ff]lat)?\s{0,5}\d+[&-]*\w*\s*(?:[Ee]nd|[Gg]reen|[Cc]auseway|[Cc]heapside|[Cc]rescent|[Ss]treet|[Ll]ane|[Ww]alk|[Rr]oad|[Aa]venue|[Dd]rive|[Pp]ark|[Ww]ay|[Pp]lace|[Pp]arade|[Ii]ndustrial[Ee]state|[Tt]rading [Ee]state|[Hh]ouse|[Gg]reen)(?:\s{1,5}\w+){0,5}\s{0,5}[A-Z0-9][A-Z0-9][A-Z0-9]?[A-Z0-9]? {1,2}[0-9][A-Z]{2}

See the regex demo.

See the Python demo:

import re

def parse_results(string):
space = r"\s{0,5}"
building_type = r"(?:[Uu]nit|[Ss]tudio|[Ff]lat)?"
street_type = (r"\d+[&-]*\w*\s*"
r"(?:[Ee]nd|[Gg]reen|[Cc]auseway|[Cc]heapside|[Cc]rescent|"
r"[Ss]treet|[Ll]ane|[Ww]alk|[Rr]oad|[Aa]venue|[Dd]rive|"
r"[Pp]ark|[Ww]ay|[Pp]lace|[Pp]arade|[Ii]ndustrial"
r"[Ee]state|[Tt]rading [Ee]state|[Hh]ouse|[Gg]reen)")
postcode = r"[A-Z0-9][A-Z0-9][A-Z0-9]?[A-Z0-9]? {1,2}[0-9][A-Z]{2}"
pattern = re.compile(rf"{building_type}{space}{street_type}(?:\s{{1,5}}\w+){{0,5}}"
rf"{space}{postcode}")
print(pattern.pattern)
try:
return [re.sub(r"\s+", r" ", x) for x in pattern.findall(string)]
except Exception as e:
return (f"Error looking for address, exception {e}")

print(parse_results('Unit 23 End AS4 0SS'))

Why atomic grouping in Python is slower than a simple non-capturing alternation?

Your assumption is not correct. The whole point of atomic patterns is to prevent backtracking into the pattern.

The atomic_group pattern is of (?=(...))\1 type in your code and the non-atomic one is of (?:...) type. So, the first one already loses to the second one due to the use of a capturing group, see capturing group VS non-capturing group.

Besides, you need to match the strings twice with the atomic_group pattern, first, with the lookahead, second, with the backreference.

So, only use this techinque when you need to control backtracking inside a longer pattern.

Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?

A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.

The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.

Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.

Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).

A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.

Example

Input string: bbabbbabbbbc

Pattern: /(?>.*)c/

The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.


Input string: bbabbbabbbbc

Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/

There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.

You can see the backtracking behavior here.

  • Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.

    Pattern: /(.*|b*)[ac]/
    bbabbbabbbbc
    ^ -- Start matching. Look at first item in alternation: .*
    bbabbbabbbbc
    ^ -- First match of .*, due to greedy quantifier
    bbabbbabbbbc
    X -- [ac] cannot match
    -- Backtrack to ()
    bbabbbabbbbc
    ^ -- Continue explore other possibility with .*
    -- Step back 1 character
    bbabbbabbbbc
    ^ -- [ac] matches, end of regex, a match is found
  • With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.

    Pattern: /((?>.*)|b*)[ac]/
    bbabbbabbbbc
    ^ -- Start matching. Look at first item in alternation: (?>.*)
    bbabbbabbbbc
    ^ -- First match of .*, due to greedy quantifier
    -- The atomic grouping will disallow .* to be backtracked and rematched
    bbabbbabbbbc
    X -- [ac] cannot match
    -- Backtrack to ()
    -- (?>.*) is atomic, check the next possibility by alternation: b*
    bbabbbabbbbc
    ^ -- Starting to rematch with b*
    bbabbbabbbbc
    ^ -- First match with b*, due to greedy quantifier
    bbabbbabbbbc
    ^ -- [ac] matches, end of regex, a match is found

    The subsequent matches will continue on from here.

Python regex: how to match anything up to a specific string and avoid backtraking when failin

One-Regex Solution

^(?=(?P<aux1>(?:[^P]|P(?!ATTERN))*))(?P=aux1)PATTERN

Explanation

You wanted to use the atomic grouping like this: (?>.*?)PATTERN, right? This won't work. Problem is, you can't use lazy quantifiers at the end of an atomic grouping: the definition of the AG is that once you're outside of it, the regex won't backtrack inside.

So the regex engine will match the .*?, because of the laziness it will step outside of the group to check if the next character is a P, and if it's not it won't be able to backtrack inside the group to match that next character inside the .*.

What's usually used in Perl are structures like this: (?>(?:[^P]|P(?!ATTERN))*)PATTERN. That way, the equivalent of .* (here (?:[^P]|P(?!ATTERN))) won't "eat up" the wanted pattern.

This pattern is easier to read in my opinion with possessive quantifiers, which are made just for these occasions: (?:[^P]|P(?!ATTERN))*+PATTERN.

Translated with your workaround, this would lead to the above regex (added ^ since you should anchor the regex, either to the start of the string or to another regex).

Ruby Regex vs Python Regex

The last time I checked, they differed substantially in their Unicode support. Ruby in 1.9 at least has some very limited Unicode support. I believe one or two Unicode properties might be supported by now. Probably the general categories and maybe the scripts were the two I'm thinking of.

Python has less and more Unicode support at the same time. Python does seem to make it possible to meet the requirements of RL1.2a "Compatability Properties" from UTS#18 on Unicode Regular Expressions.

That said, there is a really rather nice Python library out there by Matthew Barnett (mrab) that finally adds a couple of Unicode properties to Python regexes. He supports the two most important ones: the general categories, and the script properties. It has some other intriguing features as well. It deserves some good publicity.

I don't think either of Ruby or Python support Unicode all that terribly well, although more and more gets done every day. In particular, however, neither meets even the barebones Level 1 requirement for Unicode Regular Expressions cited above. For example, RL1.2 requires that at least 11 properties be supported: General_Category, Script, Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, ANY, ASCII, and ASSIGNED.

I think Python only lets you get to some of those, and only in a roundabout way. Of course, there are many, many other properties beyond these 11.

When you’re looking for Unicode support, there's more than just UTS#10 on Regular Expressions of course, although that is the one that matters most to this question and neither Ruby nor Puython are Level 1 compliant. Other very important aspects of Unicode include UAX#15, UAX#14, UTS#18, UAX#11, UAX#29, and of course the crucial UAX#44. Python has libraries for at least a couple of those, I know. I don't know that they're standard.

But when it comes to regular expression support, um, there are richer alternatives than just those two, you know. :)

Alternation in atomic grouping is useless?

It's not useless in the general case.

It doesn't work in your examples because the first choice always matches, therefore the second choice won't be tried. And the engine won't backtrack inside the atomic group, because this is pretty much the purpose of an atomic group.

If you have two disjoint patterns in an atomic group, that will make perfect sense

(something like (?>ab|cd)).

Optimizing a regular expression to parse chinese pinyin

Assuming that your regex engine supports lookbehinds, atomic groups and possessive quantifiers (that are PCRE features):

Some examples of what can be replaced:

  • all (?: by (?>

  • the begining (all the first named group) by:

    ^(?P<initial>(?>[csz]h?+|[bdfghj-npqrtwxy])?)

  • this part* by:

    |(?<![csz]h)(?<=h)(?>a(?>[io]|ng?+)?|e(?>i|ng?+)?|o(?>u|ng)|u(?>[ino]|a(?>i|ng?+)?)?)

*( ie: |(?:(?<!sh|ch|zh)(?<=h)uang|(?<!sh|ch|...|(?<!sh|ch|zh)(?<=h)u) )

  • the last part* by:

    |(?<![bcdfghj-np-tw-z])(?>a(?>[io]|ng?+)?|e(?>[ir]|ng?+)?|ou?+))$

*( ie:|(?:(?<!r|c|b|d|g|f|h|k|j|m|l|n|q|p|s|t|w|y|x|z)a|(?<!r|c|b|d|...))$ )

How to deal with the other parts:

example:

(?:(?<=ch)uang|(?<=ch)ang|(?<=ch)eng|(?<=ch)ong|(?<=ch)uai|(?<=ch)uan|(?<=ch)ai|(?<=ch)an|(?<=ch)ao|(?<=ch)en|(?<=ch)ou|(?<=ch)ua|(?<=ch)ui|(?<=ch)un|(?<=ch)uo|(?<=ch)a|(?<=ch)e|(?<=ch)i|(?<=ch)u)

_ all this kind of parts has the same lookbehind, you must do these steps for each _

# step 1: lookarounds factorization

(?<=ch)(?>ang|eng|ong|uai|uan|ai|an|ao|en|ou|ua|ui|un|uo|a|e|i|u)

# step 2: sort all the content by alphabetic order

(?<=ch)(?>a|ai|an|ang|ao|e|en|eng|i|ong|ou|u|ua|uai|uan|ui|un|uo)

# step 3: group by first letter: don't forget the ? if the letter can be alone

(?<=ch)(?>a(?>i|n|ng|o)?|e(?>n|ng)?|i|o(?>ng|u)|u(?>a|ai|an|i|n|o)?)

# step 4: reduce the terminations (ie: n & ng => ng?+)

(?<=ch)(?>a(?>i|ng?+|o)?|e(?>ng?+)?|i|o(?>ng|u)|u(?>a[in]?+|i|n|o)?)

# step 5: put single letters in a character class

(?<=ch)(?>a(?>[io]|ng?+)?|e(?>ng?+)?|i|o(?>ng|u)|u(?>a[in]?+|[ino])?)

conclusion

Although the result is shorter, the goal here is optimization. I reduced the number of tests with the factorization and the number of backtracks using atomic groups and possessive quantifiers.

some limitations

Note that regex features like atomic groups and possessive quantifiers are not supported by all regex flavors, but it is possible to remedy the problem:

  • for flavors that don't support possessive quantifiers: change ?+ to ?
  • for flavors that don't support atomic groups: change (?> to (?:

(Note that there is a trick to have atomic groups with Python, which you may test with a timer, to surround all the pattern. See this incredible post: Do Python regular expressions have an equivalent to Ruby's atomic grouping?)

Some regex engines such as javascript do not support lookbehinds. In this case, you must rewrite all your pattern using only alternations (ie |), which isn't a bad thing, since lookbehinds make your pattern slower; and give up the named captures that are not supported too. (In this context, it should be noted that to remove negative lookbehinds you need to put syllables described in these parts before all others so that they are matched first.)

other ways of optimization

  • rewrite your pattern without lookbehinds and with | instead
  • sort the different lines by the most used syllables

Using Regex to extract certain phrases but exclude phrases followed by the word of

You can emulate atomic groups with capturing groups and lookahead:

(?=(?P<section>Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*))(?P=section)(?! of)

Demo

Long story short:
* in positive lookahead you create a capturing group called section that finds a section pattern
* then you match the group contents in (?P=secion)
* then in negative lookahead you check that there is no of following

Here is a really good answer that explains that technique.

What is a non-capturing group in regular expressions?

Let me try to explain this with an example.

Consider the following text:

http://stackoverflow.com/
https://stackoverflow.com/questions/tagged/regex

Now, if I apply the regex below over it...

(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

... I would get the following result:

Match "http://stackoverflow.com/"
Group 1: "http"
Group 2: "stackoverflow.com"
Group 3: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
Group 1: "https"
Group 2: "stackoverflow.com"
Group 3: "/questions/tagged/regex"

But I don't care about the protocol -- I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:).

(?:https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

Now, my result looks like this:

Match "http://stackoverflow.com/"
Group 1: "stackoverflow.com"
Group 2: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
Group 1: "stackoverflow.com"
Group 2: "/questions/tagged/regex"

See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.


EDIT:

As requested, let me try to explain groups too.

Well, groups serve many purposes. They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions. Let's try some examples, shall we?

Imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job, but it is nice as an example). You want to parse the tags, so you could do something like this (I have added spaces to make it easier to understand):

   \<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\>
or
\<(.+?)\> [^<]*? \</\1\>

The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing tag. The difference is that the first one uses the name to match the value, and the second one uses the group index (which starts at 1).

Let's try some substitutions now. Consider the following text:

Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.

Now, let's use this dumb regex over it:

\b(\S)(\S)(\S)(\S*)\b

This regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:

Match "Lorem"
Group 1: "L"
Group 2: "o"
Group 3: "r"
Group 4: "em"
Match "ipsum"
Group 1: "i"
Group 2: "p"
Group 3: "s"
Group 4: "um"
...

Match "consectetuer"
Group 1: "c"
Group 2: "o"
Group 3: "n"
Group 4: "sectetuer"
...

So, if we apply the substitution string:

$1_$3$2_$4

... over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.

L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.

You can use named groups for substitutions too, using ${name}.

To play around with regexes, I recommend http://regex101.com/, which offers a good amount of details on how the regex works; it also offers a few regex engines to choose from.



Related Topics



Leave a reply



Submit