Confusion With Atomic Grouping - How It Differs from the Grouping in Regular Expression of Ruby

Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?

A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.

The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.

Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.

Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).

A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.

Example

Input string: bbabbbabbbbc

Pattern: /(?>.*)c/

The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.


Input string: bbabbbabbbbc

Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/

There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.

You can see the backtracking behavior here.

  • Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.

    Pattern: /(.*|b*)[ac]/
    bbabbbabbbbc
    ^ -- Start matching. Look at first item in alternation: .*
    bbabbbabbbbc
    ^ -- First match of .*, due to greedy quantifier
    bbabbbabbbbc
    X -- [ac] cannot match
    -- Backtrack to ()
    bbabbbabbbbc
    ^ -- Continue explore other possibility with .*
    -- Step back 1 character
    bbabbbabbbbc
    ^ -- [ac] matches, end of regex, a match is found
  • With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.

    Pattern: /((?>.*)|b*)[ac]/
    bbabbbabbbbc
    ^ -- Start matching. Look at first item in alternation: (?>.*)
    bbabbbabbbbc
    ^ -- First match of .*, due to greedy quantifier
    -- The atomic grouping will disallow .* to be backtracked and rematched
    bbabbbabbbbc
    X -- [ac] cannot match
    -- Backtrack to ()
    -- (?>.*) is atomic, check the next possibility by alternation: b*
    bbabbbabbbbc
    ^ -- Starting to rematch with b*
    bbabbbabbbbc
    ^ -- First match with b*, due to greedy quantifier
    bbabbbabbbbc
    ^ -- [ac] matches, end of regex, a match is found

    The subsequent matches will continue on from here.

Alternation in atomic grouping is useless?

It's not useless in the general case.

It doesn't work in your examples because the first choice always matches, therefore the second choice won't be tried. And the engine won't backtrack inside the atomic group, because this is pretty much the purpose of an atomic group.

If you have two disjoint patterns in an atomic group, that will make perfect sense

(something like (?>ab|cd)).

REGEX: PCRE atomic group doesn't work

I will assume you know what you're doing by using regex here, since there's probably an argument to be made that PCRE is not the best approach to implementing this sort of matching in a "tree"-like fashion. But I'm not fussed about that.

The idea of using conditionals isn't bad, but it adds extra steps in the form of the conditions themselves. Also, you can only branch off in two directions per conditional.

PCRE has a feature called "backtracking control verbs" which allow you to do precisely what you want. They have varying levels of control, and the one I would suggest in this case is the strongest:

<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b

https://regex101.com/r/p572K8/2

Just by adding a single (*COMMIT) verb after the 'f' branch, it's cut the number of steps required to find a failure in this case by half.

(*COMMIT) tells the engine to commit to the match at that point. It won't even re-attempt the match starting from </ again if no match is found.

To fully optimize the expression, you'll have to add (*COMMIT) at every point after branching has occurred.

Another thing you can do is try to re-order your alternatives in such a way as to prioritize those that are found most commonly. That might be something else to consider in your optimization process.

Why atomic grouping in Python is slower than a simple non-capturing alternation?

Your assumption is not correct. The whole point of atomic patterns is to prevent backtracking into the pattern.

The atomic_group pattern is of (?=(...))\1 type in your code and the non-atomic one is of (?:...) type. So, the first one already loses to the second one due to the use of a capturing group, see capturing group VS non-capturing group.

Besides, you need to match the strings twice with the atomic_group pattern, first, with the lookahead, second, with the backreference.

So, only use this techinque when you need to control backtracking inside a longer pattern.

How to match an even number of any character in a string?

Another possible regex-only solution that doesn't involve conditionals:

(.)(?<!\1\1)\1(?:\1\1)*(?!\1)

Breakdown:

(.)         # First capturing group - matches any character.
(?<!\1\1) # Negative lookbehind - ensures the matched char isn't preceded by the same char.
\1 # Match another one of the character in the 1st group (at least two in total).
(?:\1\1) # A non-capturing group that matches two occurrences of the same char.
* # Matches between zero and unlimited times of the previous group.
(?!\1) # Negative lookahead to make sure no extra occurrence of the char follows.

Demo:

string input = "aaabbashasccddee";
string pattern = @"(.)(?<!\1\1)\1(?:\1\1)*(?!\1)";
var matches = Regex.Matches(input, pattern);
foreach (Match m in matches)
Console.WriteLine(m.Value);

Output:

bb
cc
dd
ee

Try it online.

Why does gsub's '\1' capture group produce this string?

The regex matches "12", "45", "78", and gsub replaces them with "1", "4", "7", respectively, giving "13 46 79".

Understanding negative look aheads in regular expressions

In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:

^(?!.*localhost).*$ 

The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.

Now the other one

^((?!localhost).)*$

This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.

In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.

Can regular expressions be used to match nested patterns?

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.

For more background: Automata Theory at Wikipedia



Related Topics



Leave a reply



Submit