Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?
A ()
has some properties (include those such as (?!pattern)
, (?=pattern)
, etc. and the plain (pattern)
), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.
The normal capturing (pattern)
has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern)
doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern)
since it doesn't store the start and end index of the string matching the pattern inside.
Atomic grouping (?>pattern)
also has the non-capturing property, so the position of the text matched inside will not be captured.
Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).
A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.
Example
Input string: bbabbbabbbbc
Pattern: /(?>.*)c/
The first match by .*
is bbabbbabbbbc
due to the greedy quantifier *
. It will hold on to this match, disallowing c
from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.
Input string: bbabbbabbbbc
Pattern: /((?>.*)|b*)[ac]/
, for testing /(((?>.*))|(b*))[ac]/
There are 3 matches to this regex, which are bba
, bbba
, bbbbc
. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b*
inside.
You can see the backtracking behavior here.
Without the atomic grouping
/(.*|b*)[ac]/
, the string will have a single match which is the whole string, due to backtracking at the end to match[ac]
. Note that the engine will go back to.*
to backtrack by 1 character since it still have other possibilities.Pattern: /(.*|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: .*
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
bbabbbabbbbc
^ -- Continue explore other possibility with .*
-- Step back 1 character
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is foundWith the atomic grouping, all possibilities of
.*
is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for theb*
pattern, where it successfully finds a match to the regex.Pattern: /((?>.*)|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: (?>.*)
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
-- The atomic grouping will disallow .* to be backtracked and rematched
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
-- (?>.*) is atomic, check the next possibility by alternation: b*
bbabbbabbbbc
^ -- Starting to rematch with b*
bbabbbabbbbc
^ -- First match with b*, due to greedy quantifier
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is foundThe subsequent matches will continue on from here.
Alternation in atomic grouping is useless?
It's not useless in the general case.
It doesn't work in your examples because the first choice always matches, therefore the second choice won't be tried. And the engine won't backtrack inside the atomic group, because this is pretty much the purpose of an atomic group.
If you have two disjoint patterns in an atomic group, that will make perfect sense
(something like (?>ab|cd)
).
REGEX: PCRE atomic group doesn't work
I will assume you know what you're doing by using regex here, since there's probably an argument to be made that PCRE is not the best approach to implementing this sort of matching in a "tree"-like fashion. But I'm not fussed about that.
The idea of using conditionals isn't bad, but it adds extra steps in the form of the conditions themselves. Also, you can only branch off in two directions per conditional.
PCRE has a feature called "backtracking control verbs" which allow you to do precisely what you want. They have varying levels of control, and the one I would suggest in this case is the strongest:
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
https://regex101.com/r/p572K8/2
Just by adding a single (*COMMIT)
verb after the 'f' branch, it's cut the number of steps required to find a failure in this case by half.
(*COMMIT)
tells the engine to commit to the match at that point. It won't even re-attempt the match starting from </
again if no match is found.
To fully optimize the expression, you'll have to add (*COMMIT)
at every point after branching has occurred.
Another thing you can do is try to re-order your alternatives in such a way as to prioritize those that are found most commonly. That might be something else to consider in your optimization process.
Why atomic grouping in Python is slower than a simple non-capturing alternation?
Your assumption is not correct. The whole point of atomic patterns is to prevent backtracking into the pattern.
The atomic_group
pattern is of (?=(...))\1
type in your code and the non-atomic one is of (?:...)
type. So, the first one already loses to the second one due to the use of a capturing group, see capturing group VS non-capturing group.
Besides, you need to match the strings twice with the atomic_group
pattern, first, with the lookahead, second, with the backreference.
So, only use this techinque when you need to control backtracking inside a longer pattern.
How to match an even number of any character in a string?
Another possible regex-only solution that doesn't involve conditionals:
(.)(?<!\1\1)\1(?:\1\1)*(?!\1)
Breakdown:
(.) # First capturing group - matches any character.
(?<!\1\1) # Negative lookbehind - ensures the matched char isn't preceded by the same char.
\1 # Match another one of the character in the 1st group (at least two in total).
(?:\1\1) # A non-capturing group that matches two occurrences of the same char.
* # Matches between zero and unlimited times of the previous group.
(?!\1) # Negative lookahead to make sure no extra occurrence of the char follows.
Demo:
string input = "aaabbashasccddee";
string pattern = @"(.)(?<!\1\1)\1(?:\1\1)*(?!\1)";
var matches = Regex.Matches(input, pattern);
foreach (Match m in matches)
Console.WriteLine(m.Value);
Output:
bb
cc
dd
ee
Try it online.
Why does gsub's '\1' capture group produce this string?
The regex matches "12"
, "45"
, "78"
, and gsub
replaces them with "1"
, "4"
, "7"
, respectively, giving "13 46 79"
.
Understanding negative look aheads in regular expressions
In both of your cases, ^
is just the start of the line (since it's not used inside a character class). Since both ^
and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:
^(?!.*localhost).*$
The ^
anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost
anywhere the string (the "anywhere" is taken care of by the .*
in front of localhost
). If that localhost
can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^
this means, the pattern overall cannot match. If, however the .*localhost
does not match (and hence localhost
does not occur in the string), the lookahead succeeds, and the .*$
simply takes care of matching the rest of the string.
Now the other one
^((?!localhost).)*$
This time the lookahead only checks at the current position (there is no .*
inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^
makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost
is found at that position. If not, all is well, and .
consumes one character. The *
then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost
- again, if not, all is well, and .
consumes another character. This is done for every single character in the string, until we reach the end.
In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.
Can regular expressions be used to match nested patterns?
No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).
There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia
Related Topics
Best Practices For Reusing Code Between Controllers in Ruby on Rails
How to Read a User Uploaded File, Without Saving It to the Database
Reading the Last N Lines of a File in Ruby
How to Get Filename Without Extension from File Path in Ruby
In Ruby, Is There an Array Method That Combines 'Select' and 'Map'
What Are the Magic $-Prefixed Variables in Ruby
Ruby 1.9.2 and Rails 3 Cannot Open Rails Console
Using Rvm on Ubuntu 12.04 to Use Rails. the Program 'Rails' Is Currently Not Installed
In Ruby Why Won't 'Foo = True Unless Defined(Foo)' Make the Assignment
Serving Static Files With Sinatra
Which Style of Ruby String Quoting Do You Favour
List of Ruby Operators That Can Be Overridden/Implemented
Extract a Substring from a String in Ruby Using a Regular Expression
Extract Number from String in Ruby
Read and Write Yaml Files Without Destroying Anchors and Aliases