Why Do Some Regex Engines Match .* Twice in a Single Input String

Why do some regex engines match .* twice in a single input string?

Kinda interesting question. Instead of referring to your questions first, I'll go for your comment.

Once the input string has been consumed in full, why would you treat the fact that there is nothing left as the empty string?

A position called end of subject string is left. It's a position and can be matched. Like other zero-width assertions and anchors \b, \B, ^, $... that assert, a dot-star .* can match an empty string. This is highly dependent on regex engine. E.g. TRegEx does it differently.

And if you do, shouldn't this result in an infinite loop?

No, this is of the main jobs of regex engines to handle. They raise a flag and store current cursor data to avoid such loops to occur. Perl docs explain it this way:

A common abuse of this power stems from the ability to make infinite
loops using regular expressions, with something as innocuous as:

'foo' =~ m{ ( o? )* }x;

The o? matches at the beginning of foo, and since the position in
the string is not moved by the match, o? would match again and again
because of the * quantifier. Another common way to create a similar
cycle is with the looping modifier /g...

Thus Perl allows such constructs, by forcefully breaking the infinite
loop
. The rules for this are different for lower-level loops given by
the greedy quantifiers *+{} , and for higher-level ones like the
/g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken)
when Perl detects that a repeated expression matched a zero-length substring.

Now back to your questions:

Is there a good reason for this behavior?

Yes, there is. Every regex engine has to meet a significant amount of challenges in order to process a text. One of which is dealing with zero-length matches. Your question raises another question,

Q: How does an engine should proceed after matching a zero-length string?

A: It all depends.

PCRE (or Ruby here) doesn't skip zero-length matches.

It matches it then raises a flag to not match the same position again with the (same)? pattern. In PCRE .* matches entire subject string then stops right after it. Being at the end, current position is a meaningful position in PCRE, positions can be matched or being asserted so there is a position (zero-length string) left to be matched. PCRE goes through the regex again (if g modifier is enabled) and finds a match at the end of subject.

PCRE then tries to advance to the next immediate position to run whole process again but it fails since there is no position left.

You see if you want to prevent the second match from being happened you need to tell engine in some way:

^.*

Or to provide a better insight into what is going on:

(?!$).*

See live demo here specially take a look at debugger window.

Why does running the same regex twice yield different results?

It's because of the /g modifier in your if. Since the if is evaluating the =~ in scalar context, it only gets the first item matched. Then, inside your if block, the @array assignment continues the search from where it left off. (This is useful for parsing.)

When you run the extra match, you've already finished matching everything in the string, so you start over from the beginning again, in list context, and you get everything then.

If you remove the g flag in your if, then things work as you expect.

Why do regex engines allow / automatically attempt matching at the end of the input string?

Note:

  • My question post contains two related, but distinct questions, for which I should have created separate posts, as I now realize.
  • The other answers here focus on one of the questions each, so in part this answer provides a road map to what answers address which question.

As for why patterns such as $<expr> are allowed (i.e., matching something after the input's end) / when they make sense:

  • dawg's answer argues that nonsensical combinations such as $.+ probably aren't prevented for pragmatic reasons; ruling them out may not be worth the effort.

  • Tim's answer shows how certain expressions can make sense after $, namely negative lookbehind assertions.

  • The second half of ivan_pozdeev's answer answer cogently synthesizes dawg's and Tim's answers.


As for why global matching finds two matches for patterns such as .* and .*$:

  • revo's answer contains great background information about zero-length (empty-string) matching, which is what the problem ultimately comes down to.

Let me complement his answer by relating it more directly to how the behavior contradicts my expectations in the context of global matching:

  • From a purely common-sense perspective, it stands to reason that once the input has been fully consumed while matching, there is by definition nothing left, so there is no reason to look for further matches.

  • By contrast, most regex engines consider the character position after the last character of the input string - the position known as end of subject string in some engines - a valid starting position for a match and therefore attempt another one.

    • If the regex at hand happens to match the empty string (produces a zero-length match; e.g., regexes such as .*, or a?), it matches that position and returns an empty-string match.

    • Conversely, you won't see an extra match if the regex doesn't (also) match the empty string - while the additional match is still attempted in all cases, no match will be found in this case, given that the empty string is the only possible match at the end-of-subject-string position.

While this provides a technical explanation of the behavior, it still doesn't tell us why matching after the last character was implemented.

The closest thing we have is an educated guess by Wiktor Stribiżew in a comment (emphasis added), which again suggests a pragmatic reason for the behavior:

... as when you get an empty string match, you might still match the next char that is still at the same index in the string. If a regex engine did not support it, these matches would be skipped. Making an exception for the end of string was probably not that critical for regex engine authors.

The first half of ivan_pozdeev's answer explains the behavior in more technical detail by telling us that the void at the end of the [input] string is a valid position for matching, just like any other character-boundary position.

However, while treating all such positions the same is certainly internally consistent and presumably simplifies the implementation, the behavior still defies common sense and has no obvious benefit to the user.



Further observations re empty-string matching:

Note: In all code snippets below, global string replacement is performed to highlight the resulting matches: each match is enclosed in [...], whereas non-matching parts of the input are passed through as-is.

In summary, 3 different, independent behaviors apply in the context of empty(-string) matches, and different engines use different combinations:

  • Whether the POSIX ERE spec's longest leftmost ruleThanks, revo. is obeyed.

  • In global matching:

    • Whether or not the character position is advanced after an empty match.
    • Whether or not another match is attempted for the by-definition empty string at the very end of the input (the 2nd question in my question post).

Matching at the end-of-subject-string position is not limited to those engines where matching continues at the same character position after an empty match.

For instance, the .NET regex engine does not do so (PowerShell example):

PS> 'a1' -replace '\d*|a', '[$&]'
[]a[1][]

That is:

  • \d* matched the empty string before a
  • a itself then did not match, which implies that the character position was advanced after the empty match.
  • 1 was matched by \d*
  • The end-of-subject-string position was again matched by \d*, resulting in another empty-string match.

Perl 5 is an example of an engine that does resume matching at the same character position:

$ "a1" | perl -ple "s/\d*|a/[$&]/g"
[][a][1][]

Note how a was matched too.

Interestingly, Perl 6 not only behaves differently, but exhibits yet another behavior variant:

$ "a1" | perl6 -pe "s:g/\d*|a/[$/]/"
[a][1][]

Seemingly, if an alternation finds both and empty and a non-empty match, only the non-empty one is reported.

Perl 6's behavior appears to be following the longest leftmost rule.

While sed and awk do as well, they don't attempt another match at the end of the string:

sed, both the BSD/macOS and GNU/Linux implementations:

$ echo a1 | sed -E 's/[0-9]*|a/[&]/g'
[a][1]

awk - both the BSD/macOS and GNU/Linux implementations as well as mawk:

$ echo a1 | awk '1 { gsub(/[0-9]*|a/, "[&]"); print }'
[a][1]

Why are regex search queries in R represented by strings?

The basic issue is that regular expressions are handled by functions in R, they aren't a built-in part of the language. Building them in would require a change in the way characters are parsed when reading R code. Since regular expressions aren't central to the language, this is seen as an unnecessary complication.

More specifically, for the R parser to handle regex(\.), you'd need a new reserved word (regex), and a whole new parsing mode to be defined, with its own complications. For example, both "" and ")" are legal regular expressions. (Ignore the quotes, just consider the characters within them.) Putting them in your suggested syntax would look like regex() and regex()), so the R parser would have to look ahead when it hit the first ) to know where the regular expression ended. But "))" is also legal, so how would it know where to stop?

Putting regular expressions into strings adds the extra layer of escapes, but at least it doesn't complicate the design of the parser.

EDITED TO ADD:

As of R 4.0.0, things are better for writing regular expressions because of the new syntax for string literals described in this NEWS entry:

There is a new syntax for specifying raw character constants similar
to the one used in C++: r"(...)" with ... any character sequence not
containing the sequence )". This makes it easier to write strings that
contain backslashes or both single and double quotes. For more details
see ?Quotes.

So if you want to enter \., you replace the ... above with exactly what you want, with no escapes necessary:

r"(\.)"

This is parsed the same as "\.". It's not exactly what you wished for, but it's kind of close.

Why running .Net Regex with ECMAScript flavor support \A

You need to look for the answer further in the "ECMAScript Matching Behavior" article.

This option does NOT redefine the .NET-specific anchors meanings, they are still supported.

The behavior of ECMAScript and canonical regular expressions differs in three areas: character class syntax, self-referencing capturing groups, and octal versus backreference interpretation.

Character class syntax. Because canonical regular expressions support Unicode whereas ECMAScript does not, character classes in ECMAScript have a more limited syntax, and some character class language elements have a different meaning. For example, ECMAScript does not support language elements such as the Unicode category or block elements \p and \P. Similarly, the \w element, which matches a word character, is equivalent to the [a-zA-Z_0-9] character class when using ECMAScript and [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}] when using canonical behavior. For more information, see Character Classes.

Self-referencing capturing groups. A regular expression capture class with a backreference to itself must be updated with each capture iteration.

Resolution of ambiguities between octal escapes and backreferences.



























Regular expressionCanonical behaviorECMAScript behavior
\0 followed by 0 to 2 octal digitsInterpret as an octal. For example, \044 is always interpreted as an octal value and means "$".Same behavior.
\ followed by a digit from 1 to 9, followed by no additional decimal digits,Interpret as a backreference. For example, \9 always means backreference 9, even if a ninth capturing group does not exist. If the capturing group does not exist, the regular expression parser throws an ArgumentException.If a single decimal digit capturing group exists, backreference to that digit. Otherwise, interpret the value as a literal.
\ followed by a digit from 1 to 9, followed by additional decimal digitsInterpret the digits as a decimal value. If that capturing group exists, interpret the expression as a backreference. Otherwise, interpret the leading octal digits up to octal 377; that is, consider only the low 8 bits of the value. Interpret the remaining digits as literals. For example, in the expression \3000, if capturing group 300 exists, interpret as backreference 300; if capturing group 300 does not exist, interpret as octal 300 followed by 0.Interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal by using the leading octal digits up to octal 377; interpret the remaining digits as literals.



Related Topics



Leave a reply



Submit