Why Does the Order of Alternatives Matter in Regex

Why does the order of alternatives matter in regex?

The regular expression engine tries to match the alternatives in the order in which they are specified. So when the pattern is (foo|foobar)&? it matches foo immediately and continues trying to find matches. The next bit of the input string is bar& b which cannot be matched.

In other words, because foo is part of foobar, there is no way (foo|foobar) will ever match foobar, since it will always match foo first.

Occasionally, this can be a very useful trick, actually. The pattern (o|a|(\w)) will allow you to capture \w and a or o differently:

Regex.Replace("a foobar& b", "(o|a|(\\w))", "$2") // fbr& b

python regular expressions does ordering of alternatives matter for speed/choosing between alternatives


Does ordering of alternatives matter for speed/choosing between alternatives?

Yes, it does. Alternative groups are analyzed from left to right, and that happens at each position in the input string.

Thus, putting the most common matches at the start is already a boost.

When speaking about unanchored alternation lists in NFA regex (as in Python), it is important that alternatives that can match at the same location should be ordered in such a way that the longest comes first because otherwise a shorter alternative will always "win", and you may end up with xxxone when matching with some|someone -> xxx wanting to get xxx from someone.

JavaScript regex: why is alternation not ordered?

The alternative graph does not match starting at the third character, but the alternative photograph does. The engine proceeds through the string from left to right.

The ordering you refer to in the question applies when alternatives match from a common starting point in the string. Otherwise, while proceeding through the "haystack" string, the alternatives are all considered. If there's a single match starting from a particular character,
then the rest of the regex will proceed with that (and may of course backtrack later).

Whether the engine prefers longer matches from a set of alternatives when there are multiple matches from the same character in the source, I can't say off the top of my head. I would guess it would try the longer one first, to consume more of the string optimistically, because it can always backtrack. However, I don't know that to be actual specified behavior and just thinking about reading the regex semantics in the spec makes my head hurt.

Order of regular expression operator (..|.. ... ..|..)

Left to right, and the first alternative matched "wins", others are not checked for. This is a typical NFA regex behavior. A good description of that behavior is provided at regular-expressions.info Alternation page.

Note that RegexOptions.RightToLeft only makes the regex engine examine the input string from right to left, the modifier does not impact how the regex engine processes the pattern itself.

Let me illustrate: if you have a (aaa|bb|a) regex and try to find a match in bbac using Regex.Match, the value you will obtain is bb because a alternative appears after bbb. If you use Regex.Matches, you will get all matches, and both bb and a will land in your results.

Also, the fact that the regex pattern is examined from left to right makes it clear that inside a non-anchored alternative group, the order of alternatives matter. If you use a (a|aa|aaa) regex to match against abbccaa, the first a alternative will be matching each a in the string (see the regex demo). Once you add word boundaries, you can place the alternatives in any order (see one more regex demo).

Does order not matter in regular expressions?


Why can't it be b*a(ab*ab*)*b* instead?

b*a(ab*ab*)*b* does not work because it would require the string to have two consecutive as before the first non-leading b, wouldn't it? For example, abaa would not be matched by your proposed regex when it should. Use the regex debugger on a site like Regex101 to see this for yourself.

On the other hand, moving the whole ab* part to the start (b*ab*(ab*ab*)*) works as well.

why it is (ab*ab*)* and not (b*ab*ab*)*?

(b*ab*ab*)* does work, but the first b* is quite redundant because whatever b there is left, will be matched by the last b* in the group. There is also a b* before the group, which causes the b* to not be able to match anything, hence it is redundant.

Does order really matter with | here in this parser?

There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.

When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed).

The tests pass because the tests don't cover this part of the parser (the closest ones only exercise stringParser).

Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in test/Spec.hs and add it to the do block under main):

badex :: Spec
badex = describe "Bad example" $ do
it "Should fail" $
shouldMatch
exampleLineParser
"| 3.4 |\n"
[ ValueNumber 3.4 ]

If you swap the parsers, you get as a result ValueNumber 3.0: the integerParser (which is now first) succeeds parsing 3, but then the rest of the input gets discarded.

To give more context, we have to see where numberParser is used:

  1. numberParser is one of the alternatives of valueParser...
  2. which is used in exampleLineParser, where valueParser is followed by readThroughBar (and I mean the relevant piece of code is literally valueParser <* readThroughBar);
  3. readThroughBar discards all characters until the next vertical bar (using many (psym (\c -> c /= '|' && c /= '\n'))).

So if valueParser succeeds parsing just 3, then the subsequent readThroughBar will happily consume and discard the rest .4 |.

The explanation from the blogpost you quote is only partially correct:

Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

(emphasis mine) You will only lose information if your parser actively discards it, which readThroughBar does here.

As you already suggested, the backtracking behavior of RE means that the noncommutativity of <|> really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here if readThroughBar were less lenient, e.g., by consuming only whitespace before |.

I think that shows that using psym with (/=) is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.

Does order of characters in character class matter in regular expressions

Within a character class, - is a range operator (as in [a-f] is the same as [abcdef]). So if you want to include an actual - in your range, it must be the first or last character.

Therefore, your first example will match + / * -, while your second will match + / * - , ..

Sample Image

JS Regex: (a|b) Matches the a or the b part of the subexpression. Different behaviour when a and b switched

The difference is in what the regex tries to match first. The left expression takes precedence. The right is only tried when the left one fails to match.



Related Topics



Leave a reply



Submit