Order of Regular Expression Operator (..|.. ... ..|..)

Order of regular expression operator (..|.. ... ..|..)

Left to right, and the first alternative matched "wins", others are not checked for. This is a typical NFA regex behavior. A good description of that behavior is provided at regular-expressions.info Alternation page.

Note that RegexOptions.RightToLeft only makes the regex engine examine the input string from right to left, the modifier does not impact how the regex engine processes the pattern itself.

Let me illustrate: if you have a (aaa|bb|a) regex and try to find a match in bbac using Regex.Match, the value you will obtain is bb because a alternative appears after bbb. If you use Regex.Matches, you will get all matches, and both bb and a will land in your results.

Also, the fact that the regex pattern is examined from left to right makes it clear that inside a non-anchored alternative group, the order of alternatives matter. If you use a (a|aa|aaa) regex to match against abbccaa, the first a alternative will be matching each a in the string (see the regex demo). Once you add word boundaries, you can place the alternatives in any order (see one more regex demo).

Operator precedence in regular expressions

Given the Oracle doc:

Table 4-2 lists the list of metacharacters supported for use in regular expressions passed to SQL regular expression functions and conditions. These metacharacters conform to the POSIX standard; any differences in behavior from the standard are noted in the "Description" column.

And taking a look at the | value in that table:

The expression a|b matches character a or character b.

Plus taking a look at the POSIX doc:

Operator precedence
The order of precedence for of operators is as follows:

  1. Collation-related bracket symbols [==] [::] [..]

  2. Escaped characters \

  3. Character set (bracket expression) []

  4. Grouping ()

  5. Single-character-ERE duplication * + ? {m,n}

  6. Concatenation

  7. Anchoring ^$

  8. Alternation |

I would say that H|ha+ would be the same as (?:H|ha+).

How is the AND/OR operator represented as in Regular Expressions?

I'm going to assume you want to build a the regex dynamically to contain other words than part1 and part2, and that you want order not to matter. If so you can use something like this:

((^|, )(part1|part2|part3))+$

Positive matches:

part1
part2, part1
part1, part2, part3

Negative matches:

part1,           //with and without trailing spaces.
part3, part2,
otherpart1

Regex AND operator

It is impossible for both (?=foo) and (?=baz) to match at the same time. It would require the next character to be both f and b simultaneously which is impossible.

Perhaps you want this instead:

(?=.*foo)(?=.*baz)

This says that foo must appear anywhere and baz must appear anywhere, not necessarily in that order and possibly overlapping (although overlapping is not possible in this specific case because the letters themselves don't overlap).

Why does the order of alternatives matter in regex?

The regular expression engine tries to match the alternatives in the order in which they are specified. So when the pattern is (foo|foobar)&? it matches foo immediately and continues trying to find matches. The next bit of the input string is bar& b which cannot be matched.

In other words, because foo is part of foobar, there is no way (foo|foobar) will ever match foobar, since it will always match foo first.

Occasionally, this can be a very useful trick, actually. The pattern (o|a|(\w)) will allow you to capture \w and a or o differently:

Regex.Replace("a foobar& b", "(o|a|(\\w))", "$2") // fbr& b

Regular expressions - what determines the precedence of a conditional?

+? is a lazy operator, meaning that it tries to match as few characters as possible before going further.

Normally, operators try to match as much as possible, from left to right, and if the rest of the expression fails, they backtrack to a shorter match. Lazy operators do the other way around: try to match as few characters as possible, and if the remaining expressions don't match, expand the current match.

So, the first part, (\b\w+?), will try to match 1 character (g), and see if what follows is an es or an s, and a word boundary. Since that fails, it adds one more letter, and so on, until the first part matches glass. In this phase, the second part does match the remaining es.

If you replace that with a non-lazy, greedy operator, as in (\b\w+)(?=(?:es|s)\b), it will go the other way around. First, it assigns glasses to the first part, (\b\w+), but fails to match an additional e or es, so it backtracks to glasse, which succeeds in matching the remaining s with the second part of the expression.

JavaScript regex: why is alternation not ordered?

The alternative graph does not match starting at the third character, but the alternative photograph does. The engine proceeds through the string from left to right.

The ordering you refer to in the question applies when alternatives match from a common starting point in the string. Otherwise, while proceeding through the "haystack" string, the alternatives are all considered. If there's a single match starting from a particular character,
then the rest of the regex will proceed with that (and may of course backtrack later).

Whether the engine prefers longer matches from a set of alternatives when there are multiple matches from the same character in the source, I can't say off the top of my head. I would guess it would try the longer one first, to consume more of the string optimistically, because it can always backtrack. However, I don't know that to be actual specified behavior and just thinking about reading the regex semantics in the spec makes my head hurt.



Related Topics



Leave a reply



Submit