Why Won't a Longer Token in an Alternation Be Matched

Why won't a longer token in an alternation be matched?

Your assumption that regex matches a longer alternation is incorrect.

If you have a bit of time, let's look at how your regex works...

_{Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.}

There are two pointers, one on the Pattern:

(cdefghijkl|bcd)

The other on your String:

abcdefghijklmnopqrstuvw

The pointer on the String moves from the left. As soon as it can return, it will:

_{(source: gyazo.com)}

Let's turn that into a more "sequential" sequence for understanding:

_{(source: gyazo.com)}

Your foobar example is a different topic. As I mentioned in this post:

How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation.

That's good, Unihedron, but how do I force it to the first alternation?

Look!^*

^(?:.*?\Kcdefghijkl|.*?\Kbcd)

Here have a regex demo.

This regex first attempts to match the entire string with the first alternation. Only if it fails completely will it then attempt to match the second alternation. \K is used here to keep the match with the contents behind the construct \K.

_{*: \K was supported in Ruby since 2.0.0.}

Read more:

The Stack Overflow Regex Reference
On greedy vs non-greedy

Ah, I was bored, so I optimized the regex:

^(?:(?:(?!cdefghijkl)c?[^c]*)++\Kcdefghijkl|(?:(?!bcd)b?[^b]*)++\Kbcd)

You can see a demo here.

Raku regex: Inconsistent longest token matching

There are two things at work here.

The first is the meaning of "longest token". When there is an alternation (using | or implied by use of proto regexes), the declarative prefix of each branch is extracted. Declarative means the subset of the Raku regex language that can be matched by a finite state machine. The declarative prefix is determined by taking regex elements until a non-declarative element is encountered. You can read more and find some further references in the docs.

To understand why things are this way, a small detour may be helpful. A common approach to building parsers is to write a tokenizer, which breaks the input text up into a sequence of "tokens", and then a parser that identifies larger (and perhaps recursive) structure from those tokens. Tokenizing is typically performed using a finite state machine, since it is able to rapidly cut down the search space. With Raku grammars, we don't write the tokenizer ourselves; instead, it's automatically extracted from the grammar for us (more precisely, a tokenizer is calculated per alternation point).

Secondly, Raku regexes are a nested language within the main Raku language, parsed in a single pass with it and compiled at the same time. (This is a departure from most languages, where regexes are provided as a library that we pass strings to.) The longest token calculation takes place at compile time. However, variables are interpolated at runtime. Therefore, a variable interpolation in a regex is non-declarative, and therefore is not considered as part of the longest token matching.

Priority in regex manipulating

How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation:

_{(source: gyazo.com)}

,,|, == ,,?:

_{(source: gyazo.com)}

However, you should use ,,? instead so there's no backtracking:

_{(source: gyazo.com)}

capture in a string everything that is not a token

Your regular expression for T_STRING most certainly doesn't do what you want. What it does do is a little more difficult to answer.

In principle, it consists only of two zero-length assertions: ^, which is only true at the beginning of the string (unless you provide the re.MULTILINE flag, which you don't), and a long negative lookahead assertion.

A pattern which consists only of zero-length assertions can only match the empty string, if it matches anything at all. But lexer patterns cannot be allowed to match the empty string. Lexers divide the input into a series of tokens, so that every character in the input belongs to some token. Each match -- and they are all matches, not searches -- starts precisely at the end of the previous match. So if a pattern could match the empty string, the lexer would try the next match at the same place, with the same result, which would be an endless loop.

Some lexer generators solve this problem by forcing a minimum one-character match using a built-in catch-all error pattern, but Ply simply refuses to generate a lexer if a pattern matches the empty string. Yet Ply does not complain about this lexer specification. The only possible explanation is that the pattern cannot match anything.

The key is that Ply compiles all patterns using the re.VERBOSE flag, which allows you to separate items in regular expressions with whitespace, making the regexes slightly less unreadable. As the Python documentation indicates:

Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>.

Whitespace includes newlines and even comments (starting with a # character), so you can split patterns over several lines and insert comments about each piece.

We could do that, in fact, with your pattern:

def t_STRING(t):
    r'''^         # Anchor this match at the beginning of the input
        (?!       # Don't match if the next characters match:
           \)   | # Close parenthesis
           \(   | # Open parenthesis
           \    | # !!! HERE IS THE PROBLEM
           \t   | # Tab character
           \n   | # Newline character
           \\\/ | # \/ token
           \/\\   # /\ token
        )
     '''
    t.value = t
    return t

So as I added whitespace and comments to your pattern, I had to notice that the original pattern attempted to match a space character as an alternative with | |. But since the pattern is compiled as re.VERBOSE, that space character is ignored, leaving an empty alternative, which matches the empty string. That alternative is part of a negative lookahead assertion, which means that the assertion will fail if the string to match at that point starts with the empty string. Of course, every string starts with the empty string, so the negative lookahead assertion always fails, explaining why Ply didn't complain (and why the pattern never matches anything).

Regardless of that particular glitch, the pattern cannot be useful because, as mentioned already, a lexer pattern must match some characters, and so a pattern which only matches the empty string cannot be useful. What we want to do is match any character, providing that the negative lookahead (corrected, as below) allows it. So that means that the negative lookahead assertion show be followed with ., which will match the next character.

But you almost certainly don't want to match just one character. Presumably you wanted to match a string of characters which don't match any other token. So that means putting the negative lookahead assertion and the following . into a repetition. And remember that it needs to be a non-empty repetition (+, not *), because patterns must not have empty matches.

Finally, there is absolutely no point using an anchor assertion, because that would limit the pattern to matching only at the beginning of the input, and that is certainly not what you want. It's not at all clear what it is doing there. (I've seen recommendations which suggest using an anchor with a negative lookahead search, which I think are generally misguided, but that discussion is out of scope for this question.)

And before we write the pattern, let's make one more adjustment: in a Python regular expression, if you can replace a set of alternatives with a character class, you should do so because it is a lot more efficient. That's true even if only some of the alternatives can be replaced.

So that produces the following:

def t_STRING(t):
    r'''(
         (?!            # Don't match if the next characters match:
            [() \t\n] |   # Parentheses or whitespace
            \\\/      |   # \/ token
            \/\\          # /\ token
         ) .            # If none of the above match, accept a character
        )+              # and repeat as many times as possible (at least once)
     '''
    return t

I removed t.value = t. t is a token object, not a string, and the value should be the string it matched. If you overwrite the value with a circular reference, you won't be able to figure out which string was matched.

This works, but not quite in the way you intended. Since whitespace characters are excluded from T_STRING, you don't get a single token representing (/ 1 3) <= x_4. Instead, you get a series of tokens:

STRING b_1 1 0
AND /\ 1 4
LPAREN ( 1 7
STRING x_2 1 8
STRING <= 1 12
STRING 2 1 15
OR \/ 1 17
LPAREN ( 1 20
STRING b_3 1 21
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
STRING / 1 30
STRING 1 1 32
STRING 3 1 34
RPAREN ) 1 35
STRING <= 1 37
STRING x_4 1 40
RPAREN ) 1 43
RPAREN ) 1 44

But I think that's reasonable. How could the lexer be able to tell that the parentheses in (x_2 <= 2 and (b_3 are parenthesis tokens, while the parentheses in (/ 1 3) <= x_4 are part of T_STRING? That determination will need to be made in your parser.

In fact, my inclination would be to fully tokenise the input, even if you don't (yet) require a complete tokenisation. As this entire question and answer shows, attempting to recognised "everything but..." can actually be a lot more complicated than just recognising all tokens. Trying to get the tokeniser to figure out which tokens are useful and which ones aren't is often more difficult than tokenising everything and passing it through a parser.

specifying regular expression preference (order of precedence important)

No offense, but... Before I give you the answer, let me point to you the many fixes to apply in your regex.

While you're attempting a dual case match, [Ii][Dd][Rr] isn't a good idea: use idr as usual, but include the case-insensitive flag: #i

Using \d over [0-9] makes the world happier.

Also, your Price entry is Price : 185,000 but the subpattern ([Pp][Rr][Ii][Cc][Ee]:?\s*[0-9.,]+) won't capture it because of the space before the colon. Add \s*.

Using alternation or character class for single character matching?

Use [efx] - that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.

I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.

My reasoning (without ever having written a regex engine, so this is pure conjecture):

The regex token [abc] will be applied in a single step of the regex engine: "Is the next character one of a, b, or c?"

(a|b|c) however tells the regex engine to

remember the current position in the string for backtracking, if necessary
check if it's possible to match a. If so, success. If not:
check if it's possible to match b. If so, success. If not:
check if it's possible to match c. If so, success. If not:
give up.

Does antlr4 memoize tokens?

ANTLR will recognize the 1st expr and then if it doesn't find a BitwiseAnd, it will look for a BitwiseXor to try to match the second alternative. It won't backtrack all the way to trying to recognize the 1st expr again. It's not exactly memoization, but you get the same benefit (arguably even better).

You may find it useful to have ANTLR generate the ATN for your grammar. Use the -atn option when running the antlr4 command, this will generate *.dot files for each of your rules (both Lexer and Parser). You can then use graphViz to render them to svg, pdf, etc. They may look a bit intimidating at first glance, but just take a moment with them and you'll get a LOT of insight into how ANTLR goes about parsing your input.

The second place to look is the generated parser code. It too is much more understandable than you might expect (especially if reading it with the ATN graph handy).

Why Won't a Longer Token in an Alternation Be Matched