Poor Man's "Lexer" for C#

Poor man's lexer for C#

The original version I posted here as an answer had a problem in that it only worked while there was more than one "Regex" that matched the current expression. That is, as soon as only one Regex matched, it would return a token - whereas most people want the Regex to be "greedy". This was especially the case for things such as "quoted strings".

The only solution that sits on top of Regex is to read the input line-by-line (which means you cannot have tokens that span multiple lines). I can live with this - it is, after all, a poor man's lexer! Besides, it's usually useful to get line number information out of the Lexer in any case.

So, here's a new version that addresses these issues. Credit also goes to this

public interface IMatcher
{
    /// <summary>
    /// Return the number of characters that this "regex" or equivalent
    /// matches.
    /// </summary>
    /// <param name="text">The text to be matched</param>
    /// <returns>The number of characters that matched</returns>
    int Match(string text);
}

sealed class RegexMatcher : IMatcher
{
    private readonly Regex regex;
    public RegexMatcher(string regex) => this.regex = new Regex(string.Format("^{0}", regex));

    public int Match(string text)
    {
        var m = regex.Match(text);
        return m.Success ? m.Length : 0;
    }
    public override string ToString() => regex.ToString();
}

public sealed class TokenDefinition
{
    public readonly IMatcher Matcher;
    public readonly object Token;

    public TokenDefinition(string regex, object token)
    {
        this.Matcher = new RegexMatcher(regex);
        this.Token = token;
    }
}

public sealed class Lexer : IDisposable
{
    private readonly TextReader reader;
    private readonly TokenDefinition[] tokenDefinitions;

    private string lineRemaining;

    public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)
    {
        this.reader = reader;
        this.tokenDefinitions = tokenDefinitions;
        nextLine();
    }

    private void nextLine()
    {
        do
        {
            lineRemaining = reader.ReadLine();
            ++LineNumber;
            Position = 0;
        } while (lineRemaining != null && lineRemaining.Length == 0);
    }

    public bool Next()
    {
        if (lineRemaining == null)
            return false;
        foreach (var def in tokenDefinitions)
        {
            var matched = def.Matcher.Match(lineRemaining);
            if (matched > 0)
            {
                Position += matched;
                Token = def.Token;
                TokenContents = lineRemaining.Substring(0, matched);
                lineRemaining = lineRemaining.Substring(matched);
                if (lineRemaining.Length == 0)
                    nextLine();

                return true;
            }
        }
        throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",
                                          LineNumber, Position, lineRemaining));
    }

    public string TokenContents { get; private set; }
    public object Token   { get; private set; }
    public int LineNumber { get; private set; }
    public int Position   { get; private set; }

    public void Dispose() => reader.Dispose();
}

Example program:

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";

var defs = new TokenDefinition[]
{
    // Thanks to [steven levithan][2] for this great quoted string
            // regex
    new TokenDefinition(@"([""'])(?:\\\1|.)*?\1", "QUOTED-STRING"),
    // Thanks to http://www.regular-expressions.info/floatingpoint.html
    new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),
    new TokenDefinition(@"[-+]?\d+", "INT"),
    new TokenDefinition(@"#t", "TRUE"),
    new TokenDefinition(@"#f", "FALSE"),
    new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),
    new TokenDefinition(@"\.", "DOT"),
    new TokenDefinition(@"\(", "LEFT"),
    new TokenDefinition(@"\)", "RIGHT"),
    new TokenDefinition(@"\s", "SPACE")
};

TextReader r = new StringReader(sample);
Lexer l = new Lexer(r, defs);
while (l.Next())
    Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);

Output:

Token: LEFT Contents: (
Token: SPACE Contents:
Token: SYMBOL Contents: one
Token: SPACE Contents:
Token: LEFT Contents: (
Token: SYMBOL Contents: two
Token: SPACE Contents:
Token: INT Contents: 456
Token: SPACE Contents:
Token: FLOAT Contents: -43.2
Token: SPACE Contents:
Token: QUOTED-STRING Contents: " \" quoted"
Token: SPACE Contents:
Token: RIGHT Contents: )
Token: RIGHT Contents: )

Simple lexical parser

I suggest using Regular Expressions. Something like @"[a-zA-Z\-]+" will pick up words (a-z and dashes), while @"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - @"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).

Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.

EDIT:

Or see Justin's answer for the particular regexes.

Use of Goto within lexer/parser

Within parsers the use of GOTO is perfectly reasonable. When you get down to a base level, the loops and conditions etc are all implemented as gotos, because that is what processors can do - "take the next instruction to be executed from here".

The only problems with gotos, and the reason they are so often demonised, is that they can be an indication of unstructured code, coming form unstructured thinking. Within modern high level languages, there is no need for gotos, because all of the facilities are available to structure code well, and well structured code implies at least some structured thinking.

So use gotos if they are needed. Don't use them just because you can't be bothered to think things through properly.

Interpreter Arguments

What you are trying to achieve is easily done with a Lexer. There are many lexers for C# (just do a quick google search), but if you want to learn, you can build one yourself.

You asked for the first step of a lexer work, called tokenization, that is correct splitting a string into smaller strings called tokens, taking into account things like single/double quotes, escape characters, variable expansions, and so on.

Tokenization is a simple task and you will find tons of ready-to-use libraries. Process is like that:

scan the input string character by character
mark boundaries for each token
extract substrings (tokens) to an array

Is a loop more efficient than regex in a lexer?

Given the architecture of the computers we use, a regex ends up being implemented with loops.

If the code is structured, it will be a combination of something like a switch statement within a while statement, with the cases in the switch representing the states of the Deterministic Finite Automaton that recognizes the same language as the regex.

If goto is allowed, the implementation can be much more efficient than what a generic regex library can do.

Unless you have specific efficiency needs, sticking to a regex library should be efficient enough, and it will save you lots of programming (debugging) time.

What's the specification of a lexer?

That's a good question.

There is no formal definition of a lexical analyser; it's a pattern, not a standard. And the use of maximal munch is a heuristic, not an absolute rule. But it's a pretty good heuristic: maximal munch is almost always the Right Thing To Do.

But "almost always" is not "always", and many languages have exceptions to maximal munch.

One pretty common example, which goes back (at least) to Modula-2, is the range operator .., generally used with integer operands so that 0..99 indicates the range from 0 to 99. (The semantics varies from language to language.) In most languages, maximal munch would interpret 0..99 as two floating point constants (0. and .99) rather than the three tokens making up a range (0, .. and 99). So that has to be handled as an exception to maximal munch.

C++ has another exception. When digraphs were added to the language, it became possible to write [ as <:; unlike trigraphs, digraphs are just tokens so maximal munch should apply. However, :: is used in C++ as a namespace separator, and ::name indicates "name in the top-level namespace" (as opposed to namespaces active because of a using declaration). Many already existing programs included template expansions using this syntax -- Template<::classname> -- and these would have become syntax errors if the <: were suddenly interpreted as the syntactic equivalent of [. So an exception was made to maximal munch for the sequence <::, which must be the two tokens < :: unless the next character is > or :. So we have <::a ⇒ < :: a but <:::a ⇒ <: :: a and <::> ⇒ <: :> (that is, []).

And there are other examples.

(F)lex handles these exceptions by using a slightly more flexible definition of "matching" which still depends on the maximal munch heuristic.

A (f)lex pattern can contain a single (unparenthesized) / operator. / is usually called the "trailing context" operator, but that's not precise. The / does not change the characters matched; it's effect is to exclude the part of the match following the / from the matched token. So a (f)lex scanner for C++ might include the rules:

<         { return T_LT; }
<:        { return T_LT_COLON; }
</::[^:>] { return T_LT; }

As a result, the input <::a would match all three patterns. Because of maximal munch, the last rule will win, but the token returned from that match is just < rather than the full four-character match. That means that the next scan will start by rescanning ::a (resulting in the detection of a :: token). On the other hand, the input <:::a will not match the last pattern, so the scanner will have to backtrack to the longest pattern actually matched: <:.

A similar trick can be used to ensure that 2..3 in a language with ranges (Swift, for example) does not match a floating point pattern:

[[:digit:]]+"."[[:digit:]] { ... return T_FLOAT; }
[[:digit:]]+/".."          { ... return T_INTEGER; }

Again, the longest match for 2..3 is the second match, so maximal munch selects it. But the token returned is just 2; the trailing context is rescanned for the next token.

However, there is another class of maximal munch exceptions which can only be handled by making lexical analysis dependent on syntactic context. For example, EcmaScript (and other languages, including Awk) allow regular expressions literals which start with a / character. However, / might also be a division operator or the first character of the /= operator. (It might also start a comment, but since a regular expression cannot start with * that case is unproblematic.) The language syntax effectively defines two mutually exclusive contexts:

in "operator" context, an infix or postfix operator is possible and an operand such as a variable or literal is not.
in "operand" context, an operand or prefix operator is possible, but other operators are not.

So the parser always knows whether the next token might be a division operator, and when it might be a regular expression. But it doesn't have a good way to communicate that fact to the lexer. (The fact that the parser depends on lookahead tokens complicates the scenario further: for the parser to call the lexical analyser with an indication of whether to use the operator or regular expression rule, the parser would need to know that before it reads the lookahead token.)

Another apparent set of syntactic exceptions to maximal munch are ambiguous double close brackets, of which probably the most infamous is the C++ template argument. when templates were first added to C++, we needed to be careful to avoid closing two open template specializations without an intervening space, to avoid >> being recognised as a right-shift operator. That was annoying, and there was no good need for it; there are very few expressions in which >> could be interpreted in two different ways, and in almost all such expressions, one of the possibilities is pretty contrived. So, to the relief of most C++ programmers (and the annoyance of C++ lexical analyser writers), the standard was eventually changed to allow >> to close two open template specializations instead of forcing it to be interpreted as a syntactically (or semantically) invalid right shift. (Note that this case doesn't have to be handled by the lexer. The lexer can return a single >> token, leaving it to the parser to use that single token to close two open specializations. That complicates the grammar a bit, but not unbearably.)

So, maximal munch is not quite a universal. Moreover, from time to time proposals come up for lexers which can handle ambiguous lexicons by producing multiple possible token streams, much in the same way that GLR/GLL parsers can produce multiple possible parse trees. Indeed, one often-suggested possibility is the "scannerless" parser, in which lexical analysis is simply integrated into the grammar. If a GLR/GLL parser is available, the fact that lexical analysis tends to blow up the need for lookahead is no longer a show-stopper. (Allegedly, scannerless parsing works well with PEG grammars, as well.)

On the other hand, processing parallel token streams can dramatically increase parsing time, even to the extent of requiring time quadratic in the length of the input (although this shouldn't happen with practical grammars). And I think that leads us back to the example in the original question.

All of the examples presented above show deviations from maximal munch (to the extent that they are deviations from maximal munch) involving two consecutive tokens, which increases the complexity of lexical analysis but shouldn't have much impact on the computational complexity. But in the grammar suggested in the OP, recognising Poor Man's "Lexer" for C#a as starting with a aaa token rather than aaaa requires looking ahead more than one token (since both Poor Man's "Lexer" for C# and Poor Man's "Lexer" for C#aa should start with aaaa).

In addition to placing possible additional computational demands on the parser, this non-local disambiguation presents a cognitive load on the human reader (or writer), who is often best-served by a parser whose functioning is easy to predict, even if it sometimes fails to find a potential parse.

The C example of a+++++b springs to mind; occasionally the question arises as to why this is not parsed as a++ + ++b, which is the only possible meaningful parse. I think the answer here is not so much "because maximal munch", as "because most code readers wouldn't see that parse". One of the great advantages of maximal munch, beyond its computational value, is that it is easy to explain and easy to predict.

Non-regex alternative to word boundaries

Not sure why my initial question is being downvoted as to me it seems rather clear. I was not after getting my Regex' fixed as profiling showed that even the simplest Regex still takes more than I want. It may be a poor mans lexer, but I still want it to perform as best as possible.

The question, however, was if .NET had an alternative to word-boundaries built-in and if not, how I would go about implementing it myself WITHOUT using Regex.

The answer to the first question appears to be No.

As for the second, I wrote an Extension-method for the char class:

public static bool IsWordCharacter(this char character)
{
    return (
        (character >= 'a' && character <= 'z') || 
        (character >= 'A' && character <= 'Z') || 
        (character >= '0' && character <= '9') || 
        character == '_');
}

According to most Regex documentation, this mimics the \w flag (negating this method with ! results in \W obviously), which in return is used in \b, but without matching it in the result.

I then use this in a method something like this:

return 
    text.StartsWith(<needle>, StringComparison.Ordinal) 
    && !text[<length of needle>].IsWordCharacter()
        ? <length of needle> 
        : 0;

After which my underlying code knows if it has to use or drop the token.

Disclaimer: I'm aware it's not a full implementation of \b, but it serves my purpose.

Also, after having converted all my Regex' in this way, I went from 250ms to a mere 50ms for exactly the same file. Lexing all the 110 script files I have to my possession takes less than a second in total, averaging to roughly 7ms per file.

Accumulating Lexer errors in ANTLR 4

First, the reason you are getting this exception is because you need to declare Class2 as follows:

class Class2 : IAntlrErrorListener<int>

in order to match the class signature of Lexer defined as follows:

public abstract class Lexer : Recognizer<int, LexerATNSimulator>, ITokenSource

The method signature for AddErrorListener is defined in Recognizer to match the Symbol type from the class definition as follows:

public abstract class Recognizer<Symbol, ATNInterpreter> : IRecognizer

This means that Lexer specifies "int" as the "Symbol" when extending Recognizer, and so the AddErrorListener method requires an IAntlrErrorListener of type int.

Now, the real answer to your question: don't write an error listener for a lexer. Errors in the lexing are the fault of the lex grammar, not of the user input. Error listeners are useful for handling bad user input in a Parser. Lexer grammars should be tested for coverage during development - i.e., all possible input will match some token. If you encounter a lex error during development, the default error listener will do just fine.

For more information, see the answer here: ANTLR4 Lexer error reporting (length of offending characters)

Antlr lexer support for range of occurrences?

No, there is not. Not in ANTLR v3, nor in the future (now in beta) ANTLR v4.

You could use a predicate¹ to (manually) count the number of chars the rule has matched, and stop matching after a predefined number.

¹ What is a 'semantic predicate' in ANTLR?

Poor Man's "Lexer" for C#