How to Write a Parser in C#

How to write a Parser in C#?

I have implemented several parsers in C# - hand-written and tool generated.

A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.

You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).

On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.

Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).

Writing an extremely simple parser

The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:

echo "Here is the date: " & date();

I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for &, date and ().

My token list now contains the following tokens:

echo

"Here is the date: "

&

date

()

Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date, string or math. Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.

For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.

Best/fastest way to write a parser in c#

I've had good experience with ANTLR v3. By far the biggest benefit is that it lets you write LL(*) parsers with infinite lookahead - these can be quite suboptimal, but the grammar can be written in the most straightforward and natural way with no need to refactor to work around parser limitations, and parser performance is often not a big deal (I hope you aren't writing a C++ compiler), especially in learning projects.

It also provides pretty good means of constructing meaningful ASTs without need to write any code - for every grammar production, you indicate the "crucial" token or sub-production, and that becomes a tree node. Or you can write a tree production.

Have a look at the following ANTLR grammars (listed here in order of increasing complexity) to get a gist of how it looks and feels

  • JSON grammar - with tree productions
  • Lua grammar
  • C grammar

What's the best way to write a parser by hand?

I would highly recommend the F# language as your language of choice for parsing on the .NET Platform. It's roots in the ML family of languages means it has excellent support for language-oriented programming.

Discriminated unions and pattern-matching allow for a very succinct and powerful specification of your AST. Higher-order functions allow for definition of parse operations and their composition. First-class support for monadic types allows for state management to be handled implicitly greatly simplifying the composition of parsers. Powerful type-inference greatly aides the definition of these (complex) types. And all of this can be specified and executed interactively allowing you to rapidly prototype.

Stephan Tolksdorf has put this into practice with his parser combinator library FParsec

From his examples we see how naturally an AST is specified:

type expr =
| Val of string
| Int of int
| Float of float
| Decr of expr

type stmt =
| Assign of string * expr
| While of expr * stmt
| Seq of stmt list
| IfThen of expr * stmt
| IfThenElse of expr * stmt * stmt
| Print of expr

type prog = Prog of stmt list

the implementation of the parser (partially elided) is just as succinct:

let stmt, stmtRef = createParserForwardedToRef()

let stmtList = sepBy1 stmt (ch ';')

let assign =
pipe2 id (str ":=" >>. expr) (fun id e -> Assign(id, e))

let print = str "print" >>. expr |>> Print

let pwhile =
pipe2 (str "while" >>. expr) (str "do" >>. stmt) (fun e s -> While(e, s))

let seq =
str "begin" >>. stmtList .>> str "end" |>> Seq

let ifthen =
pipe3 (str "if" >>. expr) (str "then" >>. stmt) (opt (str "else" >>. stmt))
(fun e s1 optS2 ->
match optS2 with
| None -> IfThen(e, s1)
| Some s2 -> IfThenElse(e, s1, s2))

do stmtRef:= choice [ifthen; pwhile; seq; print; assign]

let prog =
ws >>. stmtList .>> eof |>> Prog

On the second line, as an example, stmt and ch are parsers and sepBy1 is a monadic parser combinator that takes two simple parsers and returns a combination parser. In this case sepBy1 p sep returns a parser that parses one or more occurrences of p separated by sep. You can thus see how quickly a powerful parser can be combined from simple parsers. F#'s support for overridden operators also allow for concise infix notation e.g. the sequencing combinator and the choice combinator can be specified as >>. and <|>.

Best of luck,

Danny

Parsing a simple text grammar with Superpower

Step 1 of writing any Superpower parser is to figure out what the token kinds are. You have something like:

// ECL - Elevator Control Language ;-)
enum EclToken {
LParen,
RParen,
UpKeyword,
DownKeyword,
WaitKeyword,
AtSymbol,
Number,
Comma
}

Step 2, write a Tokenizer<EclToken>. This is left as a direct programming task by Superpower v1 - there aren't many helpers to lean on, you just need to write the code as in the examples.

The tokenizer takes the input string, strips out the whitespace, and figures out what the sequence of tokens is.

For your example input, the first line will be:

// (UP 100),
LParen, UpKeyword, Number, RParen, Comma

For tokens like Number that contain content, the span associated with the Result<EclToken> will point to the portion of the input string corresponding to the token. In this line, the number will be a TextSpan covering 100.

Step 3 is to figure out what you want to parse the input into. For programming languages with nested expressions, this is usually an AST. In the case of the ECL sample, it's pretty simple so you might cut it down to:

struct ElevatorCommand {        
public int Distance; // + or -
public bool IsRelative;
}

Step 4, the parser. This is usually embedded in a static class. The job of the parser is to build up more complex results (an ElevatorCommand[], here), from simpler results (number, movement).

This is where Superpower does the heavy lifting, particularly with respect to expectations and errors.

static class EclParser 
{
static TokenListParser<EclToken, int> Number =
Token.EqualTo(EclToken.Number).Apply(Numerics.IntegerInt32);
}

The first thing we do is define the parser for numbers; this one applies a built-in TextParser<int> to the content of an EclToken.Number span.

You can see more parsing machinery in this example.

A few more clues to help you find the way (not syntax checked, let alone compiled/tested):

    static TokenListParser<EclToken, ElevatorCommand> Up =
from _ in Token.EqualTo(EclToken.UpKeyword)
from distance in Number
select new ElevatorCommand {
Distance = distance,
IsRelative = false
};

static TokenListParser<EclToken, ElevatorCommand> Command =
from lp in Token.EqualTo(EclToken.LParen)
from command in Up // .Or(Down).Or(Wait)
from rp in Token.EqualTo(EclToken.RParen)
select command;

static TokenListParser<EclToken, ElevatorCommand[]> Commands =
Command.ManyDelimitedBy(Token.EqualTo(EclToken.Comma));
}

Commands is the completed parser that you can apply to the input.

It's best to build up the parser incrementally, testing each smaller parser on chunks of input they're expected to parse.

C# writing a first parser, early stackoverflow detection

Answer is explained better here: https://stackoverflow.com/a/20179940/5556724

The version of EBNF on Wikipedia is written like so:

rhs = identifier
| terminal
| "[" , rhs , "]"
| "{" , rhs , "}"
| "(" , rhs , ")"
| rhs , "|" , rhs
| rhs , "," , rhs ;

but should be written like so:

primary = identifier
| terminal
| "[" , rhs , "]"
| "{" , rhs , "}"
| "(" , rhs , ")"
;
factor = primary , { "|" , primary } ;
rhs = factor , { "," , factor } ;

otherwise there is a FIRST/FIRST conflict.

Custom Expression Parser in C#

This is an old problem that was solved a long time ago. The solution is to use a parser generator, not to write your own parser from scratch. There are many options available, one of the more popular ones being ANTLR.

Using a parser generator like ANTLR you can describe your problem using easy-to-understand EBNF-like production rules. The parser generator will generate the complicated logic it seems you are now trying to write by hand, in the language of your choosing, in your case C#. Probably the grammar of your language is already available in this format or you have used this format to describe the language to your users and in your development team.

Writing and polishing a CSV parser

As a programming exercise (for learning and gaining experience) it is probably a very reasonable thing to do. For production code, it may be better to use an existing library mainly because the work is already done. There are quite a few things to address with a CSV parser. For example (randomly off the top of my head):

  • Quoted values (strings)
  • Embedded quotes in quoted strings
  • Empty values (NULL ... or maybe even NULL vs. empty).
  • Lines without the correct number of entries
  • Headers vs. no headers.
  • Recognizing different data types (e.g., different date formats).

If you have a very specific input format in a very controlled environment, though, you may not need to deal with all of those.



Related Topics



Leave a reply



Submit