Writing a Simple Equation Parser

Writing a simple equation parser

You can use Shunting yard algorithm or Reverse Polish Notation, both of them are using stacks to handle this, wiki said it better than me.

From wiki,

While there are tokens to be read:

Read a token.
If the token is a number, then add it to the output queue.
If the token is a function token, then push it onto the stack.
If the token is a function argument separator (e.g., a comma):

Until the token at the top of the stack is a left parenthesis, pop operators off the stack onto the output queue. If no left parentheses are encountered, either the separator was misplaced or parentheses were mismatched.

If the token is an operator, o1, then:

while there is an operator token, o2, at the top of the stack, and

either o1 is left-associative and its precedence is less than or equal to that of o2,
or o1 is right-associative and its precedence is less than that of o2,

pop o2 off the stack, onto the output queue;

push o1 onto the stack.

If the token is a left parenthesis, then push it onto the stack.
If the token is a right parenthesis:

Until the token at the top of the stack is a left parenthesis, pop operators off the stack onto the output queue.
Pop the left parenthesis from the stack, but not onto the output queue.
If the token at the top of the stack is a function token, pop it onto the output queue.
If the stack runs out without finding a left parenthesis, then there are mismatched parentheses.

When there are no more tokens to read:

While there are still operator tokens in the stack:

If the operator token on the top of the stack is a parenthesis, then there are mismatched parentheses.
Pop the operator onto the output queue.

Exit.

Smart design of a math parser?

A pretty good approach would involve two steps. The first step involves converting the expression from infix to postfix (e.g. via Dijkstra's shunting yard) notation. Once that's done, it's pretty trivial to write a postfix evaluator.

Equation (expression) parser with precedence?

The hard way

You want a recursive descent parser.

To get precedence you need to think recursively, for example, using your sample string,

1+11*5

to do this manually, you would have to read the 1, then see the plus and start a whole new recursive parse "session" starting with 11... and make sure to parse the 11 * 5 into its own factor, yielding a parse tree with 1 + (11 * 5).

This all feels so painful even to attempt to explain, especially with the added powerlessness of C. See, after parsing the 11, if the * was actually a + instead, you would have to abandon the attempt at making a term and instead parse the 11 itself as a factor. My head is already exploding. It's possible with the recursive decent strategy, but there is a better way...

The easy (right) way

If you use a GPL tool like Bison, you probably don't need to worry about licensing issues since the C code generated by bison is not covered by the GPL (IANAL but I'm pretty sure GPL tools don't force the GPL on generated code/binaries; for example Apple compiles code like say, Aperture with GCC and they sell it without having to GPL said code).

Download Bison (or something equivalent, ANTLR, etc.).

There is usually some sample code that you can just run bison on and get your desired C code that demonstrates this four function calculator:

http://www.gnu.org/software/bison/manual/html_node/Infix-Calc.html

Look at the generated code, and see that this is not as easy as it sounds. Also, the advantages of using a tool like Bison are 1) you learn something (especially if you read the Dragon book and learn about grammars), 2) you avoid NIH trying to reinvent the wheel. With a real parser-generator tool, you actually have a hope at scaling up later, showing other people you know that parsers are the domain of parsing tools.


Update:

People here have offered much sound advice. My only warning against skipping the parsing tools or just using the Shunting Yard algorithm or a hand rolled recursive decent parser is that little toy languages1 may someday turn into big actual languages with functions (sin, cos, log) and variables, conditions and for loops.

Flex/Bison may very well be overkill for a small, simple interpreter, but a one off parser+evaluator may cause trouble down the line when changes need to be made or features need to be added. Your situation will vary and you will need to use your judgement; just don't punish other people for your sins [2] and build a less than adequate tool.

My favorite tool for parsing

The best tool in the world for the job is the Parsec library (for recursive decent parsers) which comes with the programming language Haskell. It looks a lot like BNF, or like some specialized tool or domain specific language for parsing (sample code [3]), but it is in fact just a regular library in Haskell, meaning that it compiles in the same build step as the rest of your Haskell code, and you can write arbitrary Haskell code and call that within your parser, and you can mix and match other libraries all in the same code. (Embedding a parsing language like this in a language other than Haskell results in loads of syntactic cruft, by the way. I did this in C# and it works quite well but it is not so pretty and succinct.)

Notes:

1 Richard Stallman says, in Why you should not use Tcl

The principal lesson of Emacs is that
a language for extensions should not
be a mere "extension language". It
should be a real programming language,
designed for writing and maintaining
substantial programs. Because people
will want to do that!

[2] Yes, I am forever scarred from using that "language".

Also note that when I submitted this entry, the preview was correct, but SO's less than adequate parser ate my close anchor tag on the first paragraph, proving that parsers are not something to be trifled with because if you use regexes and one off hacks you will probably get something subtle and small wrong.

[3] Snippet of a Haskell parser using Parsec: a four function calculator extended with exponents, parentheses, whitespace for multiplication, and constants (like pi and e).

aexpr   =   expr `chainl1` toOp
expr = optChainl1 term addop (toScalar 0)
term = factor `chainl1` mulop
factor = sexpr `chainr1` powop
sexpr = parens aexpr
<|> scalar
<|> ident

powop = sym "^" >>= return . (B Pow)
<|> sym "^-" >>= return . (\x y -> B Pow x (B Sub (toScalar 0) y))

toOp = sym "->" >>= return . (B To)

mulop = sym "*" >>= return . (B Mul)
<|> sym "/" >>= return . (B Div)
<|> sym "%" >>= return . (B Mod)
<|> return . (B Mul)

addop = sym "+" >>= return . (B Add)
<|> sym "-" >>= return . (B Sub)

scalar = number >>= return . toScalar

ident = literal >>= return . Lit

parens p = do
lparen
result <- p
rparen
return result

Techniques needed to write an arithmetic expression parser

Without knowledge of compilation techniques it would be ugly. But there is no need to learn a ton of compilation for an introductory example like this.

Look at something like http://www.codeproject.com/Articles/345888/How-to-write-a-simple-interpreter-in-JavaScript and see if it makes sense to you.

parsing math expression in c++

Use the Shunting-yard algorithm. The wikipedia description is quite comprehensive, I hope it will suffice.

You can also try to write a formal grammar, for example a parsing-expression grammar, and use a tool to generate a parser. This site about PEGs lists 3 C/C++ libraries for PEG parsing.

How to parse mathematical expressions involving parentheses

Ages ago when working on a simple graphing app, I used this algorithm (which is reasonably easy to understand and works great for simple math expressions like these) to first turn the expression into RPN and then calculated the result. RPN was nice and fast to execute for different variable values.

Of course, language parsing is a very wide topic and there are many other ways of going about it (and pre-made tools for it too)

Equation parsing in Python

Python's own internal compiler can parse this, if you use Python notation.

If your change the notation slightly, you'll be happier.

import compiler
eq= "sin(x)*x**2"
ast= compiler.parse( eq )

You get an abstract syntax tree that you can work with.

Writing an extremely simple parser

The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:

echo "Here is the date: " & date();

I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for &, date and ().

My token list now contains the following tokens:

echo

"Here is the date: "

&

date

()

Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date, string or math. Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.

For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.



Related Topics



Leave a reply



Submit