Parse Math Expression

How to parse mathematical expressions involving parentheses

Ages ago when working on a simple graphing app, I used this algorithm (which is reasonably easy to understand and works great for simple math expressions like these) to first turn the expression into RPN and then calculated the result. RPN was nice and fast to execute for different variable values.

Of course, language parsing is a very wide topic and there are many other ways of going about it (and pre-made tools for it too)

parsing math expression in c++

Use the Shunting-yard algorithm. The wikipedia description is quite comprehensive, I hope it will suffice.

You can also try to write a formal grammar, for example a parsing-expression grammar, and use a tool to generate a parser. This site about PEGs lists 3 C/C++ libraries for PEG parsing.

Parse Math Expression

You can try using DataTable.Compute.

A related one is DataColumn.Expression.

Also check out: Doing math in vb.net like Eval in javascript

Note: I haven't used these myself.

Parsing mathematical expression string

Tokenization is the first step towards syntactic analysis. As such, it is probably a good idea to encapsulate it into a layer of abstraction, a function that yields one token at a time. This function also discards white space and combines consecutive digits into numbers. It is hard to delegate these steps to strtok(). Finally, the function may rewrap single characters into more structured information.

#include <ctype.h>
#include <stdio.h>

#define T_NUMBER 0
#define T_OPERATOR 1
#define T_BRACKET 2

typedef struct {
int type;
int value;
} token;

/* Reads the next token from stdin. Returns 1 on success, 0 on EOL. */
int next_token(token *t) {
char c;

/* discard spaces silently */
do {
c = getchar();
} while (c == ' ');

if (isdigit(c)) {
t->type = T_NUMBER;
t->value = 0;
do {
t->value = t->value * 10 + (c - '0');
c = getchar();
} while (isdigit(c));
ungetc(c, stdin); /* save the non-digit for next time */

} else if (c == '+' || c == '-' || c == '*' || c == '/') {
t->type = T_OPERATOR;
t->value = c;

} else if (c == '(' || c == ')') {
t->type = T_BRACKET;
t->value = c;

} else if (c == '\n') {
ungetc(c, stdin); /* make sure we always return 0 from now on */
return 0;

} else {
/* handle illegal character */

}

return 1;
}

int main() {

token t;

while (next_token(&t)) {
switch (t.type) {
case T_NUMBER: printf("number %d\n", t.value); break;
case T_OPERATOR: printf("operator %c\n", t.value); break;
case T_BRACKET: printf("bracket %c\n", t.value); break;
}
}

return 0;
}

Sample run:

(22+   37 * ( 1534-9)) + 18
bracket (
number 22
operator +
number 37
operator *
bracket (
number 1534
operator -
number 9
bracket )
bracket )
operator +
number 18

Equation (expression) parser with precedence?


The hard way

You want a recursive descent parser.

To get precedence you need to think recursively, for example, using your sample string,

1+11*5

to do this manually, you would have to read the 1, then see the plus and start a whole new recursive parse "session" starting with 11... and make sure to parse the 11 * 5 into its own factor, yielding a parse tree with 1 + (11 * 5).

This all feels so painful even to attempt to explain, especially with the added powerlessness of C. See, after parsing the 11, if the * was actually a + instead, you would have to abandon the attempt at making a term and instead parse the 11 itself as a factor. My head is already exploding. It's possible with the recursive decent strategy, but there is a better way...

The easy (right) way

If you use a GPL tool like Bison, you probably don't need to worry about licensing issues since the C code generated by bison is not covered by the GPL (IANAL but I'm pretty sure GPL tools don't force the GPL on generated code/binaries; for example Apple compiles code like say, Aperture with GCC and they sell it without having to GPL said code).

Download Bison (or something equivalent, ANTLR, etc.).

There is usually some sample code that you can just run bison on and get your desired C code that demonstrates this four function calculator:

http://www.gnu.org/software/bison/manual/html_node/Infix-Calc.html

Look at the generated code, and see that this is not as easy as it sounds. Also, the advantages of using a tool like Bison are 1) you learn something (especially if you read the Dragon book and learn about grammars), 2) you avoid NIH trying to reinvent the wheel. With a real parser-generator tool, you actually have a hope at scaling up later, showing other people you know that parsers are the domain of parsing tools.


Update:

People here have offered much sound advice. My only warning against skipping the parsing tools or just using the Shunting Yard algorithm or a hand rolled recursive decent parser is that little toy languages1 may someday turn into big actual languages with functions (sin, cos, log) and variables, conditions and for loops.

Flex/Bison may very well be overkill for a small, simple interpreter, but a one off parser+evaluator may cause trouble down the line when changes need to be made or features need to be added. Your situation will vary and you will need to use your judgement; just don't punish other people for your sins [2] and build a less than adequate tool.

My favorite tool for parsing

The best tool in the world for the job is the Parsec library (for recursive decent parsers) which comes with the programming language Haskell. It looks a lot like BNF, or like some specialized tool or domain specific language for parsing (sample code [3]), but it is in fact just a regular library in Haskell, meaning that it compiles in the same build step as the rest of your Haskell code, and you can write arbitrary Haskell code and call that within your parser, and you can mix and match other libraries all in the same code. (Embedding a parsing language like this in a language other than Haskell results in loads of syntactic cruft, by the way. I did this in C# and it works quite well but it is not so pretty and succinct.)

Notes:

1 Richard Stallman says, in Why you should not use Tcl

The principal lesson of Emacs is that
a language for extensions should not
be a mere "extension language". It
should be a real programming language,
designed for writing and maintaining
substantial programs. Because people
will want to do that!

[2] Yes, I am forever scarred from using that "language".

Also note that when I submitted this entry, the preview was correct, but SO's less than adequate parser ate my close anchor tag on the first paragraph, proving that parsers are not something to be trifled with because if you use regexes and one off hacks you will probably get something subtle and small wrong.

[3] Snippet of a Haskell parser using Parsec: a four function calculator extended with exponents, parentheses, whitespace for multiplication, and constants (like pi and e).

aexpr   =   expr `chainl1` toOp
expr = optChainl1 term addop (toScalar 0)
term = factor `chainl1` mulop
factor = sexpr `chainr1` powop
sexpr = parens aexpr
<|> scalar
<|> ident

powop = sym "^" >>= return . (B Pow)
<|> sym "^-" >>= return . (\x y -> B Pow x (B Sub (toScalar 0) y))

toOp = sym "->" >>= return . (B To)

mulop = sym "*" >>= return . (B Mul)
<|> sym "/" >>= return . (B Div)
<|> sym "%" >>= return . (B Mod)
<|> return . (B Mul)

addop = sym "+" >>= return . (B Add)
<|> sym "-" >>= return . (B Sub)

scalar = number >>= return . toScalar

ident = literal >>= return . Lit

parens p = do
lparen
result <- p
rparen
return result

parsing math expression in python and solving to find an answer

If I wasn't going to rely on external libraries, I'd do it something like this:

def parse(x):
operators = set('+-*/')
op_out = [] #This holds the operators that are found in the string (left to right)
num_out = [] #this holds the non-operators that are found in the string (left to right)
buff = []
for c in x: #examine 1 character at a time
if c in operators:
#found an operator. Everything we've accumulated in `buff` is
#a single "number". Join it together and put it in `num_out`.
num_out.append(''.join(buff))
buff = []
op_out.append(c)
else:
#not an operator. Just accumulate this character in buff.
buff.append(c)
num_out.append(''.join(buff))
return num_out,op_out

print parse('3/2*15')

It's not the most elegant, but it gets you the pieces in a reasonable data structure (as far as I'm concerned anyway)

Now code to actually parse and evaluate the numbers -- This will do everything in floating point, but would be easy enough to change ...

import operator
def my_eval(nums,ops):

nums = list(nums)
ops = list(ops)
operator_order = ('*/','+-') #precedence from left to right. operators at same index have same precendece.
#map operators to functions.
op_dict = {'*':operator.mul,
'/':operator.div,
'+':operator.add,
'-':operator.sub}
Value = None
for op in operator_order: #Loop over precedence levels
while any(o in ops for o in op): #Operator with this precedence level exists
idx,oo = next((i,o) for i,o in enumerate(ops) if o in op) #Next operator with this precedence
ops.pop(idx) #remove this operator from the operator list
values = map(float,nums[idx:idx+2]) #here I just assume float for everything
value = op_dict[oo](*values)
nums[idx:idx+2] = [value] #clear out those indices

return nums[0]

print my_eval(*parse('3/2*15'))

Parse mathematical expressions with pyparsing

The actual name for this parsing problem is "infix notation" (and in recent versions of pyparsing, I am renaming operatorPrecedence to infixNotation). To see the typical implementation of infix notation parsing, look at the fourFn.py example on the pyparsing wiki. There you will see an implementation of this simplified BNF to implement 4-function arithmetic, with precedence of operations:

operand :: integer or real number
factor :: operand | '(' expr ')'
term :: factor ( ('*' | '/') factor )*
expr :: term ( ('+' | '-') term )*

So an expression is one or more terms separated by addition or subtraction operations.

A term is one or more factors separated by multiplication or division operations.

A factor is either a lowest-level operand (in this case, just integers or reals), OR an expr enclosed in ()'s.

Note that this is a recursive parser, since factor is used indirectly in the definition of expr, but expr is also used to define factor.

In pyparsing, this looks roughly like this (assuming that integer and real have already been defined):

LPAR,RPAR = map(Suppress, '()')
expr = Forward()
operand = real | integer
factor = operand | Group(LPAR + expr + RPAR)
term = factor + ZeroOrMore( oneOf('* /') + factor )
expr <<= term + ZeroOrMore( oneOf('+ -') + term )

Now using expr, you can parse any of these:

3
3+2
3+2*4
(3+2)*4

The infixNotation pyparsing helper method takes care of all the recursive definitions and groupings, and lets you define this as:

expr = infixNotation(operand,
[
(oneOf('* /'), 2, opAssoc.LEFT),
(oneOf('+ -'), 2, opAssoc.LEFT),
])

But this obscures all the underlying theory, so if you are trying to understand how this is implemented, look at the raw solution in fourFn.py.

Parsing complex mathematical expressions

You can use an operator precedence parser, such as the infamous Shunting Yard Algorithm. (This answer might also be helpful. Or maybe not.)

The shunting-yard algorithm is often presented (including in that Wikipedia article) as a way of "translating infix to reverse polish". It's not. It's a parsing algorithm, and it can easily produce an AST, or three-address code, or whatever else you want to generate with it. But I agree that an AST is the best option.

To modify the algorithm on Wikipedia to an algorithm which produces an syntax tree, do the following:

  • Replace what they call "the output queue" with a stack of tree nodes.
  • When they "push a value onto the output queue", create an AST node of the value (a literal or an identifier, most likely), and push it onto the stack.}
  • When they "push an operator onto the output queue": the nodes on the top of the stack are the child nodes of the operator syntax node. So pop them (remembering that the one on the top of the stack is the rightmost argument, for operators which take more than one operand) and combine them into a new operator node. Then push this node back onto the stack.
  • At the end of the parse, the stack will contain exactly one element (if all operators were used with the correct number of arguments), which is the AST node for the entire expression.

Many languages which allow operator syntax to be defined in the source code use exactly this algorithm.

Having done a couple of those myself, let me offer a couple of pieces of advice which you are free to ignore:

  1. Although these days Unicode offers the possibility to use all sorts of single "characters" as mathematical operators, the reality is that ⊛ is not all that easy to type, and a library which insisted you used it for some function might not be all that popular with users. Of course, you could try writing your own cross-platform Unicode-aware IDE, but that strikes me as a very deep and convoluted rabbit hole, a maze of twisty passages all alike. So it's quite likely that you'll want to allow that to be typed as something like (*). But then you have the problem that you don't know what the tokens are until you read the operator declarations. There are no truly satisfying solutions (until someone writes the aforementioned IDE), and most of the languages I've seen which allow custom operators insist that they be separated with whitespace, after having designated some set of characters which can be used as operator characters.

  2. As anyone who has written bitwise expressions in C or C++ knows, having a lot of different operator precedences is not all that user friendly. Most of us don't have intuitions which guide us in understanding how x + 2 << n ^ 1 | 7 is intended to be parsed, and most style guides require you (reasonably, IMHO) to add redundant parentheses even if you know precisely what that expressions means, because people reading your code probably don't. ((((x + 2) << n) ^ 1) | 7)). When the operators have just been added ad hoc to the syntax, intuitions are probably even less helpful. But while only a few might guess that xor comes between and and or, or that all bitwise operators except ~ bind more tightly than booleans, it probably is common to understand that multiplication-type operators have higher precedence than additive-type operators in the same domain.

    Anyway, the point I intended to make is that declarations like "⋇ has precedence 5", while perfectly reasonable in the library which defines ⋇, doesn't compose well, because "5" is a completely arbitrary number. Perhaps there is a different library intended for a different domain, which declares that "⋈ has precedence 3". The authors of the two libraries probably never came together to decide what the relative precedence of ⋇ and ⋈ are, and the fact that one used 5 and the other 3 does not indicate that one of those operators "is intended" to bind more tightly than the other one. Personally, I prefer a style where operator precedences are defined relative to other operators in the same module ("∷ has the same precedence as ⋕ and higher precedence than ⊹".) From those declarations, you could build a partial ordering, which is sufficient for operator-precedence parsing; if the parser finds two unparenthesized operators which do not have a declared precedence order, it should complain rather than just using a comparison between arbitrary numbers out of any context. But that's just my opinion.)



Related Topics



Leave a reply



Submit