Antlr: Is There a Simple Example

ANTLR: Is there a simple example?

Note: this answer is for ANTLR3! If you're looking for an ANTLR4 example, then this Q&A demonstrates how to create a simple expression parser, and evaluator using ANTLR4.


You first create a grammar. Below is a small grammar that you can use to evaluate expressions that are built using the 4 basic math operators: +, -, * and /. You can also group expressions using parenthesis.

Note that this grammar is just a very basic one: it does not handle unary operators (the minus in: -1+9) or decimals like .99 (without a leading number), to name just two shortcomings. This is just an example you can work on yourself.

Here's the contents of the grammar file Exp.g:

grammar Exp;

/* This will be the entry point of our parser. */
eval
: additionExp
;

/* Addition and subtraction have the lowest precedence. */
additionExp
: multiplyExp
( '+' multiplyExp
| '-' multiplyExp
)*
;

/* Multiplication and division have a higher precedence. */
multiplyExp
: atomExp
( '*' atomExp
| '/' atomExp
)*
;

/* An expression atom is the smallest part of an expression: a number. Or
when we encounter parenthesis, we're making a recursive call back to the
rule 'additionExp'. As you can see, an 'atomExp' has the highest precedence. */
atomExp
: Number
| '(' additionExp ')'
;

/* A number: can be an integer value, or a decimal value */
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;

/* We're going to ignore all white space characters */
WS
: (' ' | '\t' | '\r'| '\n') {$channel=HIDDEN;}
;

(Parser rules start with a lower case letter, and lexer rules start with a capital letter)

After creating the grammar, you'll want to generate a parser and lexer from it. Download the ANTLR jar and store it in the same directory as your grammar file.

Execute the following command on your shell/command prompt:

java -cp antlr-3.2.jar org.antlr.Tool Exp.g

It should not produce any error message, and the files ExpLexer.java, ExpParser.java and Exp.tokens should now be generated.

To see if it all works properly, create this test class:

import org.antlr.runtime.*;

public class ANTLRDemo {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("12*(5-6)");
ExpLexer lexer = new ExpLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExpParser parser = new ExpParser(tokens);
parser.eval();
}
}

and compile it:

// *nix/MacOS
javac -cp .:antlr-3.2.jar ANTLRDemo.java

// Windows
javac -cp .;antlr-3.2.jar ANTLRDemo.java

and then run it:

// *nix/MacOS
java -cp .:antlr-3.2.jar ANTLRDemo

// Windows
java -cp .;antlr-3.2.jar ANTLRDemo

If all goes well, nothing is being printed to the console. This means the parser did not find any error. When you change "12*(5-6)" into "12*(5-6" and then recompile and run it, there should be printed the following:

line 0:-1 mismatched input '<EOF>' expecting ')'

Okay, now we want to add a bit of Java code to the grammar so that the parser actually does something useful. Adding code can be done by placing { and } inside your grammar with some plain Java code inside it.

But first: all parser rules in the grammar file should return a primitive double value. You can do that by adding returns [double value] after each rule:

grammar Exp;

eval returns [double value]
: additionExp
;

additionExp returns [double value]
: multiplyExp
( '+' multiplyExp
| '-' multiplyExp
)*
;

// ...

which needs little explanation: every rule is expected to return a double value. Now to "interact" with the return value double value (which is NOT inside a plain Java code block {...}) from inside a code block, you'll need to add a dollar sign in front of value:

grammar Exp;

/* This will be the entry point of our parser. */
eval returns [double value]
: additionExp { /* plain code block! */ System.out.println("value equals: "+$value); }
;

// ...

Here's the grammar but now with the Java code added:

grammar Exp;

eval returns [double value]
: exp=additionExp {$value = $exp.value;}
;

additionExp returns [double value]
: m1=multiplyExp {$value = $m1.value;}
( '+' m2=multiplyExp {$value += $m2.value;}
| '-' m2=multiplyExp {$value -= $m2.value;}
)*
;

multiplyExp returns [double value]
: a1=atomExp {$value = $a1.value;}
( '*' a2=atomExp {$value *= $a2.value;}
| '/' a2=atomExp {$value /= $a2.value;}
)*
;

atomExp returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;

Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;

WS
: (' ' | '\t' | '\r'| '\n') {$channel=HIDDEN;}
;

and since our eval rule now returns a double, change your ANTLRDemo.java into this:

import org.antlr.runtime.*;

public class ANTLRDemo {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("12*(5-6)");
ExpLexer lexer = new ExpLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExpParser parser = new ExpParser(tokens);
System.out.println(parser.eval()); // print the value
}
}

Again (re) generate a fresh lexer and parser from your grammar (1), compile all classes (2) and run ANTLRDemo (3):

// *nix/MacOS
java -cp antlr-3.2.jar org.antlr.Tool Exp.g // 1
javac -cp .:antlr-3.2.jar ANTLRDemo.java // 2
java -cp .:antlr-3.2.jar ANTLRDemo // 3

// Windows
java -cp antlr-3.2.jar org.antlr.Tool Exp.g // 1
javac -cp .;antlr-3.2.jar ANTLRDemo.java // 2
java -cp .;antlr-3.2.jar ANTLRDemo // 3

and you'll now see the outcome of the expression 12*(5-6) printed to your console!

Again: this is a very brief explanation. I encourage you to browse the ANTLR wiki and read some tutorials and/or play a bit with what I just posted.

Good luck!

EDIT:

This post shows how to extend the example above so that a Map<String, Double> can be provided that holds variables in the provided expression.

To get this code working with a current version of Antlr (June 2014) I needed to make a few changes. ANTLRStringStream needed to become ANTLRInputStream, the returned value needed to change from parser.eval() to parser.eval().value, and I needed to remove the WS clause at the end, because attribute values such as $channel are no longer allowed to appear in lexer actions.

Is there a simple example of using antlr4 to create an AST from java source code and extract methods, variables and comments?

Here it is.

First, you're gonna buy the ANTLR4 book ;-)

Second, you'll download antlr4 jar and the java grammar (http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference)

Then, you can change the grammar a little bit, adding these to the header

    (...)
grammar Java;

options
{
language = Java;
}

// starting point for parsing a java file
compilationUnit
(...)

I'll change a little thing in the grammar just to illustrate something.

/*
methodDeclaration
: (type|'void') Identifier formalParameters ('[' ']')*
('throws' qualifiedNameList)?
( methodBody
| ';'
)
;
*/
methodDeclaration
: (type|'void') myMethodName formalParameters ('[' ']')*
('throws' qualifiedNameList)?
( methodBody
| ';'
)
;

myMethodName
: Identifier
;

You see, the original grammar does not let you identify the method identifier from any other identifier, so I've commented the original block and added a new one just to show you how to get what you want.

You'll have to do the same for other elements you want to retrieve, like the comments, that are currently being just skipped. That's for you :-)

Now, create a class like this to generate all the stubs

package mypackage;

public class Gen {

public static void main(String[] args) {
String[] arg0 = { "-visitor", "/home/leoks/EclipseIndigo/workspace2/SO/src/mypackage/Java.g4", "-package", "mypackage" };
org.antlr.v4.Tool.main(arg0);
}

}

Run Gen, and you'll get some java code created for you in mypackage.

Now create a Visitor. Actually, the visitor will parse itself in this example

package mypackage;

import java.io.FileInputStream;
import java.io.IOException;

import mypackage.JavaParser.MyMethodNameContext;

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

/**
* @author Leonardo Kenji Feb 4, 2014
*/
public class MyVisitor extends JavaBaseVisitor<Void> {

/**
* Main Method
*
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
ANTLRInputStream input = new ANTLRInputStream(new FileInputStream("/home/leoks/EclipseIndigo/workspace2/SO/src/mypackage/MyVisitor.java")); // we'll
// parse
// this
// file
JavaLexer lexer = new JavaLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
JavaParser parser = new JavaParser(tokens);
ParseTree tree = parser.compilationUnit(); // see the grammar ->
// starting point for
// parsing a java file

MyVisitor visitor = new MyVisitor(); // extends JavaBaseVisitor<Void>
// and overrides the methods
// you're interested
visitor.visit(tree);
}

/**
* some attribute comment
*/
private String someAttribute;

@Override
public Void visitMyMethodName(MyMethodNameContext ctx) {
System.out.println("Method name:" + ctx.getText());
return super.visitMyMethodName(ctx);
}

}

and that's it.

You'll get something like

Method name:main
Method name:visitMyMethodName

ps. one more thing. While I was writing this code in eclipse, I've got a strange exception. This is caused by Java 7 and can be fixed just adding these parameters to your compiler (thanks to this link http://java.dzone.com/articles/javalangverifyerror-expecting)

Sample Image

Antlr4 - Is there a simple example of using the ParseTree Walker?

For each of your parser rules in your grammar the generated parser will have a corresponding method with that name. Calling that method will start parsing at that rule.

Therefore if your "root-rule" is named start then you'd start parsing via gramParser.start() which returns a ParseTree. This tree can then be fed into the ParseTreeWalker alongside with the listener you want to be using.

All in all it could look something like this (EDITED BY OP):

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
import static org.antlr.v4.runtime.CharStreams.fromFileName;

public class launch{
public static void main(String[] args) {

CharStream cs = fromFileName("program.txt"); //load the file
gramLexer lexer = new gramLexer(cs); //instantiate a lexer
CommonTokenStream tokens = new CommonTokenStream(lexer); //scan stream for tokens
gramParser parser = new gramParser(tokens); //parse the tokens

ParseTree tree = parser.start(); // parse the content and get the tree
Mylistener listener = new Mylistener();

ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(listener,tree);
}}

************ NEW FILE Mylistener.java ************

public class Mylistener extends gramBaseListener {
@Override public void enterEveryRule(ParserRuleContext ctx) { //see gramBaseListener for allowed functions
System.out.println("rule entered: " + ctx.getText()); //code that executes per rule
}
}

Of course you have to replace <listener> with your implementation of BaseListener

And just one small sidenode: In Java it is convention to start classnames with capital letters and I'd advise you to stick to that in order for making the code more readable for other people.

ANTLR basic example in Java

From ANTLR here is the trivial example of parsing (and evaluating) an expression.

grammar Expr;

@header {
package test;
import java.util.HashMap;
}

@lexer::header {package test;}

@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}

prog: stat+ ;

stat: expr NEWLINE {System.out.println($expr.value);}
| ID '=' expr NEWLINE
{memory.put($ID.text, new Integer($expr.value));}
| NEWLINE
;

expr returns [int value]
: e=multExpr {$value = $e.value;}
( '+' e=multExpr {$value += $e.value;}
| '-' e=multExpr {$value -= $e.value;}
)*
;

multExpr returns [int value]
: e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
;

atom returns [int value]
: INT {$value = Integer.parseInt($INT.text);}
| ID
{
Integer v = (Integer)memory.get($ID.text);
if ( v!=null ) $value = v.intValue();
else System.err.println("undefined variable "+$ID.text);
}
| '(' e=expr ')' {$value = $e.value;}
;

ID : ('a'..'z'|'A'..'Z')+ ;
INT : '0'..'9'+ ;
NEWLINE:'\r'? '\n' ;
WS : (' '|'\t')+ {skip();} ;

But like I mentioned in my comments, C++ is very hard to parse correctly. There are many ambiguities and requires * amount of look ahead (which ANTLR does provide). So doing this in any efficient form is complicated. That is why I recommend implementing something like PL/0 which was designed for students to write their first compiler for. Tiny BASIC is also a good start. Both of these can be implemented without using a tool like ANTLR by doing recursive descent. I have implemented both in under 1000 lines together (in C++ and C# respectively).

ANTLR is a great tool though, especially once you get your head wrapped around recursive descent you might want to upgrade to a more powerful parser. I recommend both of Terrence Parr's books, ANTLR Reference and Language Implementation Patterns. The ANTLR book will tell you everything (plus some) that you want to know about ANTLR. The second book will teach you all about parsers and compilers, from recursive descent to black-magic backtracking.

More resources from a similar SO question can be found here. And if you're into Lisp or Scheme, you can check out JScheme, it is written in Java (less than 1000 lines I believe).

ANTLR4 visitor pattern on simple arithmetic example

Sure, just label it differently. After all, the alternative '(' expr ')' isn't a #opExpr:

expr  : left=expr op=('*'|'/') right=expr #opExpr
| left=expr op=('+'|'-') right=expr #opExpr
| '(' expr ')' #parenExpr
| atom=INT #atomExpr
;

And in your visitor, you'd do something like this:

public class EvalVisitor extends ExpressionsBaseVisitor<Integer> {

@Override
public Integer visitOpExpr(@NotNull ExpressionsParser.OpExprContext ctx) {
int left = visit(ctx.left);
int right = visit(ctx.right);
String op = ctx.op.getText();
switch (op.charAt(0)) {
case '*': return left * right;
case '/': return left / right;
case '+': return left + right;
case '-': return left - right;
default: throw new IllegalArgumentException("Unknown operator " + op);
}
}

@Override
public Integer visitStart(@NotNull ExpressionsParser.StartContext ctx) {
return this.visit(ctx.expr());
}

@Override
public Integer visitAtomExpr(@NotNull ExpressionsParser.AtomExprContext ctx) {
return Integer.valueOf(ctx.getText());
}

@Override
public Integer visitParenExpr(@NotNull ExpressionsParser.ParenExprContext ctx) {
return this.visit(ctx.expr());
}

public static void main(String[] args) {
String expression = "2 * (3 + 4)";
ExpressionsLexer lexer = new ExpressionsLexer(CharStreams.fromString(expression));
ExpressionsParser parser = new ExpressionsParser(new CommonTokenStream(lexer));
ParseTree tree = parser.start();
Integer answer = new EvalVisitor().visit(tree);
System.out.printf("%s = %s\n", expression, answer);
}
}

If you run the class above, you'd see the following output:

2 * (3 + 4) = 14

Antlr grammar for parsing simple expression

You're using way too many lexer rules.

When you're defining a token like this:

BODY        : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;

then the tokenizer (lexer) will try to create the (single!) token: xxx,xxx,5,true. E.g. it does not allow any space in between it. Lexer rules (the ones starting with a capital) should really be the "atoms" of your language (the smallest parts). Whenever you start creating elements like a body, you glue atoms together in parser rules, not in lexer rules.

Try something like this:

grammar TermSpanNear;

// parser rules (the elements)
start : termpsan EOF ;
termpsan : termpsannear | 'xxx' ;
termpsannear : 'termspannear' OPENP body CLOSEP ;
body : termpsan COMMA termpsan COMMA SLOP COMMA ORDERED ;

// lexer rules (the atoms)
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ;

How to create AST with ANTLR4?

Ok, let's build a simple math example. Building an AST is totally overkill for such a task but it's a nice way to show the principle.

I'll do it in C# but the Java version would be very similar.

The grammar

First, let's write a very basic math grammar to work with:

grammar Math;

compileUnit
: expr EOF
;

expr
: '(' expr ')' # parensExpr
| op=('+'|'-') expr # unaryExpr
| left=expr op=('*'|'/') right=expr # infixExpr
| left=expr op=('+'|'-') right=expr # infixExpr
| func=ID '(' expr ')' # funcExpr
| value=NUM # numberExpr
;

OP_ADD: '+';
OP_SUB: '-';
OP_MUL: '*';
OP_DIV: '/';

NUM : [0-9]+ ('.' [0-9]+)? ([eE] [+-]? [0-9]+)?;
ID : [a-zA-Z]+;
WS : [ \t\r\n] -> channel(HIDDEN);

Pretty basic stuff, we have a single expr rule that handles everything (precedence rules etc).

The AST nodes

Then, let's define some AST nodes we'll use. These are totally custom and you can define them in the way you want to.

Here are the nodes we'll be using for this example:

internal abstract class ExpressionNode
{
}

internal abstract class InfixExpressionNode : ExpressionNode
{
public ExpressionNode Left { get; set; }
public ExpressionNode Right { get; set; }
}

internal class AdditionNode : InfixExpressionNode
{
}

internal class SubtractionNode : InfixExpressionNode
{
}

internal class MultiplicationNode : InfixExpressionNode
{
}

internal class DivisionNode : InfixExpressionNode
{
}

internal class NegateNode : ExpressionNode
{
public ExpressionNode InnerNode { get; set; }
}

internal class FunctionNode : ExpressionNode
{
public Func<double, double> Function { get; set; }
public ExpressionNode Argument { get; set; }
}

internal class NumberNode : ExpressionNode
{
public double Value { get; set; }
}

Converting a CST to an AST

ANTLR generated the CST nodes for us (the MathParser.*Context classes). We now have to convert these to AST nodes.

This is easily done with a visitor, and ANTLR provides us with a MathBaseVisitor<T> class, so let's work with that.

internal class BuildAstVisitor : MathBaseVisitor<ExpressionNode>
{
public override ExpressionNode VisitCompileUnit(MathParser.CompileUnitContext context)
{
return Visit(context.expr());
}

public override ExpressionNode VisitNumberExpr(MathParser.NumberExprContext context)
{
return new NumberNode
{
Value = double.Parse(context.value.Text, NumberStyles.AllowDecimalPoint | NumberStyles.AllowExponent)
};
}

public override ExpressionNode VisitParensExpr(MathParser.ParensExprContext context)
{
return Visit(context.expr());
}

public override ExpressionNode VisitInfixExpr(MathParser.InfixExprContext context)
{
InfixExpressionNode node;

switch (context.op.Type)
{
case MathLexer.OP_ADD:
node = new AdditionNode();
break;

case MathLexer.OP_SUB:
node = new SubtractionNode();
break;

case MathLexer.OP_MUL:
node = new MultiplicationNode();
break;

case MathLexer.OP_DIV:
node = new DivisionNode();
break;

default:
throw new NotSupportedException();
}

node.Left = Visit(context.left);
node.Right = Visit(context.right);

return node;
}

public override ExpressionNode VisitUnaryExpr(MathParser.UnaryExprContext context)
{
switch (context.op.Type)
{
case MathLexer.OP_ADD:
return Visit(context.expr());

case MathLexer.OP_SUB:
return new NegateNode
{
InnerNode = Visit(context.expr())
};

default:
throw new NotSupportedException();
}
}

public override ExpressionNode VisitFuncExpr(MathParser.FuncExprContext context)
{
var functionName = context.func.Text;

var func = typeof(Math)
.GetMethods(BindingFlags.Public | BindingFlags.Static)
.Where(m => m.ReturnType == typeof(double))
.Where(m => m.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] { typeof(double) }))
.FirstOrDefault(m => m.Name.Equals(functionName, StringComparison.OrdinalIgnoreCase));

if (func == null)
throw new NotSupportedException(string.Format("Function {0} is not supported", functionName));

return new FunctionNode
{
Function = (Func<double, double>)func.CreateDelegate(typeof(Func<double, double>)),
Argument = Visit(context.expr())
};
}
}

As you can see, it's just a matter of creating an AST node out of a CST node by using a visitor. The code should be pretty self-explanatory (well, maybe except for the VisitFuncExpr stuff, but it's just a quick way to wire up a delegate to a suitable method of the System.Math class).

And here you have the AST building stuff. That's all that's needed. Just extract the relevant information from the CST and keep it in the AST.

The AST visitor

Now, let's play a bit with the AST. We'll have to build an AST visitor base class to traverse it. Let's just do something similar to the AbstractParseTreeVisitor<T> provided by ANTLR.

internal abstract class AstVisitor<T>
{
public abstract T Visit(AdditionNode node);
public abstract T Visit(SubtractionNode node);
public abstract T Visit(MultiplicationNode node);
public abstract T Visit(DivisionNode node);
public abstract T Visit(NegateNode node);
public abstract T Visit(FunctionNode node);
public abstract T Visit(NumberNode node);

public T Visit(ExpressionNode node)
{
return Visit((dynamic)node);
}
}

Here, I took advantage of C#'s dynamic keyword to perform a double-dispatch in one line of code. In Java, you'll have to do the wiring yourself with a sequence of if statements like these:

if (node is AdditionNode) {
return Visit((AdditionNode)node);
} else if (node is SubtractionNode) {
return Visit((SubtractionNode)node);
} else if ...

But I just went for the shortcut for this example.

Work with the AST

So, what can we do with a math expression tree? Evaluate it, of course! Let's implement an expression evaluator:

internal class EvaluateExpressionVisitor : AstVisitor<double>
{
public override double Visit(AdditionNode node)
{
return Visit(node.Left) + Visit(node.Right);
}

public override double Visit(SubtractionNode node)
{
return Visit(node.Left) - Visit(node.Right);
}

public override double Visit(MultiplicationNode node)
{
return Visit(node.Left) * Visit(node.Right);
}

public override double Visit(DivisionNode node)
{
return Visit(node.Left) / Visit(node.Right);
}

public override double Visit(NegateNode node)
{
return -Visit(node.InnerNode);
}

public override double Visit(FunctionNode node)
{
return node.Function(Visit(node.Argument));
}

public override double Visit(NumberNode node)
{
return node.Value;
}
}

Pretty simple once we have an AST, isn't it?

Putting it all together

Last but not least, we have to actually write the main program:

internal class Program
{
private static void Main()
{
while (true)
{
Console.Write("> ");
var exprText = Console.ReadLine();

if (string.IsNullOrWhiteSpace(exprText))
break;

var inputStream = new AntlrInputStream(new StringReader(exprText));
var lexer = new MathLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
var parser = new MathParser(tokenStream);

try
{
var cst = parser.compileUnit();
var ast = new BuildAstVisitor().VisitCompileUnit(cst);
var value = new EvaluateExpressionVisitor().Visit(ast);

Console.WriteLine("= {0}", value);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}

Console.WriteLine();
}
}
}

And now we can finally play with it:

Sample Image

Extending simple ANTLR grammar to support input variables

You could create a Map<String, Double> memory in your parser and introduce a Identifier in your grammar:

Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;

Then your atomExp parser rule would look like this:

atomExp returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| i=Identifier {$value = memory.get($i.text);} // <- added!
| '(' exp=additionExp ')' {$value = $exp.value;}
;

Here's a small (complete) demo:

grammar Exp;

@parser::members {

private java.util.HashMap<String, Double> memory = new java.util.HashMap<String, Double>();

public static Double eval(String expression) throws Exception {
return eval(expression, new java.util.HashMap<String, Double>());
}

public static Double eval(String expression, java.util.Map<String, Double> vars) throws Exception}

Related Topics



Leave a reply



Submit