ANTLR - Writing a tree grammar for an AST

ANTLR - Writing a tree grammar for an AST - c#

I have an AST outputted for some Lua code by my grammar file, which currently does parsing and lexing for me. I want to add a tree grammar to this, but since i'm using C# i'm not sure how to do it. What's the basic process for generating tree grammar code when you already have a parser and lexer written?
UPDATE: I have the following grammar file:
tree grammar LuaGrammar;
options {
backtrack=true;
language=CSharp2;
//output=AST;
tokenVocab=Lua;
filter=true;
ASTLabelType=CommonTree;
}
#lexer::namespace{/*my namespace*/}
#parser::namespace{/*my namespace*/}
dummyRule
: ^('=' x=. y=.) {};
placed in the same directory as my main grammar file, which generates fine. However, when trying to compile this i get the following errors:
[02:54:06] error(143): C:\Users\RCIX\Desktop\AguaLua\Project\trunk\AguaLua\AguaLua\ANTLR Data\LuaGrammar.g:12:18: unknown or invalid action scope for tree grammar: lexer
[02:54:06] error(143): C:\Users\RCIX\Desktop\AguaLua\Project\trunk\AguaLua\AguaLua\ANTLR Data\LuaGrammar.g:13:19: unknown or invalid action scope for tree grammar: parser
Am i on the right track or totally off?

Well going back to my usual example of a calculator grammar:)
This is how you would declare your Tree Walker class
class CalcTreeShaker extends TreeParser;
expr returns [float r]
{
float a,b;
r=0;
}
: #(PLUS a=expr b=expr) {r = a+b;}
| #(STAR a=expr b=expr) {r = a*b;}
| i:INT {r = Convert.ToSingle(i.getText());}
;
Here we have a tree rule called expr. Tree walkers are very similar to parser grammars.
The big difference is that while a parser grammar has to match exactly a tree grammar only has to match a portion of the tree.
In the expr rule we can see it matches any tree that has the tokens PLUS or STAR or INT.
We can see we are matching trees because we are using Antlr's Tree syntax #(...).
The PLUS and STAR tree's also match 2 expr rules. Each expr rule is assigned to a name so we can use it to evaluate the expression. Similar to parser grammars we can put C# code into blocks defiend by {...}.
Also note that in this example we show how to return a value from a TreeWalker rule, we use the syntax return [...].
To call the tree walker you create it and then call it's top level rule. I"ll copy this from the Antlr example:)
// Get the ast from your parser.
CommonAST t = (CommonAST)parser.getAST();
// Create the Tree Shaker
CalcTreeWalker walker = new CalcTreeWalker();
CalcParser.initializeASTFactory(walker.getASTFactory());
// pass the ast to the walker and call the top level rule.
float r = walker.expr(t);

I have not encountered this error but there are 2 things I would try.
1) remove the #lexer and #parser namespace lines.
2) If they are necessary then move them until after the Tokens {...} section of your grammar, ie just before the rules.

Related

how to parse an expression step by step in c# (preferably visitor pattern)

I am new to c#. I have a question about parsing a string. If i have a file that contains dome lines such as PC: SWITCH_A == ON or a string like PC: defined(SWITCH_B) && SWITCH_C == OFF. All the operators(==, &&, defined) are string here and all the switch names(SWITCH_A) and their values are identifiers(OFF). How do i parse these kind of string? Do i first have to tokenize them split them by new lines or white spaces and then make an abstract syntax tree for parsing them? Also do i need to store all the identifiers in a dictionary first? I have no idea about parsing can anyone help? an tell me with an example how to do it what should be the methods and classes that should be included? Thanks.

Unfortunately, Yes. You have to tokenize them if the syntax that you are parsing is something custom and not a standard syntax where a compiler already exists for parsing the source.
You could take advantage of Expression Trees. They are there in the .NET Framework for building and evaluating dynamic languages.
To start parsing the syntax you have to have a grammar document that describes all the possible cases of the syntax in each line. After that, you can start parsing the lines and building your expression tree.
Parsing any source code typically goes a character at a time since each character might change the entire semantics of the piece that is being parsed.
So, i suggest you start with a grammar document for the syntax that you have and then start writing your parser.
Make sure that there isn't anything already out there for the syntax you are trying to parse as these kind of projects tend to be error-prone and time consuming
Now since your high-level grammar is
Expression ::= Identifier | IntegerValue | BooleanExpression
Identifier and IntegerValue are constant literals in the source, so you need to start looking for a BooleanExpression.
To find a BooleanExpression you need to look for either BooleanBinaryExpression, BooleanUnaryExpression, TrueExpression or FalseExpression.
You can detect a BooleanBinaryExpression by look for the && or == operators and then taking the left and right operands.
To detect a BooleanUnaryExpression you need to look for the word defined and then parse the identifier in the parantheses.
And so on...
Notice that your grammar supports recursion in the syntax, look at the definition of the AndExpression or EqualsExpression, they point back to Expression
AndExpression ::= Expression '&&' Expression
EqualsExpression ::= Expression '==' Expression
You got a bunch of methods in the String Class in the .NET Framework to assist you in detecting and parsing your grammar.
Another alternative is that you can look for a parser generator that targets c#. For example, see ANTLR

Antlr 4 Lexer rule ambiguity

So I'm building a grammar to parse c++ header files.
I have only written grammar for the header files, and I don't intend to write any for the implementation.
My problem is that if a method is implemented in the header rather than just defined.
Foo bar()
{
//some code
};
I just want to match the implementation of bar to
BLOCK
: '{' INTERNAL_BLOCK*? '}'
;
fragment INTERNAL_BLOCK
: BLOCK
| ~('}')
;
but then this interferes for anything other grammar that includes { ... } because this will always match what is in between two braces. Is there anyway to specify which token to use when there is an ambiguity?
p.s. I don't know if the grammar for BLOCK works but you get the gist.

So, the significant parser rules would be:
method : mType mTypeName LPAREN RPAREN BLOCK ; // simplified
unknown : . ;
BLOCK tokens produced by the lexer that are not matched as part of the method rule will appear in the parse-tree in unknown context nodes. Analyze the method context nodes and ignore the unknown nodes.

C# grammar using ANTLR 3

I'm now writing C# grammar using Antlr 3 based on this grammar file.
But, I found some definitions I can't understand.
NUMBER:
Decimal_digits INTEGER_TYPE_SUFFIX? ;
// For the rare case where 0.ToString() etc is used.
GooBall
#after
{
CommonToken int_literal = new CommonToken(NUMBER, $dil.text);
CommonToken dot = new CommonToken(DOT, ".");
CommonToken iden = new CommonToken(IDENTIFIER, $s.text);
Emit(int_literal);
Emit(dot);
Emit(iden);
Console.Error.WriteLine("\tFound GooBall {0}", $text);
}
:
dil = Decimal_integer_literal d = '.' s=GooBallIdentifier
;
fragment GooBallIdentifier
: IdentifierStart IdentifierPart* ;
The above fragments contain the definition of 'GooBall'.
I have some questions about this definition.
Why is GooBall needed?
Why does this grammar define lexer rules to parse '0.ToString()' instead of parser rules?

It's because that's a valid expression that's not handled by any of the other rules - I guess you'd call it something like an anonymous object, for lack of a better term. Similar to "hello world".ToUpper(). Normally method calls are only valid on variable identifiers or return values ala GetThing().Method(), or otherwise bare.

Sorry. I found the reason from the official FAQ pages.
Now if you want to add '..' range operator so 1..10 makes sense, ANTLR has trouble distinguishing 1. (start of the range) from 1. the float without backtracking. So, match '1..' in NUM_FLOAT and just emit two non-float tokens:

Parse expression (with custom functions and operations)

I have a string, which contains a custom expression, I have to parse and evaluate:
For example:
(FUNCTION_A(5,4,5) UNION FUNCTION_B(3,3))
INTERSECT (FUNCTION_C(5,4,5) UNION FUNCTION_D(3,3))
FUNCTION_X represent functions, which are implemented in C# and return ILists.
UNION or INTERSECT are custom functions which should be applied to the lists, which are returned from those functions.
Union and intersect are implemented via Enumerable.Intersect/Enumerable.Union.
How can the parsing and evaluating be implemented in an elegant and expandable manner?

It depends on how complex your expressions will become, how many different operators are going to be available, and a whole number of different variables. Whichever way you do it, you will probably need to first determine a grammar for your mini-language.
For simple grammars, you can just write a custom parser. In the case of many calculators and similar applications, a recursive descent parser is expressive enough to handle the grammar and is intuitive to write. The linked Wikipedia page gives a sample grammar and the implementation of a C parser for it. Eric White also has a blog post on building recursive descent parsers in C#.
For more complex grammars, you will likely want to skip the work of creating this yourself and use a lex/yacc-type lexer and parser toolset. Normally you give as input to these a grammar in EBNF or similar syntax, and they will produce the code necessary to parse the input for you. The parser will typically return a syntax tree which you can traverse, allowing you to apply logic for each token in the input stream (each node in the tree). For C#, I have worked with GPLex and GPPG, but others such as ANTLR are also available.
Basic Parsing Concepts
In general, you want to be able to split each item in the input into a meaningful token, and build a tree based on those tokens. Once the tree is built, you can traverse the tree and perform the necessary action at each node. A syntax tree for FUNCTION_A(5,4,5) UNION FUNCTION_B(3,3) might look like this, where the node types are in capital letters and their values are in parenthesis:
PROGRAM
|
|
UNION
|
------------------------------
| |
FUNCTION (FUNCTION_A) FUNCTION(FUNCTION_B)
| |
------------- ----------
| | | | |
INT(5) INT(4) INT(5) INT(3) INT(3)
The parser needs to be smart enough to know that when a UNION is found, it needs to be supplied with two items to union, etc. Given this tree, you would start at the root (PROGRAM) and do a depth-first traversal. At the UNION node, the action would be to first visit all children, and then union the results together. At a FUNCTION node, the action would be to first visit all of the children, find their values, and use those values as parameters to the function, and secondly to evaluate the function on those inputs and return the value.
This would continue for all tokens, for any expression you can come up with. In this way, if you spend the time to get the parser to produce the right tree and each node knows how to perform whatever action it needs to, your design is very extensible and can handle any input that matches the grammar it was designed for.

How can I build a Truth Table Generator?

I'm looking to write a Truth Table Generator as a personal project.
There are several web-based online ones here and here.
(Example screenshot of an existing Truth Table Generator)
I have the following questions:
How should I go about parsing expressions like: ((P => Q) & (Q => R)) => (P => R)
Should I use a parser generator like ANTLr or YACC, or use straight regular expressions?
Once I have the expression parsed, how should I go about generating the truth table? Each section of the expression needs to be divided up into its smallest components and re-built from the left side of the table to the right. How would I evaluate something like that?
Can anyone provide me with tips concerning the parsing of these arbitrary expressions and eventually evaluating the parsed expression?

This sounds like a great personal project. You'll learn a lot about how the basic parts of a compiler work. I would skip trying to use a parser generator; if this is for your own edification, you'll learn more by doing it all from scratch.
The way such systems work is a formalization of how we understand natural languages. If I give you a sentence: "The dog, Rover, ate his food.", the first thing you do is break it up into words and punctuation. "The", "SPACE", "dog", "COMMA", "SPACE", "Rover", ... That's "tokenizing" or "lexing".
The next thing you do is analyze the token stream to see if the sentence is grammatical. The grammar of English is extremely complicated, but this sentence is pretty straightforward. SUBJECT-APPOSITIVE-VERB-OBJECT. This is "parsing".
Once you know that the sentence is grammatical, you can then analyze the sentence to actually get meaning out of it. For instance, you can see that there are three parts of this sentence -- the subject, the appositive, and the "his" in the object -- that all refer to the same entity, namely, the dog. You can figure out that the dog is the thing doing the eating, and the food is the thing being eaten. This is the semantic analysis phase.
Compilers then have a fourth phase that humans do not, which is they generate code that represents the actions described in the language.
So, do all that. Start by defining what the tokens of your language are, define a base class Token and a bunch of derived classes for each. (IdentifierToken, OrToken, AndToken, ImpliesToken, RightParenToken...). Then write a method that takes a string and returns an IEnumerable'. That's your lexer.
Second, figure out what the grammar of your language is, and write a recursive descent parser that breaks up an IEnumerable into an abstract syntax tree that represents grammatical entities in your language.
Then write an analyzer that looks at that tree and figures stuff out, like "how many distinct free variables do I have?"
Then write a code generator that spits out the code necessary to evaluate the truth tables. Spitting IL seems like overkill, but if you wanted to be really buff, you could. It might be easier to let the expression tree library do that for you; you can transform your parse tree into an expression tree, and then turn the expression tree into a delegate, and evaluate the delegate.
Good luck!

I think a parser generator is an overkill. You could use the idea of converting an expression to postfix and evaluating postfix expressions (or directly building an expression tree out of the infix expression and using that to generate the truth table) to solve this problem.

As Mehrdad mentions you should be able to hand roll the parsing in the same time as it would take to learn the syntax of a lexer/parser. The end result you want is some Abstract Syntax Tree (AST) of the expression you have been given.
You then need to build some input generator that creates the input combinations for the symbols defined in the expression.
Then iterate across the input set, generating the results for each input combo, given the rules (AST) you parsed in the first step.
How I would do it:
I could imagine using lambda functions to express the AST/rules as you parse the tree, and building a symbol table as you parse, you then could build the input set, parsing the symbol table to the lambda expression tree, to calculate the results.

If your goal is processing boolean expressions, a parser generator and all the machinery that go with is a waste of time, unless you want to learn how they work (then any of them would be fine).
But it is easy to build a recursive-descent parser by hand for boolean expressions, that computes and returns the results of "evaluating" the expression. Such a parser could be used on a first pass to determine the number of unique variables, where "evaluation" means "couunt 1 for each new variable name".
Writing a generator to produce all possible truth values for N variables is trivial; for each set of values, simply call the parser again and use it to evaluate the expression, where evaluate means "combine the values of the subexpressions according to the operator".
You need a grammar:
formula = disjunction ;
disjunction = conjunction
| disjunction "or" conjunction ;
conjunction = term
| conjunction "and" term ;
term = variable
| "not" term
| "(" formula ")" ;
Yours can be more complicated, but for boolean expressions it can't be that much more complicated.
For each grammar rule, write 1 subroutine that uses a global "scan" index into the string being parsed:
int disjunction()
// returns "-1"==> "not a disjunction"
// in mode 1:
// returns "0" if disjunction is false
// return "1" if disjunction is true
{ skipblanks(); // advance scan past blanks (duh)
temp1=conjunction();
if (temp1==-1) return -1; // syntax error
while (true)
{ skipblanks();
if (matchinput("or")==false) return temp1;
temp2= conjunction();
if (temp2==-1) return temp1;
temp1=temp1 or temp2;
}
end
int term()
{ skipblanks();
if (inputmatchesvariablename())
{ variablename = getvariablenamefrominput();
if unique(variablename) then += numberofvariables;
return lookupvariablename(variablename); // get truthtable value for name
}
...
}
Each of your parse routines will be about this complicated. Seriously.

You can get source code of pyttgen program at http://code.google.com/p/pyttgen/source/browse/#hg/src It generates truth tables for logical expressions. Code based on ply library, so its very simple :)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.