C# grammar using ANTLR 3 - c#

I'm now writing C# grammar using Antlr 3 based on this grammar file.
But, I found some definitions I can't understand.
NUMBER:
Decimal_digits INTEGER_TYPE_SUFFIX? ;
// For the rare case where 0.ToString() etc is used.
GooBall
#after
{
CommonToken int_literal = new CommonToken(NUMBER, $dil.text);
CommonToken dot = new CommonToken(DOT, ".");
CommonToken iden = new CommonToken(IDENTIFIER, $s.text);
Emit(int_literal);
Emit(dot);
Emit(iden);
Console.Error.WriteLine("\tFound GooBall {0}", $text);
}
:
dil = Decimal_integer_literal d = '.' s=GooBallIdentifier
;
fragment GooBallIdentifier
: IdentifierStart IdentifierPart* ;
The above fragments contain the definition of 'GooBall'.
I have some questions about this definition.
Why is GooBall needed?
Why does this grammar define lexer rules to parse '0.ToString()' instead of parser rules?

It's because that's a valid expression that's not handled by any of the other rules - I guess you'd call it something like an anonymous object, for lack of a better term. Similar to "hello world".ToUpper(). Normally method calls are only valid on variable identifiers or return values ala GetThing().Method(), or otherwise bare.

Sorry. I found the reason from the official FAQ pages.
Now if you want to add '..' range operator so 1..10 makes sense, ANTLR has trouble distinguishing 1. (start of the range) from 1. the float without backtracking. So, match '1..' in NUM_FLOAT and just emit two non-float tokens:

Related

Why is ANTLR C# autogenerated parser not matching EOF?

I have a very simple grammar that (I think) should only allow additions of two elements like 1+1 or 2+3
grammar dumbCalculator;
expression: simple_add EOF;
simple_add: INT ADD INT;
INT:('0'..'9');
ADD : '+';
I generate my C# classes using the official ANTLR jar file
java -jar "antlr-4.9-complete.jar" C:\Users\Me\source\repos\ConsoleApp2\ConsoleApp2\dumbCalculator.g4 -o C:\Users\Me\source\repos\ConsoleApp2\ConsoleApp2\Dumb -Dlanguage=CSharp -no-listener -visitor
No matter what I try, the parser keeps adding the trailing elements although they shouldn't be allowed.
For example "1+1+1" gets parsed properly as an AST :
expression
simple_add
1
+
1
+
1
Although I specifically wrote that expression must be simple_add then EOF and simple_add is just INT ADD INT. I have no idea why the rest is being accepted, I expect ANTLR to throw an exception on this.
This is how I test my parser :
var inputStream = new AntlrInputStream("1+1+1");
var lexer = new dumbCalculatorLexer(inputStream);
lexer.RemoveErrorListeners();
lexer.AddErrorListener(new ThrowExceptionErrorListener());
var commonTokenStream = new CommonTokenStream(lexer);
var parser = new dumbCalculatorParser(commonTokenStream);
parser.RemoveErrorListeners();
parser.AddErrorListener(new ThrowExceptionErrorListener());
var ex = parser.expression();
ExploreAST(ex);
Why is the rest of the output being accepted ?
Classical scenario, I find my error 5 minutes after posting on Stack Overflow.
For anyone encountering a similar scenario, this happened because I did not explicitly set the ErrorHandler on my parser.
Naively, I expected all the AddErrorListener to handle the errors, but somehow there's a specific thing to do if you need the errors to be handled before visiting the tree.
I needed to add
parser.ErrorHandler = new BailErrorStrategy();
After this, I indeed got the exceptions on wrong input strings.
This is probably not the right thing to do, I'll let someone who knows ANTLR better to comment on this.

Increase Performance of Semantic Predicate

I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:
// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4.cs
bool IsUserDefinedFunction()
{
foreach (string function in listOfUserDefinedFunctions)
{
if (this.Text == function)
{
return true;
}
}
return false;
}
However, I've found that just having the semantic predicate {IsUserDefinedFunction()}? makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction() to always return false had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?
A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.
For example: Given the language defined keyword GOTO and a user-defined function GOTO20Something, a typical piece of program text could look like:
GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something
and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION
Edit to clarify:
Even rewriting IsUserDefinedFunction() as:
bool IsUserDefinedFunction() { return false; }
I still get the same slow performance.
Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:
// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
: 'ActualUserKeyword'
| 'AnotherActualUserKeyword'
| 'MoreKeywords'
...
;
Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?
Edit: In response to there not being any other identifiers in this language, I would take a different approach.
Use the original grammar, but remove the semantic predicate altogether. This means both valid and invalid user-defined function identifiers will result in USER_FUNCTION tokens.
Use a listener or visitor after the parse is complete to validate instances of USER_FUNCTION in the parse tree, and report an error at that time if the code uses a function that has not been defined.
This strategy results in better error messages, greatly improves the ability of the lexer and parser to recover from these types of errors, and produces a usable parse tree from file (even through it's not completely semantically valid, it can still be used for analysis, reporting, and potentially to support IDE features down the road).
Original answer assuming that identifiers which are not USER_FUNCTION should result in IDENTIFIER tokens.
The problem is the predicate is getting executed after every letter, digit, and underscore during the lexing phase. You can improve performance by declaring your USER_FUNCTION as a token (and removing the USER_FUNCTION rule from the grammar):
tokens {
USER_FUNCTION
}
Then, in the Lexer.g4.cs file, override the Emit() method to perform the test and override the token type if necessary.
public override IToken Emit() {
if (_type == IDENTIFIER && IsUserDefinedFunction())
_type = USER_FUNCTION;
return base.Emit();
}
My solution for this specific language was to use a System.Text.RegularExpressions.Regex to surround all instances of user-defined functions in the input string with a special character (I chose the § (\u00A7) character).
Then the lexer defines:
USER_FUNCTION : '\u00A7' [a-zA_Z0-9_]+ '\u00A7';
In the parser listener, I strip the surrounding §'s from the function name.

.net System.Text.RegularExpressions nesting regexps in regexp

I would like to ask any skilled .net developer, if there is a possibility to define regular expression (using the .net RegularExpressions namespace cpabilities), which would include references to another regexp(s). I would like to describe grammar rules, each rule as a single regexp. The final regexp would be the grammar's start symbol.
Of course I can perform the expansion to single line regular expression, but the readability would suffer. I also would not like to try each option included in start symbol programatically (like foreach(regexp r in line.regexps) {check if r.matches(input)}).
For example having following ini-like file grammar in regexp-like form (does not follow microsoft regexp rules, just general ones):
sp = \s*
allowed_char = [a-zA-Z0-9_]
key = <allowed_char>+
value = <allowed_char>((<allowed_char>|[ ])*<allowed_char>)?
comment = (;|(//)|#)(.*)
empty_line = ^<sp>$
line_comment = ^<sp><comment>$
section = ^<sp>\[<sp><value><sp>\]<sp>(<comment>)?$
item = ^<sp><key><sp>=<sp><value><sp>(<comment>)?$
line = <empty_line>|<line_comment>|<section>|<item>
I would like to:
Check if a sentence is part of the language (true/false) - seems trivial: matches the <line> start symbol.
Access the terminal-like symbol values (e.g. <section>, <key>, <value>, ...) - I suppose this could be achieved via named matching groups (or whatever exactly is it called - still nedd to read some details at msdn).
I do not expect you to write the code, just if you could give me some hints, whether it is possible (and how) or not, because I have not found this info yet. All examples are for single regexp matching.
Thank you.
This is what I came up with when I was doing my own regex based mathematical expression parser:
private static class Regexes {
// omitted...
private static readonly string
strFunctionNames = "sin|ln|cos|tg|tan|abs",
strReal = #"([\+-]?\d+([,\.]\d+)?(E[\+-]?\d+)?)|[\+-]Infinit(y|o)",
strFunction = string.Format( #"(?<function>{0})(?<argument>{1})",
strFuncitonNames, strReal );
// omitted...
public static readonly Regex
FunzioniLowerCase = new Regex( strFunctionNames ),
RealNumber = new Regex( strReal ),
Function = new Regex( strFunction );
}
This has the obvious disadvantage that there's some sort of repetition in the code, but you could use reflection to compile (and perhaps even create) those regexes in a static constructor.

C-sharp math issue with square root

How do i Evaluate the following expression from a string to a answer as an integer?
Expression:
√(7+74) + √(30+6)
Do i have to evaluate each one of the parameters like Sqroot(7+74) and Sqroot(30+6) or is it possible to evaluate the whole expression. Any ideas?
If this string is user-supplied (or anyway available only at runtime) what you need is a mathematical expressions parser (maybe replacing the √ character in the text with sqrt or whatever the parser likes before feeding the string to it). There are many free ones available on the net, personally I used info.lundin.math several times without any problem.
Quick example for your problem:
info.lundin.Math.ExpressionParser parser = new info.lundin.Math.ExpressionParser();
double result = parser.Parse("sqrt(7+74)+sqrt(30+6)", null);
(on the site you can find more complex examples with e.g. parameters that can be specified programmatically)
You can use NCalc for this purpose
NCalc.Expression expr = new NCalc.Expression("Sqrt(7+74) + Sqrt(30+6)");
object result = expr.Evaluate();

How can I build a Truth Table Generator?

I'm looking to write a Truth Table Generator as a personal project.
There are several web-based online ones here and here.
(Example screenshot of an existing Truth Table Generator)
I have the following questions:
How should I go about parsing expressions like: ((P => Q) & (Q => R)) => (P => R)
Should I use a parser generator like ANTLr or YACC, or use straight regular expressions?
Once I have the expression parsed, how should I go about generating the truth table? Each section of the expression needs to be divided up into its smallest components and re-built from the left side of the table to the right. How would I evaluate something like that?
Can anyone provide me with tips concerning the parsing of these arbitrary expressions and eventually evaluating the parsed expression?
This sounds like a great personal project. You'll learn a lot about how the basic parts of a compiler work. I would skip trying to use a parser generator; if this is for your own edification, you'll learn more by doing it all from scratch.
The way such systems work is a formalization of how we understand natural languages. If I give you a sentence: "The dog, Rover, ate his food.", the first thing you do is break it up into words and punctuation. "The", "SPACE", "dog", "COMMA", "SPACE", "Rover", ... That's "tokenizing" or "lexing".
The next thing you do is analyze the token stream to see if the sentence is grammatical. The grammar of English is extremely complicated, but this sentence is pretty straightforward. SUBJECT-APPOSITIVE-VERB-OBJECT. This is "parsing".
Once you know that the sentence is grammatical, you can then analyze the sentence to actually get meaning out of it. For instance, you can see that there are three parts of this sentence -- the subject, the appositive, and the "his" in the object -- that all refer to the same entity, namely, the dog. You can figure out that the dog is the thing doing the eating, and the food is the thing being eaten. This is the semantic analysis phase.
Compilers then have a fourth phase that humans do not, which is they generate code that represents the actions described in the language.
So, do all that. Start by defining what the tokens of your language are, define a base class Token and a bunch of derived classes for each. (IdentifierToken, OrToken, AndToken, ImpliesToken, RightParenToken...). Then write a method that takes a string and returns an IEnumerable'. That's your lexer.
Second, figure out what the grammar of your language is, and write a recursive descent parser that breaks up an IEnumerable into an abstract syntax tree that represents grammatical entities in your language.
Then write an analyzer that looks at that tree and figures stuff out, like "how many distinct free variables do I have?"
Then write a code generator that spits out the code necessary to evaluate the truth tables. Spitting IL seems like overkill, but if you wanted to be really buff, you could. It might be easier to let the expression tree library do that for you; you can transform your parse tree into an expression tree, and then turn the expression tree into a delegate, and evaluate the delegate.
Good luck!
I think a parser generator is an overkill. You could use the idea of converting an expression to postfix and evaluating postfix expressions (or directly building an expression tree out of the infix expression and using that to generate the truth table) to solve this problem.
As Mehrdad mentions you should be able to hand roll the parsing in the same time as it would take to learn the syntax of a lexer/parser. The end result you want is some Abstract Syntax Tree (AST) of the expression you have been given.
You then need to build some input generator that creates the input combinations for the symbols defined in the expression.
Then iterate across the input set, generating the results for each input combo, given the rules (AST) you parsed in the first step.
How I would do it:
I could imagine using lambda functions to express the AST/rules as you parse the tree, and building a symbol table as you parse, you then could build the input set, parsing the symbol table to the lambda expression tree, to calculate the results.
If your goal is processing boolean expressions, a parser generator and all the machinery that go with is a waste of time, unless you want to learn how they work (then any of them would be fine).
But it is easy to build a recursive-descent parser by hand for boolean expressions, that computes and returns the results of "evaluating" the expression. Such a parser could be used on a first pass to determine the number of unique variables, where "evaluation" means "couunt 1 for each new variable name".
Writing a generator to produce all possible truth values for N variables is trivial; for each set of values, simply call the parser again and use it to evaluate the expression, where evaluate means "combine the values of the subexpressions according to the operator".
You need a grammar:
formula = disjunction ;
disjunction = conjunction
| disjunction "or" conjunction ;
conjunction = term
| conjunction "and" term ;
term = variable
| "not" term
| "(" formula ")" ;
Yours can be more complicated, but for boolean expressions it can't be that much more complicated.
For each grammar rule, write 1 subroutine that uses a global "scan" index into the string being parsed:
int disjunction()
// returns "-1"==> "not a disjunction"
// in mode 1:
// returns "0" if disjunction is false
// return "1" if disjunction is true
{ skipblanks(); // advance scan past blanks (duh)
temp1=conjunction();
if (temp1==-1) return -1; // syntax error
while (true)
{ skipblanks();
if (matchinput("or")==false) return temp1;
temp2= conjunction();
if (temp2==-1) return temp1;
temp1=temp1 or temp2;
}
end
int term()
{ skipblanks();
if (inputmatchesvariablename())
{ variablename = getvariablenamefrominput();
if unique(variablename) then += numberofvariables;
return lookupvariablename(variablename); // get truthtable value for name
}
...
}
Each of your parse routines will be about this complicated. Seriously.
You can get source code of pyttgen program at http://code.google.com/p/pyttgen/source/browse/#hg/src It generates truth tables for logical expressions. Code based on ply library, so its very simple :)

Categories