Antlr grammar generating invalid C# code

Antlr grammar generating invalid C# code - c#

I am trying to develop a c# code generator using ANTLR and the StringTemplate library. AntlrWorks can generate the c# parser and lexer files without reporting any errors. However, the c# parser code is not valid and cannot be compiled in visual studio.
Can anyone see what is wrong with the following grammar?
grammar StrucadShape;
options {
language=CSharp3 ;
output=template;
}
#header {using System;}
#lexer::header {using System;}
#lexer::members {const int HIDDEN = Hidden;}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
public shapedef: parameters_def
-> class_temp( parameters={$parameters_def.st} )
;
parameters_def : (PARAMETERS LPAREN (p+=param) (COMMA (p+=param))* RPAREN )
-> parameter_list(params={$p})
;
param : IDENTIFIER ->Parameter_decl(p={$IDENTIFIER.text});
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
fragment EOL:'\r'|'\n'|'\r\n' ;
WS : (' '
| '\t'
| EOL)
{ $channel = HIDDEN; } ;
PARAMETERS: 'PARAMETERS';
COMMA : ',' ;
LPAREN : '(' ;
RPAREN : ')' ;
fragment LETTER :('A'..'Z' | 'a'..'z');
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER : (DIGIT)+ ;
FLOAT : (DIGIT)+'.'(DIGIT)+;
fragment DIGIT : '0'..'9' ;
This results in the following lines of code in generated parameters_def() method
List<object> list_p = null;
...snipped some code
if (list_p==null) list_p=new List<StringTemplate>();
This is failing on the assignment of the List <StringTemplate> to type List<Object>.
The grammar works before I add the string template rules. The error is introduced when I add the (p+=param) syntax required for list processing in the StringTemplate library.
I'll add my StringTemplate file for completeness, but I don't think this could be causing an error as it is not loaded until runtime.
group StrucadShape;
Parameter_decl(p)::= "public double <p> { get; set; }"
parameter_list(params) ::=
<<
start expressions
<params; separator="\n">
end
>>
class_temp( parameters)::=
<<
public class name
{
<parameters; separator="\n>
}
>>
A sample input string PARAMETERS( D,B,T)
Antlr Versions
Antlr3.Runtime 3.4.1.9004
AntlrWorks 1.4.3

I found a related issue on the Antlr mailing list here.
The solution was to add an ASTLabeltype to the grammar options
options {
language=CSharp3;
output=template;
ASTLabelType = StringTemplate;
}

Related

ANTLR visitor unit test succeeds on one rule but fails on another

I'm trying to define unit tests for my ANTLR parser. The unit test successfully extracts the value of the first expr, but fails to extract the value of of the first idEscape. This suggests that I am misunderstanding something core to the way in which the parser works, or the way in which visitors work.
I'm writing a parser for calculations in FileMaker Pro. In FileMaker, it is technically valid for an identifier to contain whitespace as well as operators and other characters which would otherwise have a functional purpose in the calculation engine. In those cases, the identifier is escaped by surrounding it with a '${' and '}'. While the parser successfully identifies '${abcdef + 123}' as a valid expression,
I still need to be able to identify 'abcdef + 123' as a valid identifier. When I request the value of the first idEscape in a second unit test, I get an empty string.
If relevant, I'm using ANTLR4.Runtime.Standard.
What am I doing wrong? Any assistance in resolving my misunderstanding would be greatly appreciated. Thank you.
Grammar
grammar FileMakerCalc;
// PARSER RULES
calculation : expr;
expr : idEscExpr;
idEscExpr : LEFTESCAPE idEscape RIGHTESCAPE;
idEscape : (WORD|WS|OPERATOR|INT|FLOAT)*?;
// LEXER RULES
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
LEFTESCAPE : '${';
RIGHTESCAPE : '}';
OPERATOR : ('+'|'-'|'*'|'/'|'&'|'^'|'='|'≠'|'<>'|'>'|'<'|'≤'|'<='|'≥'|'>=' );
WORD : (LOWERCASE | UPPERCASE)+ ;
FLOAT : [0-9]+ '.' [0-9]+;
INT : [0-9]+ ;
NEWLINE : [\r\n]+ ;
WS : [ \t];
Visitor
public class FileMakerCalcVisitor : FileMakerCalcBaseVisitor<String>
{
public override string VisitExpr(FileMakerCalcParser.ExprContext context)
{
return context.GetText();
}
public override string VisitIdEscape(FileMakerCalcParser.IdEscapeContext context)
{
return context.GetText();
}
}
Unit Tests
namespace Antler_Tests
{
[TestFixture()]
public class ParserTest
{
private FileMakerCalcParser Setup(string text)
{
AntlrInputStream inputStream = new AntlrInputStream(text);
FileMakerCalcLexer lexer = new FileMakerCalcLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
FileMakerCalcParser parser = new FileMakerCalcParser(commonTokenStream);
return parser;
}
// This one successfully pulls '${abcdef + 123}' as the text of the first expr
[Test()]
public void EscapedID_CheckForExpr()
{
FileMakerCalcParser parser = Setup("${abcdef + 123}");
FileMakerCalcParser.ExprContext context = parser.expr();
FileMakerCalcVisitor visitor = new FileMakerCalcVisitor();
var testVal = visitor.VisitExpr(context);
Assert.AreEqual("${abcdef + 123}", testVal, testVal);
}
// This one does NOT successfully pull 'abcdef + 123' as the text of the first idEscape
[Test()]
public void EscapedID()
{
FileMakerCalcParser parser = Setup("${abcdef + 123}");
FileMakerCalcParser.IdEscapeContext context = parser.idEscape();
FileMakerCalcVisitor visitor = new FileMakerCalcVisitor();
var testVal = visitor.VisitIdEscape(context);
Assert.AreEqual("abcdef + 123", testVal);
}
}
}

${abcdef + 123} is not a valid idEscape because it starts with ${ and ends with }, neither of which the idEscape rule accepts. The way you've defined it, idEscape only matches the stuff between ${} and idEscapeExpr is the one that matches the whole thing.
So you'll want your test to either invoke the idEscapeExpr rule instead of idEscape or change the string you're parsing to abcdef + 123 (or have one test for each).

Parsing csv using ANTLR in c#

I have created following grammar in ANTLR for parsing csv file.
grammar CSV;
file returns [List<List<string>> data]
#init {$data = new List<List<string>>();}
: (row {$data.Add($row.list);})+ EOF
;
row returns [List<string> list]
#init {$list = new List<string>();}
: a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);}
)*
(LineBreak | EOF)
;
value returns [string val]
: SimpleValue {$val = $SimpleValue.text;}
| QuotedValue
{
System.Console.WriteLine($val);
$val = $QuotedValue.text;
$val = $val.Substring(1, $val.Length-1);
$val = $val.Replace("\"\"", "\"");
}
;
Comma :
( ' '* ',' ' '*);
LineBreak :
'\r'? '\n';
SimpleValue
: ~[,\r\n"]+
;
QuotedValue
: '"' ('""' | ~'"')* '"'
;
Above grammar is parsing following csv file without error.
a,b
1,2
3,4
but when I am parsing following csv file it's throwing following error
a,b
,2
3,4
line 2:0 extraneous input ',' expecting {<EOF>, SimpleValue, QuotedValue}
can someone guide me how to solve this problem?
Main Program
public List<List<string>> Parse()
{
string csvData = string.Empty;
if (string.IsNullOrEmpty(_path))
throw new ArgumentException("Path can not be empty");
try
{
csvData = File.ReadAllText(_path);
}
catch (Exception)
{
throw new FileNotFoundException(string.Format("{0} not found", _path));
}
// create an instance of the lexer
CSVLexer lexer = new CSVLexer(new AntlrInputStream(csvData));
// wrap a token-stream around the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create the parser
CSVParser parser = new CSVParser(tokens);
// invoke the entry point of our grammar
_results = parser.file().data;
return _results;
}
Upadte
As per the Mike Lischke answer I have updated the row rule as below. Now I am not getting any error
row returns [List<string> list]
#init {$list = new List<string>();}
: Comma? a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);
}
)*
(LineBreak | EOF)
;

Obviously your row rule is not flexible enough to handle missing values. You should use something like this instead:
row: Comma? value (Comma value)*;
which adds the possibility for a leading comma (actually a missing first value).
And a recommendation: don't use action code in your grammar to collect the values. Instead create and assign a parse listener to your parser whose methods are triggered during parsing to do all the background work. It keeps the grammar a lot cleaner and allows to use it indepent of the actual target language.

ANTLR4 in C# catches only one token

g4 file:
grammar TestFlow;
options
{
language=CSharp4;
output=AST;
}
/*
* Parser Rules
*/
compileUnit : LC | BC ;
/*
* Lexer Rules
*/
BC : '/*' .*? '*/' ;
LC : '//' .*? [\r\n] ;
Code:
var input = " /*aaa*/ /// \n ";
var stream = new AntlrInputStream(input);
ITokenSource lexer = new TestFlowLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
var parser = new TestFlowParser(tokens);
parser.BuildParseTree = true;
var tree = parser.compileUnit();
var n = tree.ChildCount;
var top = new List<string>();
for (int i = 0; i < n; i++) {
top.Add(tree.GetChild(i).GetText());
}
After running above code I get single string in top: /*aaa*/. The single-line comment isn't caught.
What's wrong?

All parser/lexer generation errors & warnings are significant. Both options statements are invalid in the current version of Antlr4.
The runtime errors detail the root problem: unrecognizable input characters, specifically, the grammar does not handle whitespace. Add a lexer rule to fix:
WS: [ \r\n\t] -> skip ;
While not necessarily a problem, it is good form to require the parser to process all input. The lexer will generate an EOF token at the end of the source input. Fix the main rule to require the EOF:
compileUnit : ( LC | BC ) EOF ;
The correct way to allow for repetition is to use a * or + operator:
compileUnit : ( LC | BC )+ EOF ;

antlr grammar for tree construction from simple logic string

I want to create a parser using antlr for the following string:
"1 AND (2 OR (3 AND 4)) AND 5"
-> so i want to have AND and OR operations which should result in a tree after parsing was successful. this should result in the following tree:
AND
- 1
- OR
- 2
- AND
-3
-4
- 5
i also want to avoid unclear inputs like "1 AND 2 OR 3" as it is not clear how to construct the tree from that. And it also seems like the parser "accepts" input with trailing sings such as "1 AND 2asdf".
what i have so far is (not working as expected):
grammar code;
options {
language=CSharp3;
output=AST;
ASTLabelType=CommonTree;
//backtrack=true;
}
tokens {
ROOT;
}
#rulecatch {
catch {
throw;
}
}
#parser::namespace { Web.DealerNet.Areas.QueryBuilder.Parser }
#lexer::namespace { Web.DealerNet.Areas.QueryBuilder.Parser }
#lexer::members {
public override void ReportError(RecognitionException e) {
throw e;
}
}
public parse : exp EOF -> ^(ROOT exp);
exp
: atom
( And^ atom (And! atom)*
| Or^ atom (Or! atom)*
)?
;
atom
: Number
| '(' exp ')' -> exp
;
Number
: ('0'..'9')+
;
And
: 'AND' | 'and'
;
Or
: 'OR' | 'or'
;
WS : (' '|'\t'|'\f'|'\n'|'\r')+{ Skip(); };
Hope someone of you guys can help me get on the right track!
edit: and how can i archieve "1 AND 2 AND 3" to result in
AND
1
2
3
instead of
AND
AND
1
2
3
EDIT:
thanks for the great solution, it works like a charm except for one thing: when i call the parse() method on the following term "1 AND (2 OR (1 AND 3) AND 4" (closing bracket missing) the parser still accepts the input as valid.
this is my code so far:
grammar code;
options {
language=CSharp3;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ROOT;
}
#rulecatch {
catch {
throw;
}
}
#lexer::members {
public override void ReportError(RecognitionException e) {
throw e;
}
}
public parse
: exp -> ^(ROOT exp)
;
exp
: atom
( And^ atom (And! atom)*
| Or^ atom (Or! atom)*
)?
;
atom
: Number
| '(' exp ')' -> exp
;
Number
: ('0'..'9')+
;
And
: 'AND' | 'and'
;
Or
: 'OR' | 'or'
;
WS : (' '|'\t'|'\f'|'\n'|'\r')+{ Skip(); };
edit2:
i just found another problem with my grammar:
when i have input like "1 AND 2 OR 3" the grammar gets parsed just fine, but it should fail because either the "1 AND 2" needs to be inside brackets or the "2 OR 3" part.
i dont understand why the parser runs through as in my opinion this grammar should really cover that case.
is there any sort of online-testing-environment or so to find the problem? (i tried antlrWorks but the errors given there dont lead me anywhere...)
edit3:
updated the code to represent the new grammar like suggested.
i still have the same problem that the following grammar:
public parse : exp EOF -> ^(ROOT exp);
doesnt parse to the end.. the generated c# sources seem to just ignore the EOF... can you provide any further guidance on how i could resolve the issue?
edit4
i still have the same problem that the following grammar:
public parse : exp EOF -> ^(ROOT exp);
doesnt parse to the end.. the generated c# sources seem to just ignore the EOF... can you provide any further guidance on how i could resolve the issue?
the problem seems to be in this part of the code:
EOF2=(IToken)Match(input,EOF,Follow._EOF_in_parse97);
stream_EOF.Add(EOF2);
When i add the following code (just a hack) it works...
if (EOF2.Text == "<missing EOF>") {
throw new Exception(EOF2.Text);
}
can i change anything so the parser gets generated correclty from the start?

This rule will disallow expressions containing both AND and OR without parentheses. It will also construct the parse tree you described by making the first AND or OR token the root of the AST, and then hiding the rest of the AND or OR tokens from the same expression.
exp
: atom
( 'AND'^ atom ('AND'! atom)*
| 'OR'^ atom ('OR'! atom)*
)?
;
Edit: The second problem is unrelated to this. If you don't instruct ANTLR to consume all input by including an explicit EOF symbol in one of your parser rules, then it is allowed to consume only a portion of the input in an attempt to successfully match something.
The original parse rule says "match some input as exp". The following modification to the parse rule says "match the entire input as exp".
public parse : exp EOF -> ^(ROOT exp);

ANTLR for c# behaves illogicaly

Is there ANY reason why would ANTLR, for any reason, ignore tokens? Here's the relative code, i'm calling var_assign directly.
LABEL
: LETTER (LETTER | DIGIT | '_')*;
fragment LOWER_CASE
: 'a'..'z';
fragment UPPER_CASE
: 'A'..'Z';
fragment LETTER
: UPPER_CASE | LOWER_CASE;
public var_assign
: LABEL ':=' expression -> ^( VARIABLE_ASSIGNMENT LABEL expression )
;
expression is the standard chain of expressions ending with tokens like NUMBER and LABEL (for variables), etc.
Now the issue is that i can just type "anything anything" and the parser will recognize that as an assignment.
ANTLRStringStream Input = new ANTLRStringStream(input_to_process);
processor.lexer.ConsoleGrammarLexer Lexer = new processor.lexer.ConsoleGrammarLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
processor.parser.ConsoleGrammarParser Parser = new processor.parser.ConsoleGrammarParser(Tokens);
CommonTree start_rule_tree = Parser.var_assign().Tree;
//view the tree to help debug
processor_output = start_rule_tree.ToStringTree();
If i type "x 5", i get (VARIABLE_ASSIGNMENT x 5)).
If i type "x:=5", i get (BLOCK (VARIABLE_ASSIGNMENT x 5))
If i type "x*5", i get (BLOCK ,1:1], resync=x*5>)
This happens even if i send a constnat "string" directly into ANTLRStringStream.
I have managed to solve this by either replacing ':=' with (':=' | 'anythinghere') or (':=')*. But there are other odd behaviours.
I'm using CSharp3 as a language option and the newest .dlls.
What is going on, this makes absolutely no sense.
EDIT:
I've created a test grammar.
grammar testgrammar;
options {
language = CSharp3;
output = AST;
TokenLabelType = CommonToken;
ASTLabelType = CommonTree;
}
LABEL : 'a'..'z';
WS : ' ' {Skip();};
public start
: if_statement EOF!;
if_statement
: LABEL ':=' LABEL ->^(LABEL LABEL);
Typing "ff" produces (f f), typing f*f produces a run-time error, typing f:=f produces (f f). What. The. Hell.

The Java version gives:
/tmp $ java TestT
ff
line 1:2 no viable alternative at character '\n'
line 1:1 missing ':=' at 'f'
From:
InputStream is = System.in;
if ( inputFile!=null ) {
is = new FileInputStream(inputFile);
}
CharStream input = new ANTLRInputStream(is);
TLexer lex = new TLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lex);
TParser parser = new TParser(tokens);
parser.start();
Not sure what's up. CSharp3 should work too. I'm baffled. Start the debugger and set a break point. It's your only hope, Luke!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Antlr grammar generating invalid C# code - c#

I found a related issue on the Antlr mailing list here. The solution was to add an ASTLabeltype to the grammar options options { language=CSharp3; output=template; ASTLabelType = StringTemplate; }

Related

ANTLR visitor unit test succeeds on one rule but fails on another

Parsing csv using ANTLR in c#

ANTLR4 in C# catches only one token

antlr grammar for tree construction from simple logic string

ANTLR for c# behaves illogicaly

Categories

Resources