Is there ANY reason why would ANTLR, for any reason, ignore tokens? Here's the relative code, i'm calling var_assign directly.
LABEL
: LETTER (LETTER | DIGIT | '_')*;
fragment LOWER_CASE
: 'a'..'z';
fragment UPPER_CASE
: 'A'..'Z';
fragment LETTER
: UPPER_CASE | LOWER_CASE;
public var_assign
: LABEL ':=' expression -> ^( VARIABLE_ASSIGNMENT LABEL expression )
;
expression is the standard chain of expressions ending with tokens like NUMBER and LABEL (for variables), etc.
Now the issue is that i can just type "anything anything" and the parser will recognize that as an assignment.
ANTLRStringStream Input = new ANTLRStringStream(input_to_process);
processor.lexer.ConsoleGrammarLexer Lexer = new processor.lexer.ConsoleGrammarLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
processor.parser.ConsoleGrammarParser Parser = new processor.parser.ConsoleGrammarParser(Tokens);
CommonTree start_rule_tree = Parser.var_assign().Tree;
//view the tree to help debug
processor_output = start_rule_tree.ToStringTree();
If i type "x 5", i get (VARIABLE_ASSIGNMENT x 5)).
If i type "x:=5", i get (BLOCK (VARIABLE_ASSIGNMENT x 5))
If i type "x*5", i get (BLOCK ,1:1], resync=x*5>)
This happens even if i send a constnat "string" directly into ANTLRStringStream.
I have managed to solve this by either replacing ':=' with (':=' | 'anythinghere') or (':=')*. But there are other odd behaviours.
I'm using CSharp3 as a language option and the newest .dlls.
What is going on, this makes absolutely no sense.
EDIT:
I've created a test grammar.
grammar testgrammar;
options {
language = CSharp3;
output = AST;
TokenLabelType = CommonToken;
ASTLabelType = CommonTree;
}
LABEL : 'a'..'z';
WS : ' ' {Skip();};
public start
: if_statement EOF!;
if_statement
: LABEL ':=' LABEL ->^(LABEL LABEL);
Typing "ff" produces (f f), typing f*f produces a run-time error, typing f:=f produces (f f). What. The. Hell.
The Java version gives:
/tmp $ java TestT
ff
line 1:2 no viable alternative at character '\n'
line 1:1 missing ':=' at 'f'
From:
InputStream is = System.in;
if ( inputFile!=null ) {
is = new FileInputStream(inputFile);
}
CharStream input = new ANTLRInputStream(is);
TLexer lex = new TLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lex);
TParser parser = new TParser(tokens);
parser.start();
Not sure what's up. CSharp3 should work too. I'm baffled. Start the debugger and set a break point. It's your only hope, Luke!
Related
I'm trying to define unit tests for my ANTLR parser. The unit test successfully extracts the value of the first expr, but fails to extract the value of of the first idEscape. This suggests that I am misunderstanding something core to the way in which the parser works, or the way in which visitors work.
I'm writing a parser for calculations in FileMaker Pro. In FileMaker, it is technically valid for an identifier to contain whitespace as well as operators and other characters which would otherwise have a functional purpose in the calculation engine. In those cases, the identifier is escaped by surrounding it with a '${' and '}'. While the parser successfully identifies '${abcdef + 123}' as a valid expression,
I still need to be able to identify 'abcdef + 123' as a valid identifier. When I request the value of the first idEscape in a second unit test, I get an empty string.
If relevant, I'm using ANTLR4.Runtime.Standard.
What am I doing wrong? Any assistance in resolving my misunderstanding would be greatly appreciated. Thank you.
Grammar
grammar FileMakerCalc;
// PARSER RULES
calculation : expr;
expr : idEscExpr;
idEscExpr : LEFTESCAPE idEscape RIGHTESCAPE;
idEscape : (WORD|WS|OPERATOR|INT|FLOAT)*?;
// LEXER RULES
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
LEFTESCAPE : '${';
RIGHTESCAPE : '}';
OPERATOR : ('+'|'-'|'*'|'/'|'&'|'^'|'='|'≠'|'<>'|'>'|'<'|'≤'|'<='|'≥'|'>=' );
WORD : (LOWERCASE | UPPERCASE)+ ;
FLOAT : [0-9]+ '.' [0-9]+;
INT : [0-9]+ ;
NEWLINE : [\r\n]+ ;
WS : [ \t];
Visitor
public class FileMakerCalcVisitor : FileMakerCalcBaseVisitor<String>
{
public override string VisitExpr(FileMakerCalcParser.ExprContext context)
{
return context.GetText();
}
public override string VisitIdEscape(FileMakerCalcParser.IdEscapeContext context)
{
return context.GetText();
}
}
Unit Tests
namespace Antler_Tests
{
[TestFixture()]
public class ParserTest
{
private FileMakerCalcParser Setup(string text)
{
AntlrInputStream inputStream = new AntlrInputStream(text);
FileMakerCalcLexer lexer = new FileMakerCalcLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
FileMakerCalcParser parser = new FileMakerCalcParser(commonTokenStream);
return parser;
}
// This one successfully pulls '${abcdef + 123}' as the text of the first expr
[Test()]
public void EscapedID_CheckForExpr()
{
FileMakerCalcParser parser = Setup("${abcdef + 123}");
FileMakerCalcParser.ExprContext context = parser.expr();
FileMakerCalcVisitor visitor = new FileMakerCalcVisitor();
var testVal = visitor.VisitExpr(context);
Assert.AreEqual("${abcdef + 123}", testVal, testVal);
}
// This one does NOT successfully pull 'abcdef + 123' as the text of the first idEscape
[Test()]
public void EscapedID()
{
FileMakerCalcParser parser = Setup("${abcdef + 123}");
FileMakerCalcParser.IdEscapeContext context = parser.idEscape();
FileMakerCalcVisitor visitor = new FileMakerCalcVisitor();
var testVal = visitor.VisitIdEscape(context);
Assert.AreEqual("abcdef + 123", testVal);
}
}
}
${abcdef + 123} is not a valid idEscape because it starts with ${ and ends with }, neither of which the idEscape rule accepts. The way you've defined it, idEscape only matches the stuff between ${} and idEscapeExpr is the one that matches the whole thing.
So you'll want your test to either invoke the idEscapeExpr rule instead of idEscape or change the string you're parsing to abcdef + 123 (or have one test for each).
I have created following grammar in ANTLR for parsing csv file.
grammar CSV;
file returns [List<List<string>> data]
#init {$data = new List<List<string>>();}
: (row {$data.Add($row.list);})+ EOF
;
row returns [List<string> list]
#init {$list = new List<string>();}
: a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);}
)*
(LineBreak | EOF)
;
value returns [string val]
: SimpleValue {$val = $SimpleValue.text;}
| QuotedValue
{
System.Console.WriteLine($val);
$val = $QuotedValue.text;
$val = $val.Substring(1, $val.Length-1);
$val = $val.Replace("\"\"", "\"");
}
;
Comma :
( ' '* ',' ' '*);
LineBreak :
'\r'? '\n';
SimpleValue
: ~[,\r\n"]+
;
QuotedValue
: '"' ('""' | ~'"')* '"'
;
Above grammar is parsing following csv file without error.
a,b
1,2
3,4
but when I am parsing following csv file it's throwing following error
a,b
,2
3,4
line 2:0 extraneous input ',' expecting {<EOF>, SimpleValue, QuotedValue}
can someone guide me how to solve this problem?
Main Program
public List<List<string>> Parse()
{
string csvData = string.Empty;
if (string.IsNullOrEmpty(_path))
throw new ArgumentException("Path can not be empty");
try
{
csvData = File.ReadAllText(_path);
}
catch (Exception)
{
throw new FileNotFoundException(string.Format("{0} not found", _path));
}
// create an instance of the lexer
CSVLexer lexer = new CSVLexer(new AntlrInputStream(csvData));
// wrap a token-stream around the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create the parser
CSVParser parser = new CSVParser(tokens);
// invoke the entry point of our grammar
_results = parser.file().data;
return _results;
}
Upadte
As per the Mike Lischke answer I have updated the row rule as below. Now I am not getting any error
row returns [List<string> list]
#init {$list = new List<string>();}
: Comma? a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);
}
)*
(LineBreak | EOF)
;
Obviously your row rule is not flexible enough to handle missing values. You should use something like this instead:
row: Comma? value (Comma value)*;
which adds the possibility for a leading comma (actually a missing first value).
And a recommendation: don't use action code in your grammar to collect the values. Instead create and assign a parse listener to your parser whose methods are triggered during parsing to do all the background work. It keeps the grammar a lot cleaner and allows to use it indepent of the actual target language.
So I'm stuck, and nothing that I've tried has been working. I am trying to parse text between matching symbols, in this case, equal signs. I've gotten this to work in a different parser I was testing, but have since deleted. I tried to replicate what I could and my attempt is shown in the code below.
QUESTION: How do I parse text between matching symbols, and/or what am I doing wrong with my current implementation.
Also, secondary to this, is there a way to get an output of all tokens found - their names and text values. I haven't searched for this yet so I'm sure I could find out, but I've been stuck on this first problem so I haven't been able to test options.
All of this is being run with Antlr4, Visual Studio 2013, and Windows 10.
TestGrammar.g4:
grammar TestGrammar;
start
: title EOF
;
title
: EQUALS title EQUALS
| EQUALS ANY EQUALS // using NOTEQUALS didn't work either
| EQUALS ' ' ANY ' ' EQUALS
;
EQUALS: '=' ;
ANY : .+ ;
NOTEQUALS: ~[\r\n=]+ ;
Program.cs:
class Program
{
private static void Main(string[] args)
{
string[] testStrings =
{
"= asdf =",
"== asdf ==",
"=== asdf ===",
"=asdf=",
"==asdf==",
"===asdf==="
};
foreach (string s in testStrings)
{
AntlrInputStream inputStream = new AntlrInputStream(s);
TestGrammarLexer wikiLexer = new TestGrammarLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(wikiLexer);
TestGrammarParser wikiParser = new TestGrammarParser(commonTokenStream);
TestGrammarParser.StartContext startContext = wikiParser.start();
TestGrammarVisitor visitor = new TestGrammarVisitor();
visitor.VisitStart(startContext);
}
}
}
TestGrammarVisitor.cs
class TestGrammarVisitor : TestGrammarBaseVisitor<object>
{
public override object VisitStart([NotNull] TestGrammarParser.StartContext context)
{
Console.WriteLine("TestGrammarVisitor VisitStart");
context.children.OfType<TerminalNodeImpl>().ToList().ForEach(child => Visit(child));
return null;
}
private void Visit(TerminalNodeImpl node)
{
Console.WriteLine(" Visit Symbol='{0}'", node.Symbol.Text);
Console.WriteLine();
}
}
Result:
line 1:0 no viable alternative at input '=== asdf ==='
TestGrammarVisitor VisitStart
Visit Symbol='<EOF>'
g4 file:
grammar TestFlow;
options
{
language=CSharp4;
output=AST;
}
/*
* Parser Rules
*/
compileUnit : LC | BC ;
/*
* Lexer Rules
*/
BC : '/*' .*? '*/' ;
LC : '//' .*? [\r\n] ;
Code:
var input = " /*aaa*/ /// \n ";
var stream = new AntlrInputStream(input);
ITokenSource lexer = new TestFlowLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
var parser = new TestFlowParser(tokens);
parser.BuildParseTree = true;
var tree = parser.compileUnit();
var n = tree.ChildCount;
var top = new List<string>();
for (int i = 0; i < n; i++) {
top.Add(tree.GetChild(i).GetText());
}
After running above code I get single string in top: /*aaa*/. The single-line comment isn't caught.
What's wrong?
All parser/lexer generation errors & warnings are significant. Both options statements are invalid in the current version of Antlr4.
The runtime errors detail the root problem: unrecognizable input characters, specifically, the grammar does not handle whitespace. Add a lexer rule to fix:
WS: [ \r\n\t] -> skip ;
While not necessarily a problem, it is good form to require the parser to process all input. The lexer will generate an EOF token at the end of the source input. Fix the main rule to require the EOF:
compileUnit : ( LC | BC ) EOF ;
The correct way to allow for repetition is to use a * or + operator:
compileUnit : ( LC | BC )+ EOF ;
I am trying to develop a c# code generator using ANTLR and the StringTemplate library. AntlrWorks can generate the c# parser and lexer files without reporting any errors. However, the c# parser code is not valid and cannot be compiled in visual studio.
Can anyone see what is wrong with the following grammar?
grammar StrucadShape;
options {
language=CSharp3 ;
output=template;
}
#header {using System;}
#lexer::header {using System;}
#lexer::members {const int HIDDEN = Hidden;}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
public shapedef: parameters_def
-> class_temp( parameters={$parameters_def.st} )
;
parameters_def : (PARAMETERS LPAREN (p+=param) (COMMA (p+=param))* RPAREN )
-> parameter_list(params={$p})
;
param : IDENTIFIER ->Parameter_decl(p={$IDENTIFIER.text});
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
fragment EOL:'\r'|'\n'|'\r\n' ;
WS : (' '
| '\t'
| EOL)
{ $channel = HIDDEN; } ;
PARAMETERS: 'PARAMETERS';
COMMA : ',' ;
LPAREN : '(' ;
RPAREN : ')' ;
fragment LETTER :('A'..'Z' | 'a'..'z');
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER : (DIGIT)+ ;
FLOAT : (DIGIT)+'.'(DIGIT)+;
fragment DIGIT : '0'..'9' ;
This results in the following lines of code in generated parameters_def() method
List<object> list_p = null;
...snipped some code
if (list_p==null) list_p=new List<StringTemplate>();
This is failing on the assignment of the List <StringTemplate> to type List<Object>.
The grammar works before I add the string template rules. The error is introduced when I add the (p+=param) syntax required for list processing in the StringTemplate library.
I'll add my StringTemplate file for completeness, but I don't think this could be causing an error as it is not loaded until runtime.
group StrucadShape;
Parameter_decl(p)::= "public double <p> { get; set; }"
parameter_list(params) ::=
<<
start expressions
<params; separator="\n">
end
>>
class_temp( parameters)::=
<<
public class name
{
<parameters; separator="\n>
}
>>
A sample input string PARAMETERS( D,B,T)
Antlr Versions
Antlr3.Runtime 3.4.1.9004
AntlrWorks 1.4.3
I found a related issue on the Antlr mailing list here.
The solution was to add an ASTLabeltype to the grammar options
options {
language=CSharp3;
output=template;
ASTLabelType = StringTemplate;
}