Parsing: ANTLR for .NET - c#

I am trying to parse the following text:
<<! notes, Test!>>
Grammar:
grammar Hello;
prog: stat+;
stat: DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr: NOTES value=VAR_VALUE # delim_body ;
VAR_VALUE : [ a-Z A-Z 0-9 ! ];
NOTES : 'notes,'
| ' notes,';
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
Error:
line 1:12 token recognition error at: '>'
line 1:13 token recognition error at: '>'
line 1:10 mismatched input ' !' expecting VAR_VALUE
(NOTE: Added DELIMITER defs since I forgot them earlier)

Try this:
grammar Hello;
prog : stat+ EOF ;
stat : DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr : NOTES COMMA value=VAR_VALUE # delim_body ;
VAR_VALUE : ANBang* AlphaNum ;
NOTES : 'notes' ;
COMMA : ',' ;
WS : [ \t\r\n]+ -> skip ;
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
fragment ANBang : AlphaNum | Bang ;
fragment AlphaNum : [a-zA-Z0-9] ;
fragment Bang : '!' ;
Ideally, the rules have to be mutually unambiguous. So, the VAR_VALUE rule is defined to limit the existence of a ! from the end. This will prevent the ! from being consumed by VAR_VALUE in preference to DELIMITER_CLOSE. Of course, that presumes the redefinition is acceptable. If not, a more involved solution will be required.
Also, as a general principle, skip anything that is not syntactically significant to the parsing.

Related

How to avoid the spaces between two texts while parsing using antlr

Please help me with the following case. I have a line with multiple texts in it. Based on some rule I need to parse each words in the line. Below is my example input line
## KEYWORD = MyName MyAliasName
Below is my parsing rule sets.
rule1:
Keyword name = identifier{ $name.str;} (' '* diffName = identifierTest { $diffName.str; })?
;
identifier:
returns [string str]
#init{$str="";}:
i=Word{$str+=$i.text;} (i=(Number | Word ) {$str+=$i.text;})*
;
Keyword: SPACE* START SPACE* 'KEYWORD' SPACE* EQUAL SPACE*;
Number:DIGIT+;
Word:LETTER+;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: [0-9];
fragment SPACE: ' ' | '\t';
fragment START: '##';
fragment EQUAL: '=';
The "rule1" rule defines that, the MyName text is mandatory and MyAliasName is an optional one.
The "identifier" rule defined that, the name can start with only by a letter or underscores.
The Problem
If I give exactly one space between MyName and MyAliasName then the above rules works fine. Whereas if there are more than one spaces between MyName and MyAliasName, then the first identifier rule reads both the texts together as MyNameMyAliasName(it removes the spaces automatically). Why ? I don't know what I'm doing wrong!
Whenever the optional texts is available then i will have to overwrite the name with AliasName. Please help and thanks in advance
This grammar should solve your problem
grammar TestGrammar;
rule1:
keyword name=IDENT{ System.out.println($name.text);} ( diffName = IDENT { System.out.println($diffName.text); })?
;
keyword: START KEYWORD EQUAL;
KEYWORD : 'KEYWORD' ;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: '0'..'9';
IDENT : LETTER (LETTER|DIGIT)*;
START : '##';
EQUAL : '=';
SPACE : [ \t]+ -> skip;

ANTLR C# missing X at X

I'm using ANTLR with C# to create a simple parser for C-like structs. The runtime version is 4.7.
The grammar looks like this:
structDef : STRUCT ID OPENBLOCK (fieldDef)+ CLOSEBLOCK ;
fieldDef : (namespaceQualifier)+ ID ID SEMICOLON ;
namespaceQualifier : ID DOT ;
/*
* Lexer Rules
*/
ID : [a-zA-Z_] [a-zA-Z0-9_]* ;
STRUCT : 'struct' ;
NAMESPACE : 'namespace' ;
OPENBLOCK : '{' ;
CLOSEBLOCK : '}' ;
DOT : '.' ;
SEMICOLON : ';' ;
WHITESPACE : (' '|'\t')+ -> skip;
Now when I run the parser like this:
test = "struct Stest { type name; }"
var lexer = new OdefGrammarLexer(new AntlrInputStream(test));
var tokenStream = new CommonTokenStream(lexer);
var parser = new OdefGrammarParser(tokenStream);
var ctx = parser.structDef();
Console.Out.WriteLine(ctx.ToString());
I get an error output:
line 1:0 missing 'struct' at 'struct'
line 1:7 extraneous input 'Stest' expecting '{'
line 1:20 missing '.' at 'name'
line 1:24 mismatched input ';' expecting '.'
The first error in the output is particularly interesting, seems that parser fails to find a match where it should. I suspect problems with string locale/encoding, but I'm not sure how to tackle that for ANTLR.
Any help is much appreciated.
ID rule must be after STRUCT and NAMESPACE rules (any rules that might collide with it), since if an input can match multiple tokens, the one defined first wins
ID rule should probably be (but perhaps your notation is supported?):
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A' .. 'Z' | '0'..'9' | '_')* ;

Antlr4 Grammar issue (not entirely parsing)

I am new to ANTLR and I am trying to get this grammar working:
grammar TemplateGrammar;
//Parser Rules
start
: block
| statement
| expression
| parExpression
| primary
;
block
: LBRACE statement* RBRACE
;
statement
: block
| IF parExpression statement (ELSE statement)?
| expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: primary #PRIMARY
| number op=('*'|'/') number #MULDIV
| number op=('+'|'-') number #ADDSUB
| number op=('>='|'<='|'>'|'<') number #GRLWOREQUALS
| expression op=('='|'!=') expression #EQDIFF
;
primary
: parExpression
| literal
;
literal
: number #NumberLiteral
| string #StringLiteral
| columnName #ColumnNameLiteral
;
number
: DecimalIntegerLiteral #DecimalIntegerLiteral
| DecimalFloatingPointLiteral #FloatLiteral
;
string
: '"' StringChars? '"'
;
columnName
: '[' StringChars? ']'
;
// Lexer Rules
//Integers
DecimalIntegerLiteral
: DecimalNumeral
;
fragment
DecimalNumeral
: '0'
| NonZeroDigit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitOrUnderscore* Digit)?
;
fragment
Digit
: '0'
| NonZeroDigit
;
fragment
NonZeroDigit
: [1-9]
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
//Floating point
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
//Strings
StringChars
: StringChar+
;
fragment
StringChar
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
;
//Separators
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
COMMA : ',';
DOT : '.';
//Keywords
IF : 'IF';
ELSE : 'ELSE';
THEN : 'THEN';
//Operators
PLUS : '+';
MINUS : '-';
MULTIPLY : '*';
DIVIDE : '/';
EQUALS : '=';
DIFFERENT : '!=';
GRTHAN : '>';
GROREQUALS : '>=';
LWTHAN : '<';
LWOREQUALS : '<=';
AND : '&';
OR : '|';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;
When I put "Test" in the input, it is working and returning the String "Test".
Here is what I get in the IParseTree when I put "Test" in the input:
"(start (statement (expression (primary (literal (string \" Test \"))))))"
But when I put [Test] (wich is almost the same as "Test" but with braces instead of quotes), the parser does not recognize the token...
Here is the IParseTree I get when I put [Tree]:
"(start [Test])"
Same with numbers, it does well recognize lonely numbers such as 1, 123, 12.5, etc. but not expressions like 1+2...
Do you have any idea why the parser isn't recognizing columnNames rule but does work well with the string rule?
Probably because "StringChar" is defined incorrectly for your purpose? It doesn't handle "]"
Perhaps you want to define StringChar as:
fragment
StringChar
: ~["\\\]]
| EscapeSequence
;
If it were my grammar, I'd define a QuotedStringChar as you have for quoted strings, and define BracketStringChar as ~[\]\\] to use for your bracket column names.
Welcome to debugging grammars at the lexical level, and defining different types of "quotes" for different types of strings. This is pretty common. (You should see Ruby, where you can define the string quote at the beginning of the string, ick.).
I finnaly got it working by putting:
QuotedStringChars
: '"' ~[\"]+ '"'
;
BracketStringChars
: '[' ~[\]]+ ']'
;
To take any characters between quotes or brackets. Then :
primary
: literal #PrimLiteral
| number #PrimNumber
;
literal
: QuotedStringChars #OneString
| BracketStringChars #ColumnName
| number #NUMBER
;
number
: DecimalIntegerLiteral #DecimalIntegerLiteral
| DecimalFloatingPointLiteral #FloatLiteral
;
The literal rule helps to distinguish quoted string, bracket string and numbers.
There is a duplication of number in primary and literal rules because I need a different behavior in my application for each one.
I managed this with the good advices of Ira Baxter :)
Hope this will help other newbies to ANTLR like me to have a better
understanding :)

how to resolve simple ambiguity

I just started using Antlr and am stuck. I have the below grammar and am trying to resolve the ambiguity to parse input like Field:ValueString.
expression : Field ':' ValueString;
Field : Letter LetterOrDigit*;
ValueString : ~[:];
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
suppose a:b is passed in to the grammar, a and b are both identified as Field. How do I resolve this in Antlr4 (C#)?
You can use a semantic predicate in your lexer rules to perform lookahead (or behind) without consuming characters (ANTLR4 negative lookahead in lexer)
In you case, to remove ambiguity, you can check if the char after the Field rule is : or you can check if the char before the ValueString is :.
Ïn the first case:
expression : Field ':' ValueString;
Field : Letter LetterOrDigit* {_input.LA(1) == ':'}?;
ValueString : ~[:];
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
In the second one (please note that Field and ValueString order have been inversed):
expression : Field ':' ValueString;
ValueString : {_input.LA(-1) == ':'}? ~[:];
Field : Letter LetterOrDigit*;
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
Also consider using fragment keyword for Letter and LetterOrDigit
fragment Letter : [a-zA-Z];
fragment LetterOrDigit : [a-zA-Z0-9];
"[With fragment keyword] You can also define rules that are not tokens but rather aid in the recognition of tokens. These fragment rules do not result in tokens visible to the parser." (source https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules)
A way to resolve this without lookahead is to simple define a parser rule that can be any of those two lexer tokens:
expression : Field ':' value;
value : Field | ValueString;
Field : Letter LetterOrDigit*;
ValueString : ~[:];
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
The parser will work as expected and the grammar is kept simple, but you may need to add a method to your visitor or listener implementation.

ANTLR grammar matches incompatible rule instead of throwing NoViableAltException

I have the following ANTLR grammar that forms part of a larger expression parser:
grammar ProblemTest;
atom : constant
| propertyname;
constant: (INT+ | BOOL | STRING | DATETIME);
propertyname
: IDENTIFIER ('/' IDENTIFIER)*;
IDENTIFIER
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+;
INT
: '0'..'9'+;
BOOL : ('true' | 'false');
DATETIME
: 'datetime\'' '0'..'9'+ '-' '0'..'9'+ '-' + '0'..'9'+ 'T' '0'..'9'+ ':' '0'..'9'+ (':' '0'..'9'+ ('.' '0'..'9'+)*)* '\'';
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
If I invoke this in the interpreter from within ANTLR works with
'Hello\\World'
then this is getting interpreted as a propertyname instead of a constant. The same happens if I compile this in C# and run it in a test harness, so it is not a problem with the dodgy interpreter
I'm sure I am missing something really obvious... but why is this happening? It's clear there is a problem with the string matcher, but I would have thought at the very least that the fact that IDENTIFIER does not match the ' character would mean that this would throw a NoViableAltException instead of just falling through?
First, neither ANTLRWorks nor antlr-3.5-complete.jar can be used to generate code for the C# targets. They might produce files ending in .cs, and those files might even compile, but those files will not be the same as the files produced by the C# port of the ANTLR Tool (Antlr3.exe) or the recommended MSBuild integration. Make sure you are producing your generated parser by one of the tested methods.
Second, INT will never be matched. Since IDENTIFIER appears before INT in the grammar, and all sequences '0'..'9'+ match both IDENTIFIER and INT, the lexer will always take the first option which appears (IDENTIFIER).

Categories