Antlr 4 Lexer rule ambiguity

Antlr 4 Lexer rule ambiguity - c#

So I'm building a grammar to parse c++ header files.
I have only written grammar for the header files, and I don't intend to write any for the implementation.
My problem is that if a method is implemented in the header rather than just defined.
Foo bar()
{
//some code
};
I just want to match the implementation of bar to
BLOCK
: '{' INTERNAL_BLOCK*? '}'
;
fragment INTERNAL_BLOCK
: BLOCK
| ~('}')
;
but then this interferes for anything other grammar that includes { ... } because this will always match what is in between two braces. Is there anyway to specify which token to use when there is an ambiguity?
p.s. I don't know if the grammar for BLOCK works but you get the gist.

So, the significant parser rules would be:
method : mType mTypeName LPAREN RPAREN BLOCK ; // simplified
unknown : . ;
BLOCK tokens produced by the lexer that are not matched as part of the method rule will appear in the parse-tree in unknown context nodes. Analyze the method context nodes and ignore the unknown nodes.

Related

Antlr4 picks up wrong tokens and rules

I have something that goes alongside:
method_declaration : protection? expression identifier LEFT_PARENTHESES (method_argument (COMMA method_argument)*)? RIGHT_PARENTHESES method_block;
expression
: ...
| ...
| identifier
| kind
;
identifier : IDENTIFIER ;
kind : ... | ... | VOID_KIND; // void for example there are more
IDENTIFIER : (LETTER | '_') (LETTER | DIGIT | '_')*;
VOID_KIND : 'void';
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
*The other rules on the method_declaration are not relavent for this question
What happens is that when I input something such as void Start() { }
and look at the ParseTree, it seems to think void is an identifier and not a kind, and treats it as such.
I tried changing the order in which kind and identifier are written in the .g4 file... but it doesn't quite seem to make any difference... why does this happen and how can I fix it?

The order in which parser rules are defined makes no difference in ANTLR. The order in which token rules are defined does though. Specifically, when multiple token rules match on the current input and produce a token of the same length, ANTLR will apply the one that is defined first in the grammar.
So in your case, you'll want to move VOID_KIND (and any other keyword rules you may have) before IDENTIFIER. So pretty much what you already tried except with the lexer rules instead of the parser rules.
PS: I'm somewhat surprised that ANTLR doesn't produce a warning about VOID_KIND being unmatchable in this case. I'm pretty sure other lexer generators would produce such a warning in cases like this.

Increase Performance of Semantic Predicate

I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:
// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4.cs
bool IsUserDefinedFunction()
{
foreach (string function in listOfUserDefinedFunctions)
{
if (this.Text == function)
{
return true;
}
}
return false;
}
However, I've found that just having the semantic predicate {IsUserDefinedFunction()}? makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction() to always return false had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?
A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.
For example: Given the language defined keyword GOTO and a user-defined function GOTO20Something, a typical piece of program text could look like:
GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something
and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION
Edit to clarify:
Even rewriting IsUserDefinedFunction() as:
bool IsUserDefinedFunction() { return false; }
I still get the same slow performance.
Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:
// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
: 'ActualUserKeyword'
| 'AnotherActualUserKeyword'
| 'MoreKeywords'
...
;
Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?

Edit: In response to there not being any other identifiers in this language, I would take a different approach.
Use the original grammar, but remove the semantic predicate altogether. This means both valid and invalid user-defined function identifiers will result in USER_FUNCTION tokens.
Use a listener or visitor after the parse is complete to validate instances of USER_FUNCTION in the parse tree, and report an error at that time if the code uses a function that has not been defined.
This strategy results in better error messages, greatly improves the ability of the lexer and parser to recover from these types of errors, and produces a usable parse tree from file (even through it's not completely semantically valid, it can still be used for analysis, reporting, and potentially to support IDE features down the road).
Original answer assuming that identifiers which are not USER_FUNCTION should result in IDENTIFIER tokens.
The problem is the predicate is getting executed after every letter, digit, and underscore during the lexing phase. You can improve performance by declaring your USER_FUNCTION as a token (and removing the USER_FUNCTION rule from the grammar):
tokens {
USER_FUNCTION
}
Then, in the Lexer.g4.cs file, override the Emit() method to perform the test and override the token type if necessary.
public override IToken Emit() {
if (_type == IDENTIFIER && IsUserDefinedFunction())
_type = USER_FUNCTION;
return base.Emit();
}

My solution for this specific language was to use a System.Text.RegularExpressions.Regex to surround all instances of user-defined functions in the input string with a special character (I chose the § (\u00A7) character).
Then the lexer defines:
USER_FUNCTION : '\u00A7' [a-zA_Z0-9_]+ '\u00A7';
In the parser listener, I strip the surrounding §'s from the function name.

C# grammar using ANTLR 3

I'm now writing C# grammar using Antlr 3 based on this grammar file.
But, I found some definitions I can't understand.
NUMBER:
Decimal_digits INTEGER_TYPE_SUFFIX? ;
// For the rare case where 0.ToString() etc is used.
GooBall
#after
{
CommonToken int_literal = new CommonToken(NUMBER, $dil.text);
CommonToken dot = new CommonToken(DOT, ".");
CommonToken iden = new CommonToken(IDENTIFIER, $s.text);
Emit(int_literal);
Emit(dot);
Emit(iden);
Console.Error.WriteLine("\tFound GooBall {0}", $text);
}
:
dil = Decimal_integer_literal d = '.' s=GooBallIdentifier
;
fragment GooBallIdentifier
: IdentifierStart IdentifierPart* ;
The above fragments contain the definition of 'GooBall'.
I have some questions about this definition.
Why is GooBall needed?
Why does this grammar define lexer rules to parse '0.ToString()' instead of parser rules?

It's because that's a valid expression that's not handled by any of the other rules - I guess you'd call it something like an anonymous object, for lack of a better term. Similar to "hello world".ToUpper(). Normally method calls are only valid on variable identifiers or return values ala GetThing().Method(), or otherwise bare.

Sorry. I found the reason from the official FAQ pages.
Now if you want to add '..' range operator so 1..10 makes sense, ANTLR has trouble distinguishing 1. (start of the range) from 1. the float without backtracking. So, match '1..' in NUM_FLOAT and just emit two non-float tokens:

ANTLR - basic grammar including unexpected characters?

I've got a really simple ANTLR grammar that I'm trying to get working, but failing miserably at the moment. Would really appreciate some pointers on this...
root : (keyword|ignore)*;
keyword : KEYWORD;
ignore : IGNORE;
KEYWORD : ABBRV|WORD;
fragment WORD : ALPHA+;
fragment ALPHA : 'a'..'z'|'A'..'Z';
fragment ABBRV : WORD?('.'WORD);
IGNORE : .{ Skip(); };
With the following test input:
"some ASP.NET and .NET stuff. that work."
I'm wanting a tree that is just a list of keyword nodes,
"some", "ASP.NET", "and", ".NET", "stuff", "that", "work"
At the moment I get
"some", "ASP.NET", "and", ".NET", "stuff. that",
(for some reason "." appears within the last keyword, and it misses "work"
If I change the ABBRV clause to
fragment ABBRV : ('.'WORD);
then that works fine, but I get keyword (asp) and keyword (.net) - seperately - but I need them as a single token.
Any help you can give would be much appreciated.

There are a couple things, first your ignore parser rule will never be triggered and does not even have to appear in this grammar (also leave out of the root rule). Of course, since you were debugging and had the ignore rule it is much easier to test (by dropping the skip(); in the IGNORE lexer rule).
Now to explain the test data, since none of the lexer tokens match just WORD '.' the ending of your test data is being ignored because of the period right after the text. If you place a space between 'work' and the period then the last word will appear and the period will not appear, this is what you want. The lexer does not know what to do with 'work.' when it ends. If you add another word at the end (put a space between the period and the new word) then 'work.' is being passed from the lexer rules as one IGNORE token. I would have thought the word would be passed and the period should be in the IGNORE token only.

I decided to try to solve your problem with an ANTLR3 Grammar. This is what I came up with, with some strings attached:
Your spec does not contain many rules, and as a result, my grammar is not very thorough.
Consider adding to KEYW to match more tokens.
I don't have C# compatible ANTLR right now. Capitalize the 'skip()' to make it compatible.
grammar TestSplitter;
start: (KEYW DELIM!?)* ;
KEYW: ('a'..'z'|'A'..'Z'|'.')+ ;
DELIM: '.'? ' '+ ;

ANTLR - Writing a tree grammar for an AST

I have an AST outputted for some Lua code by my grammar file, which currently does parsing and lexing for me. I want to add a tree grammar to this, but since i'm using C# i'm not sure how to do it. What's the basic process for generating tree grammar code when you already have a parser and lexer written?
UPDATE: I have the following grammar file:
tree grammar LuaGrammar;
options {
backtrack=true;
language=CSharp2;
//output=AST;
tokenVocab=Lua;
filter=true;
ASTLabelType=CommonTree;
}
#lexer::namespace{/*my namespace*/}
#parser::namespace{/*my namespace*/}
dummyRule
: ^('=' x=. y=.) {};
placed in the same directory as my main grammar file, which generates fine. However, when trying to compile this i get the following errors:
[02:54:06] error(143): C:\Users\RCIX\Desktop\AguaLua\Project\trunk\AguaLua\AguaLua\ANTLR Data\LuaGrammar.g:12:18: unknown or invalid action scope for tree grammar: lexer
[02:54:06] error(143): C:\Users\RCIX\Desktop\AguaLua\Project\trunk\AguaLua\AguaLua\ANTLR Data\LuaGrammar.g:13:19: unknown or invalid action scope for tree grammar: parser
Am i on the right track or totally off?

Well going back to my usual example of a calculator grammar:)
This is how you would declare your Tree Walker class
class CalcTreeShaker extends TreeParser;
expr returns [float r]
{
float a,b;
r=0;
}
: #(PLUS a=expr b=expr) {r = a+b;}
| #(STAR a=expr b=expr) {r = a*b;}
| i:INT {r = Convert.ToSingle(i.getText());}
;
Here we have a tree rule called expr. Tree walkers are very similar to parser grammars.
The big difference is that while a parser grammar has to match exactly a tree grammar only has to match a portion of the tree.
In the expr rule we can see it matches any tree that has the tokens PLUS or STAR or INT.
We can see we are matching trees because we are using Antlr's Tree syntax #(...).
The PLUS and STAR tree's also match 2 expr rules. Each expr rule is assigned to a name so we can use it to evaluate the expression. Similar to parser grammars we can put C# code into blocks defiend by {...}.
Also note that in this example we show how to return a value from a TreeWalker rule, we use the syntax return [...].
To call the tree walker you create it and then call it's top level rule. I"ll copy this from the Antlr example:)
// Get the ast from your parser.
CommonAST t = (CommonAST)parser.getAST();
// Create the Tree Shaker
CalcTreeWalker walker = new CalcTreeWalker();
CalcParser.initializeASTFactory(walker.getASTFactory());
// pass the ast to the walker and call the top level rule.
float r = walker.expr(t);

I have not encountered this error but there are 2 things I would try.
1) remove the #lexer and #parser namespace lines.
2) If they are necessary then move them until after the Tokens {...} section of your grammar, ie just before the rules.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Antlr 4 Lexer rule ambiguity - c#

Related

Antlr4 picks up wrong tokens and rules

Increase Performance of Semantic Predicate

C# grammar using ANTLR 3

ANTLR - basic grammar including unexpected characters?

ANTLR - Writing a tree grammar for an AST

Categories

Resources