Antlr4 picks up wrong tokens and rules - c#

I have something that goes alongside:
method_declaration : protection? expression identifier LEFT_PARENTHESES (method_argument (COMMA method_argument)*)? RIGHT_PARENTHESES method_block;
expression
: ...
| ...
| identifier
| kind
;
identifier : IDENTIFIER ;
kind : ... | ... | VOID_KIND; // void for example there are more
IDENTIFIER : (LETTER | '_') (LETTER | DIGIT | '_')*;
VOID_KIND : 'void';
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
*The other rules on the method_declaration are not relavent for this question
What happens is that when I input something such as void Start() { }
and look at the ParseTree, it seems to think void is an identifier and not a kind, and treats it as such.
I tried changing the order in which kind and identifier are written in the .g4 file... but it doesn't quite seem to make any difference... why does this happen and how can I fix it?

The order in which parser rules are defined makes no difference in ANTLR. The order in which token rules are defined does though. Specifically, when multiple token rules match on the current input and produce a token of the same length, ANTLR will apply the one that is defined first in the grammar.
So in your case, you'll want to move VOID_KIND (and any other keyword rules you may have) before IDENTIFIER. So pretty much what you already tried except with the lexer rules instead of the parser rules.
PS: I'm somewhat surprised that ANTLR doesn't produce a warning about VOID_KIND being unmatchable in this case. I'm pretty sure other lexer generators would produce such a warning in cases like this.

Related

Handle Chain Expressions (Antlr)

I am making an interpreter in Antlr and I am having trouble with the following:
One of my rules states:
expression
: { .. }
| {... }
| identifier
| expression DOT expressioon
| get_index
| invoke_method
| { ... }
;
I thought this way I'd evaluate expressions such as "a.b[2].Run()" or whatever... the idea is being able to use both variables, indexing, and methods. All 3 syntaxes are in place... How do I chain them tho? Is my approach correct? or is there a better way? (The example I provided just randomly throws those in with no specification for the sake of clearance, it's not the actual grammar but I can assure you all other 3 grammar rules(identifier, get_index and invoke_method) are defined and are proper, identifier is a single one, not chained)

Negate POSIX or Unicode character classes in ANTLR Lexer (C#)

ANTLR build system:
Visual Studio 2017, C#
NuGet packages: Antlr4.CodeGenerator 4.6.5-rc002, Antlr4.Runtime 4.6.5-rc002
I've got the following Flex rule which I'd like to convert to ANTLR 4:
NOT_NAME [^[:alpha:]_*\n]+
I think that I've already found out that ANTLR doesn't support POSIX or Unicode character classes but that you can create fragments to include them into your lexer grammar.
In my attempt to translate the above rule I've already created the following fragments:
fragment ALPHA: L | Nl;
fragment L : Ll | Lm | Lo | Lt | Lu ;
fragment Ll : '\u0061'..'\u007A' ; /* rest omitted for brevity */
fragment Lm : '\u02B0'..'\u02C1' ; /* rest omitted for brevity */
fragment Lo : '\u00AA' | '\u00BA' ; /* rest omitted for brevity */
fragment Lt : '\u01C5' | '\u01C8' ; /* rest omitted for brevity */
fragment Lu : '\u0041'..'\u005A' ; /* rest omitted for brevity */
fragment Nl : '\u16EE'..'\u16F0' ; /* rest omitted for brevity */
The ANTLR rule I had thought would work was the following:
NOT_NAME: ~(ALPHA | '_' | '*' | '\n')+;
but it gives me the following error:
rule reference 'ALPHA' is not currently supported in a set
The problem seems to be the negation as rules without negation seem to work without problems.
I know that it works if I inline all the above fragments into one rule but this appears insanely complicated to me - especially given the pretty simple and straightforward Flex rule.
I must be missing some elegant trick that you will possibly point me to.
The Unicode characterset support doesn't depend on the target runtime. The ANTLR4 tool itself converts the grammars and also parses the charset definitions. You should be able to use any of the Unicode classes as laid out in the lexer documentation. I'm not sure however if you can negate that block with the tilde. At least there is the option to use \P... to negate a char class (also mention in that document).

Antlr 4 Lexer rule ambiguity

So I'm building a grammar to parse c++ header files.
I have only written grammar for the header files, and I don't intend to write any for the implementation.
My problem is that if a method is implemented in the header rather than just defined.
Foo bar()
{
//some code
};
I just want to match the implementation of bar to
BLOCK
: '{' INTERNAL_BLOCK*? '}'
;
fragment INTERNAL_BLOCK
: BLOCK
| ~('}')
;
but then this interferes for anything other grammar that includes { ... } because this will always match what is in between two braces. Is there anyway to specify which token to use when there is an ambiguity?
p.s. I don't know if the grammar for BLOCK works but you get the gist.
So, the significant parser rules would be:
method : mType mTypeName LPAREN RPAREN BLOCK ; // simplified
unknown : . ;
BLOCK tokens produced by the lexer that are not matched as part of the method rule will appear in the parse-tree in unknown context nodes. Analyze the method context nodes and ignore the unknown nodes.

Increase Performance of Semantic Predicate

I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:
// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4.cs
bool IsUserDefinedFunction()
{
foreach (string function in listOfUserDefinedFunctions)
{
if (this.Text == function)
{
return true;
}
}
return false;
}
However, I've found that just having the semantic predicate {IsUserDefinedFunction()}? makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction() to always return false had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?
A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.
For example: Given the language defined keyword GOTO and a user-defined function GOTO20Something, a typical piece of program text could look like:
GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something
and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION
Edit to clarify:
Even rewriting IsUserDefinedFunction() as:
bool IsUserDefinedFunction() { return false; }
I still get the same slow performance.
Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:
// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
: 'ActualUserKeyword'
| 'AnotherActualUserKeyword'
| 'MoreKeywords'
...
;
Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?
Edit: In response to there not being any other identifiers in this language, I would take a different approach.
Use the original grammar, but remove the semantic predicate altogether. This means both valid and invalid user-defined function identifiers will result in USER_FUNCTION tokens.
Use a listener or visitor after the parse is complete to validate instances of USER_FUNCTION in the parse tree, and report an error at that time if the code uses a function that has not been defined.
This strategy results in better error messages, greatly improves the ability of the lexer and parser to recover from these types of errors, and produces a usable parse tree from file (even through it's not completely semantically valid, it can still be used for analysis, reporting, and potentially to support IDE features down the road).
Original answer assuming that identifiers which are not USER_FUNCTION should result in IDENTIFIER tokens.
The problem is the predicate is getting executed after every letter, digit, and underscore during the lexing phase. You can improve performance by declaring your USER_FUNCTION as a token (and removing the USER_FUNCTION rule from the grammar):
tokens {
USER_FUNCTION
}
Then, in the Lexer.g4.cs file, override the Emit() method to perform the test and override the token type if necessary.
public override IToken Emit() {
if (_type == IDENTIFIER && IsUserDefinedFunction())
_type = USER_FUNCTION;
return base.Emit();
}
My solution for this specific language was to use a System.Text.RegularExpressions.Regex to surround all instances of user-defined functions in the input string with a special character (I chose the § (\u00A7) character).
Then the lexer defines:
USER_FUNCTION : '\u00A7' [a-zA_Z0-9_]+ '\u00A7';
In the parser listener, I strip the surrounding §'s from the function name.

C# grammar using ANTLR 3

I'm now writing C# grammar using Antlr 3 based on this grammar file.
But, I found some definitions I can't understand.
NUMBER:
Decimal_digits INTEGER_TYPE_SUFFIX? ;
// For the rare case where 0.ToString() etc is used.
GooBall
#after
{
CommonToken int_literal = new CommonToken(NUMBER, $dil.text);
CommonToken dot = new CommonToken(DOT, ".");
CommonToken iden = new CommonToken(IDENTIFIER, $s.text);
Emit(int_literal);
Emit(dot);
Emit(iden);
Console.Error.WriteLine("\tFound GooBall {0}", $text);
}
:
dil = Decimal_integer_literal d = '.' s=GooBallIdentifier
;
fragment GooBallIdentifier
: IdentifierStart IdentifierPart* ;
The above fragments contain the definition of 'GooBall'.
I have some questions about this definition.
Why is GooBall needed?
Why does this grammar define lexer rules to parse '0.ToString()' instead of parser rules?
It's because that's a valid expression that's not handled by any of the other rules - I guess you'd call it something like an anonymous object, for lack of a better term. Similar to "hello world".ToUpper(). Normally method calls are only valid on variable identifiers or return values ala GetThing().Method(), or otherwise bare.
Sorry. I found the reason from the official FAQ pages.
Now if you want to add '..' range operator so 1..10 makes sense, ANTLR has trouble distinguishing 1. (start of the range) from 1. the float without backtracking. So, match '1..' in NUM_FLOAT and just emit two non-float tokens:

Categories