Negate POSIX or Unicode character classes in ANTLR Lexer (C#)

Negate POSIX or Unicode character classes in ANTLR Lexer (C#) - c#

ANTLR build system:
Visual Studio 2017, C#
NuGet packages: Antlr4.CodeGenerator 4.6.5-rc002, Antlr4.Runtime 4.6.5-rc002
I've got the following Flex rule which I'd like to convert to ANTLR 4:
NOT_NAME [^[:alpha:]_*\n]+
I think that I've already found out that ANTLR doesn't support POSIX or Unicode character classes but that you can create fragments to include them into your lexer grammar.
In my attempt to translate the above rule I've already created the following fragments:
fragment ALPHA: L | Nl;
fragment L : Ll | Lm | Lo | Lt | Lu ;
fragment Ll : '\u0061'..'\u007A' ; /* rest omitted for brevity */
fragment Lm : '\u02B0'..'\u02C1' ; /* rest omitted for brevity */
fragment Lo : '\u00AA' | '\u00BA' ; /* rest omitted for brevity */
fragment Lt : '\u01C5' | '\u01C8' ; /* rest omitted for brevity */
fragment Lu : '\u0041'..'\u005A' ; /* rest omitted for brevity */
fragment Nl : '\u16EE'..'\u16F0' ; /* rest omitted for brevity */
The ANTLR rule I had thought would work was the following:
NOT_NAME: ~(ALPHA | '_' | '*' | '\n')+;
but it gives me the following error:
rule reference 'ALPHA' is not currently supported in a set
The problem seems to be the negation as rules without negation seem to work without problems.
I know that it works if I inline all the above fragments into one rule but this appears insanely complicated to me - especially given the pretty simple and straightforward Flex rule.
I must be missing some elegant trick that you will possibly point me to.

The Unicode characterset support doesn't depend on the target runtime. The ANTLR4 tool itself converts the grammars and also parses the charset definitions. You should be able to use any of the Unicode classes as laid out in the lexer documentation. I'm not sure however if you can negate that block with the tilde. At least there is the option to use \P... to negate a char class (also mention in that document).

Related

Antlr4 picks up wrong tokens and rules

I have something that goes alongside:
method_declaration : protection? expression identifier LEFT_PARENTHESES (method_argument (COMMA method_argument)*)? RIGHT_PARENTHESES method_block;
expression
: ...
| ...
| identifier
| kind
;
identifier : IDENTIFIER ;
kind : ... | ... | VOID_KIND; // void for example there are more
IDENTIFIER : (LETTER | '_') (LETTER | DIGIT | '_')*;
VOID_KIND : 'void';
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
*The other rules on the method_declaration are not relavent for this question
What happens is that when I input something such as void Start() { }
and look at the ParseTree, it seems to think void is an identifier and not a kind, and treats it as such.
I tried changing the order in which kind and identifier are written in the .g4 file... but it doesn't quite seem to make any difference... why does this happen and how can I fix it?

The order in which parser rules are defined makes no difference in ANTLR. The order in which token rules are defined does though. Specifically, when multiple token rules match on the current input and produce a token of the same length, ANTLR will apply the one that is defined first in the grammar.
So in your case, you'll want to move VOID_KIND (and any other keyword rules you may have) before IDENTIFIER. So pretty much what you already tried except with the lexer rules instead of the parser rules.
PS: I'm somewhat surprised that ANTLR doesn't produce a warning about VOID_KIND being unmatchable in this case. I'm pretty sure other lexer generators would produce such a warning in cases like this.

Antlr 4 Lexer rule ambiguity

So I'm building a grammar to parse c++ header files.
I have only written grammar for the header files, and I don't intend to write any for the implementation.
My problem is that if a method is implemented in the header rather than just defined.
Foo bar()
{
//some code
};
I just want to match the implementation of bar to
BLOCK
: '{' INTERNAL_BLOCK*? '}'
;
fragment INTERNAL_BLOCK
: BLOCK
| ~('}')
;
but then this interferes for anything other grammar that includes { ... } because this will always match what is in between two braces. Is there anyway to specify which token to use when there is an ambiguity?
p.s. I don't know if the grammar for BLOCK works but you get the gist.

So, the significant parser rules would be:
method : mType mTypeName LPAREN RPAREN BLOCK ; // simplified
unknown : . ;
BLOCK tokens produced by the lexer that are not matched as part of the method rule will appear in the parse-tree in unknown context nodes. Analyze the method context nodes and ignore the unknown nodes.

how to connect ANTLRWorks output to c# project?

i have written a grammar for while in ANTLRWorks 1.5.2.
i also added some actions so when i debug my code with a while code it will show 3 address code in output of ANTLRWorks.
my grammar is like that:
NAME:
LETTER (LETTER | DIGIT | '_')*;
NUMBER:
DIGIT+; // just integers
fragment DIGIT:
'0'..'9';
fragment LETTER:
'A'..'Z' | 'a'..'z';
RELATION:
'<' | '<=' | '==' | '>=' | '>' | '!=' ;
WHITESPACE:
(' ' | '\t' | '\r' | '\n')+ { $channel = HIDDEN; };
and i generate my grammar and i have whileParser.cs and whileLexer.cs in output folder.
now i want to add my grammar to a c# project.
i want to get input from user and then show output of my grammar to them.
and i dont know how to add .g file and output classes to a c# project.
i am using visual studio 2013.
can anybody help me?

You grammar contains Java code blocks, you need to translate them to C# first. Actually, it may be a good opportunity for you to use ANTLR 4 instead and/or to switch to a parse tree approach. I should mention there's an ANTLRWorks 2 version, mostly for ANTLR 4, should you need it.
Anyway, just install the ANTLR Visual Studio Plugin and let it handle that for you. It works with both ANTLR 3 and 4.
You'll then have to add the ANTLR runtime to your project. For this, you can install the ANTLR4 NuGet or the ANTLR3 version depending on what version you chose to use in the end.

Antlr3 (C# Target) Consistently Hits Phantom EOF Character

While writing an Antlr3 grammar in AntlrWorks (generating C#), I wrote the following set of lexer rules as follows:
array :
'[' properties? ']' -> ^(ARR properties?)
;
properties :
propertyName (','! propertyName)*
;
propertyName :
ID
| ESC_ID
;
ESC_ID :
'\'' ESC_STRING '\''
;
fragment
ESC_STRING
: ( ESCAPE_SEQ | ~('\u0000'..'\u001f' | '\\' | '\"' ) )*
;
However, whenever I try to parse any string where the ESC_ID rule is matched, I hit a phantom EOF character at the end of the string:
Input: ['testing 123']
<mismatched token: [#4,15:15='<EOF>',<-1>,1:15]
I know that the Java version of ANTLR's generated code is not thoroughly debugged, but I've managed to find my way around the quirks so far. Thoughts on how not to hit this error when matching this lexer rule?
UPDATE
I have now tried using the official C# port of Antlr3, and I still get the same error.

ANTLRWorks can't be used to generate code for the C# targets. You'll need to generate your C# code using the Antlr3.exe tool that's included in the C# port. The preferred method is using the MSBuild integration, which can either be done manually or (finally!) automatically using NuGet.
The latest official release is found here:
http://www.antlr.org/wiki/display/ANTLR3/Antlr3CSharpReleases
In addition to that, I have released an alpha build of ANTLR 3 on NuGet. If you enable the "Include Prereleases" in the NuGet package manager in Visual Studio 2010+, you'll find it listed as ANTLR 3 version 3.5.0.3-alpha002.

Quotation real usages

I faced with the 'quotation' term and I'm trying to figure out some real-life examples of usage of it. Ability of having AST for each code expression sounds awesome, but how to use it in real life?
Does anyone know such example?

F# and Nemerle quotations are both used for metaprogramming, but the approaches are different: Nemerle uses metaprogramming at compilation time to extend the language, while F# uses them at run time.
Nemerle
In Nemerle, quotations are used within macros for taking apart pieces of code and generating new ones. Much of the language itself is implemented this way. For example, here is an example from the official library — the macro implementing the when conditional construct. Nemerle does not have statements, so an if has to have an else part: when and unless macros provide shorthand for an if with an empty then and else parts, respectively. The when macro also has extended pattern-matching functionality.
macro whenmacro (cond, body)
syntax ("when", "(", cond, ")", body)
{
match (cond)
{
| <[ $subCond is $pattern ]> with guard = null
| <[ $subCond is $pattern when $guard ]> =>
match (pattern)
{
| PT.PExpr.Call when guard != null =>
// generate expression to replace 'when (expr is call when guard) body'
<[ match ($subCond) { | $pattern when $guard => $body : void | _ => () } ]>
| PT.PExpr.Call =>
// generate expression to replace 'when (expr is call) body'
<[ match ($subCond) { | $pattern => $body : void | _ => () } ]>
| _ =>
// generate expression to replace 'when (expr is pattern) body'
<[ match ($cond) { | true => $body : void | _ => () } ]>
}
| _ =>
// generate expression to replace 'when (cond) body'
<[ match ($cond : bool) { | true => $body : void | _ => () } ]>
}
}
The code uses quotation to handle patterns that look like some predefined templates and replace them with corresponding match expressions. For example, matching the cond expression given to the macro with:
<[ $subCond is $pattern when $guard ]>
checks whether it follows the x is y when z pattern and gives us the expressions composing it. If the match succeeds, we can generate a new expression from the parts we got using:
<[
match ($subCond)
{
| $pattern when $guard => $body : void
| _ => ()
}
]>
This converts when (x is y when z) body to a basic pattern-matching expression. All of this is automatically type-safe and produces reasonable compilation errors when used incorrectly. So, as you see quotation provides a very convenient and type-safe way of manipulating code.

Well, anytime you want to manipulate code programmatically, or do some metaprogramming, quotations make it more declarative, which is a good thing.
I've written two posts about how this makes life easier in Nemerle: here and here.
For real life examples, it's interesting to note that Nemerle itself defines many common statements as macros (where quotations are used). Some examples include: if, for, foreach, while, break, continue and using.

I think quotations have quite different uses in F# and Nemerle. In F#, you don't use quotations to extend the F# language itself, but you use them to take an AST (data representation of code) of some program written in standard F#.
In F#, this is done either by wrapping a piece of code in <# ..F# code.. #>, or by adding a special attribtue to a function:
[<ReflectedDefinition>]
let foo () =
// body of a function (standard F# code)
Robert already mentioned some uses of this mechanism - you can take the code and translate F# to SQL to query database, but there are several other uses. You can for example:
translate F# code to run on GPU
translate F# code to JavaScript using WebSharper

As Jordão has mentioned already quotations enable meta programming. One real world example of this is the ability to use quotations to translated F# into another language, like for example SQL. In this way Quotations server much the same purpose as expression trees do in C#: they enable linq queries to be translated into SQL (or other data-acess language) and executed against a data store.

Unquote is a real-life example of quotation usage.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Negate POSIX or Unicode character classes in ANTLR Lexer (C#) - c#

Related

Antlr4 picks up wrong tokens and rules

Antlr 4 Lexer rule ambiguity

how to connect ANTLRWorks output to c# project?

Antlr3 (C# Target) Consistently Hits Phantom EOF Character

Quotation real usages

Categories

Resources