Parse XPath Expressions

Parse XPath Expressions - c#

I am trying to create a 'AET' (Abstract Expression Tree) for XPath (as I am writing a WYSIWYG XSL editor). I have been hitting my head against the wall with the XPath BNF for the past three to four hours.
I have thought of another solution. I thought I could write a class that implements IXPathNavigable, which returns a XPathNavigator of my own when CreateNavigator is called. This XPathNavigator would always succeed on any method calls, and would keep track of those calls - e.g. we moved to the customers node and then the customer node. I could then use this information (hopefully) to create the 'AET' (so we would have customers/customer in a object model now).
Only question is: how on earth do I run a IXPathNavigable through an XPathExpression?
I know this is excessively lazy. But has anyone else gone through the effort and written a XPath expression parser? I haven't yet POC'd my possible solution, because I can't test it (because I can't run the XPathExpression against a IXPathNavigable), so I don't even know if my solution will even work.

There is an antlr xpath grammar here. Since its license permits, I copied the whole grammar here to avoid link rot in the future.
grammar xpath;
/*
XPath 1.0 grammar. Should conform to the official spec at
http://www.w3.org/TR/1999/REC-xpath-19991116. The grammar
rules have been kept as close as possible to those in the
spec, but some adjustmewnts were unavoidable. These were
mainly removing left recursion (spec seems to be based on
LR), and to deal with the double nature of the '*' token
(node wildcard and multiplication operator). See also
section 3.7 in the spec. These rule changes should make
no difference to the strings accepted by the grammar.
Written by Jan-Willem van den Broek
Version 1.0
Do with this code as you will.
*/
/*
Ported to Antlr4 by Tom Everett <tom#khubla.com>
*/
main : expr
;
locationPath
: relativeLocationPath
| absoluteLocationPathNoroot
;
absoluteLocationPathNoroot
: '/' relativeLocationPath
| '//' relativeLocationPath
;
relativeLocationPath
: step (('/'|'//') step)*
;
step : axisSpecifier nodeTest predicate*
| abbreviatedStep
;
axisSpecifier
: AxisName '::'
| '#'?
;
nodeTest: nameTest
| NodeType '(' ')'
| 'processing-instruction' '(' Literal ')'
;
predicate
: '[' expr ']'
;
abbreviatedStep
: '.'
| '..'
;
expr : orExpr
;
primaryExpr
: variableReference
| '(' expr ')'
| Literal
| Number
| functionCall
;
functionCall
: functionName '(' ( expr ( ',' expr )* )? ')'
;
unionExprNoRoot
: pathExprNoRoot ('|' unionExprNoRoot)?
| '/' '|' unionExprNoRoot
;
pathExprNoRoot
: locationPath
| filterExpr (('/'|'//') relativeLocationPath)?
;
filterExpr
: primaryExpr predicate*
;
orExpr : andExpr ('or' andExpr)*
;
andExpr : equalityExpr ('and' equalityExpr)*
;
equalityExpr
: relationalExpr (('='|'!=') relationalExpr)*
;
relationalExpr
: additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*
;
additiveExpr
: multiplicativeExpr (('+'|'-') multiplicativeExpr)*
;
multiplicativeExpr
: unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?
| '/' (('div'|'mod') multiplicativeExpr)?
;
unaryExprNoRoot
: '-'* unionExprNoRoot
;
qName : nCName (':' nCName)?
;
functionName
: qName // Does not match nodeType, as per spec.
;
variableReference
: '$' qName
;
nameTest: '*'
| nCName ':' '*'
| qName
;
nCName : NCName
| AxisName
;
NodeType: 'comment'
| 'text'
| 'processing-instruction'
| 'node'
;
Number : Digits ('.' Digits?)?
| '.' Digits
;
fragment
Digits : ('0'..'9')+
;
AxisName: 'ancestor'
| 'ancestor-or-self'
| 'attribute'
| 'child'
| 'descendant'
| 'descendant-or-self'
| 'following'
| 'following-sibling'
| 'namespace'
| 'parent'
| 'preceding'
| 'preceding-sibling'
| 'self'
;
PATHSEP
:'/';
ABRPATH
: '//';
LPAR
: '(';
RPAR
: ')';
LBRAC
: '[';
RBRAC
: ']';
MINUS
: '-';
PLUS
: '+';
DOT
: '.';
MUL
: '*';
DOTDOT
: '..';
AT
: '#';
COMMA
: ',';
PIPE
: '|';
LESS
: '<';
MORE_
: '>';
LE
: '<=';
GE
: '>=';
COLON
: ':';
CC
: '::';
APOS
: '\'';
QUOT
: '\"';
Literal : '"' ~'"'* '"'
| '\'' ~'\''* '\''
;
Whitespace
: (' '|'\t'|'\n'|'\r')+ ->skip
;
NCName : NCNameStartChar NCNameChar*
;
fragment
NCNameStartChar
: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
// Unfortunately, java escapes can't handle this conveniently,
// as they're limited to 4 hex digits. TODO.
// | '\U010000'..'\U0EFFFF'
;
fragment
NCNameChar
: NCNameStartChar | '-' | '.' | '0'..'9'
| '\u00B7' | '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;

I have both written an XPath parser and an implementation of IXPathNavigable (I used to be a developer for XMLPrime). Neither is easy; and I suspect that the IXPathNavigable is not going to be the cheap win you hope, as there is quite a lot of subtlety in the interactions between different methods - I suspect a full blown XPath parser will be simpler (and more reliable).
To answer your question though:
var results xpathNavigable.CreateNavigator().Evaluate("/my/xpath[expression]").
You'd probably need to enumerate over results to cause the node to be navigated.
If you always returned true then all you'd know about the following XPath is that it looks for bar children of foo: foo[not(bar)]/other/elements
If you always return a fixed number of nodes then you'd never know about most of this XPath a[100]/b/c/
Essentially, this won't work.

Related

Multiple nested expressions in ANTLR

Parser does not see equality expression:
extraneous input '=' expecting {<EOF>, '~', '(', OPERATOR, IDENTIFIER, NUMBER, STRING}
Even error is not clear, it tells it expects operator, but = is a defined operator.
Also I achieve 2 member access expressions instead of 3.
This is the grammar:
grammar xxx;
parse: expression+ EOF;
expression:
expression op=OPERATOR expression #binaryExpression
| op=OPERATOR expression #unaryPrefixExpression
| expression op=OPERATOR #unarPostfixExpression
| member_expression #memberExpression
| OPENING_PARENTHESIS expression CLOSING_PARENTHESIS #parenthesisExpression
| STRING #stringExpression
| NUMBER #numberExpression
| NEGATE expression #negationExpression
;
member_expression:
IDENTIFIER (DOT(IDENTIFIER DOT?))*
;
// operators
PLUS: '+' ;
MINUS: '-' ;
BIGGER_THAN: '>' ;
LESS_THAN: '<' ;
BIGGER_THAN_OR_EQUALS: '>=' ;
LESS_THAN_OR_EQUALS: '<=' ;
NEGATE: '~' ;
EQUALITY: '=' ;
OPENING_PARENTHESIS: '(' ;
CLOSING_PARENTHESIS: ')' ;
fragment LOGICAL_OPERATOR:
| EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
;
OPERATOR:
PLUS
| MINUS
| NEGATE
| LOGICAL_OPERATOR
;
DOT: '.' ;
IDENTIFIER: [a-zA-Z]+[a-zA-Z0-9_]* ;
// literals
NUMBER: [0-9] + ('.' [0-9] +)? ;
STRING : '"' .*? '"' ;
WS: [ \t\n]+ -> skip ;
ANY: . ;
This is the expression:
context.Previous.Output.previous_value2 = 123
Tree string:
([] ([6] ([16 6] context . Previous .)) ([6] ([16 6] Output . previous_value2)) = ([6] 123) <EOF>)
`
As you can see there are 2x member access expressions, then unrecognized equality operator, then number expression.
I want to get:
3 separate member access expressions
1 equality expression

Lexer rules always match in 1 way: the lexer tries to match as much characters as possible and when 2 (or more) rules match the same characters, the rule defined first will win. So take the rule PLUS and OPERATOR:
PLUS: '+' ;
...
OPERATOR:
PLUS
| MINUS
| NEGATE
| LOGICAL_OPERATOR
;
for the input string "+", the lexer will always produce a PLUS token, never a OPERATOR token.
The solution: change the OPERATOR and LOGICAL_OPERATOR lexer rules into parser rules:
grammar xxx;
parse: expression+ EOF;
expression:
expression op=operator expression #binaryExpression
| op=unary_operator expression #unaryPrefixExpression
| expression op=operator #unarPostfixExpression
| member_expression #memberExpression
| OPENING_PARENTHESIS expression CLOSING_PARENTHESIS #parenthesisExpression
| STRING #stringExpression
| NUMBER #numberExpression
| NEGATE expression #negationExpression
;
member_expression:
IDENTIFIER (DOT(IDENTIFIER DOT?))*
;
operator:
EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
| unary_operator
;
unary_operator:
PLUS
| MINUS
| NEGATE
;
// operators
PLUS: '+' ;
MINUS: '-' ;
BIGGER_THAN: '>' ;
LESS_THAN: '<' ;
BIGGER_THAN_OR_EQUALS: '>=' ;
LESS_THAN_OR_EQUALS: '<=' ;
NEGATE: '~' ;
EQUALITY: '=' ;
OPENING_PARENTHESIS: '(' ;
CLOSING_PARENTHESIS: ')' ;
DOT: '.' ;
IDENTIFIER: [a-zA-Z]+[a-zA-Z0-9_]* ;
// literals
NUMBER: [0-9] + ('.' [0-9] +)? ;
STRING : '"' .*? '"' ;
WS: [ \t\n]+ -> skip ;
ANY: . ;
Also, the following would match an empty string:
fragment LOGICAL_OPERATOR:
| EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
;
you probably meant:
fragment LOGICAL_OPERATOR:
EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
;
Btw, a better way to print the tree is to use toStringTree(Parser):
String source = "context.Previous.Output.previous_value2 = 123";
xxxLexer lexer = new xxxLexer(CharStreams.fromString(source));
xxxParser parser = new xxxParser(new CommonTokenStream(lexer));
ParseTree tree = parser.parse();
System.out.println(tree.toStringTree(parser));
which will print:
(parse (expression (member_expression context . Previous .)) (expression (expression (member_expression Output . previous_value2)) (operator =) (expression 123)) <EOF>)

Parsing: ANTLR for .NET

I am trying to parse the following text:
<<! notes, Test!>>
Grammar:
grammar Hello;
prog: stat+;
stat: DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr: NOTES value=VAR_VALUE # delim_body ;
VAR_VALUE : [ a-Z A-Z 0-9 ! ];
NOTES : 'notes,'
| ' notes,';
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
Error:
line 1:12 token recognition error at: '>'
line 1:13 token recognition error at: '>'
line 1:10 mismatched input ' !' expecting VAR_VALUE
(NOTE: Added DELIMITER defs since I forgot them earlier)

Try this:
grammar Hello;
prog : stat+ EOF ;
stat : DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr : NOTES COMMA value=VAR_VALUE # delim_body ;
VAR_VALUE : ANBang* AlphaNum ;
NOTES : 'notes' ;
COMMA : ',' ;
WS : [ \t\r\n]+ -> skip ;
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
fragment ANBang : AlphaNum | Bang ;
fragment AlphaNum : [a-zA-Z0-9] ;
fragment Bang : '!' ;
Ideally, the rules have to be mutually unambiguous. So, the VAR_VALUE rule is defined to limit the existence of a ! from the end. This will prevent the ! from being consumed by VAR_VALUE in preference to DELIMITER_CLOSE. Of course, that presumes the redefinition is acceptable. If not, a more involved solution will be required.
Also, as a general principle, skip anything that is not syntactically significant to the parsing.

Antlr4 Grammar issue (not entirely parsing)

I am new to ANTLR and I am trying to get this grammar working:
grammar TemplateGrammar;
//Parser Rules
start
: block
| statement
| expression
| parExpression
| primary
;
block
: LBRACE statement* RBRACE
;
statement
: block
| IF parExpression statement (ELSE statement)?
| expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: primary #PRIMARY
| number op=('*'|'/') number #MULDIV
| number op=('+'|'-') number #ADDSUB
| number op=('>='|'<='|'>'|'<') number #GRLWOREQUALS
| expression op=('='|'!=') expression #EQDIFF
;
primary
: parExpression
| literal
;
literal
: number #NumberLiteral
| string #StringLiteral
| columnName #ColumnNameLiteral
;
number
: DecimalIntegerLiteral #DecimalIntegerLiteral
| DecimalFloatingPointLiteral #FloatLiteral
;
string
: '"' StringChars? '"'
;
columnName
: '[' StringChars? ']'
;
// Lexer Rules
//Integers
DecimalIntegerLiteral
: DecimalNumeral
;
fragment
DecimalNumeral
: '0'
| NonZeroDigit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitOrUnderscore* Digit)?
;
fragment
Digit
: '0'
| NonZeroDigit
;
fragment
NonZeroDigit
: [1-9]
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
//Floating point
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
//Strings
StringChars
: StringChar+
;
fragment
StringChar
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
;
//Separators
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
COMMA : ',';
DOT : '.';
//Keywords
IF : 'IF';
ELSE : 'ELSE';
THEN : 'THEN';
//Operators
PLUS : '+';
MINUS : '-';
MULTIPLY : '*';
DIVIDE : '/';
EQUALS : '=';
DIFFERENT : '!=';
GRTHAN : '>';
GROREQUALS : '>=';
LWTHAN : '<';
LWOREQUALS : '<=';
AND : '&';
OR : '|';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;
When I put "Test" in the input, it is working and returning the String "Test".
Here is what I get in the IParseTree when I put "Test" in the input:
"(start (statement (expression (primary (literal (string \" Test \"))))))"
But when I put [Test] (wich is almost the same as "Test" but with braces instead of quotes), the parser does not recognize the token...
Here is the IParseTree I get when I put [Tree]:
"(start [Test])"
Same with numbers, it does well recognize lonely numbers such as 1, 123, 12.5, etc. but not expressions like 1+2...
Do you have any idea why the parser isn't recognizing columnNames rule but does work well with the string rule?

Probably because "StringChar" is defined incorrectly for your purpose? It doesn't handle "]"
Perhaps you want to define StringChar as:
fragment
StringChar
: ~["\\\]]
| EscapeSequence
;
If it were my grammar, I'd define a QuotedStringChar as you have for quoted strings, and define BracketStringChar as ~[\]\\] to use for your bracket column names.
Welcome to debugging grammars at the lexical level, and defining different types of "quotes" for different types of strings. This is pretty common. (You should see Ruby, where you can define the string quote at the beginning of the string, ick.).

I finnaly got it working by putting:
QuotedStringChars
: '"' ~[\"]+ '"'
;
BracketStringChars
: '[' ~[\]]+ ']'
;
To take any characters between quotes or brackets. Then :
primary
: literal #PrimLiteral
| number #PrimNumber
;
literal
: QuotedStringChars #OneString
| BracketStringChars #ColumnName
| number #NUMBER
;
number
: DecimalIntegerLiteral #DecimalIntegerLiteral
| DecimalFloatingPointLiteral #FloatLiteral
;
The literal rule helps to distinguish quoted string, bracket string and numbers.
There is a duplication of number in primary and literal rules because I need a different behavior in my application for each one.
I managed this with the good advices of Ira Baxter :)
Hope this will help other newbies to ANTLR like me to have a better
understanding :)

Antlr v4: What's wrong with this simple grammar for C# literals?

I decided to translate the C# official grammar to antlr v4. However, while testing I encountered the following problem. The given grammar doesn't match simple words like \n\ntrue\n\n<EOF>. It keeps saying mismatched input '\n\ntrue\n\n' expecting Literal
. Even after I leave the definition of Literal as Literal: BooleanLiteral; the input \n\ntrue\n\n<EOF> still doesn't get matched. I was expecting the grammar to skip the \ns comsume the true and <EOF> but obviously this is not happening. Tried to debug but still haven't be able to find anything wrong. Any ideas?
grammar Test;
start: Literal EOF;
/**********
*
* Literals
*
**********/
Literal
: BooleanLiteral
| IntegerLiteral
| RealLiteral
| CharacterLiteral
| StringLiteral
| NullLiteral
;
BooleanLiteral
: 'true'
| 'false'
;
IntegerLiteral
: DecimalIntegerLiteral
| HexadecimalIntegerLiteral
;
DecimalIntegerLiteral
: DecimalDigits IntegerTypeSuffix?
;
DecimalDigits
: DecimalDigit+
;
DecimalDigit
: [0-9]
;
IntegerTypeSuffix
: 'U'
| 'u'
| 'L'
| 'l'
| 'UL'
| 'Ul'
| 'uL'
| 'ul'
| 'LU'
| 'Lu'
| 'lU'
| 'lu'
;
HexadecimalIntegerLiteral
: ('0x' | '0X') HexDigits IntegerTypeSuffix?
;
HexDigits
: HexDigit+
;
HexDigit
: [0-9A-Fa-f]
;
RealLiteral
: DecimalDigits '.' DecimalDigits ExponentPart? RealTypeSuffix?
| '.' DecimalDigits ExponentPart? RealTypeSuffix?
| DecimalDigits ExponentPart RealTypeSuffix?
| DecimalDigits RealTypeSuffix
;
ExponentPart
: ('e' | 'E') Sign? DecimalDigits
;
Sign
: '+'
| '-'
;
RealTypeSuffix
: 'F'
| 'f'
| 'D'
| 'd'
| 'M'
| 'm'
;
CharacterLiteral
: '\'' Character '\''
;
Character
: SingleCharacter
| SimpleEscapeSequence
| HexadecimalEscapeSequence
| UnicodeEscapeSequence
;
UnicodeEscapeSequence
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
| '\\' 'U' HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit
;
SingleCharacter
: ~[\\\\\\\u000D\u000A\u0085\u2028\u2029]
;
SimpleEscapeSequence
: '\\\''
| '\\"'
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
HexadecimalEscapeSequence
: '\\x' HexDigit HexDigit? HexDigit? HexDigit?
;
StringLiteral
: RegularStringLiteral
| VerbatimStringLiteral
;
RegularStringLiteral
: '"' RegularStringLiteralCharacters? '"'
;
RegularStringLiteralCharacters
: RegularStringLiteralCharacter+
;
RegularStringLiteralCharacter
: SingleRegularStringLiteralCharacter
| SimpleEscapeSequence
| HexadecimalEscapeSequence
| UnicodeEscapeSequence
;
SingleRegularStringLiteralCharacter
: ~["\\\u000D\u000A\u0085\u2028\u2029]
;
VerbatimStringLiteral
: '#"' VerbatimStringLiteralCharacters? '"'
;
VerbatimStringLiteralCharacters
: VerbatimStringLiteralCharacter+
;
VerbatimStringLiteralCharacter
: SingleVerbatimStringLiteralCharacter
| QuoteEscapeSequence
;
SingleVerbatimStringLiteralCharacter
: ~["]
;
QuoteEscapeSequence
: '""'
;
NullLiteral
: 'null'
;
/**********
*
* Whitespaces and comments
*
**********/
WS : [ \t\r\n]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
EDIT:
Ok, I've managed to isolate the problem to this piece of code:
grammar Test;
start : VerbatimStringLiteral EOF ;
VerbatimStringLiteral
: '#"' VerbatimStringLiteralCharacter* '"'
;
VerbatimStringLiteralCharacter
: SingleVerbatimStringLiteralCharacter
| QuoteEscapeSequence
;
SingleVerbatimStringLiteralCharacter
: ~["]
;
QuoteEscapeSequence
: '""'
;
WS : [ \t\r\n]+ -> skip
;

Lexer rules which do not produce tokens themselves should be marked with the fragment modifier. For example, QuoteEscapeSequence is not a standalone token; it is just a part of the VerbatimStringLiteral token, so you should mark it with fragment. Here are some other rules which should be fragment rules:
VerbatimStringLiteralCharacter
SingleVerbatimStringLiteralCharacter
SingleRegularStringLiteralCharacter
RegularStringLiteralCharacter
RegularStringLiteralCharacters ← this one was the source of your errors for this particular input
SimpleEscapeSequence
There may be more, but this should give you an idea what the problem is and how to solve it.

ANTLR grammar matches incompatible rule instead of throwing NoViableAltException

I have the following ANTLR grammar that forms part of a larger expression parser:
grammar ProblemTest;
atom : constant
| propertyname;
constant: (INT+ | BOOL | STRING | DATETIME);
propertyname
: IDENTIFIER ('/' IDENTIFIER)*;
IDENTIFIER
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+;
INT
: '0'..'9'+;
BOOL : ('true' | 'false');
DATETIME
: 'datetime\'' '0'..'9'+ '-' '0'..'9'+ '-' + '0'..'9'+ 'T' '0'..'9'+ ':' '0'..'9'+ (':' '0'..'9'+ ('.' '0'..'9'+)*)* '\'';
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
If I invoke this in the interpreter from within ANTLR works with
'Hello\\World'
then this is getting interpreted as a propertyname instead of a constant. The same happens if I compile this in C# and run it in a test harness, so it is not a problem with the dodgy interpreter
I'm sure I am missing something really obvious... but why is this happening? It's clear there is a problem with the string matcher, but I would have thought at the very least that the fact that IDENTIFIER does not match the ' character would mean that this would throw a NoViableAltException instead of just falling through?

First, neither ANTLRWorks nor antlr-3.5-complete.jar can be used to generate code for the C# targets. They might produce files ending in .cs, and those files might even compile, but those files will not be the same as the files produced by the C# port of the ANTLR Tool (Antlr3.exe) or the recommended MSBuild integration. Make sure you are producing your generated parser by one of the tested methods.
Second, INT will never be matched. Since IDENTIFIER appears before INT in the grammar, and all sequences '0'..'9'+ match both IDENTIFIER and INT, the lexer will always take the first option which appears (IDENTIFIER).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse XPath Expressions - c#

Related

Multiple nested expressions in ANTLR

Parsing: ANTLR for .NET

Antlr4 Grammar issue (not entirely parsing)

Antlr v4: What's wrong with this simple grammar for C# literals?

ANTLR grammar matches incompatible rule instead of throwing NoViableAltException

Categories

Resources