Parser does not see equality expression:
extraneous input '=' expecting {<EOF>, '~', '(', OPERATOR, IDENTIFIER, NUMBER, STRING}
Even error is not clear, it tells it expects operator, but = is a defined operator.
Also I achieve 2 member access expressions instead of 3.
This is the grammar:
grammar xxx;
parse: expression+ EOF;
expression:
expression op=OPERATOR expression #binaryExpression
| op=OPERATOR expression #unaryPrefixExpression
| expression op=OPERATOR #unarPostfixExpression
| member_expression #memberExpression
| OPENING_PARENTHESIS expression CLOSING_PARENTHESIS #parenthesisExpression
| STRING #stringExpression
| NUMBER #numberExpression
| NEGATE expression #negationExpression
;
member_expression:
IDENTIFIER (DOT(IDENTIFIER DOT?))*
;
// operators
PLUS: '+' ;
MINUS: '-' ;
BIGGER_THAN: '>' ;
LESS_THAN: '<' ;
BIGGER_THAN_OR_EQUALS: '>=' ;
LESS_THAN_OR_EQUALS: '<=' ;
NEGATE: '~' ;
EQUALITY: '=' ;
OPENING_PARENTHESIS: '(' ;
CLOSING_PARENTHESIS: ')' ;
fragment LOGICAL_OPERATOR:
| EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
;
OPERATOR:
PLUS
| MINUS
| NEGATE
| LOGICAL_OPERATOR
;
DOT: '.' ;
IDENTIFIER: [a-zA-Z]+[a-zA-Z0-9_]* ;
// literals
NUMBER: [0-9] + ('.' [0-9] +)? ;
STRING : '"' .*? '"' ;
WS: [ \t\n]+ -> skip ;
ANY: . ;
This is the expression:
context.Previous.Output.previous_value2 = 123
Tree string:
([] ([6] ([16 6] context . Previous .)) ([6] ([16 6] Output . previous_value2)) = ([6] 123) <EOF>)
`
As you can see there are 2x member access expressions, then unrecognized equality operator, then number expression.
I want to get:
3 separate member access expressions
1 equality expression
Lexer rules always match in 1 way: the lexer tries to match as much characters as possible and when 2 (or more) rules match the same characters, the rule defined first will win. So take the rule PLUS and OPERATOR:
PLUS: '+' ;
...
OPERATOR:
PLUS
| MINUS
| NEGATE
| LOGICAL_OPERATOR
;
for the input string "+", the lexer will always produce a PLUS token, never a OPERATOR token.
The solution: change the OPERATOR and LOGICAL_OPERATOR lexer rules into parser rules:
grammar xxx;
parse: expression+ EOF;
expression:
expression op=operator expression #binaryExpression
| op=unary_operator expression #unaryPrefixExpression
| expression op=operator #unarPostfixExpression
| member_expression #memberExpression
| OPENING_PARENTHESIS expression CLOSING_PARENTHESIS #parenthesisExpression
| STRING #stringExpression
| NUMBER #numberExpression
| NEGATE expression #negationExpression
;
member_expression:
IDENTIFIER (DOT(IDENTIFIER DOT?))*
;
operator:
EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
| unary_operator
;
unary_operator:
PLUS
| MINUS
| NEGATE
;
// operators
PLUS: '+' ;
MINUS: '-' ;
BIGGER_THAN: '>' ;
LESS_THAN: '<' ;
BIGGER_THAN_OR_EQUALS: '>=' ;
LESS_THAN_OR_EQUALS: '<=' ;
NEGATE: '~' ;
EQUALITY: '=' ;
OPENING_PARENTHESIS: '(' ;
CLOSING_PARENTHESIS: ')' ;
DOT: '.' ;
IDENTIFIER: [a-zA-Z]+[a-zA-Z0-9_]* ;
// literals
NUMBER: [0-9] + ('.' [0-9] +)? ;
STRING : '"' .*? '"' ;
WS: [ \t\n]+ -> skip ;
ANY: . ;
Also, the following would match an empty string:
fragment LOGICAL_OPERATOR:
| EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
;
you probably meant:
fragment LOGICAL_OPERATOR:
EQUALITY
| BIGGER_THAN_OR_EQUALS
| LESS_THAN_OR_EQUALS
| LESS_THAN
| BIGGER_THAN
;
Btw, a better way to print the tree is to use toStringTree(Parser):
String source = "context.Previous.Output.previous_value2 = 123";
xxxLexer lexer = new xxxLexer(CharStreams.fromString(source));
xxxParser parser = new xxxParser(new CommonTokenStream(lexer));
ParseTree tree = parser.parse();
System.out.println(tree.toStringTree(parser));
which will print:
(parse (expression (member_expression context . Previous .)) (expression (expression (member_expression Output . previous_value2)) (operator =) (expression 123)) <EOF>)
Related
I have a string which is a combination of variables and text which looks like this:
string str = $"\x0{str2}{str3}";
As you can see I have a string escape sequence of \x, which requires 1-4 Hexadecimal characters. str2 is a hexadecimal character (e.g. D), while str3 is two decimal characters (e.g. 37). What I expect as an outcome of str = "\x0D37" is str to contain ഷ, but instead I get whitespace, as if str == "\x0". Why is that?
As per the specification, an interpolated string is split into separate tokens before parsing. To parse the \x escape sequence, it needs to be part of the same token, which in this case it is not. And in any case, there is simply no way the interpolated part would ever use escape sequences, as that is not defined for non-literals.
You are better off just generating a char value directly, albeit this is more complex
string str = ((char) (int.Parse(str2 + str3, NumberStyles.HexNumber))).ToString();
To answer why your example doesn't work:
The spec calls \x a "hexadecimal escape sequence".
A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following "\x".
The grammar only permits literal hexadecimal digits to be used in this way. ANTLR grammar:
hexadecimal_escape_sequence
: '\\x' hex_digit hex_digit? hex_digit? hex_digit?;
hex_digit
: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
| 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f';
meaning, "the sequence \x followed immediately by one, two, three, or four hexadecimal digits". Thus, the escape can only be used as a literal in source code, not one concatenated at runtime.
Charlieface provided a good alternative for your case, though depending on the context you may want to rethink your organization and just have a single string holding the characters instead of two.
I am new to ANTLR and I am trying to get this grammar working:
grammar TemplateGrammar;
//Parser Rules
start
: block
| statement
| expression
| parExpression
| primary
;
block
: LBRACE statement* RBRACE
;
statement
: block
| IF parExpression statement (ELSE statement)?
| expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: primary #PRIMARY
| number op=('*'|'/') number #MULDIV
| number op=('+'|'-') number #ADDSUB
| number op=('>='|'<='|'>'|'<') number #GRLWOREQUALS
| expression op=('='|'!=') expression #EQDIFF
;
primary
: parExpression
| literal
;
literal
: number #NumberLiteral
| string #StringLiteral
| columnName #ColumnNameLiteral
;
number
: DecimalIntegerLiteral #DecimalIntegerLiteral
| DecimalFloatingPointLiteral #FloatLiteral
;
string
: '"' StringChars? '"'
;
columnName
: '[' StringChars? ']'
;
// Lexer Rules
//Integers
DecimalIntegerLiteral
: DecimalNumeral
;
fragment
DecimalNumeral
: '0'
| NonZeroDigit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitOrUnderscore* Digit)?
;
fragment
Digit
: '0'
| NonZeroDigit
;
fragment
NonZeroDigit
: [1-9]
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
//Floating point
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
//Strings
StringChars
: StringChar+
;
fragment
StringChar
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
;
//Separators
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
COMMA : ',';
DOT : '.';
//Keywords
IF : 'IF';
ELSE : 'ELSE';
THEN : 'THEN';
//Operators
PLUS : '+';
MINUS : '-';
MULTIPLY : '*';
DIVIDE : '/';
EQUALS : '=';
DIFFERENT : '!=';
GRTHAN : '>';
GROREQUALS : '>=';
LWTHAN : '<';
LWOREQUALS : '<=';
AND : '&';
OR : '|';
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;
When I put "Test" in the input, it is working and returning the String "Test".
Here is what I get in the IParseTree when I put "Test" in the input:
"(start (statement (expression (primary (literal (string \" Test \"))))))"
But when I put [Test] (wich is almost the same as "Test" but with braces instead of quotes), the parser does not recognize the token...
Here is the IParseTree I get when I put [Tree]:
"(start [Test])"
Same with numbers, it does well recognize lonely numbers such as 1, 123, 12.5, etc. but not expressions like 1+2...
Do you have any idea why the parser isn't recognizing columnNames rule but does work well with the string rule?
Probably because "StringChar" is defined incorrectly for your purpose? It doesn't handle "]"
Perhaps you want to define StringChar as:
fragment
StringChar
: ~["\\\]]
| EscapeSequence
;
If it were my grammar, I'd define a QuotedStringChar as you have for quoted strings, and define BracketStringChar as ~[\]\\] to use for your bracket column names.
Welcome to debugging grammars at the lexical level, and defining different types of "quotes" for different types of strings. This is pretty common. (You should see Ruby, where you can define the string quote at the beginning of the string, ick.).
I finnaly got it working by putting:
QuotedStringChars
: '"' ~[\"]+ '"'
;
BracketStringChars
: '[' ~[\]]+ ']'
;
To take any characters between quotes or brackets. Then :
primary
: literal #PrimLiteral
| number #PrimNumber
;
literal
: QuotedStringChars #OneString
| BracketStringChars #ColumnName
| number #NUMBER
;
number
: DecimalIntegerLiteral #DecimalIntegerLiteral
| DecimalFloatingPointLiteral #FloatLiteral
;
The literal rule helps to distinguish quoted string, bracket string and numbers.
There is a duplication of number in primary and literal rules because I need a different behavior in my application for each one.
I managed this with the good advices of Ira Baxter :)
Hope this will help other newbies to ANTLR like me to have a better
understanding :)
I'm trying to write a regular expression that will match all image tags apart from the first in a html file. E.g:
<html><body><img src="foo"><span><img src="bar></span><img src="foobar"></body></html>
So far I've only managed to create an expression that matches all of the image tags:
<img[^>]*>
Just use a real html parser like HtmlAgilityPack to parse an html
var html = #"html><body><img src=""foo""><span><img src=""bar""></span><img src=""foobar""></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var imgLinks = doc.DocumentNode
.Descendants("img")
.Skip(1)
.Select(x => x.Attributes["src"])
.ToList();
Don't do this
var pattern = #"<img[^>]*>"; //your pattern in question
var imgs = Regex.Matches(html, pattern)
.Cast<Match>()
.Skip(1)
.Select(m => m.Value)
.ToList();
In this answer I'm going to demonstrate that tags can be matched from a regex, contrary to the believings in some comments that a tag cannot be identified but with a complete HTML/XML parser.
For the demonstration I shall use the subset of the grammar rules for XML from the www.www.org specification for XML 1.1, available there, extending to all the rules reachable from STag and EmptyElemTag, which are the tags we want to match. As there are no backward recursive rules, I'll demonstrate that this set of rules can be converted to a regexp to parse start and empty tags respectively.
As xml uses UTF character encodings and it allows characters over the range \u0000-\uffff, I have to select some notation for the character classes in the extended UTF encoding, so I shall use a nonstandard extension to the \u notation consisting of using five hex digits instead of four, to simplify this grammar-to-regexp conversion (to allow for the allowed characters in the range 0x10000-0xeffff)
Borrowed from the xml specification for XML version 1.1 is the syntax for the start and empty element tags:
STag ::= '<' Name (S Attribute)* S? '>'
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
Name ::= (NameStartChar NameChar*)
NameChar ::= (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f-\u02040])
NameStartChar ::= ([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])
S ::= ([\u00020\u00009\u0000d\u0000a]+)
Attribute ::= (Name Eq AttValue)
Eq ::= (S? '=' S?)
AttValue ::= ( '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" )
Reference ::= (EntityRef | CharRef)
EntityRef ::= ('&' Name ';')
CharRef ::= ('&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')
To construct the regular expression that accepts start tag and empty tags, I have begun with the above grammar and construct from it a simple start rule that accepts a start and an empty tag:
Start ::= STag | EmptyElemTag
then substituting all the nonterminals by the (properly parenthesized) right sides of each rule, until I only have terminal elements in the right side and regexp operators:
Start ::= '<' Name (S Attribute)* S? '>' | '<' Name (S Attribute)* S? '/>'
I can do some operations to group terms and get
Start ::= '<' Name (S Attribute)* S? '/'?'>'
Now substitute Attribute
Start ::= '<' Name (S Name Eq AttValue)* S? '/'? '>'
Now substitute AttValue
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" ))* S? '/'? '>'
Now substitute Reference
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | EntityRef | CharRef)* '"' | "'" ([^<&'] | EntityRef | CharRef)* "'" ))* S? '/'? '>'
Now substitute EntityRef
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | CharRef)* '"' | "'" ([^<&'] | '&' Name ';' | CharRef)* "'" ))* S? '/'? '>'
Now substitute CharRef
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'
Now Eq
Start ::= '<' Name (S Name S? '=' S? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'
Next S
Start ::= '<' Name (([\u00020\u00009\u0000d\u0000a]+) Name ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
Now substitute Name
Start ::= '<' (NameStartChar NameChar*) (([\u00020\u00009\u0000d\u0000a]+) (NameStartChar NameChar*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
Now substitute NameChar
Start ::= '<' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) (([\u00020\u00009\u0000d\u0000a]+) (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
And last NameStartChar
Start ::= '<' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) (([\u00020\u00009\u0000d\u0000a]+) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
finally, after substituting 'c' by c and eliminating undesired blank spaces, the regex leads to:
<(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*)(([\u00020\u00009\u0000d\u0000a]+)(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*)([\u00020\u00009\u0000d\u0000a]+)?=([\u00020\u00009\u0000d\u0000a]+)?(\"([^<&\"]|&(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*\"|\'([^<&\']|&(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*\'))*([\u00020\u00009\u0000d\u0000a]+)?/?>
Of course you can have more regexps that allow you to match a start/empty tag, but this is one of the simplests that I have been able to develop to cope with the scenarios that have been pointed out in the comments.
A simpler one could be:
<[iI][mM][gG][ \t\n\r]+([^>"']|"[^"]*"|'[^']*')*>
if you are not dealing with UTF chars outside the range \u0000--\u007f (ascii range) and you know the HTML file is valid. (this last one can be erroneous, use with care I have constructed it in my head and can take some weird cases mistakenly)
I decided to translate the C# official grammar to antlr v4. However, while testing I encountered the following problem. The given grammar doesn't match simple words like \n\ntrue\n\n<EOF>. It keeps saying mismatched input '\n\ntrue\n\n' expecting Literal
. Even after I leave the definition of Literal as Literal: BooleanLiteral; the input \n\ntrue\n\n<EOF> still doesn't get matched. I was expecting the grammar to skip the \ns comsume the true and <EOF> but obviously this is not happening. Tried to debug but still haven't be able to find anything wrong. Any ideas?
grammar Test;
start: Literal EOF;
/**********
*
* Literals
*
**********/
Literal
: BooleanLiteral
| IntegerLiteral
| RealLiteral
| CharacterLiteral
| StringLiteral
| NullLiteral
;
BooleanLiteral
: 'true'
| 'false'
;
IntegerLiteral
: DecimalIntegerLiteral
| HexadecimalIntegerLiteral
;
DecimalIntegerLiteral
: DecimalDigits IntegerTypeSuffix?
;
DecimalDigits
: DecimalDigit+
;
DecimalDigit
: [0-9]
;
IntegerTypeSuffix
: 'U'
| 'u'
| 'L'
| 'l'
| 'UL'
| 'Ul'
| 'uL'
| 'ul'
| 'LU'
| 'Lu'
| 'lU'
| 'lu'
;
HexadecimalIntegerLiteral
: ('0x' | '0X') HexDigits IntegerTypeSuffix?
;
HexDigits
: HexDigit+
;
HexDigit
: [0-9A-Fa-f]
;
RealLiteral
: DecimalDigits '.' DecimalDigits ExponentPart? RealTypeSuffix?
| '.' DecimalDigits ExponentPart? RealTypeSuffix?
| DecimalDigits ExponentPart RealTypeSuffix?
| DecimalDigits RealTypeSuffix
;
ExponentPart
: ('e' | 'E') Sign? DecimalDigits
;
Sign
: '+'
| '-'
;
RealTypeSuffix
: 'F'
| 'f'
| 'D'
| 'd'
| 'M'
| 'm'
;
CharacterLiteral
: '\'' Character '\''
;
Character
: SingleCharacter
| SimpleEscapeSequence
| HexadecimalEscapeSequence
| UnicodeEscapeSequence
;
UnicodeEscapeSequence
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
| '\\' 'U' HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit
;
SingleCharacter
: ~[\\\\\\\u000D\u000A\u0085\u2028\u2029]
;
SimpleEscapeSequence
: '\\\''
| '\\"'
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
HexadecimalEscapeSequence
: '\\x' HexDigit HexDigit? HexDigit? HexDigit?
;
StringLiteral
: RegularStringLiteral
| VerbatimStringLiteral
;
RegularStringLiteral
: '"' RegularStringLiteralCharacters? '"'
;
RegularStringLiteralCharacters
: RegularStringLiteralCharacter+
;
RegularStringLiteralCharacter
: SingleRegularStringLiteralCharacter
| SimpleEscapeSequence
| HexadecimalEscapeSequence
| UnicodeEscapeSequence
;
SingleRegularStringLiteralCharacter
: ~["\\\u000D\u000A\u0085\u2028\u2029]
;
VerbatimStringLiteral
: '#"' VerbatimStringLiteralCharacters? '"'
;
VerbatimStringLiteralCharacters
: VerbatimStringLiteralCharacter+
;
VerbatimStringLiteralCharacter
: SingleVerbatimStringLiteralCharacter
| QuoteEscapeSequence
;
SingleVerbatimStringLiteralCharacter
: ~["]
;
QuoteEscapeSequence
: '""'
;
NullLiteral
: 'null'
;
/**********
*
* Whitespaces and comments
*
**********/
WS : [ \t\r\n]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
EDIT:
Ok, I've managed to isolate the problem to this piece of code:
grammar Test;
start : VerbatimStringLiteral EOF ;
VerbatimStringLiteral
: '#"' VerbatimStringLiteralCharacter* '"'
;
VerbatimStringLiteralCharacter
: SingleVerbatimStringLiteralCharacter
| QuoteEscapeSequence
;
SingleVerbatimStringLiteralCharacter
: ~["]
;
QuoteEscapeSequence
: '""'
;
WS : [ \t\r\n]+ -> skip
;
Lexer rules which do not produce tokens themselves should be marked with the fragment modifier. For example, QuoteEscapeSequence is not a standalone token; it is just a part of the VerbatimStringLiteral token, so you should mark it with fragment. Here are some other rules which should be fragment rules:
VerbatimStringLiteralCharacter
SingleVerbatimStringLiteralCharacter
SingleRegularStringLiteralCharacter
RegularStringLiteralCharacter
RegularStringLiteralCharacters ← this one was the source of your errors for this particular input
SimpleEscapeSequence
There may be more, but this should give you an idea what the problem is and how to solve it.
I am trying to create a 'AET' (Abstract Expression Tree) for XPath (as I am writing a WYSIWYG XSL editor). I have been hitting my head against the wall with the XPath BNF for the past three to four hours.
I have thought of another solution. I thought I could write a class that implements IXPathNavigable, which returns a XPathNavigator of my own when CreateNavigator is called. This XPathNavigator would always succeed on any method calls, and would keep track of those calls - e.g. we moved to the customers node and then the customer node. I could then use this information (hopefully) to create the 'AET' (so we would have customers/customer in a object model now).
Only question is: how on earth do I run a IXPathNavigable through an XPathExpression?
I know this is excessively lazy. But has anyone else gone through the effort and written a XPath expression parser? I haven't yet POC'd my possible solution, because I can't test it (because I can't run the XPathExpression against a IXPathNavigable), so I don't even know if my solution will even work.
There is an antlr xpath grammar here. Since its license permits, I copied the whole grammar here to avoid link rot in the future.
grammar xpath;
/*
XPath 1.0 grammar. Should conform to the official spec at
http://www.w3.org/TR/1999/REC-xpath-19991116. The grammar
rules have been kept as close as possible to those in the
spec, but some adjustmewnts were unavoidable. These were
mainly removing left recursion (spec seems to be based on
LR), and to deal with the double nature of the '*' token
(node wildcard and multiplication operator). See also
section 3.7 in the spec. These rule changes should make
no difference to the strings accepted by the grammar.
Written by Jan-Willem van den Broek
Version 1.0
Do with this code as you will.
*/
/*
Ported to Antlr4 by Tom Everett <tom#khubla.com>
*/
main : expr
;
locationPath
: relativeLocationPath
| absoluteLocationPathNoroot
;
absoluteLocationPathNoroot
: '/' relativeLocationPath
| '//' relativeLocationPath
;
relativeLocationPath
: step (('/'|'//') step)*
;
step : axisSpecifier nodeTest predicate*
| abbreviatedStep
;
axisSpecifier
: AxisName '::'
| '#'?
;
nodeTest: nameTest
| NodeType '(' ')'
| 'processing-instruction' '(' Literal ')'
;
predicate
: '[' expr ']'
;
abbreviatedStep
: '.'
| '..'
;
expr : orExpr
;
primaryExpr
: variableReference
| '(' expr ')'
| Literal
| Number
| functionCall
;
functionCall
: functionName '(' ( expr ( ',' expr )* )? ')'
;
unionExprNoRoot
: pathExprNoRoot ('|' unionExprNoRoot)?
| '/' '|' unionExprNoRoot
;
pathExprNoRoot
: locationPath
| filterExpr (('/'|'//') relativeLocationPath)?
;
filterExpr
: primaryExpr predicate*
;
orExpr : andExpr ('or' andExpr)*
;
andExpr : equalityExpr ('and' equalityExpr)*
;
equalityExpr
: relationalExpr (('='|'!=') relationalExpr)*
;
relationalExpr
: additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*
;
additiveExpr
: multiplicativeExpr (('+'|'-') multiplicativeExpr)*
;
multiplicativeExpr
: unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?
| '/' (('div'|'mod') multiplicativeExpr)?
;
unaryExprNoRoot
: '-'* unionExprNoRoot
;
qName : nCName (':' nCName)?
;
functionName
: qName // Does not match nodeType, as per spec.
;
variableReference
: '$' qName
;
nameTest: '*'
| nCName ':' '*'
| qName
;
nCName : NCName
| AxisName
;
NodeType: 'comment'
| 'text'
| 'processing-instruction'
| 'node'
;
Number : Digits ('.' Digits?)?
| '.' Digits
;
fragment
Digits : ('0'..'9')+
;
AxisName: 'ancestor'
| 'ancestor-or-self'
| 'attribute'
| 'child'
| 'descendant'
| 'descendant-or-self'
| 'following'
| 'following-sibling'
| 'namespace'
| 'parent'
| 'preceding'
| 'preceding-sibling'
| 'self'
;
PATHSEP
:'/';
ABRPATH
: '//';
LPAR
: '(';
RPAR
: ')';
LBRAC
: '[';
RBRAC
: ']';
MINUS
: '-';
PLUS
: '+';
DOT
: '.';
MUL
: '*';
DOTDOT
: '..';
AT
: '#';
COMMA
: ',';
PIPE
: '|';
LESS
: '<';
MORE_
: '>';
LE
: '<=';
GE
: '>=';
COLON
: ':';
CC
: '::';
APOS
: '\'';
QUOT
: '\"';
Literal : '"' ~'"'* '"'
| '\'' ~'\''* '\''
;
Whitespace
: (' '|'\t'|'\n'|'\r')+ ->skip
;
NCName : NCNameStartChar NCNameChar*
;
fragment
NCNameStartChar
: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
// Unfortunately, java escapes can't handle this conveniently,
// as they're limited to 4 hex digits. TODO.
// | '\U010000'..'\U0EFFFF'
;
fragment
NCNameChar
: NCNameStartChar | '-' | '.' | '0'..'9'
| '\u00B7' | '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
I have both written an XPath parser and an implementation of IXPathNavigable (I used to be a developer for XMLPrime). Neither is easy; and I suspect that the IXPathNavigable is not going to be the cheap win you hope, as there is quite a lot of subtlety in the interactions between different methods - I suspect a full blown XPath parser will be simpler (and more reliable).
To answer your question though:
var results xpathNavigable.CreateNavigator().Evaluate("/my/xpath[expression]").
You'd probably need to enumerate over results to cause the node to be navigated.
If you always returned true then all you'd know about the following XPath is that it looks for bar children of foo: foo[not(bar)]/other/elements
If you always return a fixed number of nodes then you'd never know about most of this XPath a[100]/b/c/
Essentially, this won't work.