how to resolve simple ambiguity - c#

I just started using Antlr and am stuck. I have the below grammar and am trying to resolve the ambiguity to parse input like Field:ValueString.
expression : Field ':' ValueString;
Field : Letter LetterOrDigit*;
ValueString : ~[:];
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
suppose a:b is passed in to the grammar, a and b are both identified as Field. How do I resolve this in Antlr4 (C#)?

You can use a semantic predicate in your lexer rules to perform lookahead (or behind) without consuming characters (ANTLR4 negative lookahead in lexer)
In you case, to remove ambiguity, you can check if the char after the Field rule is : or you can check if the char before the ValueString is :.
Ïn the first case:
expression : Field ':' ValueString;
Field : Letter LetterOrDigit* {_input.LA(1) == ':'}?;
ValueString : ~[:];
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
In the second one (please note that Field and ValueString order have been inversed):
expression : Field ':' ValueString;
ValueString : {_input.LA(-1) == ':'}? ~[:];
Field : Letter LetterOrDigit*;
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
Also consider using fragment keyword for Letter and LetterOrDigit
fragment Letter : [a-zA-Z];
fragment LetterOrDigit : [a-zA-Z0-9];
"[With fragment keyword] You can also define rules that are not tokens but rather aid in the recognition of tokens. These fragment rules do not result in tokens visible to the parser." (source https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules)

A way to resolve this without lookahead is to simple define a parser rule that can be any of those two lexer tokens:
expression : Field ':' value;
value : Field | ValueString;
Field : Letter LetterOrDigit*;
ValueString : ~[:];
Letter : [a-zA-Z];
LetterOrDigit : [a-zA-Z0-9];
WS: [ \t\r\n\u000C]+ -> skip;
The parser will work as expected and the grammar is kept simple, but you may need to add a method to your visitor or listener implementation.

Related

How to avoid the spaces between two texts while parsing using antlr

Please help me with the following case. I have a line with multiple texts in it. Based on some rule I need to parse each words in the line. Below is my example input line
## KEYWORD = MyName MyAliasName
Below is my parsing rule sets.
rule1:
Keyword name = identifier{ $name.str;} (' '* diffName = identifierTest { $diffName.str; })?
;
identifier:
returns [string str]
#init{$str="";}:
i=Word{$str+=$i.text;} (i=(Number | Word ) {$str+=$i.text;})*
;
Keyword: SPACE* START SPACE* 'KEYWORD' SPACE* EQUAL SPACE*;
Number:DIGIT+;
Word:LETTER+;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: [0-9];
fragment SPACE: ' ' | '\t';
fragment START: '##';
fragment EQUAL: '=';
The "rule1" rule defines that, the MyName text is mandatory and MyAliasName is an optional one.
The "identifier" rule defined that, the name can start with only by a letter or underscores.
The Problem
If I give exactly one space between MyName and MyAliasName then the above rules works fine. Whereas if there are more than one spaces between MyName and MyAliasName, then the first identifier rule reads both the texts together as MyNameMyAliasName(it removes the spaces automatically). Why ? I don't know what I'm doing wrong!
Whenever the optional texts is available then i will have to overwrite the name with AliasName. Please help and thanks in advance
This grammar should solve your problem
grammar TestGrammar;
rule1:
keyword name=IDENT{ System.out.println($name.text);} ( diffName = IDENT { System.out.println($diffName.text); })?
;
keyword: START KEYWORD EQUAL;
KEYWORD : 'KEYWORD' ;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: '0'..'9';
IDENT : LETTER (LETTER|DIGIT)*;
START : '##';
EQUAL : '=';
SPACE : [ \t]+ -> skip;

first character could be string second character should be numeric and rest should be alphanumeric

I have string which first contain either numeric or alphabet or hyphen. if first character is numeric so second character should be character and rest should be any occurrence of numeric, alphabet and hyphen.
I tried with :
([A-Za-z-]{1})(?![A-Za-z-]{1})([A-Za-z-]{61})
String is valid:
if only alphabet in string and one character in string.
if only hyphen in string and one character in string.
if first character is numeric then second character should be alphabet or hyphen and rest could be alphabet, numeric or hyphen.
No special character or tab or space only hyphen is allowed.
Max length of string is 63 characters.
for example:
1 : invalid
11 : invalid
;1 : invalid
1; : invalid
a; : invalid
;a : invalid
- : valid
a : valid
aa : valid
a1 : valid
1a : valid
1- : valid
-1 : valid
a- : valid
-a : valid
11testisgoingon : invalid
;1testingisgoingon : invalid
1;testingisgoingon : invalid
a;testingisgoingon : invalid
;atestingisgoingon : invalid
-testingisgoingon : valid
atestingisgoingon : valid
aatestingisgoingon : valid
a1testingisgoingon : valid
1atestingisgoingon : valid
1-testingisgoingon : valid
-1testingisgoingon : valid
a-testingisgoingon : valid
-atestingisgoingon : valid
([A-Za-z-])(?![A-Za-z-])
But this work well for first two characters. But if more than two characters are there then it's incorrect.
Try
^(([0-9][a-z-][a-z0-9-]{0,60})?(-?[a-z0-9])?([a-z0-9]-?)?([a-z-][a-z0-9-]{0,62})?){1}$
with flags "Multiline, IgnoreCase" for
11testisgoingoninvalid
;1testingisgoingonnvalid
1;testingisgoingoninvalid
a;testingisgoingoninvalid
;atestingisgoingoninvalid
-testingisgoingonvalid
atestingisgoingonvalid
aatestingisgoingonvalid
a1testingisgoingonvalid
1atestingisgoingonvalid
1-testingisgoingonvalid
-1testingisgoingonvalid
a-testingisgoingonvalid
-atestingisgoingon
in https://regexr.com/.
Explanation:
Add IgnoreCase flag to regex or add A-Z to any [a-z]
^((...)?(...)?(...)?(...)?){1}$
-> We set up an outer capturing group containing
4 inner ones wich may or may no occure. In total
only 1 must match.
The capturing group must fill the whole line, and
the inner caputring groups model your requirements:
([0-9][a-z-][a-z0-9-]{0,60})?
-> STARTS with numeric,
followed by alphabetic or hyphen,
followed by anything
up to 63 characters in total
(-?[a-z0-9])?
([a-z0-9]-?)?
-> one character prefixed/followed by hyphen
([a-z-][a-z0-9-]{0,62})?
-> generic long text not started by numeric
followed by anything
up to 63 characters long

Parsing: ANTLR for .NET

I am trying to parse the following text:
<<! notes, Test!>>
Grammar:
grammar Hello;
prog: stat+;
stat: DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr: NOTES value=VAR_VALUE # delim_body ;
VAR_VALUE : [ a-Z A-Z 0-9 ! ];
NOTES : 'notes,'
| ' notes,';
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
Error:
line 1:12 token recognition error at: '>'
line 1:13 token recognition error at: '>'
line 1:10 mismatched input ' !' expecting VAR_VALUE
(NOTE: Added DELIMITER defs since I forgot them earlier)
Try this:
grammar Hello;
prog : stat+ EOF ;
stat : DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr : NOTES COMMA value=VAR_VALUE # delim_body ;
VAR_VALUE : ANBang* AlphaNum ;
NOTES : 'notes' ;
COMMA : ',' ;
WS : [ \t\r\n]+ -> skip ;
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
fragment ANBang : AlphaNum | Bang ;
fragment AlphaNum : [a-zA-Z0-9] ;
fragment Bang : '!' ;
Ideally, the rules have to be mutually unambiguous. So, the VAR_VALUE rule is defined to limit the existence of a ! from the end. This will prevent the ! from being consumed by VAR_VALUE in preference to DELIMITER_CLOSE. Of course, that presumes the redefinition is acceptable. If not, a more involved solution will be required.
Also, as a general principle, skip anything that is not syntactically significant to the parsing.

Match whole word containing numbers and underscores not between quotes

I am trying to find the right way to get this regex:
\b([A-Za-z]|_)+\b
To not match whole words between quotes (' AND "), so:
example_ _where // matches: example_, _where
this would "not be matched" // matches: this and would
while this would // matches: while this would
while 'this' would // matches: while and would
Additionally I am trying to find out how to include words containing numbers, but NOT only numbers, so again:
this is number 0 // matches: this, is and number
numb3r f1ve // matches: numb3r and f1ve
example_ _where // matches: example_ and _where
this would "not be 9 matched" // matches: this and would
while 'this' would // matches: while and would
The goal is to try and match only words that would be valid variable names in most common programming languages, without matching anything in a string.
This should work:
"[a-zA-Z_0-9\s]+"|([a-zA-Z0-9_]+)
The idea here is, that if the words are surrounded by ", we won't record the matches.
demo :)
Use the below regex and then get the string you want from group index 1.
#"""[^""]*""|'[^']*'|\b([a-zA-Z_\d]*[a-zA-Z_][a-zA-Z_\d]*)\b"
DEMO
Since .NET allows variable length lookbehind you can use this regex:
\b\w+\b(?!(?<="[^"\n]*)[^"\n]*")
This matches a word if it is not followed and preceded by a double quote.
To simplify things \w matches alphanumerics plus underscore.
So \b\w+\b will match one or more such characters with word boundaries.
Avoiding the quotes cannot be simply [^'"]\b\w+… will fail if there is no character proceeding the target string (eg. at the beginning), but a negative look-behind does not. A negative lookahead solves the quote after:
(?<!['`"])\b\w+\b(?!['`"])
(Because those are negative groups, do not negate the character classes.)
To not match all numbers, again a lookahead can be used:
(?<!['`"])\b(?!\d+\b)\w+\b(?!['`"])
Explaination:
The character before is not ['"]
A word boundary
Following characters to the next word boundary are not all digits.
"Word" characters (this will be be the match
Word boundary
Now followed by a quote.
Test, in PowerShell:
PS> [regex]::Matches("foo 'asas' bar 123456 ba55z 1xyzzy", "(?>!['`"])\b(?!\d+\b)\w+\b(?!['`"])")
Groups : {foo}
Success : True
Captures : {foo}
Index : 0
Length : 3
Value : foo
Groups : {bar}
Success : True
Captures : {bar}
Index : 11
Length : 3
Value : bar
Groups : {ba55z}
Success : True
Captures : {ba55z}
Index : 22
Length : 5
Value : ba55z
Groups : {1xyzzy}
Success : True
Captures : {1xyzzy}
Index : 28
Length : 6
Value : 1xyzzy
-

Need pattern for a complex Regex Split

I'd like to split the following string
// Comments
KeyA : SomeType { SubKey : SubValue } KeyB:'This\'s a string'
KeyC : [ 1 2 3 ] // array value
into
KeyA
:
SomeType
{ SubKey : SubValue }
KeyB
:
This's a string
KeyC
:
[ 1 2 3 ]
(: and blank spaces are the delimiters although : is kept in the result; comments are ignored; no splitting between {}, [], or '')
Can I achieve that with Regex Split or Match? If so, what would be the right pattern? Comments to the pattern string would be appreciated.
Moreover, it's also desirable to throw exception or return an error message if the input string is not valid (see the comment below).
Thanks.
You can use this pattern...
string pattern = #"(\w+)\s*:\s*((?>[^\w\s\"'{[:]+|\w+\b(?!\s*:)|\s(?!\w+\s*:|$)|\[[^]]*]|{[^}]*}|\"(?>[^\"\\]|\\.)*\"|'(?>[^'\\]|\\.)*')+)\s*";
... in two ways:
with Match method which will give you what you are looking for with keys in group 1 and values in group 2
with Split method, but you must remove all the empty results.
How is build the second part (after the :) of the pattern?
The idea is to avoid, first at all, problematic characters: [^\w\s\"'{[:]+
Then you allow each of these characters but in a specific situation:
\w+\b(?!\s*:) a word that is not the key
\s(?!\w+\s*:|$) spaces that are not at the end of the value (to trim them)
\[[^]]*] content surrounded by square brackets
{[^}]*} the same with curly brackets
"(?>[^"\\]|\\\\|\\.)*" content between double quotes (with escaped double quotes allowed)
'(?>[^'\\]|\\\\|\\.)*' the same with single quotes
Note that the problem with colon inside brackets or quotes is avoided.
I'm not quite sure what you're looking for when you get to KeyC. How do you know when the string value for KeyB ends and the string for KeyC begins? Is there a colon after 'this\'s is a string' or a line break? Here's a sample to get you started:
[TestMethod]
public void SplitString()
{
string splitMe = "KeyA : SubComponent { SubKey : SubValue } KeyB:This's is a string";
string pattern = "^(.*):(.*)({.*})(.*):(.*)";
Match match = Regex.Match(splitMe, pattern);
Assert.IsTrue(match.Success);
Assert.AreEqual(6, match.Groups.Count); // 1st group is the entire match
Assert.AreEqual("KeyA", match.Groups[1].Value.Trim());
Assert.AreEqual("SubComponent", match.Groups[2].Value.Trim());
Assert.AreEqual("{ SubKey : SubValue }", match.Groups[3].Value.Trim());
Assert.AreEqual("KeyB", match.Groups[4].Value.Trim());
Assert.AreEqual("This's is a string", match.Groups[5].Value.Trim());
}
this Regex pattern should work for you
\s*:\s*(?![^\[]*\])(?![^{]*})(?=(([^"]*"[^"]*){2})*$|[^"]+$)
when replaced with
\n$0\n
Demo

Categories