How to avoid the spaces between two texts while parsing using antlr

How to avoid the spaces between two texts while parsing using antlr - c#

Please help me with the following case. I have a line with multiple texts in it. Based on some rule I need to parse each words in the line. Below is my example input line
## KEYWORD = MyName MyAliasName
Below is my parsing rule sets.
rule1:
Keyword name = identifier{ $name.str;} (' '* diffName = identifierTest { $diffName.str; })?
;
identifier:
returns [string str]
#init{$str="";}:
i=Word{$str+=$i.text;} (i=(Number | Word ) {$str+=$i.text;})*
;
Keyword: SPACE* START SPACE* 'KEYWORD' SPACE* EQUAL SPACE*;
Number:DIGIT+;
Word:LETTER+;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: [0-9];
fragment SPACE: ' ' | '\t';
fragment START: '##';
fragment EQUAL: '=';
The "rule1" rule defines that, the MyName text is mandatory and MyAliasName is an optional one.
The "identifier" rule defined that, the name can start with only by a letter or underscores.
The Problem
If I give exactly one space between MyName and MyAliasName then the above rules works fine. Whereas if there are more than one spaces between MyName and MyAliasName, then the first identifier rule reads both the texts together as MyNameMyAliasName(it removes the spaces automatically). Why ? I don't know what I'm doing wrong!
Whenever the optional texts is available then i will have to overwrite the name with AliasName. Please help and thanks in advance

This grammar should solve your problem
grammar TestGrammar;
rule1:
keyword name=IDENT{ System.out.println($name.text);} ( diffName = IDENT { System.out.println($diffName.text); })?
;
keyword: START KEYWORD EQUAL;
KEYWORD : 'KEYWORD' ;
fragment LETTER: 'A'..'Z' | 'a'..'z' | '_';
fragment DIGIT: '0'..'9';
IDENT : LETTER (LETTER|DIGIT)*;
START : '##';
EQUAL : '=';
SPACE : [ \t]+ -> skip;

Related

Parsing: ANTLR for .NET

I am trying to parse the following text:
<<! notes, Test!>>
Grammar:
grammar Hello;
prog: stat+;
stat: DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr: NOTES value=VAR_VALUE # delim_body ;
VAR_VALUE : [ a-Z A-Z 0-9 ! ];
NOTES : 'notes,'
| ' notes,';
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
Error:
line 1:12 token recognition error at: '>'
line 1:13 token recognition error at: '>'
line 1:10 mismatched input ' !' expecting VAR_VALUE
(NOTE: Added DELIMITER defs since I forgot them earlier)

Try this:
grammar Hello;
prog : stat+ EOF ;
stat : DELIMETER_OPEN expr DELIMETER_CLOSE ;
expr : NOTES COMMA value=VAR_VALUE # delim_body ;
VAR_VALUE : ANBang* AlphaNum ;
NOTES : 'notes' ;
COMMA : ',' ;
WS : [ \t\r\n]+ -> skip ;
DELIMETER_OPEN : '<<!';
DELIMETER_CLOSE : '!>>';
fragment ANBang : AlphaNum | Bang ;
fragment AlphaNum : [a-zA-Z0-9] ;
fragment Bang : '!' ;
Ideally, the rules have to be mutually unambiguous. So, the VAR_VALUE rule is defined to limit the existence of a ! from the end. This will prevent the ! from being consumed by VAR_VALUE in preference to DELIMITER_CLOSE. Of course, that presumes the redefinition is acceptable. If not, a more involved solution will be required.
Also, as a general principle, skip anything that is not syntactically significant to the parsing.

C# regular expression for finding a certain pattern in a text

I'm trying to write a program that can replace bible verses within a document with any desired translation. This is useful for older books that contain a lot of KJV referenced verses. The most difficult part of the process is coming up with a way to extract the verses within a document.
I find that most books that place bible verses within the text use a structure like "N"(BookName chapter#:verse#s), where N is the verse text, the quotations are literal and the parens are also literal. I've been having problems coming up with a regular expression to match these in a text.
The latest regular expression I'm trying to use is this: \"(.+)\"\s*\(([\w. ]+[0-9\s]+[:][\s0-9\-]+.*)\). I'm having trouble where it won't find all the matches.
Here is the regex101 of it with a sample. https://regex101.com/r/eS5oT8/1
Is there anyway to solve this using a regular expression? Any help or suggestions would be greatly appreciated.

It's worth mentioning that the site you were using to test this relies on Javascript Regular Expressions, which require the g modifier to be explicitly defined, unlike C# (which is global by default).
You can adjust your expression slightly and ensure that you escape your double-quotes properly :
// Updated expression with escaped double-quotes and other minor changes
var regex = new Regex(#"\""([^""]+)\""\s*\(([\w. ]+[\d\s]+[:][\s\d\-]+[^)]*)\)");
And then use the Regex.Matches() method to find all of the matches in your string :
// Find each of the matches and output them
foreach(Match m in regex.Matches(input))
{
// Output each match here (using Console Example)
Console.WriteLine(m.Value);
}
You can see it in action in this working example with example output seen below :

Use the "g" modifier.
g modifier: global. All matches (don't return on first match)
See the Regex Demo

you can try with example given in MSDN here is the link
https://msdn.microsoft.com/en-us/library/0z2heewz(v=vs.110).aspx
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "ablaze beagle choral dozen elementary fanatic " +
"glaze hunger inept jazz kitchen lemon minus " +
"night optical pizza quiz restoration stamina " +
"train unrest vertical whiz xray yellow zealous";
string pattern = #"\b\w*z+\w*\b";
Match m = Regex.Match(input, pattern);
while (m.Success) {
Console.WriteLine("'{0}' found at position {1}", m.Value, m.Index);
m = m.NextMatch();
}
}
}
// The example displays the following output:
// 'ablaze' found at position 0
// 'dozen' found at position 21
// 'glaze' found at position 46
// 'jazz' found at position 65
// 'pizza' found at position 104
// 'quiz' found at position 110
// 'whiz' found at position 157
// 'zealous' found at position 174

How about starting with this as a guide:
(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)
Using the options:
RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase
The expression above will give you named captures of every element in the match for easy parsing (e.g., you'll be able to pick out quote, book, chapter and verse) by looking at, e.g., match.Groups["verse"].
Full code:
var input = #"Jesus said, ""'Love your neighbor as yourself.'
There is no commandment greater than these"" (Mark 12:31).";
var bibleQuotesRegex =
#"(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)";
foreach(Match match in Regex.Matches(input, bibleQuotesRegex, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase))
{
var bibleQuote = new
{
Quote = match.Groups["quote"].Value,
Book = match.Groups["book"].Value,
Chapter = int.Parse(match.Groups["chapter"].Value),
Verse = int.Parse(match.Groups["verse"].Value)
};
//do something with it.
}

After you've added "g", also be careful if there are multiple verses without any '\n' character in between, because "(.*)" will treat them as one long match instead of multiple verses. You will want something like "([^"]*)" to prevent that.

ANTLR grammar matches incompatible rule instead of throwing NoViableAltException

I have the following ANTLR grammar that forms part of a larger expression parser:
grammar ProblemTest;
atom : constant
| propertyname;
constant: (INT+ | BOOL | STRING | DATETIME);
propertyname
: IDENTIFIER ('/' IDENTIFIER)*;
IDENTIFIER
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+;
INT
: '0'..'9'+;
BOOL : ('true' | 'false');
DATETIME
: 'datetime\'' '0'..'9'+ '-' '0'..'9'+ '-' + '0'..'9'+ 'T' '0'..'9'+ ':' '0'..'9'+ (':' '0'..'9'+ ('.' '0'..'9'+)*)* '\'';
STRING
: '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
If I invoke this in the interpreter from within ANTLR works with
'Hello\\World'
then this is getting interpreted as a propertyname instead of a constant. The same happens if I compile this in C# and run it in a test harness, so it is not a problem with the dodgy interpreter
I'm sure I am missing something really obvious... but why is this happening? It's clear there is a problem with the string matcher, but I would have thought at the very least that the fact that IDENTIFIER does not match the ' character would mean that this would throw a NoViableAltException instead of just falling through?

First, neither ANTLRWorks nor antlr-3.5-complete.jar can be used to generate code for the C# targets. They might produce files ending in .cs, and those files might even compile, but those files will not be the same as the files produced by the C# port of the ANTLR Tool (Antlr3.exe) or the recommended MSBuild integration. Make sure you are producing your generated parser by one of the tested methods.
Second, INT will never be matched. Since IDENTIFIER appears before INT in the grammar, and all sequences '0'..'9'+ match both IDENTIFIER and INT, the lexer will always take the first option which appears (IDENTIFIER).

How to capture white spaces in a string in ANTLR?

This should be simple, but I am pulling my hair out trying to solve it.
I am trying to grab a quoted string, including the white spaces (space and tab) in the string, and record the value in a CSharp string. However, I would also like to ignore the same whitespace (via the lexer) that is included outside the quoted string. I have the typical WS lexer rule included, but the WS rule is taking out the white spaces when I want it (when its in a quoted string). If I remove the {channel=HIDDEN} from the WS rule, I lose all the other whitespace and have to manually add WS everywhere between the tokens. Any help would be greatly appreciated! Here is my grammar:
program returns [KeyValuePair<string, string> kvp]
:
ident=IDENT {kvp.Key = ident.Text;}
'='
quote=quoted_ident {kvp.Value = quote.ret;}
;
quoted_ident returns [string ret]
:
'"'
(
(ident=IDENT|ident=DOUBLE) {$ret += ident.Text;}
|
ws=WS {$ret += ws.Text;}
)+
'"'
;
WS :
(
' '
|
'\t'
)
{ $channel = HIDDEN; }
;
fragment DIGIT: '0'..'9';
fragment LETTER: ('a'..'z' | 'A'..'Z');
fragment DOT:'.';
DOUBLE : ((DIGIT)+(DOT(DIGIT)+)?)|(DOT(DIGIT)+);
IDENT : (LETTER|DIGIT|DOT|':'|'\''|'/'|'\\'|'_'|'#'|';'|'?'|'-'|'#'|'$'|'%'|'^'|'&'|'*')+;
Examples:
input: ' Name = " My Name " '
Expected Value for kvp.Value ' My Name '
Actual Value 'MyName'
I want to ignore all spaces and tabs outside of the quotes, but capture them within the quotes.

Overlapping rules in regex with named groups

I'm experiencing problems with a regex that parses custom phone numbers:
A value matching "wtvCode" group is optional;
A value matching "countryCode" group is optional;
The countryCode rule overlaps with areaCityCode rule for some values. In such cases, when countryCode is missing, its expression captures the areaCityCode value instead.
Code example is below.
Regex regex = new Regex(string.Concat(
"^(",
"(?<wtvCode>[A-Z]{3}|)",
"([-|/|#| |]|)",
"(?<countryCode>[2-9+]{2,5}|)",
"([-|/|#| |]|)",
"(?<areaCityCode>[0-9]{2,3}|)",
"([-|/|#| |]|))",
"(?<phoneNumber>(([0-9]{8,18})|([0-9]{3,4}([-|/|#| |]|)[0-9]{4})|([0-9]{4}([-|/|#| |]|)[0-9]{4})|([0-9]{4}([-|/|#| |]|)[0-9]{4}([-|/|#| |]|)[0-9]{1,5})))",
"([-|/|#| |]|)",
"(?<foo>((A)|(B)))",
"([-|/|#| |]|)",
"(?<bar>(([1-9]{1,2})|)",
")$"
));
string[] validNumbers = new[] {
"11-1234-5678-27-A-2", // missing wtvCode and countryCode
"48-1234-5678-27-A-2", // missing wtvCode and countryCode
"55-48-1234-5678-27-A-2" // missing wtvCode
};
foreach (string number in validNumbers) {
Console.WriteLine("countryCode: {0}", regex.Match(number).Groups["countryCode"].Value);
Console.WriteLine("areaCityCode: {0}", regex.Match(number).Groups["areaCityCode"].Value);
Console.WriteLine("phoneNumber: {0}", regex.Match(number).Groups["phoneNumber"].Value);
}
The output for that is:
// First number
// countryCode: <- correct
// areaCityCode: 11 <- correct, but that's because "11" is never a countryCode
// phoneNumber: 1234-5678-27 <- correct
// Second number
// countryCode: 48 <- wrong, should be ""
// areaCityCode: <- wrong, should be "48"
// phoneNumber: 1234-5678-27 <- correct
// Third number
// countryCode: 55 <- correct
// areaCityCode: 48 <- correct
// phoneNumber: 1234-5678-27 <- correct
I've failed so far on fixing this regular expression in a way that it covers all my constraints and doesn't mess with countryCode and areaCityCode when a value match both rules. Any ideas?
Thanks in advance.
Update
The correct regex pattern for phone country codes can be found here: https://stackoverflow.com/a/6967885/136381

First I recommend using the ? quantifier to make things optional instead of the empty alternatives you're using now. And in the case of the country code, add another ? to make it non-greedy. That way it will try initially to capture the first bunch of digits in the areaCityCode group. Only if the overall match fails will it go back and use the countryCode group instead.
Regex regex = new Regex(
#"^
( (?<wtvCode>[A-Z]{3}) [-/# ] )?
( (?<countryCode>[2-9+]{2,5}) [-/# ] )??
( (?<areaCityCode>[0-9]{2,3}) [-/# ] )?
(?<phoneNumber> [0-9]{8,18} | [0-9]{3,4}[-/# ][0-9]{4}([-/# ][0-9]{1,5})? )
( [-/# ] (?<foo>A|B) )
( [-/# ] (?<bar>[1-9]{1,2}) )?
$",
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
As you can see, I've made a few other changes to your code, the most important being the switch from ([-|/|#| |]|) to [-/# ]. The pipes inside the brackets would just match a literal |, which I'm pretty sure you don't want. And the last pipe made the separator optional; I hope they don't really have to be optional, because that would make this job a lot more difficult.

There are two things overlooked by yourself and the other responders.
The first is that it makes more sense to work in reverse, in other words right to left because there are more required fields to the end of the text than at the begininning. By removing the doubt of the WTV and the Country code it becomes much easier for the regex parser to work (though intellectually harder for the person writting the pattern).
The second is the use of the if conditional in regex (? () | () ). That allows us to test out a scenario and implement one match pattern over another. I describe the if conditional on my blog entitled Regular Expressions and the If Conditional. The pattern below tests out whether there is the WTV & Country, if so it matches that, if not it checks for an optional country.
Also instead of concatenating the pattern why not use IgnorePatternWhitespace to allow the commenting of a pattern as I show below:
string pattern = #"
^
(?([A-Z][^\d]?\d{2,5}(?:[^\d])) # If WTV & Country Code (CC)
(?<wtvCode>[A-Z]{3}) # Get WTV & CC
(?:[^\d]?)
(?<countryCode>\d{2,5})
(?:[^\d]) # Required Break
| # else maybe a CC
(?<countryCode>\d{2,5})? # Optional CC
(?:[^\d]?) # Optional Break
)
(?<areaCityCode>\d\d\d?) # Required area city
(?:[^\d]?) # Optional break (OB)
(?<PhoneStart>\d{4}) # Default Phone # begins
(?:[^\d]?) # OB
(?<PhoneMiddle>\d{4}) # Middle
(?:[^\d]?) # OB
(?<PhoneEnd>\d\d) # End
(?:[^\d]?) # OB
(?<foo>[AB]) # Foo?
(?:[^AB]+)
(?<bar>\d)
$
";
var validNumbers = new List<string>() {
"11-1234-5678-27-A-2", // missing wtvCode and countryCode
"48-1234-5678-27-A-2", // missing wtvCode and countryCode
"55-48-1234-5678-27-A-2", // missing wtvCode
"ABC-501-48-1234-5678-27-A-2" // Calling Belize (501)
};
validNumbers.ForEach( nm =>
{
// IgnorePatternWhitespace only allows us to comment the pattern; does not affect processing
var result = Regex.Match(nm, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.RightToLeft).Groups;
Console.WriteLine (Environment.NewLine + nm);
Console.WriteLine("\tWTV code : {0}", result["wtvCode"].Value);
Console.WriteLine("\tcountryCode : {0}", result["countryCode"].Value);
Console.WriteLine("\tareaCityCode: {0}", result["areaCityCode"].Value);
Console.WriteLine("\tphoneNumber : {0}{1}{2}", result["PhoneStart"].Value, result["PhoneMiddle"].Value, result["PhoneEnd"].Value);
}
);
Results:
11-1234-5678-27-A-2
WTV code :
countryCode :
areaCityCode: 11
phoneNumber : 1234567827
48-1234-5678-27-A-2
WTV code :
countryCode :
areaCityCode: 48
phoneNumber : 1234567827
55-48-1234-5678-27-A-2
WTV code :
countryCode : 55
areaCityCode: 48
phoneNumber : 1234567827
ABC-501-48-1234-5678-27-A-2
WTV code : ABC
countryCode : 501
areaCityCode: 48
phoneNumber : 1234567827
Notes:
If there is no divider between the country code and the city code,
there is no way a parser can determine what is city and what is
country.
Your original country code pattern failed [2-9] failed for any
country with a 0 in it. Hence I changed it to [2-90].

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to avoid the spaces between two texts while parsing using antlr - c#

Related

Parsing: ANTLR for .NET

C# regular expression for finding a certain pattern in a text

ANTLR grammar matches incompatible rule instead of throwing NoViableAltException

How to capture white spaces in a string in ANTLR?

Overlapping rules in regex with named groups

Categories

Resources