ANTLR4 in C# catches only one token - c#

g4 file:
grammar TestFlow;
options
{
language=CSharp4;
output=AST;
}
/*
* Parser Rules
*/
compileUnit : LC | BC ;
/*
* Lexer Rules
*/
BC : '/*' .*? '*/' ;
LC : '//' .*? [\r\n] ;
Code:
var input = " /*aaa*/ /// \n ";
var stream = new AntlrInputStream(input);
ITokenSource lexer = new TestFlowLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
var parser = new TestFlowParser(tokens);
parser.BuildParseTree = true;
var tree = parser.compileUnit();
var n = tree.ChildCount;
var top = new List<string>();
for (int i = 0; i < n; i++) {
top.Add(tree.GetChild(i).GetText());
}
After running above code I get single string in top: /*aaa*/. The single-line comment isn't caught.
What's wrong?

All parser/lexer generation errors & warnings are significant. Both options statements are invalid in the current version of Antlr4.
The runtime errors detail the root problem: unrecognizable input characters, specifically, the grammar does not handle whitespace. Add a lexer rule to fix:
WS: [ \r\n\t] -> skip ;
While not necessarily a problem, it is good form to require the parser to process all input. The lexer will generate an EOF token at the end of the source input. Fix the main rule to require the EOF:
compileUnit : ( LC | BC ) EOF ;
The correct way to allow for repetition is to use a * or + operator:
compileUnit : ( LC | BC )+ EOF ;

Related

Regex Replace variable number of ****** with predefined digits

var panmaskednumber = "543034******0243"; Console.WriteLine(panmaskednumber.Count(x => x == '*'));
var pattern = "\\*";
var replace = "123456789";
Regex reg = new Regex(pattern);
var newnumber = reg.Replace(panmaskednumber, replace,panmaskednumber.Count(x => x == '*'));
Console.WriteLine(newnumber);
I'm trying to Replace * in var panmaskednumber(coming from DB with symmetric key).
Not liking to use the Contains approach in which I'm specifying number of * 6 / 7 with multiple If-elseif. Since those can vary between 6,7-9.
With my above approach it replaces for each char of -> * with var replace.
Any Linq approach if there is highly appreciated.
Result something: 5430341234567890243
You need to use a \*+ pattern that will match 1 or more asterisk symbols:
var panmaskednumber = "543034******0243";
var replace = "123456789";
var res = Regex.Replace(panmaskednumber, #"\*+", replace);
// res => 5430341234567890243
See the C# demo.
If the number of asterisks to be replaced depends on the replace length, you may pass the match value to the match evaluator and perform necessary manipulatons there:
var panmaskednumber = "543034*****0243";
var replace = "123";
var res = Regex.Replace(panmaskednumber, #"\*+", m =>
m.Value.Length <= replace.Length ?
replace.Substring(0, m.Value.Length) :
$"{replace}{m.Value.Substring(replace.Length)}"
);
Console.Write(res);
// "543034***0243" / "123456789" -> 543034 123 0243
// "543034*****0243" / "123" -> 543034 123** 0243
See antother C# demo
You can use a Regex, but you can achieve without. Lets go for a simpler solution. Try it online.
var panmaskednumber = "543034******0243";
var count = panmaskednumber.Count(x => x == '*');
var start = panmaskednumber.IndexOf('*');
var replace = "123456789";
// output 5430341234567890243 (543034 123456789 0243)
Console.WriteLine(panmaskednumber.Remove(start) // get head
+ replace // add replace
+ panmaskednumber.Substring(start + count)); // add tail
// output 5430341234560243 (543034 123456 0243) // get head
Console.WriteLine(panmaskednumber.Remove(start)
+ replace.Remove(count) // add replace with count respect
+ panmaskednumber.Substring(start + count)); // add tail
replace = "123";
// output 543034123***0243 (543034 123*** 0243) // get head
Console.WriteLine(panmaskednumber.Remove(start)
+ replace // add replace
+ new string('*', count - replace.Length) // fill with missing *
+ panmaskednumber.Substring(start + count)); // add tail
"I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail."
Law of instrument
Dont use Regex, if you don't have too. For this problem, C#.NET is enough. :)

Parsing csv using ANTLR in c#

I have created following grammar in ANTLR for parsing csv file.
grammar CSV;
file returns [List<List<string>> data]
#init {$data = new List<List<string>>();}
: (row {$data.Add($row.list);})+ EOF
;
row returns [List<string> list]
#init {$list = new List<string>();}
: a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);}
)*
(LineBreak | EOF)
;
value returns [string val]
: SimpleValue {$val = $SimpleValue.text;}
| QuotedValue
{
System.Console.WriteLine($val);
$val = $QuotedValue.text;
$val = $val.Substring(1, $val.Length-1);
$val = $val.Replace("\"\"", "\"");
}
;
Comma :
( ' '* ',' ' '*);
LineBreak :
'\r'? '\n';
SimpleValue
: ~[,\r\n"]+
;
QuotedValue
: '"' ('""' | ~'"')* '"'
;
Above grammar is parsing following csv file without error.
a,b
1,2
3,4
but when I am parsing following csv file it's throwing following error
a,b
,2
3,4
line 2:0 extraneous input ',' expecting {<EOF>, SimpleValue, QuotedValue}
can someone guide me how to solve this problem?
Main Program
public List<List<string>> Parse()
{
string csvData = string.Empty;
if (string.IsNullOrEmpty(_path))
throw new ArgumentException("Path can not be empty");
try
{
csvData = File.ReadAllText(_path);
}
catch (Exception)
{
throw new FileNotFoundException(string.Format("{0} not found", _path));
}
// create an instance of the lexer
CSVLexer lexer = new CSVLexer(new AntlrInputStream(csvData));
// wrap a token-stream around the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create the parser
CSVParser parser = new CSVParser(tokens);
// invoke the entry point of our grammar
_results = parser.file().data;
return _results;
}
Upadte
As per the Mike Lischke answer I have updated the row rule as below. Now I am not getting any error
row returns [List<string> list]
#init {$list = new List<string>();}
: Comma? a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);
}
)*
(LineBreak | EOF)
;
Obviously your row rule is not flexible enough to handle missing values. You should use something like this instead:
row: Comma? value (Comma value)*;
which adds the possibility for a leading comma (actually a missing first value).
And a recommendation: don't use action code in your grammar to collect the values. Instead create and assign a parse listener to your parser whose methods are triggered during parsing to do all the background work. It keeps the grammar a lot cleaner and allows to use it indepent of the actual target language.

Antlr4 with C# - Parse text between matching symbols

So I'm stuck, and nothing that I've tried has been working. I am trying to parse text between matching symbols, in this case, equal signs. I've gotten this to work in a different parser I was testing, but have since deleted. I tried to replicate what I could and my attempt is shown in the code below.
QUESTION: How do I parse text between matching symbols, and/or what am I doing wrong with my current implementation.
Also, secondary to this, is there a way to get an output of all tokens found - their names and text values. I haven't searched for this yet so I'm sure I could find out, but I've been stuck on this first problem so I haven't been able to test options.
All of this is being run with Antlr4, Visual Studio 2013, and Windows 10.
TestGrammar.g4:
grammar TestGrammar;
start
: title EOF
;
title
: EQUALS title EQUALS
| EQUALS ANY EQUALS // using NOTEQUALS didn't work either
| EQUALS ' ' ANY ' ' EQUALS
;
EQUALS: '=' ;
ANY : .+ ;
NOTEQUALS: ~[\r\n=]+ ;
Program.cs:
class Program
{
private static void Main(string[] args)
{
string[] testStrings =
{
"= asdf =",
"== asdf ==",
"=== asdf ===",
"=asdf=",
"==asdf==",
"===asdf==="
};
foreach (string s in testStrings)
{
AntlrInputStream inputStream = new AntlrInputStream(s);
TestGrammarLexer wikiLexer = new TestGrammarLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(wikiLexer);
TestGrammarParser wikiParser = new TestGrammarParser(commonTokenStream);
TestGrammarParser.StartContext startContext = wikiParser.start();
TestGrammarVisitor visitor = new TestGrammarVisitor();
visitor.VisitStart(startContext);
}
}
}
TestGrammarVisitor.cs
class TestGrammarVisitor : TestGrammarBaseVisitor<object>
{
public override object VisitStart([NotNull] TestGrammarParser.StartContext context)
{
Console.WriteLine("TestGrammarVisitor VisitStart");
context.children.OfType<TerminalNodeImpl>().ToList().ForEach(child => Visit(child));
return null;
}
private void Visit(TerminalNodeImpl node)
{
Console.WriteLine(" Visit Symbol='{0}'", node.Symbol.Text);
Console.WriteLine();
}
}
Result:
line 1:0 no viable alternative at input '=== asdf ==='
TestGrammarVisitor VisitStart
Visit Symbol='<EOF>'

ANTLR for c# behaves illogicaly

Is there ANY reason why would ANTLR, for any reason, ignore tokens? Here's the relative code, i'm calling var_assign directly.
LABEL
: LETTER (LETTER | DIGIT | '_')*;
fragment LOWER_CASE
: 'a'..'z';
fragment UPPER_CASE
: 'A'..'Z';
fragment LETTER
: UPPER_CASE | LOWER_CASE;
public var_assign
: LABEL ':=' expression -> ^( VARIABLE_ASSIGNMENT LABEL expression )
;
expression is the standard chain of expressions ending with tokens like NUMBER and LABEL (for variables), etc.
Now the issue is that i can just type "anything anything" and the parser will recognize that as an assignment.
ANTLRStringStream Input = new ANTLRStringStream(input_to_process);
processor.lexer.ConsoleGrammarLexer Lexer = new processor.lexer.ConsoleGrammarLexer(Input);
CommonTokenStream Tokens = new CommonTokenStream(Lexer);
processor.parser.ConsoleGrammarParser Parser = new processor.parser.ConsoleGrammarParser(Tokens);
CommonTree start_rule_tree = Parser.var_assign().Tree;
//view the tree to help debug
processor_output = start_rule_tree.ToStringTree();
If i type "x 5", i get (VARIABLE_ASSIGNMENT x 5)).
If i type "x:=5", i get (BLOCK (VARIABLE_ASSIGNMENT x 5))
If i type "x*5", i get (BLOCK ,1:1], resync=x*5>)
This happens even if i send a constnat "string" directly into ANTLRStringStream.
I have managed to solve this by either replacing ':=' with (':=' | 'anythinghere') or (':=')*. But there are other odd behaviours.
I'm using CSharp3 as a language option and the newest .dlls.
What is going on, this makes absolutely no sense.
EDIT:
I've created a test grammar.
grammar testgrammar;
options {
language = CSharp3;
output = AST;
TokenLabelType = CommonToken;
ASTLabelType = CommonTree;
}
LABEL : 'a'..'z';
WS : ' ' {Skip();};
public start
: if_statement EOF!;
if_statement
: LABEL ':=' LABEL ->^(LABEL LABEL);
Typing "ff" produces (f f), typing f*f produces a run-time error, typing f:=f produces (f f). What. The. Hell.
The Java version gives:
/tmp $ java TestT
ff
line 1:2 no viable alternative at character '\n'
line 1:1 missing ':=' at 'f'
From:
InputStream is = System.in;
if ( inputFile!=null ) {
is = new FileInputStream(inputFile);
}
CharStream input = new ANTLRInputStream(is);
TLexer lex = new TLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lex);
TParser parser = new TParser(tokens);
parser.start();
Not sure what's up. CSharp3 should work too. I'm baffled. Start the debugger and set a break point. It's your only hope, Luke!

Antlr grammar generating invalid C# code

I am trying to develop a c# code generator using ANTLR and the StringTemplate library. AntlrWorks can generate the c# parser and lexer files without reporting any errors. However, the c# parser code is not valid and cannot be compiled in visual studio.
Can anyone see what is wrong with the following grammar?
grammar StrucadShape;
options {
language=CSharp3 ;
output=template;
}
#header {using System;}
#lexer::header {using System;}
#lexer::members {const int HIDDEN = Hidden;}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
public shapedef: parameters_def
-> class_temp( parameters={$parameters_def.st} )
;
parameters_def : (PARAMETERS LPAREN (p+=param) (COMMA (p+=param))* RPAREN )
-> parameter_list(params={$p})
;
param : IDENTIFIER ->Parameter_decl(p={$IDENTIFIER.text});
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
fragment EOL:'\r'|'\n'|'\r\n' ;
WS : (' '
| '\t'
| EOL)
{ $channel = HIDDEN; } ;
PARAMETERS: 'PARAMETERS';
COMMA : ',' ;
LPAREN : '(' ;
RPAREN : ')' ;
fragment LETTER :('A'..'Z' | 'a'..'z');
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER : (DIGIT)+ ;
FLOAT : (DIGIT)+'.'(DIGIT)+;
fragment DIGIT : '0'..'9' ;
This results in the following lines of code in generated parameters_def() method
List<object> list_p = null;
...snipped some code
if (list_p==null) list_p=new List<StringTemplate>();
This is failing on the assignment of the List <StringTemplate> to type List<Object>.
The grammar works before I add the string template rules. The error is introduced when I add the (p+=param) syntax required for list processing in the StringTemplate library.
I'll add my StringTemplate file for completeness, but I don't think this could be causing an error as it is not loaded until runtime.
group StrucadShape;
Parameter_decl(p)::= "public double <p> { get; set; }"
parameter_list(params) ::=
<<
start expressions
<params; separator="\n">
end
>>
class_temp( parameters)::=
<<
public class name
{
<parameters; separator="\n>
}
>>
A sample input string PARAMETERS( D,B,T)
Antlr Versions
Antlr3.Runtime 3.4.1.9004
AntlrWorks 1.4.3
I found a related issue on the Antlr mailing list here.
The solution was to add an ASTLabeltype to the grammar options
options {
language=CSharp3;
output=template;
ASTLabelType = StringTemplate;
}

Categories