Antlr4 with C# - Parse text between matching symbols

Antlr4 with C# - Parse text between matching symbols - c#

So I'm stuck, and nothing that I've tried has been working. I am trying to parse text between matching symbols, in this case, equal signs. I've gotten this to work in a different parser I was testing, but have since deleted. I tried to replicate what I could and my attempt is shown in the code below.
QUESTION: How do I parse text between matching symbols, and/or what am I doing wrong with my current implementation.
Also, secondary to this, is there a way to get an output of all tokens found - their names and text values. I haven't searched for this yet so I'm sure I could find out, but I've been stuck on this first problem so I haven't been able to test options.
All of this is being run with Antlr4, Visual Studio 2013, and Windows 10.
TestGrammar.g4:
grammar TestGrammar;
start
: title EOF
;
title
: EQUALS title EQUALS
| EQUALS ANY EQUALS // using NOTEQUALS didn't work either
| EQUALS ' ' ANY ' ' EQUALS
;
EQUALS: '=' ;
ANY : .+ ;
NOTEQUALS: ~[\r\n=]+ ;
Program.cs:
class Program
{
private static void Main(string[] args)
{
string[] testStrings =
{
"= asdf =",
"== asdf ==",
"=== asdf ===",
"=asdf=",
"==asdf==",
"===asdf==="
};
foreach (string s in testStrings)
{
AntlrInputStream inputStream = new AntlrInputStream(s);
TestGrammarLexer wikiLexer = new TestGrammarLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(wikiLexer);
TestGrammarParser wikiParser = new TestGrammarParser(commonTokenStream);
TestGrammarParser.StartContext startContext = wikiParser.start();
TestGrammarVisitor visitor = new TestGrammarVisitor();
visitor.VisitStart(startContext);
}
}
}
TestGrammarVisitor.cs
class TestGrammarVisitor : TestGrammarBaseVisitor<object>
{
public override object VisitStart([NotNull] TestGrammarParser.StartContext context)
{
Console.WriteLine("TestGrammarVisitor VisitStart");
context.children.OfType<TerminalNodeImpl>().ToList().ForEach(child => Visit(child));
return null;
}
private void Visit(TerminalNodeImpl node)
{
Console.WriteLine(" Visit Symbol='{0}'", node.Symbol.Text);
Console.WriteLine();
}
}
Result:
line 1:0 no viable alternative at input '=== asdf ==='
TestGrammarVisitor VisitStart
Visit Symbol='<EOF>'

Related

ANTLR visitor unit test succeeds on one rule but fails on another

I'm trying to define unit tests for my ANTLR parser. The unit test successfully extracts the value of the first expr, but fails to extract the value of of the first idEscape. This suggests that I am misunderstanding something core to the way in which the parser works, or the way in which visitors work.
I'm writing a parser for calculations in FileMaker Pro. In FileMaker, it is technically valid for an identifier to contain whitespace as well as operators and other characters which would otherwise have a functional purpose in the calculation engine. In those cases, the identifier is escaped by surrounding it with a '${' and '}'. While the parser successfully identifies '${abcdef + 123}' as a valid expression,
I still need to be able to identify 'abcdef + 123' as a valid identifier. When I request the value of the first idEscape in a second unit test, I get an empty string.
If relevant, I'm using ANTLR4.Runtime.Standard.
What am I doing wrong? Any assistance in resolving my misunderstanding would be greatly appreciated. Thank you.
Grammar
grammar FileMakerCalc;
// PARSER RULES
calculation : expr;
expr : idEscExpr;
idEscExpr : LEFTESCAPE idEscape RIGHTESCAPE;
idEscape : (WORD|WS|OPERATOR|INT|FLOAT)*?;
// LEXER RULES
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
LEFTESCAPE : '${';
RIGHTESCAPE : '}';
OPERATOR : ('+'|'-'|'*'|'/'|'&'|'^'|'='|'≠'|'<>'|'>'|'<'|'≤'|'<='|'≥'|'>=' );
WORD : (LOWERCASE | UPPERCASE)+ ;
FLOAT : [0-9]+ '.' [0-9]+;
INT : [0-9]+ ;
NEWLINE : [\r\n]+ ;
WS : [ \t];
Visitor
public class FileMakerCalcVisitor : FileMakerCalcBaseVisitor<String>
{
public override string VisitExpr(FileMakerCalcParser.ExprContext context)
{
return context.GetText();
}
public override string VisitIdEscape(FileMakerCalcParser.IdEscapeContext context)
{
return context.GetText();
}
}
Unit Tests
namespace Antler_Tests
{
[TestFixture()]
public class ParserTest
{
private FileMakerCalcParser Setup(string text)
{
AntlrInputStream inputStream = new AntlrInputStream(text);
FileMakerCalcLexer lexer = new FileMakerCalcLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
FileMakerCalcParser parser = new FileMakerCalcParser(commonTokenStream);
return parser;
}
// This one successfully pulls '${abcdef + 123}' as the text of the first expr
[Test()]
public void EscapedID_CheckForExpr()
{
FileMakerCalcParser parser = Setup("${abcdef + 123}");
FileMakerCalcParser.ExprContext context = parser.expr();
FileMakerCalcVisitor visitor = new FileMakerCalcVisitor();
var testVal = visitor.VisitExpr(context);
Assert.AreEqual("${abcdef + 123}", testVal, testVal);
}
// This one does NOT successfully pull 'abcdef + 123' as the text of the first idEscape
[Test()]
public void EscapedID()
{
FileMakerCalcParser parser = Setup("${abcdef + 123}");
FileMakerCalcParser.IdEscapeContext context = parser.idEscape();
FileMakerCalcVisitor visitor = new FileMakerCalcVisitor();
var testVal = visitor.VisitIdEscape(context);
Assert.AreEqual("abcdef + 123", testVal);
}
}
}

${abcdef + 123} is not a valid idEscape because it starts with ${ and ends with }, neither of which the idEscape rule accepts. The way you've defined it, idEscape only matches the stuff between ${} and idEscapeExpr is the one that matches the whole thing.
So you'll want your test to either invoke the idEscapeExpr rule instead of idEscape or change the string you're parsing to abcdef + 123 (or have one test for each).

Superpower: match a string with parser only if it begins a line

When parsing in superpower, how to match a string only if it is the first thing in a line?
For example, I need to match the A colon in "A: Hello Goodbye\n" but not in "Goodbye A: Hello\n"

Using your example here, I would change your ActorParser and NodeParser definitions to this:
public readonly static TokenListParser<Tokens, Node> ActorParser =
from name in NameParser
from colon in Token.EqualTo(Tokens.Colon)
from text in TextParser
select new Node {
Actor = name + colon.ToStringValue(),
Text = text
};
public readonly static TokenListParser<Tokens, Node> NodeParser =
from node in ActorParser.Try()
.Or(TextParser.Select(text => new Node { Text = text }))
select node;
I feel like there is a bug with Superpower, as I'm not sure why in the NodeParser I had to put a Try() on the first parser when chaining it with an Or(), but it would throw an error if I didn't add it.
Also, your validation when checking input[1] is incorrect (probably just a copy paste issue). It should be checking against "Goodbye A: Hello" and not "Hello A: Goodbye"

Unless RegexOptions.Multiline is set, ^ matches the beginning of a string regardless of whether it is at the beginning of a line.
You can probably use inline (?m) to turn on multiline:
static TextParser<Unit> Actor { get; } =
from start in Span.Regex(#"(?m)^[A-Za-z][A-Za-z0-9_]+:")
select Unit.Value;

I have actually done something similar, but I do not use a Tokenizer.
private static string _keyPlaceholder;
private static TextParser<MyClass> Actor { get; } =
Span.Regex("^[A-Za-z][A-Za-z0-9_]*:")
.Then(x =>
{
_keyPlaceholder = x.ToStringValue();
return Character.AnyChar.Many();
}
))
.Select(value => new MyClass { Key = _keyPlaceholder, Value = new string(value) });
I have not tested this, just wrote it out by memory. The above parser should have the following:
myClass.Key = "A:"
myClass.Value = " Hello Goodbye"

Parsing csv using ANTLR in c#

I have created following grammar in ANTLR for parsing csv file.
grammar CSV;
file returns [List<List<string>> data]
#init {$data = new List<List<string>>();}
: (row {$data.Add($row.list);})+ EOF
;
row returns [List<string> list]
#init {$list = new List<string>();}
: a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);}
)*
(LineBreak | EOF)
;
value returns [string val]
: SimpleValue {$val = $SimpleValue.text;}
| QuotedValue
{
System.Console.WriteLine($val);
$val = $QuotedValue.text;
$val = $val.Substring(1, $val.Length-1);
$val = $val.Replace("\"\"", "\"");
}
;
Comma :
( ' '* ',' ' '*);
LineBreak :
'\r'? '\n';
SimpleValue
: ~[,\r\n"]+
;
QuotedValue
: '"' ('""' | ~'"')* '"'
;
Above grammar is parsing following csv file without error.
a,b
1,2
3,4
but when I am parsing following csv file it's throwing following error
a,b
,2
3,4
line 2:0 extraneous input ',' expecting {<EOF>, SimpleValue, QuotedValue}
can someone guide me how to solve this problem?
Main Program
public List<List<string>> Parse()
{
string csvData = string.Empty;
if (string.IsNullOrEmpty(_path))
throw new ArgumentException("Path can not be empty");
try
{
csvData = File.ReadAllText(_path);
}
catch (Exception)
{
throw new FileNotFoundException(string.Format("{0} not found", _path));
}
// create an instance of the lexer
CSVLexer lexer = new CSVLexer(new AntlrInputStream(csvData));
// wrap a token-stream around the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create the parser
CSVParser parser = new CSVParser(tokens);
// invoke the entry point of our grammar
_results = parser.file().data;
return _results;
}
Upadte
As per the Mike Lischke answer I have updated the row rule as below. Now I am not getting any error
row returns [List<string> list]
#init {$list = new List<string>();}
: Comma? a=value {
$list.Add($a.val);
}
(Comma b=value {
$list.Add($b.val);
}
)*
(LineBreak | EOF)
;

Obviously your row rule is not flexible enough to handle missing values. You should use something like this instead:
row: Comma? value (Comma value)*;
which adds the possibility for a leading comma (actually a missing first value).
And a recommendation: don't use action code in your grammar to collect the values. Instead create and assign a parse listener to your parser whose methods are triggered during parsing to do all the background work. It keeps the grammar a lot cleaner and allows to use it indepent of the actual target language.

match names with unicode chars

can somebody help me to match following type of strings "BEREŽALINS", "GŽIBOVSKIS" in C# and js , I've tried
\A\w+\z (?>\P{M}\p{M}*)+ ^[-a-zA-Z\p{L}']{2,50}$
, and so on ... but nothing works .
Thanks

Just wrote a little console app to do it:
private static void Main(string[] args) {
var list = new List<string> {
"BEREŽALINS",
"GŽIBOVSKIS",
"TEST"
};
var pat = new Regex(#"[^\u0000-\u007F]");
foreach (var name in list) {
Console.WriteLine(string.Concat(name, " = ", pat.IsMatch(name) ? "Match" : "Not a Match"));
}
Console.ReadLine();
}
Works with the two examples you gave me, but not sure about all scenarios :)

Can you give an example of what is should not match?
Reading your question it's like you want to match just string (on seperates line maybe). If thats the case just use
^.*$
In C# this becomes
foundMatch = Regex.IsMatch(SubjectString, "^.*$", RegexOptions.Multiline);
And in javascript this is
if (/^.*$/m.test(subject)) {
// Successful match
} else {
// Match attempt failed
}

C#: How should I convert the following?

Using C#, how would you go about converting a String which also contains newline characters and tabs (4 spaces) from the following format
A {
B {
C = D
E = F
}
G = H
}
into the following
A.B.C = D
A.B.E = F
A.G = H
Note that A to H are just place holders for String values which will not contain '{', '}', and '=' characters. The above is just an example and the actual String to convert can contain nesting of values which is infinitely deep and can also contain an infinite number of "? = ?".

You probably want to parse this, and then generate the desired format. Trying to do regex tranforms isn't going to get you anywhere.
Tokenize the string, then go through the tokens and build up a syntax tree. Then walk the tree generating the output.
Alternative, push each "namespace" onto a stack as you encounter it, and pop it off when you encounter the close brace.

Not very pretty, but here's an implementation that uses a stack:
static string Rewrite(string input)
{
var builder = new StringBuilder();
var stack = new Stack<string>();
string[] lines = input.Split('\n');
foreach (var s in lines)
{
if (s.Contains("{") || s.Contains("="))
{
stack.Push(s.Replace("{", String.Empty).Trim());
}
if (s.Contains("="))
{
builder.Append(string.Join(".", stack.Reverse().ToArray()));
builder.Append(Environment.NewLine);
}
if (s.Contains("}") || s.Contains("="))
{
stack.Pop();
}
}
return builder.ToString();
}

Pseudocode for the stack method:
function do_processing(Stack stack)
add this namespace to the stack;
for each sub namespace of the current namespace
do_processing(sub namespace)
end
for each variable declaration in the current namespace
make_variable_declaration(stack, variable declaration)
end
end

You can do this with regular expressions, it's just not the most efficient way to do it as you need to scan the string multiple times.
while (s.Contains("{")) {
s = Regex.Replace(s, #"([^\s{}]+)\s*\{([^{}]+)\}", match => {
return Regex.Replace(match.Groups[2].Value,
#"\s*(.*\n)",
match.Groups[1].Value + ".$1");
});
}
Result:
A.B.C = D
A.B.E = F
A.G = H
I still think using a parser and/or stack based approach is the best way to do this, but I just thought I'd offer an alternative.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Antlr4 with C# - Parse text between matching symbols - c#

Related

ANTLR visitor unit test succeeds on one rule but fails on another

Superpower: match a string with parser only if it begins a line

Parsing csv using ANTLR in c#

match names with unicode chars

C#: How should I convert the following?

Categories

Resources