I have a very simple grammar that (I think) should only allow additions of two elements like 1+1 or 2+3
grammar dumbCalculator;
expression: simple_add EOF;
simple_add: INT ADD INT;
INT:('0'..'9');
ADD : '+';
I generate my C# classes using the official ANTLR jar file
java -jar "antlr-4.9-complete.jar" C:\Users\Me\source\repos\ConsoleApp2\ConsoleApp2\dumbCalculator.g4 -o C:\Users\Me\source\repos\ConsoleApp2\ConsoleApp2\Dumb -Dlanguage=CSharp -no-listener -visitor
No matter what I try, the parser keeps adding the trailing elements although they shouldn't be allowed.
For example "1+1+1" gets parsed properly as an AST :
expression
simple_add
1
+
1
+
1
Although I specifically wrote that expression must be simple_add then EOF and simple_add is just INT ADD INT. I have no idea why the rest is being accepted, I expect ANTLR to throw an exception on this.
This is how I test my parser :
var inputStream = new AntlrInputStream("1+1+1");
var lexer = new dumbCalculatorLexer(inputStream);
lexer.RemoveErrorListeners();
lexer.AddErrorListener(new ThrowExceptionErrorListener());
var commonTokenStream = new CommonTokenStream(lexer);
var parser = new dumbCalculatorParser(commonTokenStream);
parser.RemoveErrorListeners();
parser.AddErrorListener(new ThrowExceptionErrorListener());
var ex = parser.expression();
ExploreAST(ex);
Why is the rest of the output being accepted ?
Classical scenario, I find my error 5 minutes after posting on Stack Overflow.
For anyone encountering a similar scenario, this happened because I did not explicitly set the ErrorHandler on my parser.
Naively, I expected all the AddErrorListener to handle the errors, but somehow there's a specific thing to do if you need the errors to be handled before visiting the tree.
I needed to add
parser.ErrorHandler = new BailErrorStrategy();
After this, I indeed got the exceptions on wrong input strings.
This is probably not the right thing to do, I'll let someone who knows ANTLR better to comment on this.
Related
I need to figure out a good way using C# to parse an XML file for (NULL) and remove it from the tags and replace it with the word BAD.
For example:
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
should be replaced with
<GC5_BAD DIRTY="False"></GC5_BAD>
Part of the problem is I have no control over the original XML, I just need to fix it once I receive it. The second problem is that the (NULL) can appear in zero, one, or many tags. It appears to be an issue with users filling in additional fields or not. So I might get
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
or
<MH_OTHSECTION_TXT_(NULL) DIRTY="False"></MH_OTHSECTION_TXT_(NULL)>
or
<LCDATA_(NULL) DIRTY="False"></LCDATA_(NULL)>
I am a newbie to C# and programming.
EDIT:
So I have come up with the following function that while not pretty, so far work.
public static string CleanInvalidXmlChars(string fileText)
{
List<char> charsToSubstitute = new List<char>();
charsToSubstitute.Add((char)0x19);
charsToSubstitute.Add((char)0x1C);
charsToSubstitute.Add((char)0x1D);
foreach (char c in charsToSubstitute)
fileText = fileText.Replace(Convert.ToString(c), string.Empty);
StringBuilder b = new StringBuilder(fileText);
b.Replace("", string.Empty);
b.Replace("", string.Empty);
b.Replace("<(null)", "<BAD");
b.Replace("(null)>", "BAD>");
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String result = nullMatch.Replace(b.ToString(), "<$1_BAD$2>");
result = result.Replace("(NULL)", "BAD");
return result;
}
I have only been able to find 6 or 7 bad XML files to test this code on, but it has worked on each of them and not removed good data. I appreciate the feedback and your time.
In general, regular expressions are not the right way of handling XML files. There's a range of solutions to handle XML files correctly - you can read up on System.Xml.Linq for a good start. If you're a newbie, it's certainly something you should learn at some point. As Ed Plunkett pointed out in the comments, though, your XML is not actually XML: ( and ) characters are not allowed in XML element names.
Since you will have to do it as an operation on a string, Corak's comment to use
contentOfXml.Replace("(NULL)", "BAD");
may be a good idea, but will break if any elements can contain the string (NULL) as anything other than their name.
If you want a regex approach, this might work decently, but I'm not sure if it's not missing any edge cases:
var regex = new Regex(#"(<\/?[^_]*_)\(NULL\)([^>]*>)");
var result = regex.Replace(contentOfXml, "$1BAD$2");
Will it be suitable for you to read this XML as a string and perform a regex replacement? Like:
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String processedXmlString = nullMatch.Replace(originalXmlString, "<$1_BAD$2>");
I am building a simple command-style grammar using Sprache. I am trying to find out if there's a way to get better error reporting when missing a closing character (e.g. ], ), }).
If a closing character is missing my grammar correctly reports an error. However, the messaging leads to difficulty in appreciating the real problem. Given the following string to be parsed:
sum 10 [multiply 5 4
Sprache reports the following error:
Sprache.ParseException : Parsing failure: unexpected '['; expected newline or end of input (Line 1, Column 8); recently consumed: sum 10
What appears to be happening is that the parser tries to match on my CommandSubstitution and fails to find a closing ']'. This causes the parser to fallback and try an alternate. As it can't match any more Things for the command, it tries to match on the CommandTerminator. As that can't match a '[' it reports the error complaining about the expected newline or end of input instead of saying, "Hey, buddy, you didn't match your brace!"
Are there any workarounds or recommendations of how the grammar could be improved to make the reporting better using a parsing library like Sprache?
public static readonly Parser<Word> Word = Parse.Char(IsWordChar, "word character").AtLeastOnce().Text()
.Select(str => new Word(str));
public static readonly Parser<CommandSubstitution> CommandSubstitution = from open in Parse.Char('[').Once()
from body in Parse.Ref(() => Things)
from close in Parse.Char(']').Once()
select new CommandSubstitution(body.ToList());
public static readonly Parser<Thing> Thing = CommandSubstitution.Or<Thing>(Word);
public static readonly Parser<IEnumerable<Thing>> Things = (from ignoreBefore in WordSeparator.Optional()
from thing in Thing
from ignoreAfter in WordSeparator.Optional()
select thing).Many();
public static readonly Parser<IEnumerable<Thing>> Command = from things in Things
from terminator in CommandTerminator
select things;
It sounds like the overall problem is that Sprache is failing, trying alternatives, and failing again, when it ought to just give up after the first failure instead.
You're defining the Things parser using the Parse.Many extension method. The thing about the Parse.Many parser is that it always succeeds regardless of whether its inner parser succeeds or fails. If the inner parser fails, Parse.Many simply assumes that there is no more input that it needs to consume.
This is what seems to be happening here. First, Parse.Many consumes the fragment "sum 10 ". Then it tries to parse more input, but fails. Since it has failed to parse any more input, it assumes that there is no more input it needs to consume. But then an error results, because the fragment [multiply 5 4 has not been consumed.
To fix this, use Parse.XMany instead of Parse.Many. If the inner parser for Parse.XMany fails after consuming at least one character, then Parse.XMany will give up immediately and report the failure.
I'm now writing C# grammar using Antlr 3 based on this grammar file.
But, I found some definitions I can't understand.
NUMBER:
Decimal_digits INTEGER_TYPE_SUFFIX? ;
// For the rare case where 0.ToString() etc is used.
GooBall
#after
{
CommonToken int_literal = new CommonToken(NUMBER, $dil.text);
CommonToken dot = new CommonToken(DOT, ".");
CommonToken iden = new CommonToken(IDENTIFIER, $s.text);
Emit(int_literal);
Emit(dot);
Emit(iden);
Console.Error.WriteLine("\tFound GooBall {0}", $text);
}
:
dil = Decimal_integer_literal d = '.' s=GooBallIdentifier
;
fragment GooBallIdentifier
: IdentifierStart IdentifierPart* ;
The above fragments contain the definition of 'GooBall'.
I have some questions about this definition.
Why is GooBall needed?
Why does this grammar define lexer rules to parse '0.ToString()' instead of parser rules?
It's because that's a valid expression that's not handled by any of the other rules - I guess you'd call it something like an anonymous object, for lack of a better term. Similar to "hello world".ToUpper(). Normally method calls are only valid on variable identifiers or return values ala GetThing().Method(), or otherwise bare.
Sorry. I found the reason from the official FAQ pages.
Now if you want to add '..' range operator so 1..10 makes sense, ANTLR has trouble distinguishing 1. (start of the range) from 1. the float without backtracking. So, match '1..' in NUM_FLOAT and just emit two non-float tokens:
How do i Evaluate the following expression from a string to a answer as an integer?
Expression:
√(7+74) + √(30+6)
Do i have to evaluate each one of the parameters like Sqroot(7+74) and Sqroot(30+6) or is it possible to evaluate the whole expression. Any ideas?
If this string is user-supplied (or anyway available only at runtime) what you need is a mathematical expressions parser (maybe replacing the √ character in the text with sqrt or whatever the parser likes before feeding the string to it). There are many free ones available on the net, personally I used info.lundin.math several times without any problem.
Quick example for your problem:
info.lundin.Math.ExpressionParser parser = new info.lundin.Math.ExpressionParser();
double result = parser.Parse("sqrt(7+74)+sqrt(30+6)", null);
(on the site you can find more complex examples with e.g. parameters that can be specified programmatically)
You can use NCalc for this purpose
NCalc.Expression expr = new NCalc.Expression("Sqrt(7+74) + Sqrt(30+6)");
object result = expr.Evaluate();
I am trying to parse out some information from Google's geocoding API but I am having a little trouble with efficiently getting the data out of the xml. See link for example
All I really care about is getting the short_name from address_component where the type is administrative_area_level_1 and the long_name from administrative_area_level_2
However with my test program my XPath query returns no results for both queries.
public static void Main(string[] args)
{
using(WebClient webclient = new WebClient())
{
webclient.Proxy = null;
string locationXml = webclient.DownloadString("http://maps.google.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false");
using(var reader = new StringReader(locationXml))
{
var doc = new XPathDocument(reader);
var nav = doc.CreateNavigator();
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/address_component[type=administrative_area_level_1]/short_name").InnerXml);
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/address_component[type=administrative_area_level_2]/long_name").InnerXml);
}
}
}
Can anyone help me find what I am doing wrong, or recommending a better way?
You need to put the value of the node you're looking for in quotes:
".../address_component[type='administrative_area_level_1']/short_name"
↑ ↑
I'd definitely recommend using LINQ to XML instead of XPathNavigator. It makes XML querying a breeze, in my experience. In this case I'm not sure exactly what's wrong... but I'll come up with a LINQ to XML snippet instead.
using System;
using System.Linq;
using System.Net;
using System.Xml.Linq;
class Test
{
public static void Main(string[] args)
{
using(WebClient webclient = new WebClient())
{
webclient.Proxy = null;
string locationXml = webclient.DownloadString
("http://maps.google.com/maps/api/geocode/xml?address=1600"
+ "+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false");
XElement root = XElement.Parse(locationXml);
XElement result = root.Element("result");
Console.WriteLine(result.Elements("address_component")
.Where(x => (string) x.Element("type") ==
"administrative_area_level_1")
.Select(x => x.Element("short_name").Value)
.First());
Console.WriteLine(result.Elements("address_component")
.Where(x => (string) x.Element("type") ==
"administrative_area_level_2")
.Select(x => x.Element("long_name").Value)
.First());
}
}
}
Now this is more code1... but I personally find it easier to get right than XPath, because the compiler is helping me more.
EDIT: I feel it's worth going into a little more detail about why I generally prefer code like this over using XPath, even though it's clearly longer.
When you use XPath within a C# program, you have two different languages - but only one is in control (C#). XPath is relegated to the realm of strings: Visual Studio doesn't give an XPath expression any special handling; it doesn't understand that it's meant to be an XPath expression, so it can't help you. It's not that Visual Studio doesn't know about XPath; as Dimitre points out, it's perfectly capable of spotting errors if you're editing an XSLT file, just not a C# file.
This is the case whenever you have one language embedded within another and the tool is unaware of it. Common examples are:
SQL
Regular expressions
HTML
XPath
When code is presented as data within another language, the secondary language loses a lot of its tooling benefits.
While you can context switch all over the place, pulling out the XPath (or SQL, or regular expressions etc) into their own tooling (possibly within the same actual program, but in a separate file or window) I find this makes for harder-to-read code in the long run. If code were only ever written and never read afterwards, that might be okay - but you do need to be able to read code afterwards, and I personally believe the readability suffers when this happens.
The LINQ to XML version above only ever uses strings for pure data - the names of elements etc - and uses code (method calls) to represent actions such as "find elements with a given name" or "apply this filter". That's more idiomatic C# code, in my view.
Obviously others don't share this viewpoint, but I thought it worth expanding on to show where I'm coming from.
Note that this isn't a hard and fast rule of course... in some cases XPath, regular expressions etc are the best solution. In this case, I'd prefer the LINQ to XML, that's all.
1 Of course I could have kept each Console.WriteLine call on a single line, but I don't like posting code with horizontal scrollbars on SO. Note that writing the correct XPath version with the same indentation as the above and avoiding scrolling is still pretty nasty:
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/" +
"address_component[type='administrative_area_level_1']" +
"/short_name").InnerXml);
In general, long lines work a lot better in Visual Studio than they do on Stack Overflow...
I would recommend just typing the XPath expression as part of an XSLT file in Visual Studio. You'll get error messages "as you type" -- this is an excellent XML/XSLT/XPath editor.
For example, I am typing:
<xsl:apply-templates select="#* | node() x"/>
and immediately get in the Error List window the following error:
Error 9 Expected end of the expression, found 'x'. #* | node() -->x<--
XSLTFile1.xslt 9 14 Miscellaneous Files
Only when the XPath expression does not raise any errors (I might also test that it selects the intended nodes, too), would I put this expression into my C# code.
This ensures that I will have no XPath -- syntax and semantic -- errors when I run the C# program.
dtb's response is accurate. I wanted to add that you can use xpath testing tools like the link below to help find the correct xpath:
http://www.bit-101.com/xpath/
string url = #"http://maps.google.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false";
string value = "administrative_area_level_1";
using(WebClient client = new WebClient())
{
string wcResult = client.DownloadString(url);
XDocument xDoc = XDocument.Parse(wcResult);
var result = xDoc.Descendants("address_component")
.Where(p=>p.Descendants("type")
.Any(q=>q.Value.Contains(value))
);
}
The result is an enumeration of "address_component"s that have at least one "type" node that has contains the value you're searching for. The result of the query above is an XElement that contains the following data.
<address_component>
<long_name>California</long_name>
<short_name>CA</short_name>
<type>administrative_area_level_1</type>
<type>political</type>
</address_component>
I would really recommend spending a little time learning LINQ in general because its very useful for manipulating and querying in-memory objects, querying databases and tends to be easier than using XPath when working with XML. My favorite site to reference is http://www.hookedonlinq.com/