Parser Generator: How to use GPLEX and GPPG together? - c#

After looking through posts for good C# parser generators, I stumbled across GPLEX and GPPG. I'd like to use GPLEX to generate tokens for GPPG to parse and create a tree (similar to the lex/yacc relationship). However, I can't seem to find an example on how these two interact together. With lex/yacc, lex returns tokens that are defined by yacc, and can store values in yylval. How is this done in GPLEX/GPPG (it is missing from their documentation)?
Attached is the lex code I would like to convert over to GPLEX:
%{
#include <stdio.h>
#include "y.tab.h"
%}
%%
[Oo][Rr] return OR;
[Aa][Nn][Dd] return AND;
[Nn][Oo][Tt] return NOT;
[A-Za-z][A-Za-z0-9_]* yylval=yytext; return ID;
%%
Thanks!
Andrew

First: include the reference "QUT.ShiftReduceParser.dll" in your Project. It is provided in the download-package from GPLEX.
Sample-Code for Main-Program:
using System;
using ....;
using QUT.Gppg;
using Scanner;
using Parser;
namespace NCParser
{
class Program
{
static void Main(string[] args)
{
string pathTXT = #"C:\temp\testFile.txt";
FileStream file = new FileStream(pathTXT, FileMode.Open);
Scanner scanner = new Scanner();
scanner.SetSource(file, 0);
Parser parser = new Parser(scanner);
}
}
}
Sample-Code for GPLEX:
%using Parser; //include the namespace of the generated Parser-class
%Namespace Scanner //names the Namespace of the generated Scanner-class
%visibility public //visibility of the types "Tokens","ScanBase","Scanner"
%scannertype Scanner //names the Scannerclass to "Scanner"
%scanbasetype ScanBase //names the Scanbaseclass to "ScanBase"
%tokentype Tokens //names the Tokenenumeration to "Tokens"
%option codePage:65001 out:Scanner.cs /*see the documentation of GPLEX for further Options you can use */
%{ //user-specified code will be copied in the Output-file
%}
OR [Oo][Rr]
AND [Aa][Nn][Dd]
Identifier [A-Za-z][A-Za-z0-9_]*
%% //Rules Section
%{ //user-code that will be executed before getting the next token
%}
{OR} {return (int)Tokens.kwAND;}
{AND} {return (int)Tokens.kwAND;}
{Identifier} {yylval = yytext; return (int)Tokens.ID;}
%% //User-code Section
Sample-Code for GPPG-input-file:
%using Scanner //include the Namespace of the scanner-class
%output=Parser.cs //names the output-file
%namespace Parser //names the namespace of the Parser-class
%parsertype Parser //names the Parserclass to "Parser"
%scanbasetype ScanBase //names the ScanBaseclass to "ScanBase"
%tokentype Tokens //names the Tokensenumeration to "Tokens"
%token kwAND "AND", kwOR "OR" //the received Tokens from GPLEX
%token ID
%% //Grammar Rules Section
program : /* nothing */
| Statements
;
Statements : EXPR "AND" EXPR
| EXPR "OR" EXPR
;
EXPR : ID
;
%% User-code Section
// Don't forget to declare the Parser-Constructor
public Parser(Scanner scnr) : base(scnr) { }
c#parsegppggplex

I had a similar issue - not knowing how to use my output from GPLEX with GPPG due to an apparent lack of documentation. I think the problem stems from the fact that the GPLEX distribution includes gppg.exe along with gplex.exe, but only documentation for GPLEX.
If you go the GPPG homepage and download that distribution, you'll get the documentation for GPPG, which describes the requirements for the input file, how to construct your grammar, etc. Oh, and you'll also get both binaries again - gppg.exe and gplex.exe.
It almost seems like it would be simpler to just include everything in one package. It could definitely clear up some confusion, especially for those who may be new to lexical analysis (tokenization) and parsing (and may not be 100% familiar yet with the differences between the two).
So anyways, for those who may doing this for the first time:
GPLEX http://gplex.codeplex.com - used for tokenization/scanning/lexical analysis (same thing)
GPPG http://gppg.codeplex.com/ - takes output from a tokenizer as input to parse. For example, parsers use grammars and can do things a simple tokenizer cannot, like detect whether sets of parentheses match up.

Some time ago I have had the same need of using both GPLEX and GPPG together and for making the job much more easier I have created a nuget package for using GPPG and GPLEX together in Visual Studio.
This package can be installed in C# projects based on .Net Framework and adds some command-lets to the Package Manager Console in Visual Studio. This command-lets help you in configuring the C# project for integrating GPPG and GPLEX in the build process. Essentially in your project you will edit YACC and LEX files as source code and during the build of the project, the parser and the scanner will be generated. In addition the command-lets add to the projects the files needed for customizing the parser and the scanner.
You can find it here:
https://www.nuget.org/packages/YaccLexTools/
And here is a link to the blog post that explains how to use it:
http://ecianciotta-en.abriom.com/2013/08/yacclex-tools-v02.html

Have you considered using Roslyn? (This isn't a proper answer but I don't have enough reputation to post this as a comment)

Irony, because when I jumped into parsers in C# I started exactly from those 2 tools (about a year ago). Then lexer has tiny bug (easy to fix):
http://gplex.codeplex.com/workitem/11308
but parser had more severe:
http://gppg.codeplex.com/workitem/11344
Lexer should be fixed (release date is June 2013), but parser probably still has this bug (May 2012).
So I wrote my own suite :-) https://sourceforge.net/projects/naivelangtools/ and use and develop it since then.
Your example translates (in NLT) to:
/[Oo][Rr]/ -> OR;
/[Aa][Nn][Dd]/ -> AND;
/[Nn][Oo][Tt]/ -> NOT;
// by default text is returned as value
/[A-Za-z][A-Za-z0-9_]*/ -> ID;
Entire suite is similar to lex/yacc, when possible it does not rely on side effects (so you return appropriate value).

Related

C99 grammar in Irony - declaration/statement conflicts

I'm trying to use Irony to parse C99, and I found a grammar online to guide me.
I'm having difficulty with conflicts on declaration versus statement. The following rule fails to detect a pointer declaration with initializer.
blockItemList.Rule = MakePlusRule(blockItemList, blockItem);
blockItem.Rule = declaration | statement;
The type of line it's failing on would be:
MyType *x = foo();
When I remove labeledStatement and expressionStatement from statement's rule (both may start with identifier), this type of declaration is recognized correctly.
What's the best way to force Irony to exhaust the declaration rule before trying statement? Or, can I add to the grammar as Irony parses so that it can register MyType as a terminal rather than an identifier?
I remember having similiar problems with function calls and identifiers. Don't think you've done something particularily wrong, it's just the way grammars work. You need to "fine-tune" it for Irony. As far I know, Irony is LALR(1) parser, eg looking only one symbol forward when doing decisions. This might mean that you need to do more work than just define the given grammar.
I had case where I had conflicts in my grammar and I fixed it by lowering the "precision" of grammar. The actual precision was later restored through AST nodes.
Ps, you can also:
Use Irony GrammarExplorer and see what conflicts your grammar has. You can sometimes fix the conflicts with PreferSHiftHere() or ReduceHere()
And few links that I think are interesting to read:
http://irony.codeplex.com/discussions/400830
http://irony.codeplex.com/discussions/80134
https://irony.codeplex.com/discussions/551074
Context-free grammar understanding is not enough - you have to know
smth about parsing methods like LR, LALR(1), LL, etc. Irony is
LALR(1), while Antlr is LL. Grammar rules should be fine-tuned for a
specific method. Irony 'is insisting' on something 'wrong' means that
it took one of two equally possible alternatives resulting from
ambiguities (conflicts!) that it reports. So no point trying to parse
smth before you fix the conflicts. To do this - read more about LALR
grammars.

Antlr3 (C# Target) Consistently Hits Phantom EOF Character

While writing an Antlr3 grammar in AntlrWorks (generating C#), I wrote the following set of lexer rules as follows:
array :
'[' properties? ']' -> ^(ARR properties?)
;
properties :
propertyName (','! propertyName)*
;
propertyName :
ID
| ESC_ID
;
ESC_ID :
'\'' ESC_STRING '\''
;
fragment
ESC_STRING
: ( ESCAPE_SEQ | ~('\u0000'..'\u001f' | '\\' | '\"' ) )*
;
However, whenever I try to parse any string where the ESC_ID rule is matched, I hit a phantom EOF character at the end of the string:
Input: ['testing 123']
<mismatched token: [#4,15:15='<EOF>',<-1>,1:15]
I know that the Java version of ANTLR's generated code is not thoroughly debugged, but I've managed to find my way around the quirks so far. Thoughts on how not to hit this error when matching this lexer rule?
UPDATE
I have now tried using the official C# port of Antlr3, and I still get the same error.
ANTLRWorks can't be used to generate code for the C# targets. You'll need to generate your C# code using the Antlr3.exe tool that's included in the C# port. The preferred method is using the MSBuild integration, which can either be done manually or (finally!) automatically using NuGet.
The latest official release is found here:
http://www.antlr.org/wiki/display/ANTLR3/Antlr3CSharpReleases
In addition to that, I have released an alpha build of ANTLR 3 on NuGet. If you enable the "Include Prereleases" in the NuGet package manager in Visual Studio 2010+, you'll find it listed as ANTLR 3 version 3.5.0.3-alpha002.

Is there a C# utility for matching patterns in (syntactic parse) trees?

I'm working on a Natural Language Processing (NLP) project in which I use a syntactic parser to create a syntactic parse tree out of a given sentence.
Example Input: I ran into Joe and Jill and then we went shopping
Example Output: [TOP [S [S [NP [PRP I]] [VP [VBD ran] [PP [IN into] [NP [NNP Joe] [CC and] [NNP Jill]]]]] [CC and] [S [ADVP [RB then]] [NP [PRP we]] [VP [VBD went] [NP [NN shopping]]]]]]
I'm looking for a C# utility that will let me do complex queries like:
Get the first VBD related to 'Joe'
Get the NP closest to 'Shopping'
Here's a Java utility that does this, I'm looking for a C# equivalent.
Any help would be much appreciated.
There are at least two NLP frameworks, i.e.
SharpNLP (NOTE: project inactive since 2006)
Proxem
And here you can find instructions to use a java NLP in .NET:
Using OpenNLP in .NET project
This page is about using java OpenNLP, but could apply to the java library you've mentioned in your post
Or use NLTK following this guidelines:
Open Source NLP in C# 3.5 using NLTK
We already use
One option would be to parse the output into C# code and then encoding it to XML making every node into string.Format("<{0}>", this.Name); and string.Format("</{0}>", this._name); in the middle put all the child nodes recursively.
After you do this, I would use a tool for querying XML/HTML to parse the tree. Thousands of people already use query selectors and jQuery to parse tree-like structure based on the relation between nodes. I think this is far superior to TRegex or other outdated and un-maintained java utilities.
For example, this is to answer your first example:
var xml = CQ.Create(d.ToXml());
//this can be simpler with CSS selectors but I chose Linq since you'll probably find it easier
//Find joe, in our case the node that has the text 'Joe'
var joe = xml["*"].First(x => x.InnerHTML.Equals("Joe"));
//Find the last (deepest) element that answers the critiria that it has "Joe" in it, and has a VBD in it
//in our case the VP
var closestToVbd = xml["*"].Last(x => x.Cq().Has(joe).Has("VBD").Any());
Console.WriteLine("Closest node to VPD:\n " +closestToVbd.OuterHTML);
//If we want the VBD itself we can just find the VBD in that element
Console.WriteLine("\n\n VBD itself is " + closestToVbd.Cq().Find("VBD")[0].OuterHTML);
Here is your second example
//Now for NP closest to 'Shopping', find the element with the text 'shopping' and find it's closest NP
var closest = xml["*"].First(x => x.InnerHTML.Equals("shopping")).Cq()
.Closest("NP")[0].OuterHTML;
Console.WriteLine("\n\n NP closest to shopping is: " + closest);

Checking C# Syntax from the Command Line

Does anyone know of a way in the Microsoft .NET framework to check the syntax, and only the syntax, of a given C# file?
For a little background, what I am interested in doing is setting up syntastic to check the syntax of .cs files. Out of the box, syntastic uses the Mono C# compiler with its --parse flag to perform this operation but I can find no equivalent in the Microsoft .NET framework.
My first attempt to get this to work was to use csc /target:library /nologo in place of mcs --parse, but the problem is that this is called on a per-file basis. As a result, it reports missing namespaces (which exist in the full project build) instead of only syntactic errors.
You can do this via the Roslyn CTP. It allows you to parse the .cs file entirely, and walk the full tree, looking for errors.
For details, I recommend downloading the Walkthrough: Getting Started with Syntax Analysis for C#, as it shows you the basic approach to looking at syntax trees in a C# file.
I've used NRefactory before from the icsharpcode IDE. It's quick and easy for basic stuff.
see this article:
Using NRefactory for analyzing C# code
I use it for creating VB.NET examples from C# examples. The method that does this is really straight-forward and can easily be adapted to your needs:
private static void ConvertLanguage(TextReader input, TextWriter output, SupportedLanguage language, Action<string> onError)
{
using (IParser parser = ParserFactory.CreateParser(language, input))
{
parser.Parse();
var specials = parser.Lexer.SpecialTracker.RetrieveSpecials();
var result = parser.CompilationUnit;
//if (parser.Errors.Count > 0)
// MessageBox.Show(parser.Errors.ErrorOutput, "Parse errors");
IOutputAstVisitor outputVisitor;
if (language == SupportedLanguage.CSharp)
outputVisitor = new VBNetOutputVisitor();
else
outputVisitor = new CSharpOutputVisitor();
outputVisitor.Options.IndentationChar = ' ';
outputVisitor.Options.IndentSize = 4;
outputVisitor.Options.TabSize = 4;
using (SpecialNodesInserter.Install(specials, outputVisitor))
result.AcceptVisitor(outputVisitor, null);
if (outputVisitor.Errors.Count > 0 && onError != null)
onError(outputVisitor.Errors.ErrorOutput);
output.Write(outputVisitor.Text);
}
}
Note: The preceding code is from an older version and may not compile against the latest version of the NRefactory library.
I think I may have a solution to your question. If you are trying to check the syntax of you code without being in the debugger you can use an online compiler suck as compilr.
If you wish to output the resuts then you can use this amazing api called Html Agility to grab the results off of the online compiler with ease. Hope this helped!

How do I display only the class name in doxygen class diagrams?

Using doxygen and graphviz with my C# project, I can generate class diagrams in the documentation pages. These diagrams have the full class names and namespaces in them, e.g.
Acme.MyProduct.MyClasses.MyClass
Is it possible to configure doxygen to cut this down a bit to just the class name?
MyClass
The fully qualified paths make even simple diagrams rather wide and unwieldy. I'd like to minimize the need for horizontal scrolling.
I suspect that you've already solved this as it is a year old, but an answer might be useful for anyone else searching for this (as I just did). You can use the "HIDE_SCOPE_NAMES" option. Setting it to YES (or checking it in the doxywizard GUI) will hide namespaces. From my doxygen file:
# If the HIDE_SCOPE_NAMES tag is set to NO (the default) then Doxygen
# will show members with their full class and namespace scopes in the
# documentation. If set to YES the scope will be hidden.
HIDE_SCOPE_NAMES = YES
The HIDE_SCOPE_NAMES works great but only hides the scope in the class diagram but not the caller/callee graphs for each method.
To reduce the width of those diagrams to a readable size you can rename the scope using the input filter. This will not remove the namespace but will reduce it to a more readable width.
For example to rename the namespace "COMPANY_NAMESPACE" to "sf" use:
# The INPUT_FILTER tag can be used to specify a program that doxygen should
# invoke to filter for each input file. Doxygen will invoke the filter program
# by executing (via popen()) the command <filter> <input-file>, where <filter>
# is the value of the INPUT_FILTER tag, and <input-file> is the name of an
# input file. Doxygen will then use the output that the filter program writes
# to standard output. If FILTER_PATTERNS is specified, this tag will be
# ignored.
INPUT_FILTER = "sed 's,COMPANY_NAMESPACE,sf,'"

Categories