I need to parse a simple language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser.
Things had been going pretty well, until I ran into this construct in the language. (This is a simplified version, but it does show the problem):
EOF -> #"^\s*$";
[Skip] WHITESPACE -> #"\s+";
LIST -> "LIST";
END -> "END";
IDENTIFIER -> #"[a-zA-Z_][a-zA-Z0-9_]*";
Expr -> LIST IDENTIFIER+ END;
Start -> (Expr)+ EOF;
The resulting parser cannot parse this:
LIST foo BAR Baz END
because it greedily lexes END as an IDENTIFIER, instead of properly as the END keyword.
So, Here are my questions:
Is this grammar ambiguous or wrong for LL(1) parsing? Or is this a bug in TinyPG?
Is there any way to redesign the grammar such that TinyPG will properly parse the example line?
Are there any other suggestions for a simple parser that outputs code in C# and doesn't require additional libraries? I've looked at LLLPG and ANTLR4, but found them much more troublesome than TinyPG.
You might be the same guy since the issue looks identical, as the one I answered on GitHub, but here it is again for people who google this issue.
Here is an example from the Simple-CIL-compiler project,
The identifier has to catch single words except the ones listed, which means you have to include the exception token's in to the identifier
IDENTIFIER-> #"[a-zA-Z_][a-zA-Z0-9_]*(?<!(^)(end|else|do|while|for|true|false|return|to|incby|global|or|and|not|write|readnum|readstr|call))(?!\w)";
Hope that helps.
(Link to Original post)
Related
I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).
I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).
Thanks for pointing me to the right direction.
Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.
So:
A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'
B) Parse the wrapped input
C) Realize that you are in the IF function at the first parameter
D) Return all that can go into that place.
It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/
As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.
Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:
A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).
In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.
With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).
This is probably pretty hard to do.
Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.
For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.
This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.
Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.
What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")
Has anyone come across a tool to report on commented-out code in a .NET app? I'm talking about patterns like:
//var foo = "This is dead";
And
/*
var foo = "This is dead";
*/
This won't be found by tools like ReSharper or FxCop which look for unreferenced code. There are obvious implications around distinguishing commented code from commented text but it doesn't seem like too great a task.
Is there any existing tool out there which can pick this up? Even if it was just reporting of occurrences by file rather than full IDE integration.
Edit 1: I've also logged this as a StyleCop feature request. Seems like a good fit for the tool.
Edit 2: Yes, there's a good reason why I'd like to do this and it relates to code quality. See my comment below.
You can get an approximate answer by using a regexp that recognizes comments, that end with ";" or "}".
For a more precise scheme, see this answer:
Tool to find commented out VHDL code
I've been done this road for the same reason. I more or less did was Ira Baxter suggested (though I focused on variable_type variable = value and specifically looked for lines that consisted of 0 or more whitespace at beginning followed by // followed by code (and to handle /* */, I wrote a preprocessor that converted it into //'s. I tweaked the reg exp to cut down on false positives and also did a manual inspection just to be safe; fortunately, there were very few cases where the comment was doing pseudo-code like things as drachenstern suggests above; YMMV. I'd love to find a tool that could do this but some false positives from valid but possibly overly detailed pseudo code are going to be really to rule out, especially if they're using literate programming techniques to make the code as "readable" as pseudo code.
(I also did this for VB6 code; on the one hand, the lack of ;'s made it harder to right an "easy" reg exp, on the other hand the code used a lot less classes so it was easier to match on variable types which tend not to be in pseudo code)
Another option I looked at but didn't have available was to look at the revision control logs for lines that were code in version X and then //same line in version Y... this of courses assumes that A) they are using revision control B) you can view it, and C) they didn't write code and comment it out in the same revision. (And gets a little trickier if they use /* */ comments
There is another option for this, Sonar. It is actually a Java centric app but there is a plugin that can handle C# code. Currently it can report on:
StyleCop errors
FxCop errors
Gendarme errors
Duplicated code
Commented code
Unit test results
Code coverage (using NCover, OpenCover)
It does take a while for it to scan mostly due to the duplication checks (AFAIK uses text matching rather than C# syntax trees) and if you use the internal default derby database it can fail on large code bases. However it is very useful to for the code-base metrics you gain and it has snapshot features that enable you to see how things have changed and (hopefully) got better over time.
Since StyleCop is not actively maintained (see https://github.com/StyleCop/StyleCop#considerations), I decided to roll out my own, dead-csharp:
https://github.com/mristin/dead-csharp
Dead-csharp uses heuristics to find code patterns in the comments. The comments starting with /// are intentionally ignored (so that you can write code in structured comments).
StyleCop will catch the first pattern. It suggests you use //// as a comment for code so that it will ignore the rule.
Seeing as you mentioned NDepends it also can tell you Percentage commented http://www.ndepend.com/Metrics.aspx#PercentageComment. This is defined for application, assemblies, namespaces, types and methods.
I am researching ways, tools and techniques to parse code files in order to support syntax highlighting and intellisence in an editor written in c#.
Does anyone have any ideas/patterns & practices/tools/techiques for that.
EDIT: A nice source of info for anyone interested:
Parsing beyond Context-free grammars
ISBN 978-3-642-14845-3
My favourite parser for C# is Irony: http://irony.codeplex.com/ - i have used it a couple of times with great success
Here is a wikipedia page listing many more: http://en.wikipedia.org/wiki/Compiler-compiler
There are two basic aproaches:
1) Parse the entire solution and everything it references so you understand all the types involved in the code
2) Parse locally and do your best to guess what types etc are.
The trouble with (2) is that you have to guess, and in some circumstances you just can't tell from a code snippet exactly what everything is. But if you're happy with the sort oif syntax highlighting shown on (e.g.) Stack Overflow, then this approach is easy and quite effective.
To do (1) then you need to do one of (in decreasing order of difficulty):
Parse all the source code. Not possible if you reference 3rd party assemblies.
Use reflection on the compiled code to garner type information you can use when parsing the source.
Use the host IDE's (if avaiable - so not applicable in your case!) code element interfaces to provide the information you need
You could take a look at how http://www.icsharpcode.net/ did it. They wrote a book doing just that, Dissecting a C# Application: Inside SharpDevelop, it even has a chapter called
Implement a parser to provide syntax
highlighting and auto-completion as
users type
I often type string in c# when actually I want to type String.
I know that string is an alias of String and I am really just being pedantic but i wish to outlaw string to force me to write String.
Can this be done in ether visual studio intellesence or in resharper and how?
I've not seen it done before, but you may be able to achieve this with an Intellisense extension. A good place to start would be to look at the source for this extension on CodePlex.
Would be good to hear if you have any success with this.
I have always read in "Best Coding Practices" for C# to prefer string, int, float ,double to String, Int32, Single, Double. I think it is mostly to make C# look less like VB.NET, and more like C, but it works for me.
Also, you can go the other way, and add the following on top of every file
using S = System.String;
..
S msg = #"I don't like string.";
you may laugh at this, but I have found it invaluable when I have two similar source codes with different underlying data types. I usually have using num=System.Single; or using num=System.Double; and the rest of the code is identical, so I can copy and paste from one file to the other and maintain both single precision and double precision library in sync.
I think ReSharper can do this!
Here is an extract from the documentation:
ReSharper 5 provides Structural Search and Replace to find custom code constructs and replace them with other code constructs. What's even more exciting is that it's able to continuously monitor your solution for your search patterns, highlight code that matches them, and provide quick-fixes to replace the code according to your replace patterns. That essentially means that you can extend ReSharper's own 900+ code inspections with your custom inspections. For example, if you're migrating to a newer version of a framework, you can create search patterns to find usages of its older API and replace patterns to introduce an updated API.
cheers,
Chris
Background
I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
Does standard "BBCode to HTML" specification exist?
How should I handle "[b][b][/b][/b]"? For now parser yields "<b>[b][/b]</b>".
How should I handle "[b][i][u]zzz[/b][/i][/u]" input? Currently my parser is smart enough to produce "<b><i><u>zzz</u></i></b>" output for such case, but I wonder that it is "too smart" approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
Should I rely on this browsers feature and do not try to make XHTML.
Maybe "[b][i]zzz[/b]ccc[/i]" must be understood as "<b>[i]zzz</b>ccc[/i]" - looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.
On your first question, I don't think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you're actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you're trying to have correct BBCode converted to correct HTML, if your input is [b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be <i><b>zzz</b>ccc</i> and not <b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you're stuck with regex.
If you're only going to implement B, I and U, which aren't terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it's closed.
At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won't be disastrous.
Regarding invalid user-submitted markup, you have at least three options:
Strip it out
Print it literally, i.e. don't convert it to HTML
Attempt to fix it.
I don't recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.
As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can't handle arbitrary nesting. That's one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.
For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You'll have fewer bugs this way than if you use regex or attempt to build your own state machine.