I've been using LeMP to great effect to auto-generate some code that is identical across variants but for the type of the arguments. However, the classes I'm working on also contain methods authored "by hand", with no LeMP involvement. The challenge is that LeMP seems to throw away many of the newlines in the original code, making the generated C# much harder to read (which I still need to do for use with a debugger, etc).
There seem to be two cases:
DllImport method prototypes just lose their newlines altogether -- looking in a hex editor, the newlines are transformed into spaces.
Methods with actual function bodies, which look to retain some newlines but there is no newline between the closing curly brace and the next 'public T ...' for the next method, for instance.
Some methods seem to be untouched, which is what I'd like to see for everything that isn't generated by a macro.
What's the best way to get LeMP's output to retain as much of the original formatting in the code as possible?
Ok, it seems pretty likely that this is caused by a bug. I've filed 2 issues with some more details.
Related
I need to parse a simple language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser.
Things had been going pretty well, until I ran into this construct in the language. (This is a simplified version, but it does show the problem):
EOF -> #"^\s*$";
[Skip] WHITESPACE -> #"\s+";
LIST -> "LIST";
END -> "END";
IDENTIFIER -> #"[a-zA-Z_][a-zA-Z0-9_]*";
Expr -> LIST IDENTIFIER+ END;
Start -> (Expr)+ EOF;
The resulting parser cannot parse this:
LIST foo BAR Baz END
because it greedily lexes END as an IDENTIFIER, instead of properly as the END keyword.
So, Here are my questions:
Is this grammar ambiguous or wrong for LL(1) parsing? Or is this a bug in TinyPG?
Is there any way to redesign the grammar such that TinyPG will properly parse the example line?
Are there any other suggestions for a simple parser that outputs code in C# and doesn't require additional libraries? I've looked at LLLPG and ANTLR4, but found them much more troublesome than TinyPG.
You might be the same guy since the issue looks identical, as the one I answered on GitHub, but here it is again for people who google this issue.
Here is an example from the Simple-CIL-compiler project,
The identifier has to catch single words except the ones listed, which means you have to include the exception token's in to the identifier
IDENTIFIER-> #"[a-zA-Z_][a-zA-Z0-9_]*(?<!(^)(end|else|do|while|for|true|false|return|to|incby|global|or|and|not|write|readnum|readstr|call))(?!\w)";
Hope that helps.
(Link to Original post)
I am working with Irony.Net (https://irony.codeplex.com/) and have been working with the SQL Grammar. Now I have the parser and items working to get my statements correctly parsed. ( I had to add parameter support to the default grammar).
Now my question is simple. After I have manipulated the ParseTree I then want to rebuild the statement against the ParseTree.
Does Irony have a method of Rebuilding the original parsed text against the tree or do I need to write my own system for this?
I am fine writing my own system, but if it is already in place I would rather use that.
After quite some time working with the Irony.Net parser it seems relatively difficult to rebuild the original parsed string after you have manipulated the ParseTree.
The reason for this is unless you preserve the white spaces, and allotted punctuation the parse tree removes those entries automatically.
Now part of the parse tree does give you the "span" of the characters where the token \ term existed in the original string.
Given the span details you could essentially rebuild the statement by removing characters in the original statement at the span markers.
After much discussion it was found that although the Irony.Net project is fantastic at parsing your statements into AST's the project is not well suited for manipulation of the parsed tree.
With that being said we are still using the Irony.Net project for other problems.
Long Version
In the web app that I work on, we put all our translations into .resx files that we then refer to by calling Resources.FileName.KeyName (as specified in the "To retrieve global resources using strong typing" section of http://msdn.microsoft.com/en-us/library/ms227982%28v=vs.100%29.aspx).
In some places we retrieve the value directly, but in a lot of cases we retrieve the value to be used in JavaScript, so we need to do something like this: HttpUtility.JavaScriptStringEncode(Resources.FileName.KeyName)
The problem is that there are thousands of these kinds of lines that need to be wrapped with a call to HttpUtility.JavaScriptStringEncode retroactively.
There has to be a better way to do this rather than going through the entire source code and manually wrapping each reference to the resources.
TL;DR Version
I need a better way of wrapping each Resources.FileName.KeyName call with HttpUtility.JavaScriptStringEncode() than manually going to each reference and adding it myself.
I was unable to come up with a true solution for this so I decided to solve this using regular expressions.
Now, the problem is that we are still using Visual Studio 2010, which has it's own subset of regex that is not compatible with a lot of advanced trickery, so assertions were not an option.
Instead, I just found all instances of <Resources\.Thread.{[a-zA-Z0-9_]+} and replaced it with HttpUtility.JavaScriptStringEncode( Resources.Thread.\1 ).
This can create duplicate calls though if you already had a few instances that were properly calling JavaScriptStringEncode, so unfortunately following that I had to find all instances of HttpUtility.JavaScriptStringEncode\(HttpUtility.JavaScriptStringEncode\( {[a-zA-Z0-9_ \.\)]+}\)\) and replace it with HttpUtility.JavaScriptStringEncode( \1).
The lack of space at the end of the replace string inside the parenthesis is intentional for formatting.
There are other derivatives to sort out (spaces between parenthesis) but this is the baseline. After a few more searches to clean up duplicate calls, it was done.
Not my best work, but the best I could come up with short of upgrading to Visual Studio 2012/2013 and using real regex with assertions to do it all in one shot (which is obviously recommended if you can do it).
I use the # prefix with my inline strings quite often, to support multi-line strings or to make string with quotes a little more readable. Having to still double up the inline quotes is still somewhat of a pain, so this made me wonder if there was still another option in .net that would allow strings to maintain their doublequotes without requiring some form of delimiting? Something like a CDATA section in xml? I've searched a bit and didn't find anything, but thought I'd ask here in case I've overlooked some .Net feature (perhaps even a recent one in version 4 or 4.5)
update: I've found that vb.net has "XML Literals" that allow defining xml snippets directly inline with the source. This looks pretty close to what I'd like c# to do...
If there was something that would do what you want, than we wouldn't need to "escape" double quotes.
I like to use # when writing dynamic HTML in code. But static strings do belong to resources. Even ones that have dynamic values, for example, "Application error. Error Message: {0}". Then you use string.format to form the output.
Background
I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
Does standard "BBCode to HTML" specification exist?
How should I handle "[b][b][/b][/b]"? For now parser yields "<b>[b][/b]</b>".
How should I handle "[b][i][u]zzz[/b][/i][/u]" input? Currently my parser is smart enough to produce "<b><i><u>zzz</u></i></b>" output for such case, but I wonder that it is "too smart" approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
Should I rely on this browsers feature and do not try to make XHTML.
Maybe "[b][i]zzz[/b]ccc[/i]" must be understood as "<b>[i]zzz</b>ccc[/i]" - looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.
On your first question, I don't think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you're actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you're trying to have correct BBCode converted to correct HTML, if your input is [b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be <i><b>zzz</b>ccc</i> and not <b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you're stuck with regex.
If you're only going to implement B, I and U, which aren't terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it's closed.
At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won't be disastrous.
Regarding invalid user-submitted markup, you have at least three options:
Strip it out
Print it literally, i.e. don't convert it to HTML
Attempt to fix it.
I don't recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.
As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can't handle arbitrary nesting. That's one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.
For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You'll have fewer bugs this way than if you use regex or attempt to build your own state machine.