I am working with Irony.Net (https://irony.codeplex.com/) and have been working with the SQL Grammar. Now I have the parser and items working to get my statements correctly parsed. ( I had to add parameter support to the default grammar).
Now my question is simple. After I have manipulated the ParseTree I then want to rebuild the statement against the ParseTree.
Does Irony have a method of Rebuilding the original parsed text against the tree or do I need to write my own system for this?
I am fine writing my own system, but if it is already in place I would rather use that.
After quite some time working with the Irony.Net parser it seems relatively difficult to rebuild the original parsed string after you have manipulated the ParseTree.
The reason for this is unless you preserve the white spaces, and allotted punctuation the parse tree removes those entries automatically.
Now part of the parse tree does give you the "span" of the characters where the token \ term existed in the original string.
Given the span details you could essentially rebuild the statement by removing characters in the original statement at the span markers.
After much discussion it was found that although the Irony.Net project is fantastic at parsing your statements into AST's the project is not well suited for manipulation of the parsed tree.
With that being said we are still using the Irony.Net project for other problems.
Related
I am translating a string like this and I am wondering if there is a built in functionality so you can highlight a word, so you later can surround a specific word around an element like in the example below
What is shown when users translate
Read our Terms of Service here.
What is shown on the website
Read our Terms of Service here.
What you describe doesn't make sense in general, because any given word or phrase in the source language may get translated into a different number of words that may not even be contiguous.
The way this is typically done is that you simply include the markup in your string, and your translators deal with it. Any competent translation service can handle html markup.
Another option is to make the entire sentence the clickable text. The basic principle is to provide complete segments for translation. Attempting to deconstruct or reconstruct from parts is doomed to fail under localization.
I'm part of a small "message board" type project being built in a C# Web Form. I need to parse the user-entered text for objectionable words. This is my first C# project and I'm not sure how to split the words in the textbox.
It's been requested that I make an XML config file to contain the words to be screened for. Ideally, I would like to do a fark.com style replace. I've never made an XML config file and I really just need a place to start. All the config file information I've found has not been particularly applicable to this scenario.
Edit:
I ended up using a .txt file and splitting it on whitespace, then parsing the textbox on whitespace and comparing words. The project leader wanted a config file, but I pitched him on the simple solution and we went for it. Thanks for the replies.
An XML file won't scale well, especially if accessed concurrently. You'd better be using a database engine for such a task.
Making an XML config file just to filter a bunch of words probably isn't the best way to go there, considering it's most-likely just going to be a giant list of strings...
If it's not, have a look at the XmlDocument Class and the System.Xml namespace I assume you're aware of the format for XML documents but, if not, here is a simple example. The format is pretty much open to whatever XML tags you want, but the XmlDocument class I linked you to does have some fairly annoying catches that you'll come across while implementing it.
In terms of splitting the user text, it's fairly easy to hide "bad" words in another string so I'm not sure String.Split() is even what you want either. You will probably want to Regex it.
With that said, I came across this blog post a while ago that offers a simple profanity filter for .NET using Regex. Perhaps it will suit your needs.
Depends on how large this "bad words list" will be, and whether you expect it to change.
If it's pretty static, I would load the list from your XML file into some kind of in-memory collection. Then for each line of text you receive, parse the line into words, and then check each word for its existence in the collection.
If it's going to change frequently, and you need to pick up on those changes quickly, then you want more random access...that means a database. Hitting an XML repeatedly would be a performance drag.
Either way, split the string and react to each hit.
The string can be split up using something like:
myLineOfText.Split(new String[] { " " }, StringSplitOptions.RemoveEmptyEntries);
I use the # prefix with my inline strings quite often, to support multi-line strings or to make string with quotes a little more readable. Having to still double up the inline quotes is still somewhat of a pain, so this made me wonder if there was still another option in .net that would allow strings to maintain their doublequotes without requiring some form of delimiting? Something like a CDATA section in xml? I've searched a bit and didn't find anything, but thought I'd ask here in case I've overlooked some .Net feature (perhaps even a recent one in version 4 or 4.5)
update: I've found that vb.net has "XML Literals" that allow defining xml snippets directly inline with the source. This looks pretty close to what I'd like c# to do...
If there was something that would do what you want, than we wouldn't need to "escape" double quotes.
I like to use # when writing dynamic HTML in code. But static strings do belong to resources. Even ones that have dynamic values, for example, "Application error. Error Message: {0}". Then you use string.format to form the output.
I'm trying to read a structure of a text file in a certain way. The text file is kind of a user-friendly configuration file.
Current structure of file (structure can be changed if necessary):
info1=exampleinfo
info2=exampleinfo2
info3="example","example2","example3"
info4="example","example2","example3"
There is no real difficulty in getting the first two lines, but the latter two are more difficult. I need to put both in two seperate string arrays that I can use. I could use a split string, but the problem is in that in the info4 array, the values can contain comma's (this is all user input).
How to go about solving this?
The reason you're having trouble writing parser is that you're not starting with a good definition of the file format. Instead of asking how you should parse it if there are commas, you should be deciding how to properly encode values with commas. Then parsing is simple.
If this file is written by non-technical users who can't be trusted with a complex format (like json), consider a format like:
info1=exampleinfo
info2=exampleinfo2
info3=example
example2
example3
info4=example
example2
example3
That is, don't mess around with quotes and commas. Users understand line breaks and spaces pretty well.
I'm 100% in favor of #DavidHeffernan's solutions, JSON would be great. And #ScottMermelstein's solution of a program that builds the output - that's probably your best bet if possible, not allowing the user to make a mistake even if they wanted to.
However, if you need them to build the textfile, and you're working with users who can't be trusted to put together valid JSON, since it is a picky format, maybe try a delimiter that won't be used by the user, to separate values.
For example, pipes are always good, since practically nobody uses them:
info1=exampleinfo
info2=exampleinfo2
info3=example|example2|example3
info4=example|exam,ple2|example3
All you'd need is a rule that says their data cannot contain pipes. More than likely, the users would be ok with that.
Background
I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
Does standard "BBCode to HTML" specification exist?
How should I handle "[b][b][/b][/b]"? For now parser yields "<b>[b][/b]</b>".
How should I handle "[b][i][u]zzz[/b][/i][/u]" input? Currently my parser is smart enough to produce "<b><i><u>zzz</u></i></b>" output for such case, but I wonder that it is "too smart" approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
Should I rely on this browsers feature and do not try to make XHTML.
Maybe "[b][i]zzz[/b]ccc[/i]" must be understood as "<b>[i]zzz</b>ccc[/i]" - looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.
On your first question, I don't think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you're actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you're trying to have correct BBCode converted to correct HTML, if your input is [b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be <i><b>zzz</b>ccc</i> and not <b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you're stuck with regex.
If you're only going to implement B, I and U, which aren't terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it's closed.
At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won't be disastrous.
Regarding invalid user-submitted markup, you have at least three options:
Strip it out
Print it literally, i.e. don't convert it to HTML
Attempt to fix it.
I don't recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.
As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can't handle arbitrary nesting. That's one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.
For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You'll have fewer bugs this way than if you use regex or attempt to build your own state machine.