I'd like to parse formatted basic values and a few custom strings from a TextReader - essentially like scanf allows.
My input might not have line-breaks, so ReadLine+Regex isn't an option. I could use some other way of chunking text input; but the problem is that I don't know the delimiter at compile time (so that's tricky), and that that delimiter might be localization-dependant. For instance, a float followed by a comma might be "1.5," or "1,5," but in both cases attempting to parse the float should be "greedy".
To be safe, I'd like to assume my input is actively hostile (say, streaming in from a network stream): i.e. intentionally missing chunking delimiters.
I'd like to avoid custom Regex's: int.Parse and double.Parse work well and are localization-aware. Don't get me started on DateTime's - I might need a few custom patterns anyhow, but writing Regexes to cover that scenario doesn't sound like fun.
For a concrete example, let's say I have a TextReader and that I know the next value should be a double - how can I extract that double and possibly a limited amount of lookahead without reading the entire stream and without manually writing a localizable double-parser?
Similar Questions
There's a previous question "Looking for C# equivalent of scanf" which sounds similar but the Q+A focus on readline+regex (which I'd like to avoid). How can I use Regex against a TextReader? didn't find an answer (beyond chunking), and in any case I'd like to avoid writing my own Regexes.
Based on that lack of answers and still not having found anything myself, it seems that
There is no means to use localized parsing directly from Streams (or TextReaders) in .NET, nor is there a way to know how much of the stream corresponds to a parseable prefix in a systematic way.
There is no means to apply regular expressions to Streams (or TextReaders) in .NET, so there's no easy way of implementing something like this yourself.
If you really need something like this, the easiest option is a full-fledged parser generator. ANTLR works well for this; it has a lot of existing grammars you can copy-paste for the basics, and it comes with a GUI to help understand your grammar and makes parsers for .NET, java, C and a host of other languages. It's developer friendly, fast... ...but way too powerful and flexible for what I need; like shooting a bug with a shotgun - I'm not thrilled with this solution.
Related
This question sounds trivial but let me explain my scenario.
I am working in an object oriented programming language (C#) and most of the actual execution code is procedural, i.e. series of statements, sometimes branches and loops. Fairly standard.
Now I am presented with a task to deal with a textual format (PGN, but it could be anything other like VCard or some custom format). At least for me, the "standard" way to work with it would be to use a mix of:
regular expressions
if / switch statements
for-loops
storing regexp matches into some custom structure and / or outputting it to some result format
However, I don't like this procedural approach at all - regular expressions are prone to errors, the code is usually quite hard to understand and debug, it usually tends to have quite a high cyclomatic complexity etc.
Simply put, I'd like it to be declarative but I don't know what tools or libraries to use.
I remember that when I saw demos of the "M" language I thought that that was exactly I was looking for. There was a simple way to declare syntax of my textual format, the tool would then automatically parse input string into an in-memory representation of the textual DSL, I think that it was also possible to transform the format into another etc.
I have been also in touch with the people behind JetBrains MPS which is another tool for working with DSLs but my scenario doesn't seem to be a perfect match for what they are trying to provide.
So if anyone has any idea about how to elegantly deal with textual formats in otherwise procedural code base, I'd be happy to learn about the options.
Check out my open source project meta#. I think it sounds like exactly what you're looking for.
I am creating a scripting language to be used to create web pages, but don't know exactly where to begin.
I have a file that looks like this:
mylanguagename(main) {
OnLoad(protected) {
Display(img, text, link);
}
Canvas(public) {
Image img: "Images\my_image.png";
img.Name: "img";
img.Border: "None";
img.BackgroundColor: "Transparent";
img.Position: 10, 10;
Text text: "This is a multiline str#ning. The #n creates a new line.";
text.Name: text;
text.Position: 10, 25;
Link link: "Click here to enlarge img.";
link.Name: "link";
link.Position: 10, 60;
link.Event: link.Clicked;
}
link.Clicked(sender, link, protected) {
Image img: from Canvas.FindElement(img);
img.Size: 300, 300;
}
}
... and I need to be able to make that text above target the Windows Scripting Host. I know this can be done, because there used to be a lot of Docs on it around the net a while back, but I cannot seem to find them now.
Can somebody please help, or get me started in the right direction?
Thanks
You're making a domain-specific language which does not exist. You want to translate to another language. You will need a proper scanner and parser. You've probably been told to look at antlr. yacc/bison, or gold. What went wrong with that?
And as an FYI, it's a fun exercise to make new languages, but before you do for something like this, you might ask a good solid "why? What does my new language provide that I couldn't get any other (reasonable) way?"
The thing to understand about parsing and language creation is that writing a compiler/interpreter is primarily about a set of data transformations done to an input text.
Generally, from an input text you will first translate it into a series of tokens, each token representing a concept in your language or a literal value.
From the token stream, you will generally then create an intermediate structure, typically some kind of tree structure describing the code that was written.
This tree structure can then be validated or modified for various reasons, including optimization.
Once that's done, you'll typically write the tree out to some other form - assembly instructions or even a program in another language - in fact, the earliest versions of C++ wrote out straight C code, which were then compiled by a regular C compiler that had no knowledge of C++ at all. So while skipping the assembly generation step might seem like cheating, it has a long and proud tradition behind it :)
I deliberately haven't gotten into any suggestions for specific libraries, as understanding the overall process is probably much more important than choosing a specific parser technology, for instance. Whether you use lex/yacc or ANTLR or something else is pretty unimportant in the long run. They'll all (basically) work, and have all been used successfully in various projects.
Even doing your own parsing by hand isn't a bad idea, as it will help you to learn the patterns of how parsing is done, and so then using a parser generator will tend to make more sense rather than being a black box of voodoo.
Languages similar to C# are not easy to parse - there are some naturally left-recursive rules. So you have to use a parser generator that can deal with them properly. ANTLR fits well.
If PEG fits better, try this: http://www.meta-alternative.net/mbase.html
So you want to translate C# programs to JavaScript? Script# can do this for you.
Rather than write your own language and then run a translator to convert it into Javascript, why not extend Javascript to do what you want it to do?
Take a look at jQuery - it extends Javascript in many powerful ways with a very natural and fluent syntax. It's almost as good as having your own language. Take a look at the many extensions people have created for it too, especially jQuery UI.
Assuming you are really dedicated to do this, here is the way to go. This is normally what you should do: source -> SCANNER -> tokens -> PARSER -> syntax tree
1) Create a scanner/ parser to parse your language. You need to write a grammar to generate a parser that can scan/parse your syntax, to tokenize/validate them.
I think the easiest way here is to go with Irony, that'll make creating a parser quick and easy. Here is a good starting point
http://www.codeproject.com/KB/recipes/Irony.aspx
2) Build a syntax tree - In this case, I suggest you to build a simple XML representation instead of an actual syntax tree, so that you can later walk the XML representation of your DOM to spit out VB/Java Script. If your requirements are complex (like you want to compile it or so), you can create a DLR Expression Tree or use the Code DOM - but here I guess we are talking about a translator, and not about a compiler.
But hey wait - if it is not for educational purposes, consider representing your 'script' as an xml right from the beginning, so that you can avoid a scanner/parser in between, before spitting out some VB/Java script/Html out of that.
I don't wan to be rude... but why are you doing this?
Creating a parser for a regular language is a non-trivial task. Just don't do it.
Why don't you just use html, javascript and css (and jquery as someone above suggested)
If you don't know where to begin, then you probably don't have any experience of this kind and probably you don't have a good reason, why to do this.
I want to save you the pain. Forget it. It's probably a BAD IDEA!
M.
Check out Constructing Language Processors for Little Languages. It's a very good intro I believe. In fact I just consulted my copy 2 days ago when I was having trouble with my template language parser.
Use XML if at all possible. You don't want to fiddle with a lexer and parser by hand if you want this thing in production. I've made this mistake a few times. You end up supporting code that you really shouldn't be. It seems that your language is mainly a templating language. XML would work great there. Just as ASPX files are XML. Your server side blocks can be written in Javascript, modified if necessary. If this is a learning exercise then do it all by hand, by all means.
I think writing your own language is a great exercise. So is taking a college level compiler writing class. Good luck.
You obviously need machinery designed to translate langauges: parsing, tree building, pattern matching, target-language tree building, target-language prettyprinting.
You can try to do all of this with YACC (or equivalents), but you'll discover that parsing
is only a small part of a full translator. This means there's a lot more work
to do than just parsing, and that takes time and effort.
Our DMS Software Reengineering Toolkit is a commercial solution to building full translators for relatively modest costs.
If you want to do it on your own from the ground up as an exercise, that's fine. Just be prepared for the effort it really takes.
One last remark: designing a complete language is hard if you want to get a nice result.
Personally I think that every self-imposed challenge is good. I do agree with the other opinions that if what you want is a real solution to a real life problem, it's probably better to stick with proved solutions. However, if as you said yourself, you have an academic interest into solving this problem, then I encourage you to keep on. If this is the case, I might point a couple of tips to get you on the track.
Parsing is not really an easy task, that is way we take at least a semester of it. However, it can be learned. I would recommend starting with Terrence Parr's book on language implementation patterns. There are many great books about compiling and parsing, probably the most loved and hated been the Dragon Book.
This is pretty heavy stuff, but if you are really into this, and have the time, you should definitely take a look. This would be the Robisson Crusoe's "i'll make it all by myself approach". I have recently written an LR parser generator and it took me no more than a long weekend, but that after reading a lot and taking a full two-semesters course on compilers.
If you don't have the time or simply don't want to learn to make a parser "like men do", then you can always try a commercial or academic parser generator. ANTLR is just fine, but you have to learn its meta-language. Personally I think that Irony is a great tool, specially because it stays inside C# and you can take a look at the source code and learn for yourself. Since we are here, and I'm not trying to make any advertisement at all, I have posted a tiny tool in CodePlex that could be useful for this task. Take a look for yourself, it's open-source and free.
As a final tip, don't get scared if someone tells you it cannot be done. Parsing is a difficult theoretical problem but it's nothing that can't be learned, and it really is a great tool to have in your portfolio. I think it speaks very good of a developer that he can write an descent-recursive parser by hand, even if he never has to. If you want to pursuit this goal to its end, take a college-level compilers course, you'll thank me in a year.
Background
I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
Does standard "BBCode to HTML" specification exist?
How should I handle "[b][b][/b][/b]"? For now parser yields "<b>[b][/b]</b>".
How should I handle "[b][i][u]zzz[/b][/i][/u]" input? Currently my parser is smart enough to produce "<b><i><u>zzz</u></i></b>" output for such case, but I wonder that it is "too smart" approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
Should I rely on this browsers feature and do not try to make XHTML.
Maybe "[b][i]zzz[/b]ccc[/i]" must be understood as "<b>[i]zzz</b>ccc[/i]" - looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.
On your first question, I don't think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you're actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you're trying to have correct BBCode converted to correct HTML, if your input is [b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be <i><b>zzz</b>ccc</i> and not <b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you're stuck with regex.
If you're only going to implement B, I and U, which aren't terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it's closed.
At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won't be disastrous.
Regarding invalid user-submitted markup, you have at least three options:
Strip it out
Print it literally, i.e. don't convert it to HTML
Attempt to fix it.
I don't recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.
As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can't handle arbitrary nesting. That's one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.
For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You'll have fewer bugs this way than if you use regex or attempt to build your own state machine.
I am looking for something that will take a complex search string and allow me to test it against some text to determine whether the text meets the search criteria.
I would like to support query syntax similar to google/twitter (i.e. support for: and, or, not, exact string, wildcards, etc) and would also like it to handle plurals of words (maybe synonyms if I could have my cake and eat it). I guess what I want is the analysis and query aspects of a search engine without building and maintaining an index.
I really would like to avoid developing this, and thought that it seems like it might be a fairly common requirement. But I have been unable to identify anything in the .net world that specifically meets my needs.
I thought I might be able to use elements of Lucene.net to do this, but have no experience with it. So I would like to know if anybody out there has any ideas that might help or if they have done this before (and what they used). Would be happy to consider non-.NET solutions if integration is possible.
Any input is much appreciated.
Regards
Allen
Regex is exactly your solution.
The only thing you mentioned it doesn't support is synonyms and plurals obviously, because that is language depended. But I guess, you can easily get a list of synonyms, or exceptional plurals in English or something like that, and then write your Regex builder for those (really easy).
Regex is a shortcut for Regular Expressions, and is a well known engine, that exist in a lot of languages' libraries.
A nice site you can learn Regex from is http://www.regular-expressions.info/.
In dot net, all the Regex related classes are in System.Text.RegularExpressions. you can guess quite easily by yourself how to use it... (or just google C# REGEX or something)
I'm writing a scanner as part of a compiler.
I'm having a major headache trying to write this one portion:
I need to be able to parse a stream of tokens and push them one by one into a vector, ignoring whitespace and tokenizing special symbols (simple case, lets just consider parentheses and braces)
Example:
int main(){ }
should parse into 6 different tokens:
int
main
(
)
{
}
How would you go about solving this? I'm writing this in C++, but a java/C# solution would be appreciated as well.
Some points:
and no, I can't use Boost, I can't guarantee that the libraries will be
available to me. (don't ask...)
I don't want to use lex, or any other special tools. I've never done
this before and just want to try this once to say I've done it.
Stroustrup's book, The C++ Programming Language, has a great example in it about building a lexer/parser for a simple calculator program. It should serve as a good starting point to learn how to do what you want.
Buy a copy of Compilers: Principles, Techniques, and Tools (the Dragon Book). What you're attempting to write is a lexer, not a "scanner".
Why write your own - look at Lex.
If youmust have your own, you just read the input character by character and maintain some minimum state to accumulate identifiers.
The problem itself is not hard. If you can't solve it, you must be burned out, you just need a rest. Look at it again in the morning.
If you really want to learn something from this exercise, just start coding. It doesn't demand a lot of code, so you can fail repeatedly without blowing more than an afternoon.
At this point you'll have a good feel for the problem.
Then look in any random compilers book to see what the "usual" methods are, and you'll grok then immediately.
umm.. I'd just do a while loop with iterators testing each character for type, and only an alpha to non alpha change, dump the string if it's non empty. if it's a non alpha non white space character, I'd just push it onto the token stack, this is really a trivial parsing task. Shoot, I've been meaning to learn lexx/yacc, but the level of parsing you want is really easy. I wrote a html tokenizer once which is more complicated that this.. I mean you are just looking for names, white space and single non alphanumeric characters.. just do it.
If you want to write this from scratch, you could look into writing a finite state machine (states in an enum, a big switch/case block for state switching). You'd have to push the state to a stack since everything can be nested.
I know that this is not the ideal method; I'm just trying to directly address the question.