Creating a scripting language to be used to create web pages - c#

I am creating a scripting language to be used to create web pages, but don't know exactly where to begin.
I have a file that looks like this:
mylanguagename(main) {
OnLoad(protected) {
Display(img, text, link);
}
Canvas(public) {
Image img: "Images\my_image.png";
img.Name: "img";
img.Border: "None";
img.BackgroundColor: "Transparent";
img.Position: 10, 10;
Text text: "This is a multiline str#ning. The #n creates a new line.";
text.Name: text;
text.Position: 10, 25;
Link link: "Click here to enlarge img.";
link.Name: "link";
link.Position: 10, 60;
link.Event: link.Clicked;
}
link.Clicked(sender, link, protected) {
Image img: from Canvas.FindElement(img);
img.Size: 300, 300;
}
}
... and I need to be able to make that text above target the Windows Scripting Host. I know this can be done, because there used to be a lot of Docs on it around the net a while back, but I cannot seem to find them now.
Can somebody please help, or get me started in the right direction?
Thanks

You're making a domain-specific language which does not exist. You want to translate to another language. You will need a proper scanner and parser. You've probably been told to look at antlr. yacc/bison, or gold. What went wrong with that?
And as an FYI, it's a fun exercise to make new languages, but before you do for something like this, you might ask a good solid "why? What does my new language provide that I couldn't get any other (reasonable) way?"

The thing to understand about parsing and language creation is that writing a compiler/interpreter is primarily about a set of data transformations done to an input text.
Generally, from an input text you will first translate it into a series of tokens, each token representing a concept in your language or a literal value.
From the token stream, you will generally then create an intermediate structure, typically some kind of tree structure describing the code that was written.
This tree structure can then be validated or modified for various reasons, including optimization.
Once that's done, you'll typically write the tree out to some other form - assembly instructions or even a program in another language - in fact, the earliest versions of C++ wrote out straight C code, which were then compiled by a regular C compiler that had no knowledge of C++ at all. So while skipping the assembly generation step might seem like cheating, it has a long and proud tradition behind it :)
I deliberately haven't gotten into any suggestions for specific libraries, as understanding the overall process is probably much more important than choosing a specific parser technology, for instance. Whether you use lex/yacc or ANTLR or something else is pretty unimportant in the long run. They'll all (basically) work, and have all been used successfully in various projects.
Even doing your own parsing by hand isn't a bad idea, as it will help you to learn the patterns of how parsing is done, and so then using a parser generator will tend to make more sense rather than being a black box of voodoo.

Languages similar to C# are not easy to parse - there are some naturally left-recursive rules. So you have to use a parser generator that can deal with them properly. ANTLR fits well.
If PEG fits better, try this: http://www.meta-alternative.net/mbase.html

So you want to translate C# programs to JavaScript? Script# can do this for you.

Rather than write your own language and then run a translator to convert it into Javascript, why not extend Javascript to do what you want it to do?
Take a look at jQuery - it extends Javascript in many powerful ways with a very natural and fluent syntax. It's almost as good as having your own language. Take a look at the many extensions people have created for it too, especially jQuery UI.

Assuming you are really dedicated to do this, here is the way to go. This is normally what you should do: source -> SCANNER -> tokens -> PARSER -> syntax tree
1) Create a scanner/ parser to parse your language. You need to write a grammar to generate a parser that can scan/parse your syntax, to tokenize/validate them.
I think the easiest way here is to go with Irony, that'll make creating a parser quick and easy. Here is a good starting point
http://www.codeproject.com/KB/recipes/Irony.aspx
2) Build a syntax tree - In this case, I suggest you to build a simple XML representation instead of an actual syntax tree, so that you can later walk the XML representation of your DOM to spit out VB/Java Script. If your requirements are complex (like you want to compile it or so), you can create a DLR Expression Tree or use the Code DOM - but here I guess we are talking about a translator, and not about a compiler.
But hey wait - if it is not for educational purposes, consider representing your 'script' as an xml right from the beginning, so that you can avoid a scanner/parser in between, before spitting out some VB/Java script/Html out of that.

I don't wan to be rude... but why are you doing this?
Creating a parser for a regular language is a non-trivial task. Just don't do it.
Why don't you just use html, javascript and css (and jquery as someone above suggested)
If you don't know where to begin, then you probably don't have any experience of this kind and probably you don't have a good reason, why to do this.
I want to save you the pain. Forget it. It's probably a BAD IDEA!
M.

Check out Constructing Language Processors for Little Languages. It's a very good intro I believe. In fact I just consulted my copy 2 days ago when I was having trouble with my template language parser.
Use XML if at all possible. You don't want to fiddle with a lexer and parser by hand if you want this thing in production. I've made this mistake a few times. You end up supporting code that you really shouldn't be. It seems that your language is mainly a templating language. XML would work great there. Just as ASPX files are XML. Your server side blocks can be written in Javascript, modified if necessary. If this is a learning exercise then do it all by hand, by all means.
I think writing your own language is a great exercise. So is taking a college level compiler writing class. Good luck.

You obviously need machinery designed to translate langauges: parsing, tree building, pattern matching, target-language tree building, target-language prettyprinting.
You can try to do all of this with YACC (or equivalents), but you'll discover that parsing
is only a small part of a full translator. This means there's a lot more work
to do than just parsing, and that takes time and effort.
Our DMS Software Reengineering Toolkit is a commercial solution to building full translators for relatively modest costs.
If you want to do it on your own from the ground up as an exercise, that's fine. Just be prepared for the effort it really takes.
One last remark: designing a complete language is hard if you want to get a nice result.

Personally I think that every self-imposed challenge is good. I do agree with the other opinions that if what you want is a real solution to a real life problem, it's probably better to stick with proved solutions. However, if as you said yourself, you have an academic interest into solving this problem, then I encourage you to keep on. If this is the case, I might point a couple of tips to get you on the track.
Parsing is not really an easy task, that is way we take at least a semester of it. However, it can be learned. I would recommend starting with Terrence Parr's book on language implementation patterns. There are many great books about compiling and parsing, probably the most loved and hated been the Dragon Book.
This is pretty heavy stuff, but if you are really into this, and have the time, you should definitely take a look. This would be the Robisson Crusoe's "i'll make it all by myself approach". I have recently written an LR parser generator and it took me no more than a long weekend, but that after reading a lot and taking a full two-semesters course on compilers.
If you don't have the time or simply don't want to learn to make a parser "like men do", then you can always try a commercial or academic parser generator. ANTLR is just fine, but you have to learn its meta-language. Personally I think that Irony is a great tool, specially because it stays inside C# and you can take a look at the source code and learn for yourself. Since we are here, and I'm not trying to make any advertisement at all, I have posted a tiny tool in CodePlex that could be useful for this task. Take a look for yourself, it's open-source and free.
As a final tip, don't get scared if someone tells you it cannot be done. Parsing is a difficult theoretical problem but it's nothing that can't be learned, and it really is a great tool to have in your portfolio. I think it speaks very good of a developer that he can write an descent-recursive parser by hand, even if he never has to. If you want to pursuit this goal to its end, take a college-level compilers course, you'll thank me in a year.

Related

Implement Language Auto-Completion based on ANTLR4 Grammar

I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).
I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).
Thanks for pointing me to the right direction.
Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.
So:
A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'
B) Parse the wrapped input
C) Realize that you are in the IF function at the first parameter
D) Return all that can go into that place.
It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/
As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.
Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:
A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).
In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.
With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).
This is probably pretty hard to do.
Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.
For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.
This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.
Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.
What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")

How to work with textual formats in otherwise procedural code?

This question sounds trivial but let me explain my scenario.
I am working in an object oriented programming language (C#) and most of the actual execution code is procedural, i.e. series of statements, sometimes branches and loops. Fairly standard.
Now I am presented with a task to deal with a textual format (PGN, but it could be anything other like VCard or some custom format). At least for me, the "standard" way to work with it would be to use a mix of:
regular expressions
if / switch statements
for-loops
storing regexp matches into some custom structure and / or outputting it to some result format
However, I don't like this procedural approach at all - regular expressions are prone to errors, the code is usually quite hard to understand and debug, it usually tends to have quite a high cyclomatic complexity etc.
Simply put, I'd like it to be declarative but I don't know what tools or libraries to use.
I remember that when I saw demos of the "M" language I thought that that was exactly I was looking for. There was a simple way to declare syntax of my textual format, the tool would then automatically parse input string into an in-memory representation of the textual DSL, I think that it was also possible to transform the format into another etc.
I have been also in touch with the people behind JetBrains MPS which is another tool for working with DSLs but my scenario doesn't seem to be a perfect match for what they are trying to provide.
So if anyone has any idea about how to elegantly deal with textual formats in otherwise procedural code base, I'd be happy to learn about the options.
Check out my open source project meta#. I think it sounds like exactly what you're looking for.

Dynamic user control over variables (embedded language?)

I'm creating a piece of software (written in C#, will be a windows application) and I ran into this problem-
I've got a set of variables, and I need to allow the user to define a wide range of mathematical functions on those variables.
But my users don't necessarily have to have any prior knowledge about programming.
The options I've considered are:
Create some sort of GUI for defining the mathematical "functions". But that is very limiting.
Implement a very simple embedded language, that will offer flexibility while remaining relatively easy to understand. I looked at Lua, but the problem with that is that you pretty much need to have prior knowledge in programming. I was thinking about something more readable (somewhat similar to SQL), for example "assign 3 to X;"
Other ideas are welcome.
I'm basically looking for the best way to go here, under the assumption that my users don't have any knowledge in programming.
However, note that this is not the main feature of my software, so I'm assuming that if a user wants/needs to use this feature, he will take the time to look at the manual for a few minutes and learn how to do so, as long as it's not too complicated.
Thanks, Malki :)
What you want is a domain specific language. I see you've tried Lua and didn't find that acceptable--I'll assume that most pre-built scripting languages are out then.
Depending on your expected function complexity, I would recommend that you give a shot at implementing a small recursive-descent parser so that you can exactly specify your language. This way you can realize something like:
assign 3 to X
show sin(X * 5)
If this is a bit beyond what you're willing to do, you can get some parsing assistance from a library such as Irony; this will let you focus on using the abstract syntax tree rather than playing with tokenizing/lexing for some time.
If you want, you can even look at FLEE, which will parse and evaluate some pretty complex expressions right out of the gate.
ANTLR is a greate parser if you want to make your own language

C# Interpreted Language

I am looking to write an interpreted language in C#, where should I start? I know how I would do it using fun string parsing, but what is the correct way?
It can be a pretty difficult endeavour to do right.
If you don't have much knowledge in compiler theory you should probably first start reading about it.
Just using "fun string parsing", if I understand that term correctly, isn't going to get you very far at all.
The first basic step is to write your language grammar that defines the valid syntax for the language.
A tool like ANTLR will help you get the pieces together, but I would suggest reading the Dragon book as it is the canonical starting point to get up to speed on the subject.
If you want to build an interpreted language on .NET, the DLR is the way to go - check out Martin Maly's LOLCODE sample at http://www.iunknown.com/2007/11/lolcode-on-dlr.html
Edit: Here's another link with more information from Scott Hanselman: http://www.hanselman.com/blog/TheWeeklySourceCode11LOLCodeDLREdition.aspx
Checkout the Phoenix compiler from Microsoft. This will provide many of the tools you will need to build a compiler targeting native or managed environments. Among these tools us a optimizing back end.
I second Cycnus' suggestion on reading Aho Sethi and Ullman's "Dragon Book" (Wikipedia, Amazon).
RGR

Looking for a configurable pretty printer for C# code

I work on a team with about 10 developers. Some of the developers have very exacting formatting needs. I would like to find a pretty printer that I could configure to these specifications and then add to the build processes. In this way no matter how badly other people mess up the format when it is pulled down from source control it will look acceptable.
The easiest solution is for the team lead to mandate a format and everyone use it. The VS defaults are pretty good.
Jeff Atwood did that to us here on Stack Overflow and while I rebelled at first, I got over it :) Makes everything much easier!
Coding standards are definitely something we have. The coding formatting I am talking about is imposed by a grizzled architect that is, lets say, set in his ways and extremely particular. Lets just pretend that we can not address the human factor. I was looking for a way to circumvent the whole human processes.
The visual studio defaults sadly do not address line breaks very well. I am just making this line chopping style up but....
ServiceLocator.Logger.WriteDefault(string.format("{0}{1}"
,foo
,bar)
,Logging.SuperDuper);
another example of formatting visual studio is not too hot at....
if( foo
&& ( bar
|| baz
|| apples
|| oranges)
&& IsFoo()
&& IsBar() ){
}
Visual studio does not play well at all will stuff like this. We are currently using ReSharper to allow for more granularity with formating but it sadly falls sort in many areas.
Don't get me wrong though coding standards are great. The goal of the pretty printer as part of the build process is to get 'perfect' looking code no matter how well people are paying attention or counting their spaces.
The edge cases around code formatting are very solvable since it is a well defined grammar.
As far as the VS defaults go I can only say: BSD style or die!
So all that brings me full circle back to: Is there a configurable pretty printer for C#? As much as lexical analysis and parsing fascinate I have about had my fill making a YAML C# tool chain.
Your issue was the primary intent for creating NArrange (beta). It allows configurable reformatting of C# code and you can use one common configuration file to be shared by the entire team. Since its focus is primarily on reordering members in classes and controlling regions, it is still lacking many necessary formatting options (especially formatting within member code lines).
The normal usage scenario is for each developer to run the tool prior to check-in. I'm not aware of any one running it is part of their build process, but there is no reason why you couldn't, since it is a command-line tool. One idea that I've contemplated is running NArrange on files as part of a pre-commit step. If the original file contents being checked in don't match the NArrange formatted output on the source repository server, then the developer didn't reformat to the rules and a check-in error can be raised.
For more information, see my CodeProject article on Using NArrange to Organize C# Code.
Update 2023-02-25: NArrange appears to have moved to Github. The NArrange site (referenced above) is no longer available although there are copies in web.archive.org
I second Jarrod's answer. If you have 2 developers with conflicting coding preferences, then get the rest of the team to vote, and then get the boss to back the majority decision.
Additionally, the problem with trying to automatically apply a pretty printer like that, is that there will always be exceptional cases where your blanket coding standard is not the best or most readable solution, and you will lose out by squashing them with an automated tool.
Coding Standards are just that, standards. They don't call them Coding Laws or Coding Rules, and there's a good reason for that.

Categories