Building Interpreter Of a Document Format

Building Interpreter Of a Document Format - c#

I'm going to start the development of my own document format(like PDF, XPS, DOC, RTF...), but I want to know where I can read some tutorials, how-to's...? I don't want code, this is a project that I want to learn how to build it, not use the experience of someone other.
PS: I want to make it like a XML file:
[Command Argument="Define it" Argument2="Something"]
It's like PDF, but this syntax will be interpreted by a program that I will build it using C#, just like HTML and your browser ;)
Remember that my question is about the program that will interpret this code, but it's good to start with a tutorial of interpreting XML code ;)

I assume you're doing this for the sake of learning how to do it. If that's the case, it is a worthwhile venture and I understand.
You'll want to start out by learning LL parsers and grammars. That will help you interpret the document that has been read from a file into a document object model (DOM). From there you can create routines to manipulate or render that document tree.
Good luck!

I'm confused as to what you're asking, but if you need your own format like an XML file, why not just use XML to describe the format?
Edit: Okay, I think I understand now. If you're doing this for fun and for learning (which is great), then there are lots of approaches to take. In fact, it may even be better to not do any research, try to come up with a solution on your own and see if it works, what you need to do to make it better, etc.

Sounds like a good learning project and you've got some good pointers here already. I would just add that you should remember that there is a difference between a document file language and a document format.
Consider OOXML, it is a document format that is built on top of XML (what I'd describe as the file language). If your purpose is to learn about building your own document format then I'd highly recommend starting with XML so that you don't have to reinvent a language parser. This will let you focus on the concerns around building the format.
That said, good on you if you want to play around with creating your own language; just wanted to make sure you realized that they are different beasts.
Here are some links that will help you get started using XML in C#:
Xml Tutorial (video)
XML Document overview
Reading Xml data with an XmlReader
Writing Xml data with an XmlWriter

Far be it from me to forbid you from re-inventing the wheel for the sake of learning something new. Good for you for trying this out. However, if you are going to ask questions about how to do it you are going to need to specify your questions a little more.
Are you looking for help on:
Designing your framework / format
Planning your time / Estimating deadlines
Working with XML
Working with C#
Building a web-based C# application
Building a PC-based C# application
Other aspects of development entirely
There are many people here who want to help -- but the best answers are given to focused questions (not necessarily specific, but always focused.)

There are a couple of ways to approach this. One way would be to define the format of the file first, then use a parser-generator to crate C# code that can read that format. doing a Google search on "c# parser generator" will get you links to a number of different libraries you can use.
Alternatively, you could code your own parser, from scratch. This will be more work than using a parser generation tool, but might be more educational in the end.
The define-a-grammar approach may be total overkill for a simple format. Another way to approach the problem is to design the object tree that you'll use in-app first, then write serialization and de-serialization routines to save and load the contents from a file. The serialization interface in C# is pretty flexible, and you can serialize to binary or XML files easily.
I think it should be relatively straightforward to create your own serializer to create a file formatted however you like, but MSDN is not being my friend today, so I can't find the relevant documentation.

Related

Read from XML write another XML

I want to do a software who read something from xml and write another thing in other xml, example:
From here I want the software to read all values between <>[value]
<quest>
<id>1</id>
<reward_exp1>1848</reward_exp1>
<reward_gold1>560</reward_gold1>
</quest>
And write something else like this
<quest id="1"><reward gold="560" exp="184" /></quest>
Can I find a tutorial or something?

One way to do this would be to use linq to xml.
Here are some links to get you started.
http://msdn.microsoft.com/en-us/library/bb387044.aspx
http://www.dreamincode.net/forums/topic/218979-linq-to-xml/
There are other options e.g. xslt transform, xml dom

What you're looking to do is called XML Transformation, it's a common problem with many different ways to approach a solution.
If you're new to coding, you may want to look at XSLTs, although the XSLT 'language' can be a bit tricky for complex problems, I suspect it can handle yours with minimal effort and would only take a few lines of XSLT 'code' and a few lines of which ever language you want to use to run the XSLT (e.g. Java, C#, VB etc).

Making data in an XML not able to be edited via a text editor

I'm currently following a tutorial series for a Tile Engine which uses XML files to store conversations between NPCs. A topic it doesn't appear to cover (I have only quickly glanced through the subsequent videos) is how to prevent the user from either altering or knowing in advance what the NPC is going to say by opening the XML file easily with a generic text editor.
The 2nd point of being able to read future conversations is not a real issue but something I wanted to think about, so if that's hard to implement I am not too fussed at this point.
How would I go about making the XML uneditable? I know vaguely about CRC32's which can check file integrity which may be useful and I also think there might be better ways to go about that (i.e. not with a CRC32).
The most extreme action I can think of would be to create my own arbitrary encoding for the conversation data, but the usefulness of XML files deters me from that slightly, and with the tutorials I'm following teaching me a lot things I don't know, I would prefer not to defer too far away from them!
Just looking for a direction really, thanks!

Xml is in its fundamentals an open format, so I mean there is not way how to make xml uneditable.
But you can have a copy of xml document (or some of fingerprint of xml) on your server (or on endpoints of NPC conversation) and then you can compare if xml document was edited or no.
If document was edited, you cas replace it with backup version or say to endpoints, that xml document was corrupted...

Historically, many games wrap multiple resources into a single binary file.
You might put it in a ZIP file (and maybe change the file extension). That would allow you to avoid having an XML file with an obvious name as a temptation for your users :).
Ultimately, you're asking something similar to the DRM question. I don't know whether your platform has an answer to that. (E.g., "using RSA encryption" is not secure as such; your program still has to decrypt the data at some point using the appropriate key, etc).

Acord Standard for Insurance. Has anybody dealt with this mess?

We need to implement a WCF Webservice using the ACORD Standard.
However, I don't know where to start with this since this standard is HUMONGOUS and very convoluted. A total chaos to my eyes.
I am trying to use WSCF.Blue to extract the classes from the multiple XSD I have but so far all I get is a bunch of crap: A .cs file with 50,000+ lines of code that freezes my VS2010 all the time.
Has anybody walked already thru the Valley of Death (ACORD Standard) and made it? I really would appreciate some help.

I wrote a ACORD to c# class library converter which was then used in several large commercial insurance products. It featured a very nice mapping of all of the ACORD XML into nice concise, extendable C# classes. So I know from whence you come!
Once you dig into it its not so bad, but I maintain the average coder will not 'get it' for about 3-4 months if they work at it full time (assuming anything but inquiry style messages). The real problem comes when trying to do mapping from a backend database and to/from another ACORD WS. All of the carriers, vendors, and agencies have custom rules.
My best suggestion is to find working code examples (I have tons if you need them) and maybe even a vendor or carrier who will let you hook up to a ACORD ws in a test environment.

It sounds like you are heading down the right path but are lost in the forest.
The ACORD Standard is huge and intentionally so, as it provides support for hundreds of different messages. Just as you do not download all of Wikipedia to get just a few articles, you do not need all of the classes in the ACORD Standard to support an implementation of a few messages. If you know what messages you need to support then you can generate a subset of the full XSD that will be quite manageable.
As mentioned in Hugh’s response, for any one message only a fraction of the full XSD is used. How you go about doing that will depend on the specifics of your project. If you are looking for ideas on how generate a subset of the full XSD try reaching out to the ACORD staff for help at PCS#acord.org. They should be able to offer you some help in getting started.

I have worked with the Accord PCS exposure reporting standards and yes it was a nightmare. I have also worked with other large standards like FPML and SportsML.
You need to work out exactly which types from the schema that are needed. How you do this is up to you, but VS schema viewer should be able to handle it. If not try XmlSpy or just go through it by hand if you have to. Make sure you have a good BA to hand...
Chances are you will find that you can meet your requirements by using around 1% of the types available in the standard.
What you'll probably find is that you can express the core objects with a very minimal set of values, as most nodes will be minOccurs=0 or nillable.
Then you can use the /element switch on xsd.exe to generate the code for just the types you need.
As one commenter says there is no easy pill to swallow here. The irony is that standards are supposed to make everyone's lives easier.

If you are looking to read/write ACORD documents using .NET, I just stumbled across the "IVC Software Factory for ACORD Standards" on CodePlex at http://ivc.codeplex.com.
From the limited documentation it appears as if this library can convert objects to ACORD XML documents, and vice-versa. The source code comes with different "providers" i.e. different ACORD transaction types, like 103 or 121.
Hope this helps.

I would recommend not creating a model for the entire standard. One could just pass XML and not serialize into a model but instead load it into XDocument/XElement and use Linq to query it and update the DOM using Linq to Xml. So, one is not loading the XML to a strongly typed model, but just loading the XML. There is no model, just an XML document.
From there, one can pick the data off of the XML as needed.
Using this approach, the code will be ugly and have little context since XElements will be passed everywhere, and there will be tons of magic strings of XPaths to query and define elements, but it can work. Also, everything is a string so there will be utility conversion methods to convert to numbers, date times, etc.
From my prospective, I have modeled part of the Acord into an object model using the XmlSerializer but it's well over 500 classes. The model was not tooled from XSD or other, but crafted manually and took some time. Tooling will produce monster unusable classes (as you have mentioned) and/or flat out crash. As an example, I tried to load the XSD into Stylus Studio and it crashed several times.
So, your best bet if your strapped for time is loading into an XDocument as opposed to trying to map out everything in a model. I know that sucks but Acord in general is basically a huge data hot mess.

What is the best way to read and write cXML documents in C#?

I know this is a vague open ended question. I'm hoping to get some general direction.
I need to add cXML punchout to an ASP.NET C# site / application. This is replacing something that I wrote years ago in ColdFusion.
I'm a reasonably experienced C# developer but I haven't done much with XML. There seems to be lots of different options for processing XML in .NET.
Here's the open ended question: Assuming that I have an XML document in some form, eg a file or a string, what is the best way to read it into my code? I want to get the data and then query databases etc. The cXML document size and our traffic volumes are easily small enough so that loading the a cXML document into memory is not a problem.
Should I:
1) Manually build classes based on the dtd and use the XML Serializer?
2) Use a tool to generate classes. There are sample cXML files downloadable from Ariba.com.
I tried xsd.exe to generate an xsd and then xsd.exe /c to generate classes. When I try to deserialize I get errors because there seems to be "confusion" around whether some elements should be single values or arrays.
I tried the CodeXS online tool but that gives errors in it's log and errors if I try to deserialize a sample document.
2) Create a dataset and ReadXml()?
3) Create a typed dataset and ReadXml()?
4) Use Linq to XML. I often use Linq to Objects so I'm familiar with Linq in general but I'm struggling to see what it gives me in this situation.
5) Some other means.
I guess I need to improve my understanding of XML in general but even so ... am I missing some obvious way of doing this? In the old ColdFusion site I found a free component ("tag") which basically ignored any schema and read the XML into a "structure" which is essentially a series of nested hash tables which was then easy to read in code. That was probably quite sloppy but it worked.
I also need to generate XML files from my C# objects. Maybe Linq to XML will be good for that. I could start with a default "template" document and manipulate it before saving.
Thanks for any pointers ...

If you need to generate arbitrary XML in an exact format, you should generate it manually using LINQ-to-XML.

Creating a scripting language to be used to create web pages

I am creating a scripting language to be used to create web pages, but don't know exactly where to begin.
I have a file that looks like this:
mylanguagename(main) {
OnLoad(protected) {
Display(img, text, link);
}
Canvas(public) {
Image img: "Images\my_image.png";
img.Name: "img";
img.Border: "None";
img.BackgroundColor: "Transparent";
img.Position: 10, 10;
Text text: "This is a multiline str#ning. The #n creates a new line.";
text.Name: text;
text.Position: 10, 25;
Link link: "Click here to enlarge img.";
link.Name: "link";
link.Position: 10, 60;
link.Event: link.Clicked;
}
link.Clicked(sender, link, protected) {
Image img: from Canvas.FindElement(img);
img.Size: 300, 300;
}
}
... and I need to be able to make that text above target the Windows Scripting Host. I know this can be done, because there used to be a lot of Docs on it around the net a while back, but I cannot seem to find them now.
Can somebody please help, or get me started in the right direction?
Thanks

You're making a domain-specific language which does not exist. You want to translate to another language. You will need a proper scanner and parser. You've probably been told to look at antlr. yacc/bison, or gold. What went wrong with that?
And as an FYI, it's a fun exercise to make new languages, but before you do for something like this, you might ask a good solid "why? What does my new language provide that I couldn't get any other (reasonable) way?"

The thing to understand about parsing and language creation is that writing a compiler/interpreter is primarily about a set of data transformations done to an input text.
Generally, from an input text you will first translate it into a series of tokens, each token representing a concept in your language or a literal value.
From the token stream, you will generally then create an intermediate structure, typically some kind of tree structure describing the code that was written.
This tree structure can then be validated or modified for various reasons, including optimization.
Once that's done, you'll typically write the tree out to some other form - assembly instructions or even a program in another language - in fact, the earliest versions of C++ wrote out straight C code, which were then compiled by a regular C compiler that had no knowledge of C++ at all. So while skipping the assembly generation step might seem like cheating, it has a long and proud tradition behind it :)
I deliberately haven't gotten into any suggestions for specific libraries, as understanding the overall process is probably much more important than choosing a specific parser technology, for instance. Whether you use lex/yacc or ANTLR or something else is pretty unimportant in the long run. They'll all (basically) work, and have all been used successfully in various projects.
Even doing your own parsing by hand isn't a bad idea, as it will help you to learn the patterns of how parsing is done, and so then using a parser generator will tend to make more sense rather than being a black box of voodoo.

Languages similar to C# are not easy to parse - there are some naturally left-recursive rules. So you have to use a parser generator that can deal with them properly. ANTLR fits well.
If PEG fits better, try this: http://www.meta-alternative.net/mbase.html

So you want to translate C# programs to JavaScript? Script# can do this for you.

Rather than write your own language and then run a translator to convert it into Javascript, why not extend Javascript to do what you want it to do?
Take a look at jQuery - it extends Javascript in many powerful ways with a very natural and fluent syntax. It's almost as good as having your own language. Take a look at the many extensions people have created for it too, especially jQuery UI.

Assuming you are really dedicated to do this, here is the way to go. This is normally what you should do: source -> SCANNER -> tokens -> PARSER -> syntax tree
1) Create a scanner/ parser to parse your language. You need to write a grammar to generate a parser that can scan/parse your syntax, to tokenize/validate them.
I think the easiest way here is to go with Irony, that'll make creating a parser quick and easy. Here is a good starting point
http://www.codeproject.com/KB/recipes/Irony.aspx
2) Build a syntax tree - In this case, I suggest you to build a simple XML representation instead of an actual syntax tree, so that you can later walk the XML representation of your DOM to spit out VB/Java Script. If your requirements are complex (like you want to compile it or so), you can create a DLR Expression Tree or use the Code DOM - but here I guess we are talking about a translator, and not about a compiler.
But hey wait - if it is not for educational purposes, consider representing your 'script' as an xml right from the beginning, so that you can avoid a scanner/parser in between, before spitting out some VB/Java script/Html out of that.

I don't wan to be rude... but why are you doing this?
Creating a parser for a regular language is a non-trivial task. Just don't do it.
Why don't you just use html, javascript and css (and jquery as someone above suggested)
If you don't know where to begin, then you probably don't have any experience of this kind and probably you don't have a good reason, why to do this.
I want to save you the pain. Forget it. It's probably a BAD IDEA!
M.

Check out Constructing Language Processors for Little Languages. It's a very good intro I believe. In fact I just consulted my copy 2 days ago when I was having trouble with my template language parser.
Use XML if at all possible. You don't want to fiddle with a lexer and parser by hand if you want this thing in production. I've made this mistake a few times. You end up supporting code that you really shouldn't be. It seems that your language is mainly a templating language. XML would work great there. Just as ASPX files are XML. Your server side blocks can be written in Javascript, modified if necessary. If this is a learning exercise then do it all by hand, by all means.
I think writing your own language is a great exercise. So is taking a college level compiler writing class. Good luck.

You obviously need machinery designed to translate langauges: parsing, tree building, pattern matching, target-language tree building, target-language prettyprinting.
You can try to do all of this with YACC (or equivalents), but you'll discover that parsing
is only a small part of a full translator. This means there's a lot more work
to do than just parsing, and that takes time and effort.
Our DMS Software Reengineering Toolkit is a commercial solution to building full translators for relatively modest costs.
If you want to do it on your own from the ground up as an exercise, that's fine. Just be prepared for the effort it really takes.
One last remark: designing a complete language is hard if you want to get a nice result.

Personally I think that every self-imposed challenge is good. I do agree with the other opinions that if what you want is a real solution to a real life problem, it's probably better to stick with proved solutions. However, if as you said yourself, you have an academic interest into solving this problem, then I encourage you to keep on. If this is the case, I might point a couple of tips to get you on the track.
Parsing is not really an easy task, that is way we take at least a semester of it. However, it can be learned. I would recommend starting with Terrence Parr's book on language implementation patterns. There are many great books about compiling and parsing, probably the most loved and hated been the Dragon Book.
This is pretty heavy stuff, but if you are really into this, and have the time, you should definitely take a look. This would be the Robisson Crusoe's "i'll make it all by myself approach". I have recently written an LR parser generator and it took me no more than a long weekend, but that after reading a lot and taking a full two-semesters course on compilers.
If you don't have the time or simply don't want to learn to make a parser "like men do", then you can always try a commercial or academic parser generator. ANTLR is just fine, but you have to learn its meta-language. Personally I think that Irony is a great tool, specially because it stays inside C# and you can take a look at the source code and learn for yourself. Since we are here, and I'm not trying to make any advertisement at all, I have posted a tiny tool in CodePlex that could be useful for this task. Take a look for yourself, it's open-source and free.
As a final tip, don't get scared if someone tells you it cannot be done. Parsing is a difficult theoretical problem but it's nothing that can't be learned, and it really is a great tool to have in your portfolio. I think it speaks very good of a developer that he can write an descent-recursive parser by hand, even if he never has to. If you want to pursuit this goal to its end, take a college-level compilers course, you'll thank me in a year.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.