Learning XML, what are the next steps? (navigating a document elegantly) - c#

The xml fields seems filled wit jargon, (well to new XML users its jargon), DTD, DOM, and SGML just to name a few.
I've read up on what an XML document is, and what makes a document valid. What I need are the next steps, or how to actually use an XML document. For the .Net platform there seems to be a plethora of ways to traverse an XML document, xpath, XMLReader (from System.Xml), datasets, and even the lowly streamreader.
What is the best approach? Where can I find more "advanced beginner" material? Most of the material I find is about differences in XML parsing approaches (like performance, more advanced stuff that assumes one has XML experience), or explaining XML in general terms for non-programmers (how it's platform independent, human readable, etc.)
Thanks!
Also for specifics I'm using C# (so .Net). I've tinkered around with XML in vba, but Ive run into the same problems. Practical application here is getting an iOS application to dump info into a SQL server.

Download Linqpad and it's samples. It has quite a large library of examples of Linq to XML that you might find very usefull.
http://www.linqpad.net/

It's hard to do this without some idea of the problem you want to solve.
You need to make a decision whether you want to process the XML using procedural languages like C#, or declarative languages like XSLT and XQuery. For many tasks, the declarative languages will make your life much easier, but there is more of a learning curve, and a lot depends on where you are coming from in terms of previous experience. Generally working at the C# level is appropriate if your application is 10% XML processing and 90% other things, while XSLT/XQuery are more appropriate if it's 90% XML manipulation and 10% other things.

Learn the two primary (early) methods of processing an XML document: SAX and DOM. Then learn how to use one of the new "pull" parsers.
Without learning how to parse XML, you are in danger of designing XML that poorly supports the task(s) at hand.
Recommended reading, even if you are working in C#
Java & XML, 2nd Edition (OReilly)
Java & XML data binding (OReilly)
SAX and DOM are universal enough that the language differences between C# and Java are not the hardest part of using XML effectively. Perhaps there are C# equivalents of the above, if so then use them.
As far as the "best" means of using XML is concerned, it depends heavily on the task at hand. There's no "best" way of using a text document, either! If you are processing very large streams, SAX works great until you need to cross reference. DOM is great for "whole document in memory" processing, but due to it's nature suffers when the documents get "too big".
The "right" solution is to tailor your XML to exploit the strengths of the means by which it will be processed and transformed into useful work, while avoiding the pitfalls that accompany the chosen processing methodology. That's pretty vague, but there's more than one way to skin this proverbial cat.

Related

Proper factoring when generating files with XSLT

In a particular segment of a system I'm working on, we are generating PDF and HTML files using XSLT (for email, print and display.) The business model being printed is in code (c#).
When designing the schema, I made special considerations for the requirements of the printed documents as XSLT is much more difficult (possibly just for me?) to work with than C#. For example, I generate aggregate values and tables from the business model for display on the document. These decisions don't translate well to other areas where similar Xml might be used.
I'm now facing a problem of others using the Xml as well and therefore breaking SoC.
I'm leaning towards taking a snapshot of the Xml they originally latched on to and giving them a new method. I personally don't see a problem with this (in the face of DRY), but others might have a hard time understanding the trade-off. Is my reasoning flawed? Is there a better approach?
I know it's a personal thing, but my choice would always be to put as much of the logic as possible in the XSLT code rather than the C# code - the opposite of what you are doing. It means you're working in a higher-level declarative language, and one that is expressly designed for manipulating XML. There is a learning curve, of course, but at the top of the learning curve you will find sunlit uplands. And don't allow yourself to be put off by the limitations of XSLT 1.0: 2.0 leaves all those problems behind, you just have to be prepared to ditch Microsoft and use third-party technology (Microsoft stopped doing anything new in the XML space about a decade ago, but that doesn't mean you have to stay stuck in the past).

Advantage XPath Evaluation over XSLT

I'm currently programming an application, that uses WPF.
Therefore I'm planning to load the GUI dynamically via XAML based upon a given XML.
As I see it, I have two choices:
Evaluate XML by myself with xpath and create GUI elements by myself.
Generate XAML through a XSLT transformation and load that file.
So, the question is, which way is more suitable? Or is there no difference and it's just a question of which way I prefer more?
XSLT sounds like a bad choice:
As soon as things get a bit harder, you start hacking around, plus .NET framework uses XSLT version which is older than the last one. Meaning; you have a lot less capabilities available, unless you start using third-party library for XSL transformations.
Forcing developers to learn new technology which you can easily avoid. Imagine new developer taking over of your work with no experience on XSLT. I imagine the code will be even hard to read for experienced developers.
With XML, it's pretty straight forward. However, XPath can be also quite a mess, if you start nesting, and nesting.
Define a XML format, use xml-->object deserialization, and start building UI from the objects. Don't bother with Xpath. Use XmlSerializer for "parsing".

Programatically filter XML in a streaming fashion (XmlWrappingReader/Writer alternatives?)

I'm working with some .NET services that have the potential to process significantly large XML documents, and I need to ensure that all processing is done in a streaming / pipelining fashion. I'm already using the XmlReader and XmlWriter classes. My question is, what is the best way to programmatically provide a filter into the reader and writer (either, depending upon the flow)?
(I am not looking for XSLT. I already do a lot with XSLT, and many of the things I'm looking to do are outside the scope of XSLT - or at least, implementing within XSLT would not be ideal.)
In Java & SAX, this would best be handled through a XMLFilterImpl. I do not see that .NET provides anything similar for working with a XmlReader. I did find this blog post, "On creating custom XmlReaders/XmlWriters in .NET 2.0, Part 2", which includes the following (I've fixed the first link from a broken link from the original post):
Here is the idea - have an utility wrapper class, which wraps
XmlReader/XmlWriter and does nothing else. Then derive from this class
and override methods you are interested in. These utility wrappers are
called XmlWrapingReader and XmlWrapingWriter. They are part of
System.Xml namespace, but unfortunately they are internal ones -
Microsoft XML team has considered making them public, but in the
Whidbey release rush decided to postpone this issue. Ok, happily these
classes being pure wrappers have no logic whatsoever so anybody who
needs them can indeed create them in a 10 minutes. But to save you
that 10 minutes I post these wrappers here. I will include
XmlWrapingReader and XmlWrapingWriter into the next Mvp.Xml library
release.
These 2 classes (XmlWrappingReader and XmlWrappingWriter) from the Mvp.Xml library are currently meeting my needs nicely. (As an added-bonus, it is a free & open-source library, BSD licensed.) However, due to the stale status of this project, I do have some concerns with including these classes in a contracted, commercial development project that will be handed-off. The last release of Mvp.Xml was 4.5 years ago in July 2007. Additionally, there is this comment from a "project coordinator" in response to this project discussion:
Anyway, this is not really a supported project anymore. All devs moved
out. But it's open source, you are on your own.
I've also found SAX equivalent in .Net, but SAXDotNet doesn't seem to be in any better shape - with its last release being in 2006.
I'm well aware that a stale project doesn't necessarily mean that it is any less useable, and will be moving forward with the 2 wrapper classes from the Mvp.Xml library - at least for now.
Are there any better alternatives that I should be considering? (Again, any solution must not require the entire XML to exist in-memory at any one time - whether as a DOM, a string, or otherwise.) Are there any other libraries available (preferably something from a more active project), or maybe something within the LINQ features that would meet these requirements?
Personally I find that writing a pipeline of filters works much better with a push model than a pull model, although both are possible. With a pull model, a filter that needs to generate multiple output events in response to a single input event is quite tricky to program, though of course it can be done by keeping track of the state. So I think that looking for a SAX-like approach makes sense.
I would look again at SaxDotNet or equivalents. Be prepared to look at the source code and bend it to your needs; consider contributing back your improvements. Intrinsically the job it is doing is very simple: a loop that reads events from the (pull) input and writes events to the (push) output. In fact, it's so simple that perhaps the reason it hasn't changed since 2006 is that it doesn't need to.

How to speed up generation of Word files from C#?

I'm working on an application that generates a relatively large amount of Word output. Currently, we're using Word Interop services to do the document creation, but it's quite slow, especially in older (pre-2007) versions of Office. We'd like to speed up the generation.
I haven't done a lot of profiling yet, but I'm pretty confident that the problem is that we're making tons of COM calls. I'm hoping that profiling will yield a subset of calls that are slower than the others, but my gut tells me that it's probably a question of COM overhead (or Word Interop overhead), and not just a few slow calls.
Also, the product can generate HTML output, and that process (a) is very fast, and (b) uses pretty much the same codepaths, just with a different subclass for the HTML-specific pieces of functionality. So I'm pretty sure that our algorithm isn't fundamentally slow.
So... I'm looking for suggestions for alternate ways to accelerate the generation of Word files.
We can't just rename the generated HTML files to .doc, and we can't generate RTF instead -- in both cases, important formatting information get lost, and in the RTF case, inlined graphics don't work robustly.
One of the approaches we're evaluating is programmatically generating and opening a Word file (via interop) from a template that has a macro that knows how to consume a flat file and create the requisite output. We're interested in feedback about that approach, as well as any other ideas for speeding things up.
If you can afford it, I'd recommend Aspose.Words product. Very fast and Word does not need to be installed.
Also it's much easier to use then office interop.
Your macro approach is exactly how we sped up slow excel interop (using version 2003 i think).
We found (at least with excel) that much of the slowness was due to repeated individual calls via the interop. We started to bunch up commands (ie. format large ranges, and then change specific cells as required rather than formating each cell individually), and logically moved on to macros.
I think that the macro + template approach would happily translate.

Is there a Transformation engine or library using .NET?

We're looking for a Transformation library or engine which can read any input (EDIfact files, CSV, XML, stuff like that. So files (or webservices results) that contain data which must be transformed to a known business object structure.) This data should be transformed this to a existing business object using custom rules. XSLT is both to complex (to learn) and to simple (not enough features)
Can anybody recommend a C# library or engine? I have seen Altova MapForce but would like something I can send out to dozens of people who will build / design their own transformations without having to pay dozens of Altova licenses.
If you think that XSLT is too difficult for you, I think you can try LINQ to XML for parsing XML files. It is integrated in the .NET framework, and you can use C# (or, if you use VB.NET 9.0, better because of the XML literals) instead of learning another language. You can integrate it with the existing application without much effort and withouth the paradigm mismatch between the language and the file management that occurs with XSLT.
Microsoft LINQ to XML
Sure, it's not a framework or library for parsing files, but neither XSLT is, so...
XSLT is not going to work for EDI and CSV. If you want a completely generic transformation engine, you might have to shell out some cash. I have used Symphonia for dealing with EDI, and it worked, but it is not free.
The thing is the problem you are describing sounds "enterprisey" (I am sure nobody uses EDI for fun), so there's no open source/free tooling for dealing with this stuff.
I wouldn't be so quick to dismiss XSLT as being too complex or not contain the features you require.
There are plenty of books/websites out there that describe everything you need to know about XSLT. Yes, there is a bit of a learning curve but it doesn't take much to get into it, and there's always a great community like stackoverflow to turn to if you need help ;-)
As for lack of features you can always extend xslt and call .NET assemblies from the xslt using the
XsltArgumentList.AddExtensionObject() method, which would give you the power you need.
MSDN has a great example of using this here
It's true that the MapForce and Biztalk applications make creating xslt very easy but they also cost a bit. Also, depending on your user base (assuming non developers), I think you'll find that these applications have there own learning curves and are often too feature rich for what you need.
I'd recommend you to consider building and distributing your own custom mapping tool specific to your users needs.
Also if you need a library to assist with file conversions I'd recommend FileHelpers at SourceForge
DataDirect Technologies has a product that does exactly this.
At http://www.xmlconverters.com/ there is a library called XmlConverters which converts EDI to XML and vice-versa. There are also converters for CSV, JSON, and other formats.
The libraries are available as 100% .net managed code, and a parallel port in 100% Java.
The .net side supports XmlReader and XmlWriter, while the Java side supports SAX, StAX and DOM. Both also support stream and reader/writer I/O.
DataDirect also has an XQuery engine optimized for merging relational data with EDI and XML, but it is Java only.
Microsoft BizTalk Server does a very good job of this.

Categories