Converting XML between schemas - XSLT or Objects?

Converting XML between schemas - XSLT or Objects? - c#

Given:
Two similar and complex Schemas, lets call them XmlA and XmlB.
We want to convert from XmlA to XmlB
Not all the information required to produce XmlB is contained withing XmlA (A database lookup will be required)
Can I use XSLT for this given that I'll need to reference additional data in the database? If so what are the arguments in favour of using XSLT rather than plain old object mapping and conversion? I'm thinking that the following criteria might influence this decision:
Performance/speed
Memory usage
Code reuse/comlpexity
The project will be C# based.
Thanks.

With C# you can always provide extension objects to XSLT transforms, so that's a non-issue.
It's hard to qualitatively say without having the schemas and XML to hand, but I imagine a compiled transform will be faster than object mapping since you'll have to do a fair amount of wheel reinventing.
Further, one of the huge benefits of XSLT is it's maintainability and portability. You'll be able to adapt the XSLT doc really quickly with schema changes, and on-the-fly without having to do any rebuilds and takedowns if you're monitoring the file.
Could go either way based on what you've given us though.

My question is how likely are the set-of-transformations to change?
If they won't change much, I favor doing it all in one body of source code -- here that would be C#. I would use XSD.exe (.NET XSD tool) generated serialization classes in conjunction with data layers for this kind of thing.
On the other hand, if the set-of-transformations are likely to change -- or perhaps need to be 'corrected' post installation -- then I would favor a combination of XSLT and C# extensions to XSLT. The Extension mechanism is straightforward, and if you use the XslCompiledTransform type the performance is quite good.

If the data isn't in the xml, then xslt will be a pain. You can provide additional documents with xsl:document(), or you can use xslt extension methods (but that is not well supported between vendors). So unless you are dead set on xslt, it doesn't sound like a good option on this occasion (although I'm a big fan of xslt when used correctly).
So: I would probably use regular imperative code - streaming (IEnumerable<T>) if possible. Of course, unless you have a lot of data, such nuances are moot.

Related

Acord Standard for Insurance. Has anybody dealt with this mess?

We need to implement a WCF Webservice using the ACORD Standard.
However, I don't know where to start with this since this standard is HUMONGOUS and very convoluted. A total chaos to my eyes.
I am trying to use WSCF.Blue to extract the classes from the multiple XSD I have but so far all I get is a bunch of crap: A .cs file with 50,000+ lines of code that freezes my VS2010 all the time.
Has anybody walked already thru the Valley of Death (ACORD Standard) and made it? I really would appreciate some help.

I wrote a ACORD to c# class library converter which was then used in several large commercial insurance products. It featured a very nice mapping of all of the ACORD XML into nice concise, extendable C# classes. So I know from whence you come!
Once you dig into it its not so bad, but I maintain the average coder will not 'get it' for about 3-4 months if they work at it full time (assuming anything but inquiry style messages). The real problem comes when trying to do mapping from a backend database and to/from another ACORD WS. All of the carriers, vendors, and agencies have custom rules.
My best suggestion is to find working code examples (I have tons if you need them) and maybe even a vendor or carrier who will let you hook up to a ACORD ws in a test environment.

It sounds like you are heading down the right path but are lost in the forest.
The ACORD Standard is huge and intentionally so, as it provides support for hundreds of different messages. Just as you do not download all of Wikipedia to get just a few articles, you do not need all of the classes in the ACORD Standard to support an implementation of a few messages. If you know what messages you need to support then you can generate a subset of the full XSD that will be quite manageable.
As mentioned in Hugh’s response, for any one message only a fraction of the full XSD is used. How you go about doing that will depend on the specifics of your project. If you are looking for ideas on how generate a subset of the full XSD try reaching out to the ACORD staff for help at PCS#acord.org. They should be able to offer you some help in getting started.

I have worked with the Accord PCS exposure reporting standards and yes it was a nightmare. I have also worked with other large standards like FPML and SportsML.
You need to work out exactly which types from the schema that are needed. How you do this is up to you, but VS schema viewer should be able to handle it. If not try XmlSpy or just go through it by hand if you have to. Make sure you have a good BA to hand...
Chances are you will find that you can meet your requirements by using around 1% of the types available in the standard.
What you'll probably find is that you can express the core objects with a very minimal set of values, as most nodes will be minOccurs=0 or nillable.
Then you can use the /element switch on xsd.exe to generate the code for just the types you need.
As one commenter says there is no easy pill to swallow here. The irony is that standards are supposed to make everyone's lives easier.

If you are looking to read/write ACORD documents using .NET, I just stumbled across the "IVC Software Factory for ACORD Standards" on CodePlex at http://ivc.codeplex.com.
From the limited documentation it appears as if this library can convert objects to ACORD XML documents, and vice-versa. The source code comes with different "providers" i.e. different ACORD transaction types, like 103 or 121.
Hope this helps.

I would recommend not creating a model for the entire standard. One could just pass XML and not serialize into a model but instead load it into XDocument/XElement and use Linq to query it and update the DOM using Linq to Xml. So, one is not loading the XML to a strongly typed model, but just loading the XML. There is no model, just an XML document.
From there, one can pick the data off of the XML as needed.
Using this approach, the code will be ugly and have little context since XElements will be passed everywhere, and there will be tons of magic strings of XPaths to query and define elements, but it can work. Also, everything is a string so there will be utility conversion methods to convert to numbers, date times, etc.
From my prospective, I have modeled part of the Acord into an object model using the XmlSerializer but it's well over 500 classes. The model was not tooled from XSD or other, but crafted manually and took some time. Tooling will produce monster unusable classes (as you have mentioned) and/or flat out crash. As an example, I tried to load the XSD into Stylus Studio and it crashed several times.
So, your best bet if your strapped for time is loading into an XDocument as opposed to trying to map out everything in a model. I know that sucks but Acord in general is basically a huge data hot mess.

What is faster in xml parsing: elements or attributes?

I am writing code that parses XML.
I would like to know what is faster to parse: elements or attributes.
This will have a direct effect over my XML design.
Please target the answers to C# and the differences between LINQ and XmlReader.
Thanks.

Design your XML schema so that representation of the information actually makes sense. Usually, the decision between making something in attribute or an element will not affect performance.
Performance problems with XML are in most cases related to large amounts of data that are represented in a very verbose XML dialect. A typical countermeasures is to zip the XML data when storing or transmitting them over the wire.
If that is not sufficient then switching to another format such as JSON, ASN.1 or a custom binary format might be the way to go.
Addressing the second part of your question: The main difference between the XDocument (LINQ) and the XmlReader class is that the XDocument class builds a full document object model (DOM) in memory, which might be an expensive operation, whereas the XmlReader class gives you a tokenized stream on the input document.

With XML, speed is dependent on a lot of factors.
With regards to attributes or elements, pick the one that more closely matches the data. As a guideline, we use attributes for, well, attributes of an object; and elements for contained sub objects.
Depending on the amount of data you are talking about using attributes can save you a bit on the size of your xml streams. For example, <person id="123" /> is smaller than <person><id>123</id></person> This doesn't really impact the parsing, but will impact the speed of sending the data across a network wire or loading it from disk... If we are talking about thousands of such records then it may make a difference to your application.
Of course, if that actually does make a difference then using JSON or some binary representation is probably a better way to go.
The first question you need to ask is whether XML is even required. If it doesn't need to be human readable then binary is probably better. Heck, a CSV or even a fixed-width file might be better.
With regards to LINQ vs XmlReader, this is going to boil down to what you do with the data as you are parsing it. Do you need to instantiate a bunch of objects and handle them that way or do you just need to read the stream as it comes in? You might even find that just doing basic string manipulation on the data might be the easiest/best way to go.
Point is, you will probably need to examine the strengths of each approach beyond just "what parses faster".

Without having any hard numbers to prove it, I know that the WCF team at Microsoft chose to make the DataContractSerializer their standard for WCF. It's limited in that it doesn't support XML attributes, but it is indeed up to 10-15% faster than the XmlSerializer.
From that information, I would assume that using XML attributes will be slower to parse than if you use only XML elements.

LinqToXml OR Xml?

I am trying to processing on xml and i am so tired to work with Xml namespace classes.
I found it really hard.
So i am trying to do same things with LinqtoXml classes and i see that many things possible with LinqtoXml instead of Xml classes. So i really get confused which i have to use.
When and why do u prefer Xml classes instead of LinqtoXml classes ?
Edit:
What is not possible to do with LinqToXml when it is possible with Xml classes ?

General methods used to manipulate and process XML data were available since the first version of .NET Framework. Linq is available since .NET Framework 3.5. That's one of the reasons to use plain Xml classes if you target an older version of .NET Framework.
Linq to Xml has also a different approach. It's more about querying XML data, and not reading it. So in some situations, Linq to Xml will be the easiest way to do things, but not in every situation. The ease of Linq to Xml may also be a bad thing if used by an unskilled developer: it is, IMHO, much easier to produce code with huge performance issues with Linq rather than without.

Working exclusively in the LINQ to XML space is good for querying (the Q in LINQ), and projecting the original XML data into new structures, but it's unwieldy in some cases and impossible in others to manipulate the XML directly.
One can make an argument that in many cases this is still a good thing, since immutability does reduce the chance of some kinds of bugs, but it definitely can have costs - not perfectly aligning with every situation, and sometimes performance.

Any reason not to use XmlSerializer?

I just learned about the XmlSerializer class in .Net. Before I had always parsed and written my XML using the standard classes. Before I dive into this, I am wondering if there are any cases where it is not the right option.
EDIT: By standard classes I mean XmlDocument, XmlElement, XmlAttribute...etc.

There are many constraints when you use the XmlSerializer:
You must have a public parameterless constructor (as mentioned by idlewire in the comments, it doesn't have to be public)
Only public properties are serialized
Interface types can't be serialized
and a few others...
These constraints often force you to make certain design decisions that are not the ones you would have made in other situations... and a tool that forces you to make bad design decisions is usually not a good thing ;)
That being said, it can be very handy when you need a quick way to store simple objects in XML format. I also like that fact that you have a pretty good control over the generated schema.

Well, it doesn't give you quite as much control over the output, obviously. Personally I find LINQ to XML makes it sufficiently easy to write this by hand that I'm happy to do it that way, at least for reasonably small projects. If you're using .NET 3.5 or 4 but not using LINQ to XML, look into it straight away - it's much much nicer than the old DOM.
Sometimes it's nice to be able to take control over serialization and deserialization... especially when you change the layout of your data. If you're not in that situation and don't anticipate being in it, then the built-in XML serialization would probably be fine.
EDIT: I don't think XML serialization supports constructing genuinely immutable types, whereas this is obviously feasible from hand-built construction. As I'm a fan of immutability, that's definitely something I'd be concerned about. If you implement IXmlSerializable I believe you can make do with public immutability, but you still have to be privately mutable. Of course, I could be wrong - but it's worth checking.

The XmlSerializer can save you a lot of trouble if you are regularly serializing and deserializing the same types, and if you need the serialized representations of those types to be consumable by different platforms (i.e. Java, Javascript, etc.) I do recommend using the XmlSerializer when you can, as it can alleviate a considerable amount of hassle trying to manage conversion from object graph to XML yourself.
There are some scenarios where use of XmlSerializer is not the best approach. Here are a few cases:
When you need to quickly, forward-only process large volumes of xml data
Use an XmlReader instead
When you need to perform repeated searches within an xml document using XPath
When the xml document structure is rather arbitrary, and does not regularly conform to a known object model
When the XmlSerializer imposes requirements that do not satisfy your design mandates:
Don't use it when you can't have a default public constructor
You can't use the xml serializer attributes to define xml variants of element and attribute names to conform to the necessary Xml schema

I find the major drawbacks of the XmlSerializer are:
1) For complex object graphs involving collections, sometimes it is hard to get exactly the XML schema you want by using the serialization control attributes.
2) If you change the class definitions between one version of the app and the next, your files will become unreadable.

Yes, I personally use automatic XML serialization - although I use DataContractSerializer initially brought in because of WCF instead (ability to serialize types without attributes at all is very helpful) as it doesn't embed types in there. Of course, you therefore need to know the type of object you are deserializing when loading back in.
The big problem with that is it's difficult to serialize to attributes as well without implementing IXmlSerializable on the type whose data you might want to be written so, or exposing some other types that the serializer can handle natively.
I guess the biggest gotcha with this is that you can't serialise interfaces automatically, because the DCS wants to be able to construct instances again when it receives the XML back. Standard collection interfaces, however, are supported natively.
All in all, though, I've found the DCS route to be the fastest and most pain-free way.
As an alternative, you could also investigate using Linq to XML to read and write the XML if you want total control - but you'll still have to process types on a member by member basis with this.
I've been looking at that recently (having avoided it like the plague because I couldn't see the point) after having read about it the early access of Jon Skeet's new book. Have to say - I'm most impressed with how easy it makes it to work with XML.

I've used XmlSerializer a lot in the past and will probably continue to use it. However, the greatest pitfall is one already mentioned above:
The constraints on the serializer (such as restriction to public members) either 1) impose design constraints on the class that have nothing to do with its primary function, or 2) force an increase in complexity in working around these constraints.
Of course, other methods of Xml serialization also increase the complexity.
So I guess my answer is that there's no right or wrong answer that fits all situations; chosing a serialization method is just one design consideration among many others.

Thera re some scenarios.
You have to deal with a LOT of XML data -the serializer may overlaod your memory. I had that once for a simple schema that contained a database dump for 2000 or so tables. Only a handfull of classes, but in the end serialization did not work - I had to use a SAX streaming parser.
Besides that - I do not see any under normal circumstances. It is a much easier way to deal with the XML Serializer than to use the lower level parser, especially for more complex data.

When You want to transmit lot of data and You have very limited resources.

Why would I choose XSLT or XQuery over the other for html document generation?

I was researching alternatives to using Microsoft's XslCompiledTransform and everything seemed to point towards Saxon primarily and secondly XQSharp. As I started to look at documentation for Saxon I saw that XQuery could do the equivalent of my XSLTs that are no where near as terse as XQuery's markup is.
What advantages do XSLTs offer over XQuery to deserve the much more detailed syntax?
Would it be the templating functionality that can be created?

In general, there's a lot of overlap; both are rooted in an underlying XPath implementation. As for whether to use XSLT or XQuery, the proof is in the pudding: XSLT is better at transforms, and XQuery is better at queries.
So, use XQuery for:
smaller, less complicated things
highly structured data (XQuery is very strongly typed and prone to complaining)
data from a database
extracting a small piece of information
Conversely, use XSLT for:
copying a document with only incremental changes
larger, more complicated things
loosely (or badly) structured data

XSLT is designed to take one xml document and transform it into something else, e.g. csv, html or a different xml format, e.g. xhtml.
XQuery is designed to extract information from one or more xml documents, and combine the result into a new xml document.
Both XQuery and XSLT rely heavily on XPath. If your output is based on one input xml document and **one output xml document, the two can pretty much be interchanged.
The FLWR syntax of XQuery is quite intuitive, if you have an SQL back-ground, IMO XSLT is the more powerful language when dealing with one input/one output situations, especially if the output will not be xml.
Personally I find the xml based syntax and the declarative nature of XSLT slightly difficult to read and maintain.
It really boils down to choice, although using XQuery for "simple" formatting is slightly unusual. If your input is based on more than one xml document, you are pretty much stuck with XQuery, if your output is not xml based, you are pretty much stuck with XSLT.

The biggest reason to move away from XslCompiledTransform is that it is merely an XSLT 1.0 processor.
The majority of the functionality of XSLT 2.0 and XQuery 1.0 overlaps, and for the most part they are similar languages with different syntax (a little like C# and VB).
XSLT is a lot more verbose, but its templating features add a lot of functionality that can be fairly cumbersome to replicate in XQuery, particularly making small changes to node trees. The main features that are particularly cumbersome in XQuery are things like <xsl:template match="..." /> and <xsl:copy>...</xsl:copy>.
XQuery has a much cleaner syntax (IMHO) as long as the templating features are not needed, and I find it is a lot better for more advanced computations, and retrieving data from large documents.
XQuery is often viewed as a database language. Whilst a lot of databases use it this way, it is not the only use for it. Some implementations of the language in databases are very restricted. Another commentor claims that XQuery is "very strongly typed". Unless you are using the static typing feature, XQuery is no more strongly typed than XSLT. Whilst some database implementations force you to use the static typing features, most other implementations are moving away from this.
He also claims that XQuery is not very good for "larger, more complicated things". I would have argued exactly the opposite. The conciseness and flavour of the syntax makes it far easier to write complicated functions and computations in XQuery. I have written a raytracer in XQuery, which feels really quite natural; I think it would be a lot harder (certainly more verbose) to write something this computationally complex in XSLT.
In summary:
XSLT is more natural for transformation. It is better if you have a document with roughly the right structure and you want to transform the components, for example rendering an HTML version of an XML file.
XQuery is more natural for building a new XML document from various sources, or for changing the structure of a document.
Note that the overlap is rather large and there is often no "right" choice, but if you make the wrong choice then you to tend to find yourself working against the grain.

XSLT and XQuery do two different things. XSLT, as the name suggests, is used to transform data from one form into another (e.g. from XML to HTML). On the other hand, XQuery is a language used to find and extract certain XML nodes from an XML document, which can then be used for any number of purposes.
XSLT actually relies on the functionality of XQuery, take a look at the tutorials on www.w3schools.com; when used correctly, they are both very powerful technologies.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.