What would be the best way to validate XML?

What would be the best way to validate XML? - c#

I been looking at XML Serialization for C# and it looks interesting. I was reading this tutorial
http://www.switchonthecode.com/tutorials/csharp-tutorial-xml-serialization
and of course you can de serialize it back to a list of objects. So I am wondering would it be better to de serialize it back to to a list of objects and then go through each object and validate it or validate it by using a schema then de serializing it and doing stuff with it?
http://support.microsoft.com/kb/307379
Thanks

I guess it would depend a bit on what you want to validate, and for what purpose. If it is intended for interop to other systems, then validating via xsd is a reasonable idea not least because you can use xsd.exe to write your classes for you from the xsd (you can also generate xsd from xml or dll, but it isn't as accurate). Likewise you can use XmlReader (appropriately configured) to check against xsd,
If you just want valid .NET objects, I'd be tempted to leave the serialized form as an implementation detail, and write some C# validation code - perhaps implementing IDataErrorInfo, or using data-annotations.

You can create an XmlValidatingReader and pass that into your serializer. That way you can read the file in one pass and validate it at the same time.
I believe the same technique will work even if you are using hand rolled XML classes (for extremely large XML files) so you might find it worth a look.
Edit:
Sorry just reread some of my code, XmlValidatingReader is obsolete, you can do what you need with the XmlReader.
See XmlReader Settings

For speed I would do it in C#, however for completeness you might want to do it using an XSD. The issue with that is you have to learn the verbose and cumbersome XSD syntax, which from experience takes a lot of trial and error, is time consuming and holds not a lot of reward for serialization. Particularly with constants where you have to map them in C# and also in the XSD.
You'll always be writing the XML as C#. Anything not known when read back in is simply ignored. If you aren't editing the XML with a text editor you can guarantee that it will come back in the right way, in which case XSD is definitely not needed.

If you validate the XML, you can only prove that it's structurally correct. An attempt to deserialize from the XML will tell you the same thing.
Typically business objects can implement business logic/rules/conditions that go beyond a valid schema. That type of knowledge should stay with the business objects themselves, rather than being duplicated in some sort of external validation routine (otherwise, if you change a business rule, you have to update the validator at the same time).

Related

C# Xml Serialization big purpose?

I was just wondering about XML Serialization. If i understand correctly, the main reason for using it is that it lets you transport your object data more easily, am I right? Also, i tried serializing data using a constructor but it says that that you can only serialize data that are "parameterless". The thing is I like constructors because it allows me to have for example a Player class, and adding a new player with all properties is much more productive than having to set all properties one by one.
So the big question here is, what's the BIG purpose of XML serialization, what are the ways to use it? the way I see it is that it adds another level of complexity to my code, because i now need a class to serialize my data. Can someone shed some light?!

If you're talking about the overall purpose of serialization, strictly speaking, serialization (note that I said "serialization," not "XML Serialization" - more on that in a second) doesn't just make transporting objects easier, it's the only way you could transport an object.
As indicated in Pablo Santa Cruz's answer, XML is one of many ways you can serialize data. If you're going to save or send data somewhere, by definition you must first have some way to represent it. Serialization basically means that you represent your object state in some specified format. Deserialization is the opposite - given some representation of an object state, reconstruct what the original object state was.
In that sense, XML serialization, saving an object state to a database somehow, saving it as JSON, saving it in some binary format, and saving in some XML format are all examples of serialization (because you're representing the object state in a pre-defined format for later use).
While any defined format can technically be serialization, there are several standard ways of doing that. XML and JSON are by far the most common formats because they're standardized, easy to parse, easy to constrain (e.g. with XML Schema), are widely supported by libraries, can be relatively human-readable (which makes debugging easier), and they're widely used.
In case the last point sounds a little odd (they're widely used because they're widely used), standards by their very nature tend to have a strong network effect. In other words, the more people adapt them the more useful they are; for example, it's only useful to have email if you can actually use it to contact other people - it wouldn't be even slightly useful to have email if you were the only one using it.
A lot of standards and technologies will win out over competitors more because they have more early adapters than because they're necessarily technically superior. For example, even if someone could clearly prove that OS X is a "better" operating system than Windows, it wouldn't matter because there's vastly more software developed for Windows and it would be prohibitively expensive for people to try to switch to OS X. (You could make a similar argument for Token Ring vs. Ethernet).

Serialization is for storing object representation somehow (on a disk file, on the wire {network transportation}, on a HTTP session, on a database). XML Serialization is just one type of serialization.
The reason you need a parameter-less constructor to support serialization, is that the AUTO DESERIALIZER needs to create an EMPTY (with no o little data) class before start populating it with the corresponding data.
You don't need to use ONE WAY or THE OTHER, because you can have a class with multiple constructor (the parameter-less one will be used on deserialization, and you can use the other one wherever you need in your code).

Easiest way to serialize and store objects in c#?

Im looking for a simple solution to serialize and store objects that contain configuration, application state and data. Its a simple application, its not alot of data. Speed is no issue. I want it to be in-process. I want it to be more easy-to-edit in a texteditor than xml.
I cant find any document database for .net that can handle it in-process.
Simply serializing to xml Im not sure I want to do because its... xml.
Serializing to JSON seems very javascript specific, and I wont use this data in javascript.
I figure there's very neat ways to do this, but atm im leaning to using JSON despite its javascript inclenation.

Just because "JSON" it's an acronym for JavaScript Object Notation, has no relevance on if it fits your needs or not as a data format. JSON is lightweight, text based, easily human readable / editable and it's a language agnostic format despite the name.
I'd definitely lean toward using it, as it sounds pretty ideal for your situation.

I will give a couple of choices :
Binary serialization: depends on content of your objects, if you have complicated dependecy tree it can create a problems on serializing. Also it's not very flexible, as standart binary serialization provided by Microsoft stores saving type information too. That means if you save a type in binary file, and after one month decide to reorganize your code and let's say move the same class to another namespace, on desirialization from binary file previously saved it will fail, as the type is not more the same. There are several workarrounds on that, but I personally try to avoid that kind of serialization as much as I can.
ORM mapping and storing it into small database. SQLite is awesome choice for this kind of stuff as it small (single file) and full ACID support database. You need a mapper, or you need implement mapper by yourself.
I'm sure that you will get some other choice from the folks in a couple of minutes.
So choice is up to you.
Good luck.

Can a DataContractSerializer be setup to ignore errors in a file rather then just fail entirely?

I'm using DataContractSerializer to save a large number of different classes which make up a tree structures to XML files. I'm in the initial stages of writing this software so at this point all the different components are changing around quite a bit. Yet every time I make a change to a class I end up breaking my programs ability to open previously saved files.
My tree structures will still be functional if components are missing. Is there some way to tell DataContractSerializer to skip over data it has a problem deserializing and continue on rather then just quitting at the first problem it has?
I know one answer would be to write my own serialization class, but I'd rather not spend the time to do that. I was hopping to still be able to take advantage of DataContractSerializer, but with out it being an all or nothing situation.

I think what you're looking for is IExtensibleDataObject. This way, any unexpected elements get read into a name-value dictionary maintained internally, and can even be serialized back later. See the following resources for help.
Blog post -- WCF Extensibility – Other Serialization Extensions
Forward-Compatible Data Contracts
Data Contract Versioning

What is the best approach to generalize and aggregate XML dumps in C#?

Here is the business part of the issue:
Several different companies send a
XML dump of the information to be
processed.
The information sent by the companies
are similar ... not exactly same.
Several more companies would be soon
enlisted and would start sending
information
Now, the technical part of the problem is I want to write a generic solution in C# to accommodate this information for processing. I would be transforming the XML in my C# class(es) to fit in to my database model.
Is there any pattern or solution for this issue to be handled generically without needing to change my solution in case of addition of many companies later?
What would be the best approach to write my parser/transformer?

This is how I have done something similar in the past.
As long as each company has its own fixed format which they use for their XML dump,
Have an specific XSLT for each company.
Have a way of indicating which dump is sourced from where (maybe different DUMP folders for each company )
In your program, based on 2, select 1 and apply it to the DUMP
All the XSLT's will transform the XML to your one standard database schema
Save this to your DB
Each new company addition is at the most a new XSLT
In cases where the schema is very similar, the XSLT's can be just re-used and then specific changes made to them.
Drawback to this approach: Debugging XSLT's can be a bit more painful if you do not have the right tools. However a LOT of XML Editors (eg XML Spy etc) have excellent XSLT debugging capabilities.

Sounds to me like you are just asking for a design pattern (or set of patterns) that you could use to do this in a generic, future-proof manner, right?
Ideally some of the attributes that you probably want
Each "transformer" is decoupled from one another.
You can easily add new "transformers" without having to rewrite your main "driver" routine.
You don't need to recompile / redeploy your entire solution every time you modify a transformer, or at least add a new one.
Each "transformer" should ideally implement a common interface that your driver routine knows about - call it IXmlTransformer. The responsibility of this interface is to take in an XML file and to return whatever object model / dataset that you use to save to the database. Each of your transformers would implement this interface. For common logic that is shared by all transformers you could either create a based class that all inherit from, or (my preferred choice) have a set of helper methods which you can call from any of them.
I would start by using a Factory to create each "transformer" from your main driver routine. The factory could use reflection to interrogate all assemblies it can see that, or something like MEF which could do a lot of the work for you. Your driver logic should use the factory to create all the transformers and store them.
Then you need some logic and mechanism to "lookup" each XML file received to a given Transformer - perhaps each XML file has a header that you could use to identify or something similar. Again, you want to keep these decoupled from your main logic so that you can easily add new transformers without modification of the driver routine. You could e.g. supply the XML file to each transformer and ask it "can you transform this file", and it is up to each transformer to "take responsibility" for a given file.
Every time your driver routine gets a new XML file, it looks up the appropriate transformer, and runs it through; the result gets sent to the DB processing area. If no transformer can be found, you dump the file in a directory for interrogation later.
I would recommend reading a book like Agile Principles, Patterns and Practices by Robert Martin (http://www.amazon.co.uk/Agile-Principles-Patterns-Practices-C/dp/0131857258), which gives good examples of appropriate design patterns for situations like yours e.g. Factory and DIP etc.
Hope that helps!

Solution proposed by InSane is likley the most straigh forward and definitely XML friendly approach.
If you looking for writing your own code to do conversion of different data formats than implementing multiple reader entities that would read data from each distinct format and transform to unified format, than your main code would work with this entities in unified way, i.e. by saving to the database.
Search for ETL - (Extract-Trandform-Load) to get more information - What model/pattern should I use for handling multiple data sources? , http://en.wikipedia.org/wiki/Extract,_transform,_load

Using XSLT as proposed in the currently most upvoted answer, is just moving the problem, from c# to xslt.
You are still changing the pieces that process the xml, and you are still exposed to how good/poor is the code structured / whether it is in c# or rules in the xslt.
Regardless if you keep it in c# or go xslt for those bits, the key is to separate the transformation of the xml you receive from the various companies into a unique format, whether that's an intermediate xml or a set of classes where you load the data you are processing.
Whatever you do avoid getting clever and trying to define your own generic transformation layer, if that's what you want Do use XSLT since that's what's for. If you go with c#, keep it simple with a transformation class for each company that implements the simplest interface.
On the c# way, keep any reuse you may have between the transformations to composition, don't even think of inheritance to do so ... this is one of the areas where it gets very ugly quickly if you go that way.

Have you considered BizTalk server?

Just playing the fence here and offering another solution for other readers.
The easiest way to get the data into your models within C# is to use XSLT to convert each companies data into a serialized form of your models. These are the basic steps I would take:
Create a complete model of all your data and use XmlSerializer to write out the model.
Create an XSLT that takes Company A's data and converts it into a valid serialized xml model of your data. Use the previously created XML file as a reference.
Use Deserialize on the new XML you just created. You will now have a reference to your model object containing all the data from the company.

Any reason not to use XmlSerializer?

I just learned about the XmlSerializer class in .Net. Before I had always parsed and written my XML using the standard classes. Before I dive into this, I am wondering if there are any cases where it is not the right option.
EDIT: By standard classes I mean XmlDocument, XmlElement, XmlAttribute...etc.

There are many constraints when you use the XmlSerializer:
You must have a public parameterless constructor (as mentioned by idlewire in the comments, it doesn't have to be public)
Only public properties are serialized
Interface types can't be serialized
and a few others...
These constraints often force you to make certain design decisions that are not the ones you would have made in other situations... and a tool that forces you to make bad design decisions is usually not a good thing ;)
That being said, it can be very handy when you need a quick way to store simple objects in XML format. I also like that fact that you have a pretty good control over the generated schema.

Well, it doesn't give you quite as much control over the output, obviously. Personally I find LINQ to XML makes it sufficiently easy to write this by hand that I'm happy to do it that way, at least for reasonably small projects. If you're using .NET 3.5 or 4 but not using LINQ to XML, look into it straight away - it's much much nicer than the old DOM.
Sometimes it's nice to be able to take control over serialization and deserialization... especially when you change the layout of your data. If you're not in that situation and don't anticipate being in it, then the built-in XML serialization would probably be fine.
EDIT: I don't think XML serialization supports constructing genuinely immutable types, whereas this is obviously feasible from hand-built construction. As I'm a fan of immutability, that's definitely something I'd be concerned about. If you implement IXmlSerializable I believe you can make do with public immutability, but you still have to be privately mutable. Of course, I could be wrong - but it's worth checking.

The XmlSerializer can save you a lot of trouble if you are regularly serializing and deserializing the same types, and if you need the serialized representations of those types to be consumable by different platforms (i.e. Java, Javascript, etc.) I do recommend using the XmlSerializer when you can, as it can alleviate a considerable amount of hassle trying to manage conversion from object graph to XML yourself.
There are some scenarios where use of XmlSerializer is not the best approach. Here are a few cases:
When you need to quickly, forward-only process large volumes of xml data
Use an XmlReader instead
When you need to perform repeated searches within an xml document using XPath
When the xml document structure is rather arbitrary, and does not regularly conform to a known object model
When the XmlSerializer imposes requirements that do not satisfy your design mandates:
Don't use it when you can't have a default public constructor
You can't use the xml serializer attributes to define xml variants of element and attribute names to conform to the necessary Xml schema

I find the major drawbacks of the XmlSerializer are:
1) For complex object graphs involving collections, sometimes it is hard to get exactly the XML schema you want by using the serialization control attributes.
2) If you change the class definitions between one version of the app and the next, your files will become unreadable.

Yes, I personally use automatic XML serialization - although I use DataContractSerializer initially brought in because of WCF instead (ability to serialize types without attributes at all is very helpful) as it doesn't embed types in there. Of course, you therefore need to know the type of object you are deserializing when loading back in.
The big problem with that is it's difficult to serialize to attributes as well without implementing IXmlSerializable on the type whose data you might want to be written so, or exposing some other types that the serializer can handle natively.
I guess the biggest gotcha with this is that you can't serialise interfaces automatically, because the DCS wants to be able to construct instances again when it receives the XML back. Standard collection interfaces, however, are supported natively.
All in all, though, I've found the DCS route to be the fastest and most pain-free way.
As an alternative, you could also investigate using Linq to XML to read and write the XML if you want total control - but you'll still have to process types on a member by member basis with this.
I've been looking at that recently (having avoided it like the plague because I couldn't see the point) after having read about it the early access of Jon Skeet's new book. Have to say - I'm most impressed with how easy it makes it to work with XML.

I've used XmlSerializer a lot in the past and will probably continue to use it. However, the greatest pitfall is one already mentioned above:
The constraints on the serializer (such as restriction to public members) either 1) impose design constraints on the class that have nothing to do with its primary function, or 2) force an increase in complexity in working around these constraints.
Of course, other methods of Xml serialization also increase the complexity.
So I guess my answer is that there's no right or wrong answer that fits all situations; chosing a serialization method is just one design consideration among many others.

Thera re some scenarios.
You have to deal with a LOT of XML data -the serializer may overlaod your memory. I had that once for a simple schema that contained a database dump for 2000 or so tables. Only a handfull of classes, but in the end serialization did not work - I had to use a SAX streaming parser.
Besides that - I do not see any under normal circumstances. It is a much easier way to deal with the XML Serializer than to use the lower level parser, especially for more complex data.

When You want to transmit lot of data and You have very limited resources.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.