LinqToXml OR Xml? - c#

I am trying to processing on xml and i am so tired to work with Xml namespace classes.
I found it really hard.
So i am trying to do same things with LinqtoXml classes and i see that many things possible with LinqtoXml instead of Xml classes. So i really get confused which i have to use.
When and why do u prefer Xml classes instead of LinqtoXml classes ?
Edit:
What is not possible to do with LinqToXml when it is possible with Xml classes ?

General methods used to manipulate and process XML data were available since the first version of .NET Framework. Linq is available since .NET Framework 3.5. That's one of the reasons to use plain Xml classes if you target an older version of .NET Framework.
Linq to Xml has also a different approach. It's more about querying XML data, and not reading it. So in some situations, Linq to Xml will be the easiest way to do things, but not in every situation. The ease of Linq to Xml may also be a bad thing if used by an unskilled developer: it is, IMHO, much easier to produce code with huge performance issues with Linq rather than without.

Working exclusively in the LINQ to XML space is good for querying (the Q in LINQ), and projecting the original XML data into new structures, but it's unwieldy in some cases and impossible in others to manipulate the XML directly.
One can make an argument that in many cases this is still a good thing, since immutability does reduce the chance of some kinds of bugs, but it definitely can have costs - not perfectly aligning with every situation, and sometimes performance.

Related

Retrieving data LINQ vs Reflection

I was hoping someone could tell me which is the more efficient and/or correct way to retrieve some data.
I have some XML files coming from a 3rd party and their attached DTDs. So I've converted the DTD into a C# Class so I can deserialize the XML into the classes. I now need to map that data to match the way my data structures are set up.
The question ultimately is; should I use reflection or LINQ. The format of the XML is somewhat generic by design, where things are held in Items [Array] or Item [Object].
I've done the following:
TheirClass class = theirMessage.Items.Where(n=> n.GetType() == typeof(TheirClass)).First() as TheirClass;
MyObject.Param1 = ConversionHelperClass.Convert(class.Obj1);
MyObject.Param2 = ConversionHelperClass.Convert(class.Obj2);
I can also do some stuff with Reflection where I pass in the names of the Classes and Attributes I'm trying to snag.
Trying to do things the right way here.
As a general rule I'd suggest avoiding reflection unless it is absolutely necessary! It introduces a performance overhead AND means that you miss out on all of the lovely compile time checks that the compiler team have worked so hard to give us.
Linq to entities essentially queries against an in memory data set, so it can be very fast.
If your ultimate goal is to parse information from an xml document, I'd suggest checking out the XDocument class. It provides a very nice abstraction for querying xml documents.

How/Can I use linq to xml to query huge xml files with reasonable memory consumption?

I've not done much with linq to xml, but all the examples I've seen load the entire XML document into memory.
What if the XML file is, say, 8GB, and you really don't have the option?
My first thought is to use the XElement.Load Method (TextReader) in combination with an instance of the FileStream Class.
QUESTION: will this work, and is this the right way to approach the problem of searching a very large XML file?
Note: high performance isn't required.. i'm trying to get linq to xml to basically do the work of the program i could write that loops through every line of my big file and gathers up, but since linq is "loop centric" I'd expect this to be possible....
Using XElement.Load will load the whole file into the memory. Instead, use XmlReader with the XNode.ReadFrom function, where you can selectively load notes found by XmlReader with XElement for further processing, if you need to. MSDN has a very good example doing just that: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
If you just need to search the xml document, XmlReader alone will suffice and will not load the whole document into the memory.
Gabriel,
Dude, this isn't exactly answering your ACTUAL question (How to read big xml docs using linq) but you might want to checkout my old question What's the best way to parse big XML documents in C-Sharp. The last "answer" (timewise) was a "note to self" on what ACTUALLY WORKED. It turns out that a hybrid document-XmlReader & doclet-XmlSerializer is fast (enough) AND flexible.
BUT note that I was dealing with docs upto only 150MB. If you REALLY have to handle docs as big as 8GB? then I guess you're likely to encounter all sorts of problems; including issues with the O/S's LARGE_FILE (>2GB) handling... in which case I strongly suggest you keep things as-primitive-as-possible... and XmlReader is as primitive as possible (and THE fastest according to my testing) XML-parser available in the Microsoft namespace.
Also: I've just noticed a belated comment in my old thread suggesting that I check out VTD-XML... I had a quick look at it just now... It "looks promising", even if the author seems to have contracted a terminal case of FIGJAM. He claims it'll handle docs of upto 256GB; to which I reply "Yeah, have you TESTED it? In WHAT environment?" It sounds like it should work though... I've used this same technique to implement "hyperlinks" in a textual help-system; back before HTML.
Anyway good luck with this, and your overall project. Cheers. Keith.
I realize that this answer might be considered non-responsive and possibly annoying, but I would say that if you have an XML file which is 8GB, then at least some of what you are trying to do in XML should be done by the file system or database.
If you have huge chunks of text in that file, you could store them as individual files and store the metadata and the filenames separately. If you don't, you must have many levels of structured data, probably with a lot of repetition of the structures. If you can decide what is considered an individual 'record' which can be stored as a smaller XML file or in a column of a database, then you can structure your database based on the levels of nesting above that. XML is great for small and dirty, it's also good for quite unstructured data since it is self-structuring. But if you have 8GB of data which you are going to do something meaningful with, you must (usually) be able to count on some predictable structure somewhere in it.
Storing XML (or JSON) in a database, and querying and searching both for XML records, and within the XML is well supported nowadays both by SQL stuff and by the NoSQL paradigm.
Of course you might not have the choice of not using XML files this big, or you might have some situation where they are really the best solution. But for some people reading this it could be helpful to look at this alternative.

What is faster in xml parsing: elements or attributes?

I am writing code that parses XML.
I would like to know what is faster to parse: elements or attributes.
This will have a direct effect over my XML design.
Please target the answers to C# and the differences between LINQ and XmlReader.
Thanks.
Design your XML schema so that representation of the information actually makes sense. Usually, the decision between making something in attribute or an element will not affect performance.
Performance problems with XML are in most cases related to large amounts of data that are represented in a very verbose XML dialect. A typical countermeasures is to zip the XML data when storing or transmitting them over the wire.
If that is not sufficient then switching to another format such as JSON, ASN.1 or a custom binary format might be the way to go.
Addressing the second part of your question: The main difference between the XDocument (LINQ) and the XmlReader class is that the XDocument class builds a full document object model (DOM) in memory, which might be an expensive operation, whereas the XmlReader class gives you a tokenized stream on the input document.
With XML, speed is dependent on a lot of factors.
With regards to attributes or elements, pick the one that more closely matches the data. As a guideline, we use attributes for, well, attributes of an object; and elements for contained sub objects.
Depending on the amount of data you are talking about using attributes can save you a bit on the size of your xml streams. For example, <person id="123" /> is smaller than <person><id>123</id></person> This doesn't really impact the parsing, but will impact the speed of sending the data across a network wire or loading it from disk... If we are talking about thousands of such records then it may make a difference to your application.
Of course, if that actually does make a difference then using JSON or some binary representation is probably a better way to go.
The first question you need to ask is whether XML is even required. If it doesn't need to be human readable then binary is probably better. Heck, a CSV or even a fixed-width file might be better.
With regards to LINQ vs XmlReader, this is going to boil down to what you do with the data as you are parsing it. Do you need to instantiate a bunch of objects and handle them that way or do you just need to read the stream as it comes in? You might even find that just doing basic string manipulation on the data might be the easiest/best way to go.
Point is, you will probably need to examine the strengths of each approach beyond just "what parses faster".
Without having any hard numbers to prove it, I know that the WCF team at Microsoft chose to make the DataContractSerializer their standard for WCF. It's limited in that it doesn't support XML attributes, but it is indeed up to 10-15% faster than the XmlSerializer.
From that information, I would assume that using XML attributes will be slower to parse than if you use only XML elements.

Any reason not to use XmlSerializer?

I just learned about the XmlSerializer class in .Net. Before I had always parsed and written my XML using the standard classes. Before I dive into this, I am wondering if there are any cases where it is not the right option.
EDIT: By standard classes I mean XmlDocument, XmlElement, XmlAttribute...etc.
There are many constraints when you use the XmlSerializer:
You must have a public parameterless constructor (as mentioned by idlewire in the comments, it doesn't have to be public)
Only public properties are serialized
Interface types can't be serialized
and a few others...
These constraints often force you to make certain design decisions that are not the ones you would have made in other situations... and a tool that forces you to make bad design decisions is usually not a good thing ;)
That being said, it can be very handy when you need a quick way to store simple objects in XML format. I also like that fact that you have a pretty good control over the generated schema.
Well, it doesn't give you quite as much control over the output, obviously. Personally I find LINQ to XML makes it sufficiently easy to write this by hand that I'm happy to do it that way, at least for reasonably small projects. If you're using .NET 3.5 or 4 but not using LINQ to XML, look into it straight away - it's much much nicer than the old DOM.
Sometimes it's nice to be able to take control over serialization and deserialization... especially when you change the layout of your data. If you're not in that situation and don't anticipate being in it, then the built-in XML serialization would probably be fine.
EDIT: I don't think XML serialization supports constructing genuinely immutable types, whereas this is obviously feasible from hand-built construction. As I'm a fan of immutability, that's definitely something I'd be concerned about. If you implement IXmlSerializable I believe you can make do with public immutability, but you still have to be privately mutable. Of course, I could be wrong - but it's worth checking.
The XmlSerializer can save you a lot of trouble if you are regularly serializing and deserializing the same types, and if you need the serialized representations of those types to be consumable by different platforms (i.e. Java, Javascript, etc.) I do recommend using the XmlSerializer when you can, as it can alleviate a considerable amount of hassle trying to manage conversion from object graph to XML yourself.
There are some scenarios where use of XmlSerializer is not the best approach. Here are a few cases:
When you need to quickly, forward-only process large volumes of xml data
Use an XmlReader instead
When you need to perform repeated searches within an xml document using XPath
When the xml document structure is rather arbitrary, and does not regularly conform to a known object model
When the XmlSerializer imposes requirements that do not satisfy your design mandates:
Don't use it when you can't have a default public constructor
You can't use the xml serializer attributes to define xml variants of element and attribute names to conform to the necessary Xml schema
I find the major drawbacks of the XmlSerializer are:
1) For complex object graphs involving collections, sometimes it is hard to get exactly the XML schema you want by using the serialization control attributes.
2) If you change the class definitions between one version of the app and the next, your files will become unreadable.
Yes, I personally use automatic XML serialization - although I use DataContractSerializer initially brought in because of WCF instead (ability to serialize types without attributes at all is very helpful) as it doesn't embed types in there. Of course, you therefore need to know the type of object you are deserializing when loading back in.
The big problem with that is it's difficult to serialize to attributes as well without implementing IXmlSerializable on the type whose data you might want to be written so, or exposing some other types that the serializer can handle natively.
I guess the biggest gotcha with this is that you can't serialise interfaces automatically, because the DCS wants to be able to construct instances again when it receives the XML back. Standard collection interfaces, however, are supported natively.
All in all, though, I've found the DCS route to be the fastest and most pain-free way.
As an alternative, you could also investigate using Linq to XML to read and write the XML if you want total control - but you'll still have to process types on a member by member basis with this.
I've been looking at that recently (having avoided it like the plague because I couldn't see the point) after having read about it the early access of Jon Skeet's new book. Have to say - I'm most impressed with how easy it makes it to work with XML.
I've used XmlSerializer a lot in the past and will probably continue to use it. However, the greatest pitfall is one already mentioned above:
The constraints on the serializer (such as restriction to public members) either 1) impose design constraints on the class that have nothing to do with its primary function, or 2) force an increase in complexity in working around these constraints.
Of course, other methods of Xml serialization also increase the complexity.
So I guess my answer is that there's no right or wrong answer that fits all situations; chosing a serialization method is just one design consideration among many others.
Thera re some scenarios.
You have to deal with a LOT of XML data -the serializer may overlaod your memory. I had that once for a simple schema that contained a database dump for 2000 or so tables. Only a handfull of classes, but in the end serialization did not work - I had to use a SAX streaming parser.
Besides that - I do not see any under normal circumstances. It is a much easier way to deal with the XML Serializer than to use the lower level parser, especially for more complex data.
When You want to transmit lot of data and You have very limited resources.

Converting XML between schemas - XSLT or Objects?

Given:
Two similar and complex Schemas, lets call them XmlA and XmlB.
We want to convert from XmlA to XmlB
Not all the information required to produce XmlB is contained withing XmlA (A database lookup will be required)
Can I use XSLT for this given that I'll need to reference additional data in the database? If so what are the arguments in favour of using XSLT rather than plain old object mapping and conversion? I'm thinking that the following criteria might influence this decision:
Performance/speed
Memory usage
Code reuse/comlpexity
The project will be C# based.
Thanks.
With C# you can always provide extension objects to XSLT transforms, so that's a non-issue.
It's hard to qualitatively say without having the schemas and XML to hand, but I imagine a compiled transform will be faster than object mapping since you'll have to do a fair amount of wheel reinventing.
Further, one of the huge benefits of XSLT is it's maintainability and portability. You'll be able to adapt the XSLT doc really quickly with schema changes, and on-the-fly without having to do any rebuilds and takedowns if you're monitoring the file.
Could go either way based on what you've given us though.
My question is how likely are the set-of-transformations to change?
If they won't change much, I favor doing it all in one body of source code -- here that would be C#. I would use XSD.exe (.NET XSD tool) generated serialization classes in conjunction with data layers for this kind of thing.
On the other hand, if the set-of-transformations are likely to change -- or perhaps need to be 'corrected' post installation -- then I would favor a combination of XSLT and C# extensions to XSLT. The Extension mechanism is straightforward, and if you use the XslCompiledTransform type the performance is quite good.
If the data isn't in the xml, then xslt will be a pain. You can provide additional documents with xsl:document(), or you can use xslt extension methods (but that is not well supported between vendors). So unless you are dead set on xslt, it doesn't sound like a good option on this occasion (although I'm a big fan of xslt when used correctly).
So: I would probably use regular imperative code - streaming (IEnumerable<T>) if possible. Of course, unless you have a lot of data, such nuances are moot.

Categories