I am writing code that parses XML.
I would like to know what is faster to parse: elements or attributes.
This will have a direct effect over my XML design.
Please target the answers to C# and the differences between LINQ and XmlReader.
Thanks.
Design your XML schema so that representation of the information actually makes sense. Usually, the decision between making something in attribute or an element will not affect performance.
Performance problems with XML are in most cases related to large amounts of data that are represented in a very verbose XML dialect. A typical countermeasures is to zip the XML data when storing or transmitting them over the wire.
If that is not sufficient then switching to another format such as JSON, ASN.1 or a custom binary format might be the way to go.
Addressing the second part of your question: The main difference between the XDocument (LINQ) and the XmlReader class is that the XDocument class builds a full document object model (DOM) in memory, which might be an expensive operation, whereas the XmlReader class gives you a tokenized stream on the input document.
With XML, speed is dependent on a lot of factors.
With regards to attributes or elements, pick the one that more closely matches the data. As a guideline, we use attributes for, well, attributes of an object; and elements for contained sub objects.
Depending on the amount of data you are talking about using attributes can save you a bit on the size of your xml streams. For example, <person id="123" /> is smaller than <person><id>123</id></person> This doesn't really impact the parsing, but will impact the speed of sending the data across a network wire or loading it from disk... If we are talking about thousands of such records then it may make a difference to your application.
Of course, if that actually does make a difference then using JSON or some binary representation is probably a better way to go.
The first question you need to ask is whether XML is even required. If it doesn't need to be human readable then binary is probably better. Heck, a CSV or even a fixed-width file might be better.
With regards to LINQ vs XmlReader, this is going to boil down to what you do with the data as you are parsing it. Do you need to instantiate a bunch of objects and handle them that way or do you just need to read the stream as it comes in? You might even find that just doing basic string manipulation on the data might be the easiest/best way to go.
Point is, you will probably need to examine the strengths of each approach beyond just "what parses faster".
Without having any hard numbers to prove it, I know that the WCF team at Microsoft chose to make the DataContractSerializer their standard for WCF. It's limited in that it doesn't support XML attributes, but it is indeed up to 10-15% faster than the XmlSerializer.
From that information, I would assume that using XML attributes will be slower to parse than if you use only XML elements.
Related
I'm trying to understand and to decide the best approach to my problem.
I've an xsd that represents the schema of the information that I agreed with a client.
Now, in my application (c#, .net3.5) I use and consume an object that has been deserialized from an xml created according to the xsd schema.
As soon as I fill the object with data, I want to pass it to another application and also store it in a db. I have two questions:
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Unfortunately in the db I have a limited sized field to store the info, so I need a sort of compression of the serialized object. Binary serialization creates smaller data then xml serialization or I need in any case to compress this data? if yes, how?
Thanks!
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Neither is specific enough; binary can be good or bad; xml can be good or bad. Generally speaking binary is smaller and faster to process, but changing to such will be unusable from code that expects xml.
Binary serialization creates smaller data then xml serialization or I need in any case to compress this data?
It can be smaller; or it can be larger; indeed, compression can make things smaller or larger too.
If space is your primary concern, I would suggest running it through something like protobuf-net (a binary serializer without the versioning issues common to BinaryFormatter), and then speculatively try compressing it with GZipStream. If the compressed version is smaller: store that (and a marker - perhaps a preamble - that says "I'm compressed"). If the compressed version gets bigger than the original version, store the original (again with a preamble).
Here's a recent breakdown of the performance (speed and size) of the common .NET serializers: http://theburningmonk.com/2013/09/binary-and-json-serializer-benchmarks-updated/
Im looking for a simple solution to serialize and store objects that contain configuration, application state and data. Its a simple application, its not alot of data. Speed is no issue. I want it to be in-process. I want it to be more easy-to-edit in a texteditor than xml.
I cant find any document database for .net that can handle it in-process.
Simply serializing to xml Im not sure I want to do because its... xml.
Serializing to JSON seems very javascript specific, and I wont use this data in javascript.
I figure there's very neat ways to do this, but atm im leaning to using JSON despite its javascript inclenation.
Just because "JSON" it's an acronym for JavaScript Object Notation, has no relevance on if it fits your needs or not as a data format. JSON is lightweight, text based, easily human readable / editable and it's a language agnostic format despite the name.
I'd definitely lean toward using it, as it sounds pretty ideal for your situation.
I will give a couple of choices :
Binary serialization: depends on content of your objects, if you have complicated dependecy tree it can create a problems on serializing. Also it's not very flexible, as standart binary serialization provided by Microsoft stores saving type information too. That means if you save a type in binary file, and after one month decide to reorganize your code and let's say move the same class to another namespace, on desirialization from binary file previously saved it will fail, as the type is not more the same. There are several workarrounds on that, but I personally try to avoid that kind of serialization as much as I can.
ORM mapping and storing it into small database. SQLite is awesome choice for this kind of stuff as it small (single file) and full ACID support database. You need a mapper, or you need implement mapper by yourself.
I'm sure that you will get some other choice from the folks in a couple of minutes.
So choice is up to you.
Good luck.
I've not done much with linq to xml, but all the examples I've seen load the entire XML document into memory.
What if the XML file is, say, 8GB, and you really don't have the option?
My first thought is to use the XElement.Load Method (TextReader) in combination with an instance of the FileStream Class.
QUESTION: will this work, and is this the right way to approach the problem of searching a very large XML file?
Note: high performance isn't required.. i'm trying to get linq to xml to basically do the work of the program i could write that loops through every line of my big file and gathers up, but since linq is "loop centric" I'd expect this to be possible....
Using XElement.Load will load the whole file into the memory. Instead, use XmlReader with the XNode.ReadFrom function, where you can selectively load notes found by XmlReader with XElement for further processing, if you need to. MSDN has a very good example doing just that: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
If you just need to search the xml document, XmlReader alone will suffice and will not load the whole document into the memory.
Gabriel,
Dude, this isn't exactly answering your ACTUAL question (How to read big xml docs using linq) but you might want to checkout my old question What's the best way to parse big XML documents in C-Sharp. The last "answer" (timewise) was a "note to self" on what ACTUALLY WORKED. It turns out that a hybrid document-XmlReader & doclet-XmlSerializer is fast (enough) AND flexible.
BUT note that I was dealing with docs upto only 150MB. If you REALLY have to handle docs as big as 8GB? then I guess you're likely to encounter all sorts of problems; including issues with the O/S's LARGE_FILE (>2GB) handling... in which case I strongly suggest you keep things as-primitive-as-possible... and XmlReader is as primitive as possible (and THE fastest according to my testing) XML-parser available in the Microsoft namespace.
Also: I've just noticed a belated comment in my old thread suggesting that I check out VTD-XML... I had a quick look at it just now... It "looks promising", even if the author seems to have contracted a terminal case of FIGJAM. He claims it'll handle docs of upto 256GB; to which I reply "Yeah, have you TESTED it? In WHAT environment?" It sounds like it should work though... I've used this same technique to implement "hyperlinks" in a textual help-system; back before HTML.
Anyway good luck with this, and your overall project. Cheers. Keith.
I realize that this answer might be considered non-responsive and possibly annoying, but I would say that if you have an XML file which is 8GB, then at least some of what you are trying to do in XML should be done by the file system or database.
If you have huge chunks of text in that file, you could store them as individual files and store the metadata and the filenames separately. If you don't, you must have many levels of structured data, probably with a lot of repetition of the structures. If you can decide what is considered an individual 'record' which can be stored as a smaller XML file or in a column of a database, then you can structure your database based on the levels of nesting above that. XML is great for small and dirty, it's also good for quite unstructured data since it is self-structuring. But if you have 8GB of data which you are going to do something meaningful with, you must (usually) be able to count on some predictable structure somewhere in it.
Storing XML (or JSON) in a database, and querying and searching both for XML records, and within the XML is well supported nowadays both by SQL stuff and by the NoSQL paradigm.
Of course you might not have the choice of not using XML files this big, or you might have some situation where they are really the best solution. But for some people reading this it could be helpful to look at this alternative.
I just learned about the XmlSerializer class in .Net. Before I had always parsed and written my XML using the standard classes. Before I dive into this, I am wondering if there are any cases where it is not the right option.
EDIT: By standard classes I mean XmlDocument, XmlElement, XmlAttribute...etc.
There are many constraints when you use the XmlSerializer:
You must have a public parameterless constructor (as mentioned by idlewire in the comments, it doesn't have to be public)
Only public properties are serialized
Interface types can't be serialized
and a few others...
These constraints often force you to make certain design decisions that are not the ones you would have made in other situations... and a tool that forces you to make bad design decisions is usually not a good thing ;)
That being said, it can be very handy when you need a quick way to store simple objects in XML format. I also like that fact that you have a pretty good control over the generated schema.
Well, it doesn't give you quite as much control over the output, obviously. Personally I find LINQ to XML makes it sufficiently easy to write this by hand that I'm happy to do it that way, at least for reasonably small projects. If you're using .NET 3.5 or 4 but not using LINQ to XML, look into it straight away - it's much much nicer than the old DOM.
Sometimes it's nice to be able to take control over serialization and deserialization... especially when you change the layout of your data. If you're not in that situation and don't anticipate being in it, then the built-in XML serialization would probably be fine.
EDIT: I don't think XML serialization supports constructing genuinely immutable types, whereas this is obviously feasible from hand-built construction. As I'm a fan of immutability, that's definitely something I'd be concerned about. If you implement IXmlSerializable I believe you can make do with public immutability, but you still have to be privately mutable. Of course, I could be wrong - but it's worth checking.
The XmlSerializer can save you a lot of trouble if you are regularly serializing and deserializing the same types, and if you need the serialized representations of those types to be consumable by different platforms (i.e. Java, Javascript, etc.) I do recommend using the XmlSerializer when you can, as it can alleviate a considerable amount of hassle trying to manage conversion from object graph to XML yourself.
There are some scenarios where use of XmlSerializer is not the best approach. Here are a few cases:
When you need to quickly, forward-only process large volumes of xml data
Use an XmlReader instead
When you need to perform repeated searches within an xml document using XPath
When the xml document structure is rather arbitrary, and does not regularly conform to a known object model
When the XmlSerializer imposes requirements that do not satisfy your design mandates:
Don't use it when you can't have a default public constructor
You can't use the xml serializer attributes to define xml variants of element and attribute names to conform to the necessary Xml schema
I find the major drawbacks of the XmlSerializer are:
1) For complex object graphs involving collections, sometimes it is hard to get exactly the XML schema you want by using the serialization control attributes.
2) If you change the class definitions between one version of the app and the next, your files will become unreadable.
Yes, I personally use automatic XML serialization - although I use DataContractSerializer initially brought in because of WCF instead (ability to serialize types without attributes at all is very helpful) as it doesn't embed types in there. Of course, you therefore need to know the type of object you are deserializing when loading back in.
The big problem with that is it's difficult to serialize to attributes as well without implementing IXmlSerializable on the type whose data you might want to be written so, or exposing some other types that the serializer can handle natively.
I guess the biggest gotcha with this is that you can't serialise interfaces automatically, because the DCS wants to be able to construct instances again when it receives the XML back. Standard collection interfaces, however, are supported natively.
All in all, though, I've found the DCS route to be the fastest and most pain-free way.
As an alternative, you could also investigate using Linq to XML to read and write the XML if you want total control - but you'll still have to process types on a member by member basis with this.
I've been looking at that recently (having avoided it like the plague because I couldn't see the point) after having read about it the early access of Jon Skeet's new book. Have to say - I'm most impressed with how easy it makes it to work with XML.
I've used XmlSerializer a lot in the past and will probably continue to use it. However, the greatest pitfall is one already mentioned above:
The constraints on the serializer (such as restriction to public members) either 1) impose design constraints on the class that have nothing to do with its primary function, or 2) force an increase in complexity in working around these constraints.
Of course, other methods of Xml serialization also increase the complexity.
So I guess my answer is that there's no right or wrong answer that fits all situations; chosing a serialization method is just one design consideration among many others.
Thera re some scenarios.
You have to deal with a LOT of XML data -the serializer may overlaod your memory. I had that once for a simple schema that contained a database dump for 2000 or so tables. Only a handfull of classes, but in the end serialization did not work - I had to use a SAX streaming parser.
Besides that - I do not see any under normal circumstances. It is a much easier way to deal with the XML Serializer than to use the lower level parser, especially for more complex data.
When You want to transmit lot of data and You have very limited resources.
Background
We have a project that was started in .NET 1.1, moved to .NET 2.0, and recently moved again to .NET 3.5. The project is extremely data-driven and utilizes XML for many of its data files. Some of these XML files are quite large and I would like to take the opportunity I currently have to improve the application's interaction with them. If possible, I want to avoid having to hold them entirely in memory at all times, but on the other hand, I want to make accessing their data fast.
The current setup uses XmlDocument and XPathDocument (depending on when it was written and by whom). The data is looked up when first requested and cached in an internal data structure (rather than as XML, which would take up more memory in most scenarios). In the past, this was a nice model as it had fast access times and low memory footprint (or at least, satisfactory memory footprint). Now, however, there is a feature that queries a large proportion of the information in one go, rather than the nicely spread out requests we previously had. This causes the XML loading, validation, and parsing to be a visible bottleneck in performance.
Question
Given a large XML file, what is the most efficient and responsive way to query its contents (such as, "does element A with id=B exist?") repeatedly without having the XML in memory?
Note that the data itself can be in memory, just not in its more bloated XML form if we can help it. In the worst case, we could accept a single file being loaded into memory to be parsed and then unloaded again to free resources, but I'd like to avoid that if at all possible.
Considering that we're already caching data where we can, this question could also be read as "which is faster and uses less memory; XmlDocument, XPathDocument, parsing based on XmlReader, or XDocument/LINQ-to-XML?"
Edit: Even simpler, can we randomly access the XML on disk without reading in the entire file at once?
Example
An XML file has some records:
<MyXml>
<Record id='1'/>
<Record id='2'/>
<Record id='3'/>
</MyXml>
Our user interface wants to know if a record exists with an id of 3. We want to find out without having to parse and load every record in the file, if we can. So, if it is in our cache, there's no XML interaction, if it isn't, we can just load that record into the cache and respond to the request.
Goal
To have a scalable, fast way of querying and caching XML data files so that our user interface is responsive without resorting to multiple threads or the long-term retention of entire XML files in memory.
I realize that there may well be a blog or MSDN article on this somewhere and I will be continuing to Google after I've posted this question, but if anyone has some data that might help, or some examples of when one approach is better or faster than another, that would be great.
Update
The XMLTeam published a blog today that gives great advice on when to use the various XML APIs in .NET. It looks like something based on XmlReader and IEnumerable would be my best option for the scenario I gave here.
With XML I only know of two ways
XMLReader -> stream the large XML data in
or use the XML DOM object model and read the entire XML in at once into memory.
If the XML is big, we have XML files in 80 MB range and up, reading the XML into memory is a performance hit. There is no real way to "merge" the two ways of dealing with XML documents. Sorry.
I ran across this white paper a while ago when I was trying to stream XML: API-based XML streaming with FLWOR power and functional updates The paper tries to work with in memory XML but leverage LINQ accessing.
Maybe someone will find it interesting.
This might sound stupid.
But, if you have simple things to query, you can use regex over xml files. (the way they do grep in unix/linux).
I apologize if it doesn't make any sense.
The first part of your question sounds like a schema validation would work best. If you have access to the XSD's or can create them you could use an algorithm similar to this:
public void ValidateXmlToXsd(string xsdFilePath, string xmlFilePath)
{
XmlSchema schema = ValidateXsd(xsdFilePath);
XmlDocument xmlData = new XmlDocument();
XmlReaderSettings validationSettings = new XmlReaderSettings();
validationSettings.Schemas.Add(schema);
validationSettings.Schemas.Compile();
validationSettings.ValidationFlags = XmlSchemaValidationFlags.ProcessInlineSchema;
validationSettings.ValidationType = ValidationType.Schema;
validationSettings.ValidationEventHandler += new ValidationEventHandler(ValidationHandler);
XmlReader xmlFile = XmlReader.Create(xmlFilePath, validationSettings);
xmlData.Load(xmlFile);
xmlFile.Close();
}
private XmlSchema ValidateXsd(string xsdFilePath)
{
StreamReader schemaFile = new StreamReader(xsdFilePath);
XmlSchema schema = XmlSchema.Read(schemaFile, new ValidationEventHandler(ValidationHandler));
schema.Compile(new ValidationEventHandler(ValidationHandler));
schemaFile.Close();
schemaFile.Dispose();
return schema;
}
private void ValidationHandler(object sender, ValidationEventArgs e)
{
throw new XmlSchemaException(e.Message);
}
If the xml fails to validate the XmlSchemaException is thrown.
As for LINQ, I personally prefer to use XDocument whenever I can over XmlDocument. Your goal is somewhat subjective and without seeing exactly what you're doing I can't say go this way or go that way with any certainty that it would help you. You can use XPath with XDocument. I would have to say that you should use whichever suits your needs best. There's no issue with using XPath sometimes and LINQ other times. It really depends on your comfort level along with scalability and readability. What will benefit the team, so to speak.
An XmlReader will use less memory than an XmlDocument because it doesn't need to load the entire XML into memory at one time.
Just a thought on the comments of JMarsch. Even if the XML generation your process is not up for discussion, have you considered a DB (or a subset of XML files acting as indexes) as an intermediary? This would obviously only be of benefit if the XML files aren't updated more that once or twice a day. I guess this would need to be weighed up against your existing caching mechanism.
I can't speak to speed, butt I prefer XDocument/LINQ because of the syntax.
Rich