Processing large XML files using .NET 3.5

Processing large XML files using .NET 3.5 - c#

What is the "recommended" approach for processing very large XML files in .NET 3.5?
For writing, I want to generate an element at a time then append to a file.
For reading, I would likewise want to read an element at a time (in the same order as written).
I have a few ideas how to do it using strings and File.Append, but does .NET 3.5 provide XML Api's for dealing with arbitrarily large XML files?

Without going into specifics this isn't easy to answer. .NET offers different methods to process XML files:
XmlDocument creates a DOM, supports XPath queries but loads the entire XML file into memory.
XElement/XDocument has support for LINQ and also reads the entire XML file into memory.
XmlReader is a forward-only reader. It does not read the entire file into memory.
XmlWriter is just like the XmlReader, except for writing
Based on what you say an XmlReader/XmlWriter combination seems like the best approach.

As Dirk said, using an XmlWriter/XmlReader combo sounds like the best approach. It can be very lengthy and if your XML file is fairly complex it gets very unwieldy. I had to do something similar recently with some strict memory constraints. My SO question might come in handy.
But personally, I found this method here on MSDN blogs to be very easy to implement and it neatly handles appending to the end of the XML file without fragments.

Try to make an *.xsd file out of your *.xml. You can than generate *.cs file from *.xsd file. After that load you *.xml file to your object. It should take less memory than whole file.
There is a plugin for VS2010 that gives option to generate *.cs file from *.xsd. It is called XSD2Code. In that plugin you have an option to decorate properties for serialization. For your *.xsd file named Settings you would get Settings.cs. You would than do something like this.
StreamReader str = new StreamReader("SomeFolder\\YourFile.xml");
XmlSerializer xmlSer = new XmlSerializer(typeof(TcpPostavke));
Settings m_settings = (Settings )xmlSer .Deserialize(str);
You can than query your list of objects with Linq.

Related

Using XDocument.Load(xmlreader) method?

I heard normally when using XDocument's Load or Parse method the entire file loaded into memory that is why parsing large files with this method is not recommended...but what if I use the following overload to read a xml file
XDocument xml = XDocument.Load(XmlReader.Create(#"C:\OP\file.xml", settings),LoadOptions.None);
Does it still load the entire file into memory, if so, what is this overload good for?

Yes, that still loads the whole file's content into an in-memory representation. It's less useful than the XElement.Load(XmlReader) method which can be really useful to load just part of a document into memory at one time.
I'd view the XDocument.Load(XmlReader) method as mostly present for consistency - but I could see it being useful in cases where other APIs provide an XmlReader rather than the raw data. For example, you could have some data structure which provides "fake" XML access by allowing you to create an XmlReader from it. That way it would never need to serialize to the real XML which would then need parsing again.
Another use case would be where you want to use some aspects of XmlReadSettings which aren't avaliable in LoadOptions, such as ignoring comments or using a specific name table.
But no, you shouldn't use XDocument.Load(XmlReader) if you're concerned that the document doesn't fit into memory.

Faster XML File reading than xsd generated classes

I am processing lots and lots of XML files that contain HL7 Info.
The structure of these XML files is described in several complex XSD files. They are a hierarchy of XSD files. like this:
Messages.xsd
batch.xsd
datatypes.xsd
Fields.xsd
MoreFiles.xsd
Fields.xsd
That is not the exact usage, but it helps convey the idea of how they work.
Now I can run
xsd .\messages.xsd /classes
and it generates a file called messages.cs file that is over 240,000 lines long.
Note: Despite the complexity of the XSD, the actual xml files average around 250 lines of XML with about 25 chars per line (Not really huge).
I can use that file to deserialize my xml files like this:
var bytes = Encoding.ASCII.GetBytes(message.Message);
var memoryStream = new MemoryStream(bytes);
var message = ormSerializer.Deserialize(memoryStream);
That all works great and fast.
When it comes time to pull the data out of the xml structure it is too slow.
Is there another way to access my xml data that would be faster? Should I use XPathDocument and XPathNavigator? Can XPathNavigator use all the XSD files so I don't have to re-create it for each xml file I process (Not all XML Nodes are in all XML Files)?
Any other ideas to get XML Data out fast?

The technology you are using (automatic mapping of XML to Java or C# classes) is called "data binding" and it works beautifully when the schema is simple and small. For something as big and ugly as HL7, I would have said it is a non-starter.
What kind of processing are you doing? Is there any good reason why you can't do it in XSLT or XQuery? These languages are designed to process XML and they avoid the "impedence mismatch" that you get when you have to convert data from the XML model to the data model of a programming language such as Java or C#.

Have you looked at something like XStreamingReader? It allowes you to use Linq to XML while streaming over large XML documents. I looked at this in the past and was able to stream over XML, identify chunks of XML and deserialize them into objects. If you mess with this and need examples, I can dig up the code.
http://xstreamingreader.codeplex.com/

Reading large XML documents in .net

I need to read large xml using .net files which can easily be several GB of size.
I tried to use XDocument, but it just throws an System.OutOfMemoryException when I try to load the document.
What is the most performant way to read XML files of large size?

You basically have to use the "pull" model here - XmlReader and friends. That will allow you to stream the document rather than loading it all into memory in one go.
Note that if you know that you're at the start of a "small enough" element, you can create an XElement from an XmlReader, deal with that using the glory of LINQ to XML, and then move onto the next element.

The following page makes an interesting read, providing a means to mine data from XML file without loading it in memory. It allows you to combine the speed of XmlReader with the flexibility of Linq:
http://msdn.microsoft.com/en-us/library/bb387035.aspx
And quite an interesting article based on this technique:
http://blogs.msdn.com/b/xmlteam/archive/2007/03/24/streaming-with-linq-to-xml-part-2.aspx

You could try using an XmlTextReader instance.
http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.aspx

How to build xml element in memory and then save to file?

I want to make an xml file but add certain elements selectively. I would like to build those elements in memory and be able to choose whether to be written to the xml file or not. Is this possible? I'm using C# WinForms. I already looked at XmlDocument which will allow me to build an entire xml in memory but I don't want this since it's not good for large data and also takes too much memory.

If large data volume is your main concern, XmlWriter may be ideal. The API is perhaps not as elegant as XElement etc, but it is the most direct and efficient mechanism, and is designed for firehosing a one-way stream of data.
It has a twin, XmlReader, but that is much harder to get right - due to the complexities of processing incoming xml and accounting for child trees appropriately.

If you have a stream, named say stream, then you can create an XmlWriter by
XmlWriter.Create(stream);
Then, you can create your elements, (which are type XmlElement), and call the WriteTo method on each element you want to add, passing as an argument the XmlWriter created by XmlWriter.Create.

Optimizing XML in C#

Background
We have a project that was started in .NET 1.1, moved to .NET 2.0, and recently moved again to .NET 3.5. The project is extremely data-driven and utilizes XML for many of its data files. Some of these XML files are quite large and I would like to take the opportunity I currently have to improve the application's interaction with them. If possible, I want to avoid having to hold them entirely in memory at all times, but on the other hand, I want to make accessing their data fast.
The current setup uses XmlDocument and XPathDocument (depending on when it was written and by whom). The data is looked up when first requested and cached in an internal data structure (rather than as XML, which would take up more memory in most scenarios). In the past, this was a nice model as it had fast access times and low memory footprint (or at least, satisfactory memory footprint). Now, however, there is a feature that queries a large proportion of the information in one go, rather than the nicely spread out requests we previously had. This causes the XML loading, validation, and parsing to be a visible bottleneck in performance.
Question
Given a large XML file, what is the most efficient and responsive way to query its contents (such as, "does element A with id=B exist?") repeatedly without having the XML in memory?
Note that the data itself can be in memory, just not in its more bloated XML form if we can help it. In the worst case, we could accept a single file being loaded into memory to be parsed and then unloaded again to free resources, but I'd like to avoid that if at all possible.
Considering that we're already caching data where we can, this question could also be read as "which is faster and uses less memory; XmlDocument, XPathDocument, parsing based on XmlReader, or XDocument/LINQ-to-XML?"
Edit: Even simpler, can we randomly access the XML on disk without reading in the entire file at once?
Example
An XML file has some records:
<MyXml>
<Record id='1'/>
<Record id='2'/>
<Record id='3'/>
</MyXml>
Our user interface wants to know if a record exists with an id of 3. We want to find out without having to parse and load every record in the file, if we can. So, if it is in our cache, there's no XML interaction, if it isn't, we can just load that record into the cache and respond to the request.
Goal
To have a scalable, fast way of querying and caching XML data files so that our user interface is responsive without resorting to multiple threads or the long-term retention of entire XML files in memory.
I realize that there may well be a blog or MSDN article on this somewhere and I will be continuing to Google after I've posted this question, but if anyone has some data that might help, or some examples of when one approach is better or faster than another, that would be great.
Update
The XMLTeam published a blog today that gives great advice on when to use the various XML APIs in .NET. It looks like something based on XmlReader and IEnumerable would be my best option for the scenario I gave here.

With XML I only know of two ways
XMLReader -> stream the large XML data in
or use the XML DOM object model and read the entire XML in at once into memory.
If the XML is big, we have XML files in 80 MB range and up, reading the XML into memory is a performance hit. There is no real way to "merge" the two ways of dealing with XML documents. Sorry.

I ran across this white paper a while ago when I was trying to stream XML: API-based XML streaming with FLWOR power and functional updates The paper tries to work with in memory XML but leverage LINQ accessing.
Maybe someone will find it interesting.

This might sound stupid.
But, if you have simple things to query, you can use regex over xml files. (the way they do grep in unix/linux).
I apologize if it doesn't make any sense.

The first part of your question sounds like a schema validation would work best. If you have access to the XSD's or can create them you could use an algorithm similar to this:
public void ValidateXmlToXsd(string xsdFilePath, string xmlFilePath)
{
XmlSchema schema = ValidateXsd(xsdFilePath);
XmlDocument xmlData = new XmlDocument();
XmlReaderSettings validationSettings = new XmlReaderSettings();
validationSettings.Schemas.Add(schema);
validationSettings.Schemas.Compile();
validationSettings.ValidationFlags = XmlSchemaValidationFlags.ProcessInlineSchema;
validationSettings.ValidationType = ValidationType.Schema;
validationSettings.ValidationEventHandler += new ValidationEventHandler(ValidationHandler);
XmlReader xmlFile = XmlReader.Create(xmlFilePath, validationSettings);
xmlData.Load(xmlFile);
xmlFile.Close();
}
private XmlSchema ValidateXsd(string xsdFilePath)
{
StreamReader schemaFile = new StreamReader(xsdFilePath);
XmlSchema schema = XmlSchema.Read(schemaFile, new ValidationEventHandler(ValidationHandler));
schema.Compile(new ValidationEventHandler(ValidationHandler));
schemaFile.Close();
schemaFile.Dispose();
return schema;
}
private void ValidationHandler(object sender, ValidationEventArgs e)
{
throw new XmlSchemaException(e.Message);
}
If the xml fails to validate the XmlSchemaException is thrown.
As for LINQ, I personally prefer to use XDocument whenever I can over XmlDocument. Your goal is somewhat subjective and without seeing exactly what you're doing I can't say go this way or go that way with any certainty that it would help you. You can use XPath with XDocument. I would have to say that you should use whichever suits your needs best. There's no issue with using XPath sometimes and LINQ other times. It really depends on your comfort level along with scalability and readability. What will benefit the team, so to speak.

An XmlReader will use less memory than an XmlDocument because it doesn't need to load the entire XML into memory at one time.

Just a thought on the comments of JMarsch. Even if the XML generation your process is not up for discussion, have you considered a DB (or a subset of XML files acting as indexes) as an intermediary? This would obviously only be of benefit if the XML files aren't updated more that once or twice a day. I guess this would need to be weighed up against your existing caching mechanism.
I can't speak to speed, butt I prefer XDocument/LINQ because of the syntax.
Rich

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.