Diffing large XML files in C# (.net 2.0)

Diffing large XML files in C# (.net 2.0) - c#

I'm kind of stuck having to use .Net 2.0, so LINQ xml isn't available, although I would be interested how it would compare...
I had to write an internal program to download, extract, and compare some large XML files (about 10 megs each) that are essentially build configurations. I first attempted using libraries, such as Microsoft's XML diff/patch, but comparing the files was taking 2-3 minutes, even with ignoring whitespace, namespaces, etc. (i tested each ignore one at a time to try and figure out what was speediest). The I tried to implement my own ideas - lists of nodes from XmlDocument objects, dictionaries of keys of the root's direct descendants (45000 children, by the way) that pointed to ints to indicate the node position in the XML document... all took at least 2 minutes to run.
My final implementation finishes in 1-2 seconds - I made a system process call to diff with a few lines of context and saved those results to display (our development machines include cygwin, thank goodness).
I can't help but think there is a better, XML specific way to do this that would be just as fast as a plain text diff - especially since all I'm really interested in is the Name element that is the child of each direct descendant, and could throw away 4/5 of the file for my purposes (we only need to know what files were included, not anything else involving language or version)
So, as popular as XML is, I'm sure somebody out there has had to do something similar. What is a fast efficient way to compare these large XML's? (prefereably open source or Free)
edit: a sample of the nodes - I only need to find missing Name elements (there are over 45k nodes as well)
<file>
<name>SomeFile</name>
<version>10.234</version>
<countries>CA,US</countries>
<languages>EN</languages>
<types>blah blah</types>
<internal>N</internal>
</file>

XmlDocument source = new XmlDocument();
source.Load("source.xml");
Dictionary<string, XmlNode> files = new Dictionary<string, XmlNode>();
foreach(XmlNode file in source.SelectNodes("//file"))
files.Add(file.SelectSingleNode("./name").InnerText, file);
XmlDocument source2 = new XmlDocument();
source2.Load("source2.xml");
XmlNode value;
foreach(XmlNode file in source2.SelectNodes("//file"))
if (files.TryGetValue(file.SelectSingleNode("./name").InnerText, out value))
// This file is both in source and source2.
else
// This file is only in source2.
I am not sure exactly what you want, I hope that this example will help you in your quest.

Diffing XML can be done many ways. You're not being very specific regarding the details, though. What does transpire is that the files are large and you need only 4/5 of the information.
Well, then the algorithm is as follows:
Normalize and reduce the documents to the information that matters.
Save the results.
Compare the results.
And the implementation:
Use the XmlReader API, which is efficient, to produce plain text representations of your information. Why plain text representation? Because diff tools predicated on the assumption that there is plain text. And so are our eyeballs. Why XmlReader? You could use SAX, which is memory-efficient, but XmlReader is more efficient. As for the precise spec of that plain text file ... you're just not including enough information.
Save the plain text files to some temp directory.
Use a command-line diff utility like GnuWin32 diff to get some diff output. Yeah, I know, not pure and proper, but works out of the box and there's no coding to be done. If you are familiar with some C# diff API (I am not), well, then use that API instead, of course.
Delete the temp files. (Or optionally keep them if you're going to reuse them.)

Related

Use XSLT to get modifications in old XML, then apply them in new XML

I think my problem is general enough to ask the question, but specific enough so nobody asked it yet - or I wasn't able to find it.
The task is the following: An application's configuration is stored inside an XML document. The application since then evolved and the current configuration is stored inside a different XML. The new XML is different in structure as well, so simply copying the old config will break the application.
Our goal is simple: We want to move the configuration changes from the old XML document into the new one. We have access to:
Non-modified old XML
Modified old XML
Non-modified new XML
From this information above I have to create the new XML document, where the modifications inside the old xml are re-applied.
I've figured out the algorithm already, but I don't want to reinvent the wheel here. The high level algorithm is the following:
Compare each node inside the non-modified XML with the modified XML.
If the node is not modified move forward.
If a node is removed copy it's xpath location to the removed-nodes collection.
If a node is added move it's xpath location and it's content into the added-nodes collection.
If a node is modified, move the modified atomic content / attribute to the modified-nodes collection
If a node is not atomic, apply the above algorithm to every descendant node. (recursive algorithm)
Open the new XML, go to each modification location, and
Remove the deleted nodes
Add the added nodes
Apply the modifications
The xml layout of the collected modification would look like something like this:
<xml-diff>
<added-node location="/xpath/to/the/node">
<content-of-the-added-node with="attributes">
<and-sub-elements-as-well/>
</content-of-the-added-node>
</added-node>
<removed-node location="/xpath/to/the/node"/>
<modified-node location="/xpath/to/the/node">
<content-of-the-modified-node>modified-atomic-content</content-of-the-modified-node>
</modified-node>
<added-attribute location="/xpath/to/the/#attribute">value</added-attribute>
<removed-attribute location="/xpath/to/the/#attribute" />
<modified-attribute location="/xpath/to/the/#attribute">new-value</added-attribute>
</xml-diff>
content-of-the-added-node and content-of-the-modified-node are both nodes inside the modified old xml.
After locating the modifications inside the old xml, the task is pretty straight forward. I can re-apply the modifications inside the new xml where the xpath doesn't changed. I also have a mapping which describes what old xpath value has been changed to what new xpath value, thus applying the configuration changes correctly. (For example /root/node1 has moved to /root/collections/node1, etc.)
I know that XSLT is used for transforming one XML to another. The tricky part here is to detect what are the modifications - transformations are. Sadly processing XML is a bit tricky since the order of the nodes is not always kept, but it still can mean the same thing nonetheless.
My questions are:
Is XSLT the right path to approach this problem, or should I use something else?
If XSLT is, what is the right transformation algorithm to detect these changes recursively?
If XSLT isn't the answer what is?
Can you provide me a simple XSLT where I can begin my work with?
Please note that I'm totally new in XSLT. I'm familiar with XML, and have some basic understanding of XSD.

Handling strings more than 2 GB

I have an application where an XLS file with lots of data entered by the user is opened and the data in it is converted to XML. I have already mapped the columns in the XLS file to XML Maps. When I try to use the ExportXml method in XMLMaps, I get a string with the proper XML representation of the XLS file. I parse this string a bit and upload it to my server.
The problem is, when my XLS file is really large, the string produced for XML is over 2 GB and I get a Out of Memory exception. I understand that the limit for CLR objects is 2 GB. But in my case I need to handle this scenario. Presently I just message asking the user to send less data.
Any ideas on how I can do this?
EDIT:
This is just a jist of the operation I need to do on the generated XML.
Remove certain fields which are not needed for the server data.
Add something like ID numbers for each row of data.
Modify the values of certain elements.
Do validation on the data.
While the XMLReader stream is a good idea, I cannot perform these operations by that method. While data validation can be done by Excel itself, the other things cannot be done here.
Using XMLTextReader and XMLTextWriter and creating a custom method for each of the step is a solution I had thought of. But to go through the jist above, it requires the XML document to be gone through or processed 4 times. This is just not efficient.

If the XML is that large, then you might be able to use Export to a temporary file, rather than using ExportXML to a string - http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.xmlmap.export.aspx
If you then need to parse/handle the XML in C#, then for handling such large XML structures, you'll probably be better off implementing a custom XMLReader (or XMLWriter) which works at the stream level. See this question for some similar advice - What is the best way to parse large XML (size of 1GB) in C#?

I guess there is no other way then using x64-OS and FX if you really need to hold the whole thing in RAM, but using some other way to process the data like suggested by Stuart may is the better way to go...

What you need to do is to use "stream chaining", i.e. you open up an input stream which reads from your excel file and an output stream that writes to your xml file. Then your conversion class/method will take the two streams as input and read sufficient data from the input stream to be able to write to the output.
Edit: very simple minimal Example
Converting from file:
123
1244125
345345345
4566
11
to
<List>
<ListItem>123</ListItem>
<ListItem>1244125</ListItem>
...
</List>
using
void Convert(Stream fromStream, Stream toStream)
{
using(StreamReader from= new StreamReader(fromStream))
using(StreamWriter to = new StreamWriter(toStream))
{
to.WriteLine("<List>");
while(!from.EndOfStream)
{
string bulk = from.ReadLine(); //in this case, a single line is sufficient
//some code to parse the bulk or clean it up, e.g. remove '\r\n'
to.WriteLine(string.Format("<ListItem>{0}</ListItem>", bulk));
}
to.WriteLine("</List>");
}
}
Convert(File.OpenRead("source.xls"), File.OpenWrite("source.xml"));
Of course you could do this in much more elegent, abstract manner but this is only to show my point

Reading a xml file multithreaded

I've searched a lot but I couldn't find a propper solution for my problem. I wrote a xml file containing all episode information of a TV-Show. It's 38 kb and contains attributes and strings for about 680 variables. At first I simply read it with the help of XMLTextReader which worked fine with my quadcore. But my wifes five year old laptop took about 30 seconds to read it. So I thought about multithreading but I get an exception because the file is already opened.
Thread start looks like this
while (reader.Read())
{
...
else if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("Season1"))
{
current.seasonNr = 0;
current.currentSeason = season[0];
current.reader = reader;
seasonThread[0].Start(current);
}
else if (reader.Name.Equals("Season2"))
{
current.seasonNr = 1;
current.currentSeason = season[1];
current.reader = reader;
seasonThread[1].Start(current);
}
And the parsing method like this
reader.Read();
for (episodeNr = 0; episodeNr < tmp.currentSeason.episode.Length; episodeNr++)
{
reader.MoveToFirstAttribute();
tmp.currentSeason.episode[episodeNr].id = reader.ReadContentAsInt();
...
}
But it doesn't work...
I pass the reader because I want the 'cursor' to be in the right position. But I also have no clue if this could work at all.
Please help!
EDIT:
Guys where did I wrote about IE?? The program I wrote parses the file. I run it on my PC and on the laptop. No IE at all.
EDIT2:
I did some stopwatch research and figured out that parsing the xml file only takes about 200ms on my PC and 800ms on my wifes laptop. Is it WPF beeing so slow? What can I do?

I agree with most everyone's comments. Reading a 38Kb file should not take so long. Do you have something else running on the machine, antivirus / etc, that could be interfering with the processing?
The amount of time it would take you to create a thread will be far greater than the amount of time spent reading the file. If you could post the actual code used to read the file and the file itself, it might help analyze performance bottlenecks.

I think you can't parse XML in multiple threads, at least not in a way that would bring performance benefits, because to read from some point in the file, you need to know everything that comes before it, if nothing else, to know at what level you are.
Your code, if tit worked, would do something like this:
main season1 season2
read
read
skip read
skip read
read
skip read
skip read
Note that to do “skip”, you need to fully parse the XML, which means you're doing the same amount of work as before on the main thread. The only difference is that you're doing some additional work on the background threads.
Regarding the slowness, just parsing such a small XML file should be very fast. If it's slow, you're most likely doing something else that is slow, or you're parsing the file multiple times.

If I am understanding how your .xml file is being used, you have essentially created an .xml database.
If correct, I would recommend breaking your Xml into different .xml files, with an indexed .xml document. I would think you can then query - using Linq-2-Xml - a set of .xml data from a specific .xml source.
Of course, this means you will still need to load an .xml file; however, you will be loading significantly smaller files and you would be able to, although highly discouraged, asynchronously load .xml document objects.

Your XML schema doesn't lend itself to parallelism since you seem to have node names (Season1, Season2) that contain the same data but must be parsed individually. You could redesign you schema to have the same node names (i.e. Season) and attributes that express the differences in the data (i.e. Number to indicate the season number). Then you can parallelize i.e. using Linq to XML and PLinq:
XDocument doc = XDocument.Load(#"TVShowSeasons.xml");
var seasonData = doc.Descendants("Season")
.AsParallel()
.Select(x => new Season()
{
Number = (int)x.Attribute("Number"),
Descripton = x.Value
}).ToList();

getting element offset from XMLReader

how's everyone doing this morning?
I'm writing a program that will parse a(several) xml files.
This stage of the program is going to be focusing on adding/editing skills/schools/abilities/etc for a tabletop rpg (L5R). What I learn by this one example should carry me through the rest of the program.
So I've got the xml reading set up using XMLReader. The file I'm reading looks like...
<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
I set up a Skill class, which holds the data, and a SkillEdit class which reads in the data, and will eventually have methods for editing and adding.
I'm currently able to read in everything right, but I had the thought that since description can vary in length, once I write the edit method the best way to ensure no data is overwritten would be to just append the edited skill to the end of the file and wipe out its previous entry.
In order for me to do that, I would need to know where skill's file offset is, and where /skill's file offset is. I can't seem to find any way of getting those offsets though.
Is there a way to do that, or can you guys suggest a better implementation for editing an already existing skill?

If you read your XML into LINQ to XML's XDocument (or XElement), everything could become very easy. You can read, edit, add stuff, etc. to XML files using a simple interface.
e.g.,
var xmlStr = #"<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
";
var doc = XDocument.Parse(xmlStr);
// find the skill "some name"
var mySkill = doc
.Descendants("skill") // out of all skills
.Where(e => e.Element("name").Value == "some name") // that has the element name "some name"
.SingleOrDefault(); // select it
if (mySkill != null) // if found...
{
var skillType = mySkill.Element("type").Value; // read the type
var skillPage = (int)mySkill.Element("page"); // read the page (as an int)
mySkill.Element("description").Value = "an AWESOME skill"; // change the description
// etc...
}
No need to calculate offsets, manual, step-by-step reading or maintaining other state, it is all taken care of for you.

Don't do it! In general, you can't reliably know anything about physical offsets in the serialized XML because of possible character encoding differences, entity references, embedded comments and a host of other things that can cause the physical and logical layers to have a complex relationship.
If your XML is just sitting on the file system, your safest option is to have a method in your skill class which serializes to XML (you already have one to read XML already), and re-serialize whole objects when you need to.

Tyler,
Umm, sounds like you're suffering from a text-book case of premature optimization... Have you PROVEN that reading and writing the COMPLETE skill list to/from the xml-file is TOO slow? No? Well until it's been proven that there IS NO PERFORMANCE ISSUE, right? So we just write the simplest code that works (i.e. does what we want, without worrying too much about performance), and then move on directly to the next bit of trick functionality... testing as we go.
Iff (which is short for if-and-only-if) I had a PROVEN performance problem then-and-only-then I'd consider writing each skill to individual XML-file, to avert the necessisity for rewriting a potentially large list of skills each time a single skill was modified... But this is "reference data", right? I mean you wouldn't de/serialize your (volatile) game data to/from an XML file, would you? Because an RDBMS is known to be much better at that job, right? So you're NOT going to be rewriting this file often?
Cheers. Keith.

What is the relative processing speed of manipulating data with XML or OOP techniques? (i.e. XProc or XSL vs C# or Java)

What is the difference in processing speed for executing a process using XML manipulation or using object-oriented representation? In general, is it faster to maximize or minimize the reliance on XML for a process. Let it be assumed that the code is highly optimized in either circumstance.
A simple example of what I am asking is which of the following would execute faster, when called from a C# web application, if the Thing in question were to represent the same qualified entity.
// XSL CODE FRAGMENT
<NewThings>
<xsl:for-each select="Things/Thing">
<xsl:copy-of select="." />
</xsl:for-each>
</NewThings>
or
// C# Code Fragment
void iterate(List<Thing> things){
List<Thing> newThings = new List<Thing>();
things.ForEach(t=>newThings.Add(t));
}
A complex example of might be whether it is faster to manipulate a system of objects and functions in C# or a system of xml documents in an XProc pipeline.
Thanks a lot.

Generally speaking, if you're only going to be using the source document's tree once, you're not going to gain much of anything by deserializing it into some specialized object model. The cost of admission - parsing the XML - is likely to dwarf the cost of using it, and any increase in performance that you get from representing the parsed XML in something more efficient than an XML node tree is going to be marginal.
If you're using the data in the source document over and over again, though, it can make a lot of sense to parse that data into some more efficiently-accessible structure. This is why XSLT has the xsl:key element and key() function: looking an XML node up in a hash table can be so much faster than performing a linear search on a list of XML nodes that it was worth putting the capability into the language.
To address your specific example, iterating over a List<Thing> is going to perform at the same speed as iterating over a List<XmlNode>. What will make the XSLT slower is not the iteration. It's the searching, and what you do with the found nodes. Executing the XPath query Things/Thing iterates through the child elements of the current node, does a string comparison to check each element's name, and if the element matches, it iterates through that element's child nodes and does another string comparison for each. (Actually, I don't know for a fact that it's doing a string comparison. For all I know, the XSLT processor has hashed the names in the source document and the XPath and is doing integer comparisons of hash values.) That's the expensive part of the operation, not the actual iteration over the resulting node set.
Additionally, most anything that you do with the resulting nodes in XSLT is going to involve linear searches through a node set. Accessing an object's property in C# doesn't. Accessing MyThing.MyProperty is going to be faster than getting at it via <xsl:value-of select='MyProperty'/>.
Generally, that doesn't matter, because parsing XML is expensive whether you deserialize it into a custom object model or an XmlDocument. But there's another case in which it may be relevant: if the source document is very large, and you only need a small part of it.
When you use XSLT, you essentially traverse the XML twice. First you create the source document tree in memory, and then the transform processes this tree. If you have to execute some kind of nasty XPath like //*[#some-attribute='some-value'] to find 200 elements in a million-element document, you're basically visiting each of those million nodes twice.
That's a scenario where it can be worth using an XmlReader instead of XSLT (or before you use XSLT). You can implement a method that traverses the stream of XML and tests each element to see if it's of interest, creating a source tree that contains only your interesting nodes. Or, if you want to get really crazy, you can implement a subclass of XmlReader that skips over uninteresting nodes, and pass that as the input to XslCompiledTemplate.Transform(). (I suspect, though, that if you knew enough about how XmlReader works to subclass it you probably wouldn't have needed to ask this question in the first place.) This approach allows you to visit 1,000,200 nodes instead of 2,000,000. It's also a king-hell pain in the ass, but sometimes art demands sacrifice from the artist.

All other things being equal, it's generally fastest to:
read the XML only once (disk I/O is slow)
build a document tree of nodes entirely in memory,
perform the transformations,
and generate the result.
That is, if you can represent the transformations as code operations on the in-node tree rather than having to read them from an XSLT description, that will definitely be faster. Either way, you'll have to generate some code that does the transformations you want, but with XSLT you have the extra step of "read in the transformations from this document and then transform the instructions into code", which tends to be a slow operation.
Your mileage may vary. You'll need to be more specific about the individual circumstances before a more precise answer can be given.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.