I have written a C# application to parse very large (100MB+) XML files.
The way I accomplished it was that I traverse through the file using a System.Xml.XmlReader and then, once I get to the final nodes I need to collect values from, I convert each of those very small elements into a System.Xml.Linq.XElement and perform various XPath statements via XEelement.XPathEvaluate to get the data I need.
This worked very well and efficiently, but I hit a snag where I was sometimes getting bad data due to the fact that XPathEvaluate only supports XPath 1.0 and my statements were XPath 2.0 (question posted here).
My code for originally doing this looks somewhat as follows:
void parseNode_Old(XmlReader rdr, List<string> xPathsToExtract)
{
// Enter the node:
rdr.Read();
// Load it as an XElement so as to be able to evaluate XPaths:
var nd = XElement.Load(rdr);
// Loop through the XPaths related to that node and evaluate them:
foreach (var xPath in xPathsToExtract)
{
var xPathVal = nd.XPathEvaluate(xPath);
// Do whatever with the extracted value(s)
}
}
Following the suggestions given in my previous question, I decided the best solution would be to move from System.Xml to Saxon.Api (which does support XPath 2.0) and my current updated code looks as follows:
void parseNode_Saxon(XmlReader rdr, List<string> xPathsToExtract)
{
// Set up the Saxon XPath processors:
Processor processor = new Processor(false);
XPathCompiler compiler = processor.NewXPathCompiler();
XdmNode nd = processor.NewDocumentBuilder().Build(rdr);
// Loop through the XPaths related to that node and evaluate them:
foreach (var xPath in xPathsToExtract)
{
var xPathVal = compiler.EvaluateSingle(xPath, (XdmNode)childNode);
// Do whatever with the extracted value(s)
}
}
This is working (with a few other changes to my XPaths), but it has become about 5-10 times slower.
This is my first time working with the Saxon.Api library and this is what I came up with. I'm hoping there's a better way to accomplish this to make the code-execution speed comparable or, if anyone has other ideas on how to evaluate XPath 2.0 statements in a better way without a substantial re-write, I'd love to hear them!
Any help would be greatly appreciated!!
Thanks!!
UPDATE:
In trying to fix this myself, I moved the following 2 statements to my constructor:
Processor processor = new Processor(false);
XPathCompiler compiler = processor.NewXPathCompiler();
as opposed to constantly re-create them with each call to this method which has helped substantially, but the process is still about 3 times slower than the native System.Xml.Linq version. Any other ideas / thoughts on ways to implement this parser?
This may be the best you can do with this set-up.
Saxon on .NET is often 3-5 times slower that Saxon on Java, for reasons which we have never got to the bottom of. We're currently exploring the possibility of rebuilding it using Excelsior JET rather than IKVMC to see if this can speed things up.
Saxon is much slower on a third-party DOM implementation than on its own native tree representation, but it seems you have changed your code to use the native tree model.
Since you're parsing each XPath expression every time it is executed, your performance may be dominated by XPath compilation time (even if you're searching a large XML document). Until recently Saxon's compile-time performance received very little attention, since we reckoned that doing more work at compile time to save effort at run-time was always worthwhile; but in this kind of scenario this clearly isn't the case. It may be worth splitting the compile and run-time and measuring both separately, just to see if that gives any insights. It might suggest, for example, switching off some of the optimization options. Obviously if you can cache and reuse compiled XPath expressions that will help.
Related
I'm kind of stuck having to use .Net 2.0, so LINQ xml isn't available, although I would be interested how it would compare...
I had to write an internal program to download, extract, and compare some large XML files (about 10 megs each) that are essentially build configurations. I first attempted using libraries, such as Microsoft's XML diff/patch, but comparing the files was taking 2-3 minutes, even with ignoring whitespace, namespaces, etc. (i tested each ignore one at a time to try and figure out what was speediest). The I tried to implement my own ideas - lists of nodes from XmlDocument objects, dictionaries of keys of the root's direct descendants (45000 children, by the way) that pointed to ints to indicate the node position in the XML document... all took at least 2 minutes to run.
My final implementation finishes in 1-2 seconds - I made a system process call to diff with a few lines of context and saved those results to display (our development machines include cygwin, thank goodness).
I can't help but think there is a better, XML specific way to do this that would be just as fast as a plain text diff - especially since all I'm really interested in is the Name element that is the child of each direct descendant, and could throw away 4/5 of the file for my purposes (we only need to know what files were included, not anything else involving language or version)
So, as popular as XML is, I'm sure somebody out there has had to do something similar. What is a fast efficient way to compare these large XML's? (prefereably open source or Free)
edit: a sample of the nodes - I only need to find missing Name elements (there are over 45k nodes as well)
<file>
<name>SomeFile</name>
<version>10.234</version>
<countries>CA,US</countries>
<languages>EN</languages>
<types>blah blah</types>
<internal>N</internal>
</file>
XmlDocument source = new XmlDocument();
source.Load("source.xml");
Dictionary<string, XmlNode> files = new Dictionary<string, XmlNode>();
foreach(XmlNode file in source.SelectNodes("//file"))
files.Add(file.SelectSingleNode("./name").InnerText, file);
XmlDocument source2 = new XmlDocument();
source2.Load("source2.xml");
XmlNode value;
foreach(XmlNode file in source2.SelectNodes("//file"))
if (files.TryGetValue(file.SelectSingleNode("./name").InnerText, out value))
// This file is both in source and source2.
else
// This file is only in source2.
I am not sure exactly what you want, I hope that this example will help you in your quest.
Diffing XML can be done many ways. You're not being very specific regarding the details, though. What does transpire is that the files are large and you need only 4/5 of the information.
Well, then the algorithm is as follows:
Normalize and reduce the documents to the information that matters.
Save the results.
Compare the results.
And the implementation:
Use the XmlReader API, which is efficient, to produce plain text representations of your information. Why plain text representation? Because diff tools predicated on the assumption that there is plain text. And so are our eyeballs. Why XmlReader? You could use SAX, which is memory-efficient, but XmlReader is more efficient. As for the precise spec of that plain text file ... you're just not including enough information.
Save the plain text files to some temp directory.
Use a command-line diff utility like GnuWin32 diff to get some diff output. Yeah, I know, not pure and proper, but works out of the box and there's no coding to be done. If you are familiar with some C# diff API (I am not), well, then use that API instead, of course.
Delete the temp files. (Or optionally keep them if you're going to reuse them.)
What is the difference in processing speed for executing a process using XML manipulation or using object-oriented representation? In general, is it faster to maximize or minimize the reliance on XML for a process. Let it be assumed that the code is highly optimized in either circumstance.
A simple example of what I am asking is which of the following would execute faster, when called from a C# web application, if the Thing in question were to represent the same qualified entity.
// XSL CODE FRAGMENT
<NewThings>
<xsl:for-each select="Things/Thing">
<xsl:copy-of select="." />
</xsl:for-each>
</NewThings>
or
// C# Code Fragment
void iterate(List<Thing> things){
List<Thing> newThings = new List<Thing>();
things.ForEach(t=>newThings.Add(t));
}
A complex example of might be whether it is faster to manipulate a system of objects and functions in C# or a system of xml documents in an XProc pipeline.
Thanks a lot.
Generally speaking, if you're only going to be using the source document's tree once, you're not going to gain much of anything by deserializing it into some specialized object model. The cost of admission - parsing the XML - is likely to dwarf the cost of using it, and any increase in performance that you get from representing the parsed XML in something more efficient than an XML node tree is going to be marginal.
If you're using the data in the source document over and over again, though, it can make a lot of sense to parse that data into some more efficiently-accessible structure. This is why XSLT has the xsl:key element and key() function: looking an XML node up in a hash table can be so much faster than performing a linear search on a list of XML nodes that it was worth putting the capability into the language.
To address your specific example, iterating over a List<Thing> is going to perform at the same speed as iterating over a List<XmlNode>. What will make the XSLT slower is not the iteration. It's the searching, and what you do with the found nodes. Executing the XPath query Things/Thing iterates through the child elements of the current node, does a string comparison to check each element's name, and if the element matches, it iterates through that element's child nodes and does another string comparison for each. (Actually, I don't know for a fact that it's doing a string comparison. For all I know, the XSLT processor has hashed the names in the source document and the XPath and is doing integer comparisons of hash values.) That's the expensive part of the operation, not the actual iteration over the resulting node set.
Additionally, most anything that you do with the resulting nodes in XSLT is going to involve linear searches through a node set. Accessing an object's property in C# doesn't. Accessing MyThing.MyProperty is going to be faster than getting at it via <xsl:value-of select='MyProperty'/>.
Generally, that doesn't matter, because parsing XML is expensive whether you deserialize it into a custom object model or an XmlDocument. But there's another case in which it may be relevant: if the source document is very large, and you only need a small part of it.
When you use XSLT, you essentially traverse the XML twice. First you create the source document tree in memory, and then the transform processes this tree. If you have to execute some kind of nasty XPath like //*[#some-attribute='some-value'] to find 200 elements in a million-element document, you're basically visiting each of those million nodes twice.
That's a scenario where it can be worth using an XmlReader instead of XSLT (or before you use XSLT). You can implement a method that traverses the stream of XML and tests each element to see if it's of interest, creating a source tree that contains only your interesting nodes. Or, if you want to get really crazy, you can implement a subclass of XmlReader that skips over uninteresting nodes, and pass that as the input to XslCompiledTemplate.Transform(). (I suspect, though, that if you knew enough about how XmlReader works to subclass it you probably wouldn't have needed to ask this question in the first place.) This approach allows you to visit 1,000,200 nodes instead of 2,000,000. It's also a king-hell pain in the ass, but sometimes art demands sacrifice from the artist.
All other things being equal, it's generally fastest to:
read the XML only once (disk I/O is slow)
build a document tree of nodes entirely in memory,
perform the transformations,
and generate the result.
That is, if you can represent the transformations as code operations on the in-node tree rather than having to read them from an XSLT description, that will definitely be faster. Either way, you'll have to generate some code that does the transformations you want, but with XSLT you have the extra step of "read in the transformations from this document and then transform the instructions into code", which tends to be a slow operation.
Your mileage may vary. You'll need to be more specific about the individual circumstances before a more precise answer can be given.
I'm using StringTemplate to generate some xml files from datasets. Sometimes I have more than 100,000 records in the dataset that is enumerated by a loop in a template. It goes very slow (15-20 secs per operation) so performance is not good for me.
This is an example how I use ST to render a report:
using (var sw = new StringWriter())
{
st.Write(new StringTemplateWriter(sw));
return sw.ToString();
}
StringTemplateWriter is a simple writer-class derived from IStringTemplateWriter without indentation.
By the way, in the debug screen I see a lot of such weird message:
"A first chance exception of type 'antlr.NoViableAltException' occurred in StringTemplate.DLL"
in a deep of debug I found that it parses my template recursively and if something failed (don't know what exactly) it throws NoViableAltException exception to return from a deep of stack back to a surface, so I guess the problem is in using of too much try-catch-throw's.
Google found nothing useful on this.
Main question: how to decrease this number of exceptions (except rewriting the code of ST) and improve performance of template rendering?
ST parses ST templates and groups with ANTLR. If you are getting syntax errors, your template(s) have errors. All bets are off for performance as it throws an exception for each one. ANTLR/ST not at fault here ;)
Terence
NoViableAltException sounds like a parser error. I am not sure why ANTLR is being used (except that they come from the same author), but the only guess I can come up with is that the template language itself is parsed using ANTLR. Maybe the template contains errors? Anyway ANTLR's error handling is really slow (for one, it uses exceptions) so that's probably why your template expansion is slow.
I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.
In a project that I'm working on I have to work with a rather weird data source. I can give it a "query" and it will return me a DataTable. But the query is not a traditional string. It's more like... a set of method calls that define the criteria that I want. Something along these lines:
var tbl = MySource.GetObject("TheTable");
tbl.AddFilterRow(new FilterRow("Column1", 123, FilterRow.Expression.Equals));
tbl.AddFilterRow(new FilterRow("Column2", 456, FilterRow.Expression.LessThan));
var result = tbl.GetDataTable();
In essence, it supports all the standard stuff (boolean operators, parantheses, a few functions, etc.) but the syntax for writing it is quite verbose and uncomfortable for everyday use.
I wanted to make a little parser that would parse a given expression (like "Column1 = 123 AND Column2 < 456") and convert it to the above function calls. Also, it would be nice if I could add parameters there, so I would be protected against injection attacks. The last little piece of sugar on the top would be if it could cache the parse results and reuse them when the same query is to be re-executed on another object.
So I was wondering - are there any existing solutions that I could use for this, or will I have to roll out my own expression parser? It's not too complicated, but if I can save myself two or three days of coding and a heapload of bugs to fix, it would be worth it.
Try out Irony. Though the documentation is lacking, the samples will get you up and running very quickly. Irony is a project for parsing code and building abstract syntax trees, but you might have to write a little logic to create a form that suits your needs. The DLR may be the complement for this, since it can dynamically generate / execute code from abstract syntax trees (it's used for IronPython and IronRuby). The two should make a good pair.
Oh, and they're both first-class .NET solutions and open source.
Bison or JavaCC or the like will generate a parser from a grammar. You can then augment the nodes of the tree with your own code to transform the expression.
OP comments:
I really don't want to ship 3rd party executables with my soft. I want it to be compiled in my code.
Both tools generate source code, which you link with.
I wrote a parser for exaclty this usage and complexity level by hand. It took about 2 days. I'm glad I did it, but I wouldn't do it again. I'd use ANTLR or F#'s Fslex.