XML Seek to specific elements, as efficiently as possible

XML Seek to specific elements, as efficiently as possible - c#

I'm working on an application where i have to read a specific xml node (the 'progress' node, out of a several large (3meg'ish) files.
I'm doing that via TextReader and XDocument, as shown below
TextReader reader = null;
reader = new StreamReader(Filename);
XDocument objDoc = XDocument.Load(reader);
var progressElement = objDoc.Root.Element("Progress");
var lastAccessTime = progressElement.Element("LastTimeAccessed").Value;
var user = progressElement.Element(("LastUserAccessed").Value;
var lastOpCode = progressElement.Element("LastOpCodeCompleted").Value;
var step = progressElement.Element("StepsCompleted").Value;
XDocument, I believe, is loading the entire file into memory before doing anything else. However, I don't need that! I know the node is going to be the first node in the file.
Is there any type of 'seek' xml parsers that don't cache the entire file first?
Its taking like 15 seconds to parse 10 files for the attributes mentioned above (terrible wireless here).

XmlReader is your best option if all you want is speed. It reads a node at a time, starting at the beginning. The big limitation is that you can't go backward or use any random-access to the XML document.

Yes. You can use a SAX parser, which works differently to XDocument. Basically, a SAX parser works its way through the input XML, firing events back at callback code. (You write these callback handlers.) The main advantages;
The entire document need not be read into a memory model. (A DOM)
You can stop the processing when you have what you want.
Have a look at http://www.ibm.com/developerworks/library/x-tipsaxstop/

Related

Read XML from stream, extract node, base64 decode it and write to file with minimum internal memory required

I've a WebResponse.GetResponseStream which could deliver me a string (XML file). Now I need only one node of the XML which is carrying the data (ok I would need some other nodes too). I could now save the stream as XML file. Afterwards I'd need to parse it, extract the data from the node and save it as file again. Then I would base64 decode this file and save the real file again.
For all that I want to use the minimum possible amount of internal memory (RAM). Doing everything on the fly decreases the memory very fast on bigger files and also could lead to a out-of-memory problem. In my idea described above I'd write multiple files to extract the data. Can this be done in one step so that only one file is written at all? Though the main requirement is to be as memory (RAM) efficient as possible. Perhaps only one small buffer is necessary, but no doubling or tripling of the memory is needed.

Use an XMLReader
string xml = ""; //put your string input here
StringReader reader = new StringReader(xml);
XmlReader xreader = XmlReader.Create(reader);

Using an XmlReader is correct... but you do NOT want to load it into a 'string' to do it. Rather, XmlReader is designed to read from a stream just fine. Use XmlReader.Create(yourStream).

Parsing xml file that comes in as one object per line

I haven't been here in so long, I forgot my prior account! Anyways, I am working on parsing an xml document that comes in ugly. It is for banking statements. Each line is a <statement>all tags</statement>. Now, what I need to do is read this file in, and parse the XML document at the same time, while formatting it more human readable too. Point beeing,
Original input looks like this:
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
I need the final output to be as follows:
<statement>
<name></name>
<address></address>
</statement>
This is fine and dandy. I am using the following "very slow considering 5.1 million lines, 254k data file, and about 60k statements takes around 8 minutes".
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
sr.WriteLine(xElement.ToString().Trim());
}
Then when the file is formatted this is what sucks. I need to check every single tag in transaction elements, and if a tag is missing that could be there, I have to fill it in. Our designer software will default prior values in if a tag is possible, and the current objects does not have. It defaults in the value of a prior one that was not Null. "I know, and they swear up and down it is not a bug... ok?"
So, that is also taking about 5 to 10 minutes. I need to break all this down, and find a faster method for working with the initial XML. This is a preprocess action, and cannot take that long if not necessary. It just seems redundant.
Is there a better way to parse the XML, or is this the best I can do? I parse the XML, write to a temp file, and then read that file in, to the output file inserting the missing tags. 2 IO runs for one process. Yuck.

You can start by trying a modified for loop to see if this speeds it up for you:
XElement root = new XElement("Statements");
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
}
sr.WriteLine(root.ToString().Trim());
Well, I'm not sure if this will help with memory issues. If it works, you'll get multiple xml files.
int fileCount=1;
int count = 0;
XElement root;
Action Save = () => root.Save(string.Format("statements{0}.xml",fileCount++));
while(count < lines.Length) // or lines.Count
try
{
root = new XElement("Statements");
foreach(String item in lines.Skip(count))
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
count++;
}
Save();
}
catch (OutOfMemoryException)
{
Save();
root = null;
GC.Collect();
}

xmllint file-as-one-line --format > output.xml

Read/Write Large XML Files in C#

I am developing an application with XML database. I have large XML files in which I have to read and write data.
The problem is I do not want to load the whole XML file in memory also do not want to loop through the whole file because of performance issue. Because if I load the whole file in the memory this will effect the application performance and may crash the application because of memory leek.
I need a sufficient way to write and read the XML into the file which does not effect on performance and memory.
Any help will be appreciated.

If this XML decision is not yours and you have to deal with it (see whole MSDN sample http://msdn.microsoft.com/en-us/library/bb387013.aspx)
static IEnumerable<XElement> StreamCustomerItem(string uri)
{
using (XmlReader reader = XmlReader.Create(uri))
{
XElement name = null;
XElement item = null;
reader.MoveToContent();
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
name = XElement.ReadFrom(reader) as XElement;
break;
}
}
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
item = XElement.ReadFrom(reader) as XElement;
if (item != null)
{
XElement tempRoot = new XElement("Root", new XElement(name));
tempRoot.Add(item);
yield return item;
}
}
}
}
}
}
}
BUT, if you control the decision, PLEASE, you should forget about the XML. There are several options that will help you and your application to work properly without to much hassle.
SQL Compact. Nice and easy SQL approach from Microsoft, and does not requiere SQL server instance. http://www.microsoft.com/en-us/sqlserver/editions/2012-editions/compact.aspx
SQL Lite. Works with .net and even Windows 8 applications, easy and pretty stable. http://system.data.sqlite.org/index.html/doc/trunk/www/index.wiki
You could even use MySQL, MariaDB or anything similar!

Have a look at this, it will give you some idea about fast reading the xml. http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.aspx
there is some thread about writing xml file in stackoverflow already.
How to write (big) XML to a file in C#?
However, i think if you are looking for really good performance, some database solutions e.g. sqlserver, mongodb might be a better option

Use this link.
Use XmlReader, It is a nice alternative to allowing us to have only the current record into memory which could hugely improve performance.
Edit: Never use Load method.It will load the whole XML file in memory and if this file is quite big, not only the query might take long time to execute but it might fail running out of memory.

Up to certain extent the performance quite depends on .NET version your application is running on. Another quick reference is Microsoft Patterns and Practices article.
There're 4 ways : XMLDocument, XPathNavigator, XmlTextReader, Linq to XML, I think the differences between them are valuable!
XmlDocument:
It represents the contents of an XML file. When loading it from a file, you read the entire file into memory. Generally speaking, XML parsing is much slower if you're using XmlDocument, which is more geared towards loading the whole DOM in the RAM, ... your application's memory consumption might become just like how a caterpillar moves !!
Using the DOM-model and the XmlDocument or XPathDocument classes to parse large XML documents can place significant demands on memory. These demands may severely limit the scalability of server-side Web applications.
XPath or LINQ-To-XML:
If your concern is more on performance I'd personally not recommend using the XPath or LINQ-To-XML Queries. XPathNavigator provides a cursor model for navigating and editing XML data.
XmlReader:
It might help achieving better in performance as compared to XmlDocument. As others have already suggested. XmlReader is an abstract class and provides an API for fast, forward-only, read-only parsing of an XML data stream ... It can read from a file, from an internet location, or from any other stream of data. When reading from a file, you don't load the entire document at once. And that is where it shines on.
XmlTextReader:
XmlTextReader, is an implementation of XmlReader. Use XmlTextReader to process XML data quickly in a forward, read-only manner without using validation, XPath, and XSLT services.
The EOL normalization is always on in the XmlReader from XmlReader.Create, which affects the XDocument. The normalization is off by default on the XmlTextReader, which affects the XmlDocument and XmlNodeReader. It can be turned on via the Normalization property.
Design Considerations
Consider validating large documents
Use streaming interfaces
Consider hard-coded transformations
Consider element and attribute name lengths ( ! )
Consider using XmlNameTable : https://msdn.microsoft.com/en-us/library/system.xml.xmlnametable%28v=vs.110%29.aspx
Benchmark Test
https://blogs.msdn.microsoft.com/codejunkie/2008/10/08/xmldocument-vs-xelement-performance/

Diffing large XML files in C# (.net 2.0)

I'm kind of stuck having to use .Net 2.0, so LINQ xml isn't available, although I would be interested how it would compare...
I had to write an internal program to download, extract, and compare some large XML files (about 10 megs each) that are essentially build configurations. I first attempted using libraries, such as Microsoft's XML diff/patch, but comparing the files was taking 2-3 minutes, even with ignoring whitespace, namespaces, etc. (i tested each ignore one at a time to try and figure out what was speediest). The I tried to implement my own ideas - lists of nodes from XmlDocument objects, dictionaries of keys of the root's direct descendants (45000 children, by the way) that pointed to ints to indicate the node position in the XML document... all took at least 2 minutes to run.
My final implementation finishes in 1-2 seconds - I made a system process call to diff with a few lines of context and saved those results to display (our development machines include cygwin, thank goodness).
I can't help but think there is a better, XML specific way to do this that would be just as fast as a plain text diff - especially since all I'm really interested in is the Name element that is the child of each direct descendant, and could throw away 4/5 of the file for my purposes (we only need to know what files were included, not anything else involving language or version)
So, as popular as XML is, I'm sure somebody out there has had to do something similar. What is a fast efficient way to compare these large XML's? (prefereably open source or Free)
edit: a sample of the nodes - I only need to find missing Name elements (there are over 45k nodes as well)
<file>
<name>SomeFile</name>
<version>10.234</version>
<countries>CA,US</countries>
<languages>EN</languages>
<types>blah blah</types>
<internal>N</internal>
</file>

XmlDocument source = new XmlDocument();
source.Load("source.xml");
Dictionary<string, XmlNode> files = new Dictionary<string, XmlNode>();
foreach(XmlNode file in source.SelectNodes("//file"))
files.Add(file.SelectSingleNode("./name").InnerText, file);
XmlDocument source2 = new XmlDocument();
source2.Load("source2.xml");
XmlNode value;
foreach(XmlNode file in source2.SelectNodes("//file"))
if (files.TryGetValue(file.SelectSingleNode("./name").InnerText, out value))
// This file is both in source and source2.
else
// This file is only in source2.
I am not sure exactly what you want, I hope that this example will help you in your quest.

Diffing XML can be done many ways. You're not being very specific regarding the details, though. What does transpire is that the files are large and you need only 4/5 of the information.
Well, then the algorithm is as follows:
Normalize and reduce the documents to the information that matters.
Save the results.
Compare the results.
And the implementation:
Use the XmlReader API, which is efficient, to produce plain text representations of your information. Why plain text representation? Because diff tools predicated on the assumption that there is plain text. And so are our eyeballs. Why XmlReader? You could use SAX, which is memory-efficient, but XmlReader is more efficient. As for the precise spec of that plain text file ... you're just not including enough information.
Save the plain text files to some temp directory.
Use a command-line diff utility like GnuWin32 diff to get some diff output. Yeah, I know, not pure and proper, but works out of the box and there's no coding to be done. If you are familiar with some C# diff API (I am not), well, then use that API instead, of course.
Delete the temp files. (Or optionally keep them if you're going to reuse them.)

Reading a xml file multithreaded

I've searched a lot but I couldn't find a propper solution for my problem. I wrote a xml file containing all episode information of a TV-Show. It's 38 kb and contains attributes and strings for about 680 variables. At first I simply read it with the help of XMLTextReader which worked fine with my quadcore. But my wifes five year old laptop took about 30 seconds to read it. So I thought about multithreading but I get an exception because the file is already opened.
Thread start looks like this
while (reader.Read())
{
...
else if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("Season1"))
{
current.seasonNr = 0;
current.currentSeason = season[0];
current.reader = reader;
seasonThread[0].Start(current);
}
else if (reader.Name.Equals("Season2"))
{
current.seasonNr = 1;
current.currentSeason = season[1];
current.reader = reader;
seasonThread[1].Start(current);
}
And the parsing method like this
reader.Read();
for (episodeNr = 0; episodeNr < tmp.currentSeason.episode.Length; episodeNr++)
{
reader.MoveToFirstAttribute();
tmp.currentSeason.episode[episodeNr].id = reader.ReadContentAsInt();
...
}
But it doesn't work...
I pass the reader because I want the 'cursor' to be in the right position. But I also have no clue if this could work at all.
Please help!
EDIT:
Guys where did I wrote about IE?? The program I wrote parses the file. I run it on my PC and on the laptop. No IE at all.
EDIT2:
I did some stopwatch research and figured out that parsing the xml file only takes about 200ms on my PC and 800ms on my wifes laptop. Is it WPF beeing so slow? What can I do?

I agree with most everyone's comments. Reading a 38Kb file should not take so long. Do you have something else running on the machine, antivirus / etc, that could be interfering with the processing?
The amount of time it would take you to create a thread will be far greater than the amount of time spent reading the file. If you could post the actual code used to read the file and the file itself, it might help analyze performance bottlenecks.

I think you can't parse XML in multiple threads, at least not in a way that would bring performance benefits, because to read from some point in the file, you need to know everything that comes before it, if nothing else, to know at what level you are.
Your code, if tit worked, would do something like this:
main season1 season2
read
read
skip read
skip read
read
skip read
skip read
Note that to do “skip”, you need to fully parse the XML, which means you're doing the same amount of work as before on the main thread. The only difference is that you're doing some additional work on the background threads.
Regarding the slowness, just parsing such a small XML file should be very fast. If it's slow, you're most likely doing something else that is slow, or you're parsing the file multiple times.

If I am understanding how your .xml file is being used, you have essentially created an .xml database.
If correct, I would recommend breaking your Xml into different .xml files, with an indexed .xml document. I would think you can then query - using Linq-2-Xml - a set of .xml data from a specific .xml source.
Of course, this means you will still need to load an .xml file; however, you will be loading significantly smaller files and you would be able to, although highly discouraged, asynchronously load .xml document objects.

Your XML schema doesn't lend itself to parallelism since you seem to have node names (Season1, Season2) that contain the same data but must be parsed individually. You could redesign you schema to have the same node names (i.e. Season) and attributes that express the differences in the data (i.e. Number to indicate the season number). Then you can parallelize i.e. using Linq to XML and PLinq:
XDocument doc = XDocument.Load(#"TVShowSeasons.xml");
var seasonData = doc.Descendants("Season")
.AsParallel()
.Select(x => new Season()
{
Number = (int)x.Attribute("Number"),
Descripton = x.Value
}).ToList();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.