How to set the start point for reading an xml file? - c#

i have a large XML-Document (111 MB), and want to go to a special node (by index) very fast. The Document has about 1000000 nodes like this:
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>Potthoffstr.</Strasse>
</Kt>
I want to "jump", for example to the one millionth node in the document and start from this to read. All nodes behind of this must be ignore. I've tried it already with the XMLReader but these start always to read from the first node.
int i = 0;// v-----------Index of the Node where I want to go!
while (reader.Read() == (i < 1000000))
{
if (reader.Name == "PLZ")
{
textBox1.Text = reader.ReadString();
}
if (reader.Name == "Ort")
{
textBox2.Text = reader.ReadString();
}
if (reader.Name == "OT")
{
textBox3.Text = reader.ReadString();
}
if (reader.Name == "Strasse")
{
textBox4.Text = reader.ReadString();
i++;
}
This is how the structure of the XML-Document looks!
<?xml version="1.0" encoding="UTF-8"?>
<dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Kt.xsd" generated="2014-10-21T18:20:30">
<Kt>
<PLZ>01...</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>NULL</Strasse>
</Kt>
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>Innere Altstadt</OT>
<Strasse>Marienstr.</Strasse>
</Kt>
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>Potthoffstr.</Strasse>
</Kt>
In other words: What are the possibilities to load a part of an large xml-file out without reading the complete file.

You will have to read all the data up to that point, because xml (in common with most text-based deserialization formats) does not lend itself to skipping data. XmlReader has some helper methods to assist with this, like ReadToNextSibling and ReadToFollowing. Basically, that's the best you'll do unless you pre-index the file (separately) with the byte offsets of various elements (say, every 100th or 1000th element). And doing that means you'd be working in fragment (rather than document) mode, and you'd need to be very careful about namespaces (particularly: aliases declared on the document root).
Basically, what you are doing seems about right, if we start with the premise of having a 111MB, multi-million-element xml file. Frankly, my advice would be don't do that in the first place. Xml is not a good choice for huge data, unless it is purely as a dead-drop, perhaps to be bulk-loaded again later. It does not allow for efficient random access.

If you need to do this often, then you're doing the wrong thing. The data should be in a database, or at the very least, stored in smaller chunks.
If you're not doing it often, then is it really a problem? I would expect it to be doable in 5 seconds or so.

Related

Parsing xml file that comes in as one object per line

I haven't been here in so long, I forgot my prior account! Anyways, I am working on parsing an xml document that comes in ugly. It is for banking statements. Each line is a <statement>all tags</statement>. Now, what I need to do is read this file in, and parse the XML document at the same time, while formatting it more human readable too. Point beeing,
Original input looks like this:
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
I need the final output to be as follows:
<statement>
<name></name>
<address></address>
</statement>
This is fine and dandy. I am using the following "very slow considering 5.1 million lines, 254k data file, and about 60k statements takes around 8 minutes".
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
sr.WriteLine(xElement.ToString().Trim());
}
Then when the file is formatted this is what sucks. I need to check every single tag in transaction elements, and if a tag is missing that could be there, I have to fill it in. Our designer software will default prior values in if a tag is possible, and the current objects does not have. It defaults in the value of a prior one that was not Null. "I know, and they swear up and down it is not a bug... ok?"
So, that is also taking about 5 to 10 minutes. I need to break all this down, and find a faster method for working with the initial XML. This is a preprocess action, and cannot take that long if not necessary. It just seems redundant.
Is there a better way to parse the XML, or is this the best I can do? I parse the XML, write to a temp file, and then read that file in, to the output file inserting the missing tags. 2 IO runs for one process. Yuck.
You can start by trying a modified for loop to see if this speeds it up for you:
XElement root = new XElement("Statements");
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
}
sr.WriteLine(root.ToString().Trim());
Well, I'm not sure if this will help with memory issues. If it works, you'll get multiple xml files.
int fileCount=1;
int count = 0;
XElement root;
Action Save = () => root.Save(string.Format("statements{0}.xml",fileCount++));
while(count < lines.Length) // or lines.Count
try
{
root = new XElement("Statements");
foreach(String item in lines.Skip(count))
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
count++;
}
Save();
}
catch (OutOfMemoryException)
{
Save();
root = null;
GC.Collect();
}
xmllint file-as-one-line --format > output.xml

XML Seek to specific elements, as efficiently as possible

I'm working on an application where i have to read a specific xml node (the 'progress' node, out of a several large (3meg'ish) files.
I'm doing that via TextReader and XDocument, as shown below
TextReader reader = null;
reader = new StreamReader(Filename);
XDocument objDoc = XDocument.Load(reader);
var progressElement = objDoc.Root.Element("Progress");
var lastAccessTime = progressElement.Element("LastTimeAccessed").Value;
var user = progressElement.Element(("LastUserAccessed").Value;
var lastOpCode = progressElement.Element("LastOpCodeCompleted").Value;
var step = progressElement.Element("StepsCompleted").Value;
XDocument, I believe, is loading the entire file into memory before doing anything else. However, I don't need that! I know the node is going to be the first node in the file.
Is there any type of 'seek' xml parsers that don't cache the entire file first?
Its taking like 15 seconds to parse 10 files for the attributes mentioned above (terrible wireless here).
XmlReader is your best option if all you want is speed. It reads a node at a time, starting at the beginning. The big limitation is that you can't go backward or use any random-access to the XML document.
Yes. You can use a SAX parser, which works differently to XDocument. Basically, a SAX parser works its way through the input XML, firing events back at callback code. (You write these callback handlers.) The main advantages;
The entire document need not be read into a memory model. (A DOM)
You can stop the processing when you have what you want.
Have a look at http://www.ibm.com/developerworks/library/x-tipsaxstop/

Fast way to get number of elements in a xml document

is there a best practice to get the number of elements from an XML document for progress reporting purposes?
I have an 2 GB XML file containing flights which I need to process and my idea is to first get the number of all elements in the file and then use a counter to show x of x flights are imported to our database.
For the file processing we are using the XmlTextReader in .NET (C#) to get the data without reading the whole document into memory (similiar to sax parsing).
So the question is, how can I get the number of those elements very quick... is there a best practice or should I go through the whole document first and doe something like i++; ?
Thanks!
You certainly can just read the document twice - once to simply count the elements (keep using XmlReader.ReadToFollowing for example, (or possibly ReadToNextSibling) increasing a counter as you go:
int count = 0;
while (reader.ReadToFollowing(name))
{
count++;
}
However, that does mean reading the file twice...
An alternative is to find the length of the file, and as you read through the file once, report the percentage of the file processed so far, based on the position of the underlying stream. This will be less accurate, but far more efficient. You'll need to create the XmlReader directly from a Stream so that you can keep checking the position though.
int count = 0;
using (XmlReader xmlReader = new XmlTextReader(new StringReader(text)))
{
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element &&
xmlReader.Name.Equals("Flight"))
count++;
}
}

Reading a xml file multithreaded

I've searched a lot but I couldn't find a propper solution for my problem. I wrote a xml file containing all episode information of a TV-Show. It's 38 kb and contains attributes and strings for about 680 variables. At first I simply read it with the help of XMLTextReader which worked fine with my quadcore. But my wifes five year old laptop took about 30 seconds to read it. So I thought about multithreading but I get an exception because the file is already opened.
Thread start looks like this
while (reader.Read())
{
...
else if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("Season1"))
{
current.seasonNr = 0;
current.currentSeason = season[0];
current.reader = reader;
seasonThread[0].Start(current);
}
else if (reader.Name.Equals("Season2"))
{
current.seasonNr = 1;
current.currentSeason = season[1];
current.reader = reader;
seasonThread[1].Start(current);
}
And the parsing method like this
reader.Read();
for (episodeNr = 0; episodeNr < tmp.currentSeason.episode.Length; episodeNr++)
{
reader.MoveToFirstAttribute();
tmp.currentSeason.episode[episodeNr].id = reader.ReadContentAsInt();
...
}
But it doesn't work...
I pass the reader because I want the 'cursor' to be in the right position. But I also have no clue if this could work at all.
Please help!
EDIT:
Guys where did I wrote about IE?? The program I wrote parses the file. I run it on my PC and on the laptop. No IE at all.
EDIT2:
I did some stopwatch research and figured out that parsing the xml file only takes about 200ms on my PC and 800ms on my wifes laptop. Is it WPF beeing so slow? What can I do?
I agree with most everyone's comments. Reading a 38Kb file should not take so long. Do you have something else running on the machine, antivirus / etc, that could be interfering with the processing?
The amount of time it would take you to create a thread will be far greater than the amount of time spent reading the file. If you could post the actual code used to read the file and the file itself, it might help analyze performance bottlenecks.
I think you can't parse XML in multiple threads, at least not in a way that would bring performance benefits, because to read from some point in the file, you need to know everything that comes before it, if nothing else, to know at what level you are.
Your code, if tit worked, would do something like this:
main season1 season2
read
read
skip read
skip read
read
skip read
skip read
Note that to do “skip”, you need to fully parse the XML, which means you're doing the same amount of work as before on the main thread. The only difference is that you're doing some additional work on the background threads.
Regarding the slowness, just parsing such a small XML file should be very fast. If it's slow, you're most likely doing something else that is slow, or you're parsing the file multiple times.
If I am understanding how your .xml file is being used, you have essentially created an .xml database.
If correct, I would recommend breaking your Xml into different .xml files, with an indexed .xml document. I would think you can then query - using Linq-2-Xml - a set of .xml data from a specific .xml source.
Of course, this means you will still need to load an .xml file; however, you will be loading significantly smaller files and you would be able to, although highly discouraged, asynchronously load .xml document objects.
Your XML schema doesn't lend itself to parallelism since you seem to have node names (Season1, Season2) that contain the same data but must be parsed individually. You could redesign you schema to have the same node names (i.e. Season) and attributes that express the differences in the data (i.e. Number to indicate the season number). Then you can parallelize i.e. using Linq to XML and PLinq:
XDocument doc = XDocument.Load(#"TVShowSeasons.xml");
var seasonData = doc.Descendants("Season")
.AsParallel()
.Select(x => new Season()
{
Number = (int)x.Attribute("Number"),
Descripton = x.Value
}).ToList();

getting element offset from XMLReader

how's everyone doing this morning?
I'm writing a program that will parse a(several) xml files.
This stage of the program is going to be focusing on adding/editing skills/schools/abilities/etc for a tabletop rpg (L5R). What I learn by this one example should carry me through the rest of the program.
So I've got the xml reading set up using XMLReader. The file I'm reading looks like...
<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
I set up a Skill class, which holds the data, and a SkillEdit class which reads in the data, and will eventually have methods for editing and adding.
I'm currently able to read in everything right, but I had the thought that since description can vary in length, once I write the edit method the best way to ensure no data is overwritten would be to just append the edited skill to the end of the file and wipe out its previous entry.
In order for me to do that, I would need to know where skill's file offset is, and where /skill's file offset is. I can't seem to find any way of getting those offsets though.
Is there a way to do that, or can you guys suggest a better implementation for editing an already existing skill?
If you read your XML into LINQ to XML's XDocument (or XElement), everything could become very easy. You can read, edit, add stuff, etc. to XML files using a simple interface.
e.g.,
var xmlStr = #"<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
";
var doc = XDocument.Parse(xmlStr);
// find the skill "some name"
var mySkill = doc
.Descendants("skill") // out of all skills
.Where(e => e.Element("name").Value == "some name") // that has the element name "some name"
.SingleOrDefault(); // select it
if (mySkill != null) // if found...
{
var skillType = mySkill.Element("type").Value; // read the type
var skillPage = (int)mySkill.Element("page"); // read the page (as an int)
mySkill.Element("description").Value = "an AWESOME skill"; // change the description
// etc...
}
No need to calculate offsets, manual, step-by-step reading or maintaining other state, it is all taken care of for you.
Don't do it! In general, you can't reliably know anything about physical offsets in the serialized XML because of possible character encoding differences, entity references, embedded comments and a host of other things that can cause the physical and logical layers to have a complex relationship.
If your XML is just sitting on the file system, your safest option is to have a method in your skill class which serializes to XML (you already have one to read XML already), and re-serialize whole objects when you need to.
Tyler,
Umm, sounds like you're suffering from a text-book case of premature optimization... Have you PROVEN that reading and writing the COMPLETE skill list to/from the xml-file is TOO slow? No? Well until it's been proven that there IS NO PERFORMANCE ISSUE, right? So we just write the simplest code that works (i.e. does what we want, without worrying too much about performance), and then move on directly to the next bit of trick functionality... testing as we go.
Iff (which is short for if-and-only-if) I had a PROVEN performance problem then-and-only-then I'd consider writing each skill to individual XML-file, to avert the necessisity for rewriting a potentially large list of skills each time a single skill was modified... But this is "reference data", right? I mean you wouldn't de/serialize your (volatile) game data to/from an XML file, would you? Because an RDBMS is known to be much better at that job, right? So you're NOT going to be rewriting this file often?
Cheers. Keith.

Categories