Fast way to get number of elements in a xml document - c#

is there a best practice to get the number of elements from an XML document for progress reporting purposes?
I have an 2 GB XML file containing flights which I need to process and my idea is to first get the number of all elements in the file and then use a counter to show x of x flights are imported to our database.
For the file processing we are using the XmlTextReader in .NET (C#) to get the data without reading the whole document into memory (similiar to sax parsing).
So the question is, how can I get the number of those elements very quick... is there a best practice or should I go through the whole document first and doe something like i++; ?
Thanks!

You certainly can just read the document twice - once to simply count the elements (keep using XmlReader.ReadToFollowing for example, (or possibly ReadToNextSibling) increasing a counter as you go:
int count = 0;
while (reader.ReadToFollowing(name))
{
count++;
}
However, that does mean reading the file twice...
An alternative is to find the length of the file, and as you read through the file once, report the percentage of the file processed so far, based on the position of the underlying stream. This will be less accurate, but far more efficient. You'll need to create the XmlReader directly from a Stream so that you can keep checking the position though.

int count = 0;
using (XmlReader xmlReader = new XmlTextReader(new StringReader(text)))
{
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element &&
xmlReader.Name.Equals("Flight"))
count++;
}
}

Related

How to set the start point for reading an xml file?

i have a large XML-Document (111 MB), and want to go to a special node (by index) very fast. The Document has about 1000000 nodes like this:
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>Potthoffstr.</Strasse>
</Kt>
I want to "jump", for example to the one millionth node in the document and start from this to read. All nodes behind of this must be ignore. I've tried it already with the XMLReader but these start always to read from the first node.
int i = 0;// v-----------Index of the Node where I want to go!
while (reader.Read() == (i < 1000000))
{
if (reader.Name == "PLZ")
{
textBox1.Text = reader.ReadString();
}
if (reader.Name == "Ort")
{
textBox2.Text = reader.ReadString();
}
if (reader.Name == "OT")
{
textBox3.Text = reader.ReadString();
}
if (reader.Name == "Strasse")
{
textBox4.Text = reader.ReadString();
i++;
}
This is how the structure of the XML-Document looks!
<?xml version="1.0" encoding="UTF-8"?>
<dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Kt.xsd" generated="2014-10-21T18:20:30">
<Kt>
<PLZ>01...</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>NULL</Strasse>
</Kt>
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>Innere Altstadt</OT>
<Strasse>Marienstr.</Strasse>
</Kt>
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>Potthoffstr.</Strasse>
</Kt>
In other words: What are the possibilities to load a part of an large xml-file out without reading the complete file.
You will have to read all the data up to that point, because xml (in common with most text-based deserialization formats) does not lend itself to skipping data. XmlReader has some helper methods to assist with this, like ReadToNextSibling and ReadToFollowing. Basically, that's the best you'll do unless you pre-index the file (separately) with the byte offsets of various elements (say, every 100th or 1000th element). And doing that means you'd be working in fragment (rather than document) mode, and you'd need to be very careful about namespaces (particularly: aliases declared on the document root).
Basically, what you are doing seems about right, if we start with the premise of having a 111MB, multi-million-element xml file. Frankly, my advice would be don't do that in the first place. Xml is not a good choice for huge data, unless it is purely as a dead-drop, perhaps to be bulk-loaded again later. It does not allow for efficient random access.
If you need to do this often, then you're doing the wrong thing. The data should be in a database, or at the very least, stored in smaller chunks.
If you're not doing it often, then is it really a problem? I would expect it to be doable in 5 seconds or so.

Parsing xml file that comes in as one object per line

I haven't been here in so long, I forgot my prior account! Anyways, I am working on parsing an xml document that comes in ugly. It is for banking statements. Each line is a <statement>all tags</statement>. Now, what I need to do is read this file in, and parse the XML document at the same time, while formatting it more human readable too. Point beeing,
Original input looks like this:
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
I need the final output to be as follows:
<statement>
<name></name>
<address></address>
</statement>
This is fine and dandy. I am using the following "very slow considering 5.1 million lines, 254k data file, and about 60k statements takes around 8 minutes".
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
sr.WriteLine(xElement.ToString().Trim());
}
Then when the file is formatted this is what sucks. I need to check every single tag in transaction elements, and if a tag is missing that could be there, I have to fill it in. Our designer software will default prior values in if a tag is possible, and the current objects does not have. It defaults in the value of a prior one that was not Null. "I know, and they swear up and down it is not a bug... ok?"
So, that is also taking about 5 to 10 minutes. I need to break all this down, and find a faster method for working with the initial XML. This is a preprocess action, and cannot take that long if not necessary. It just seems redundant.
Is there a better way to parse the XML, or is this the best I can do? I parse the XML, write to a temp file, and then read that file in, to the output file inserting the missing tags. 2 IO runs for one process. Yuck.
You can start by trying a modified for loop to see if this speeds it up for you:
XElement root = new XElement("Statements");
foreach(String item in lines)
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
}
sr.WriteLine(root.ToString().Trim());
Well, I'm not sure if this will help with memory issues. If it works, you'll get multiple xml files.
int fileCount=1;
int count = 0;
XElement root;
Action Save = () => root.Save(string.Format("statements{0}.xml",fileCount++));
while(count < lines.Length) // or lines.Count
try
{
root = new XElement("Statements");
foreach(String item in lines.Skip(count))
{
XElement xElement = XElement.Parse(item);
root.Add(xElement);
count++;
}
Save();
}
catch (OutOfMemoryException)
{
Save();
root = null;
GC.Collect();
}
xmllint file-as-one-line --format > output.xml

Reading a xml file multithreaded

I've searched a lot but I couldn't find a propper solution for my problem. I wrote a xml file containing all episode information of a TV-Show. It's 38 kb and contains attributes and strings for about 680 variables. At first I simply read it with the help of XMLTextReader which worked fine with my quadcore. But my wifes five year old laptop took about 30 seconds to read it. So I thought about multithreading but I get an exception because the file is already opened.
Thread start looks like this
while (reader.Read())
{
...
else if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("Season1"))
{
current.seasonNr = 0;
current.currentSeason = season[0];
current.reader = reader;
seasonThread[0].Start(current);
}
else if (reader.Name.Equals("Season2"))
{
current.seasonNr = 1;
current.currentSeason = season[1];
current.reader = reader;
seasonThread[1].Start(current);
}
And the parsing method like this
reader.Read();
for (episodeNr = 0; episodeNr < tmp.currentSeason.episode.Length; episodeNr++)
{
reader.MoveToFirstAttribute();
tmp.currentSeason.episode[episodeNr].id = reader.ReadContentAsInt();
...
}
But it doesn't work...
I pass the reader because I want the 'cursor' to be in the right position. But I also have no clue if this could work at all.
Please help!
EDIT:
Guys where did I wrote about IE?? The program I wrote parses the file. I run it on my PC and on the laptop. No IE at all.
EDIT2:
I did some stopwatch research and figured out that parsing the xml file only takes about 200ms on my PC and 800ms on my wifes laptop. Is it WPF beeing so slow? What can I do?
I agree with most everyone's comments. Reading a 38Kb file should not take so long. Do you have something else running on the machine, antivirus / etc, that could be interfering with the processing?
The amount of time it would take you to create a thread will be far greater than the amount of time spent reading the file. If you could post the actual code used to read the file and the file itself, it might help analyze performance bottlenecks.
I think you can't parse XML in multiple threads, at least not in a way that would bring performance benefits, because to read from some point in the file, you need to know everything that comes before it, if nothing else, to know at what level you are.
Your code, if tit worked, would do something like this:
main season1 season2
read
read
skip read
skip read
read
skip read
skip read
Note that to do “skip”, you need to fully parse the XML, which means you're doing the same amount of work as before on the main thread. The only difference is that you're doing some additional work on the background threads.
Regarding the slowness, just parsing such a small XML file should be very fast. If it's slow, you're most likely doing something else that is slow, or you're parsing the file multiple times.
If I am understanding how your .xml file is being used, you have essentially created an .xml database.
If correct, I would recommend breaking your Xml into different .xml files, with an indexed .xml document. I would think you can then query - using Linq-2-Xml - a set of .xml data from a specific .xml source.
Of course, this means you will still need to load an .xml file; however, you will be loading significantly smaller files and you would be able to, although highly discouraged, asynchronously load .xml document objects.
Your XML schema doesn't lend itself to parallelism since you seem to have node names (Season1, Season2) that contain the same data but must be parsed individually. You could redesign you schema to have the same node names (i.e. Season) and attributes that express the differences in the data (i.e. Number to indicate the season number). Then you can parallelize i.e. using Linq to XML and PLinq:
XDocument doc = XDocument.Load(#"TVShowSeasons.xml");
var seasonData = doc.Descendants("Season")
.AsParallel()
.Select(x => new Season()
{
Number = (int)x.Attribute("Number"),
Descripton = x.Value
}).ToList();

What is the BEST way to replace text in a File using C# / .NET?

I have a text file that is being written to as part of a very large data extract. The first line of the text file is the number of "accounts" extracted.
Because of the nature of this extract, that number is not known until the very end of the process, but the file can be large (a few hundred megs).
What is the BEST way in C# / .NET to open a file (in this case a simple text file), and replace the data that is in the first "line" of text?
IMPORTANT NOTE: - I do not need to replace a "fixed amount of bytes" - that would be easy. The problem here is that the data that needs to be inserted at the top of the file is variable.
IMPORTANT NOTE 2: - A few people have asked about / mentioned simply keeping the data in memory and then replacing it... however that's completely out of the question. The reason why this process is being updated is because of the fact that sometimes it crashes when loading a few gigs into memory.
If you can you should insert a placeholder which you overwrite at the end with the actual number and spaces.
If that is not an option write your data to a cache file first. When you know the actual number create the output file and append the data from the cache.
BEST is very subjective. For any smallish file, you can easily open the entire file in memory and replace what you want using a string replace and then re-write the file.
Even for largish files, it would not be that hard to load into memory. In the days of multi-gigs of memory, I would consider hundreds of megabytes to still be easily done in memory.
Have you tested this naive approach? Have you seen a real issue with it?
If this is a really large file (gigabytes in size), I would consider writing all of the data first to a temp file and then write the correct file with the header line going in first and then appending the rest of the data. Since it is only text, I would probably just shell out to DOS:
TYPE temp.txt >> outfile.txt
I do not need to replace a "fixed
amount of bytes"
Are you sure?
If you write a big number to the first line of the file (UInt32.MaxValue or UInt64.MaxValue), then when you find the correct actual number, you can replace that number of bytes with the correct number, but left padded with zeros, so it's still a valid integer.
e.g.
Replace 999999 - your "large number placeholder"
With 000100 - the actual number of accounts
Seems to me if I understand the question correctly?
What is the BEST way in C# / .NET to open a file (in this case a simple text file), and replace the data that is in the first "line" of text?
How about placing at the top of the file a token {UserCount} when it is first created.
Then use TextReader to read the file line by line. If it is the first line look for {UserCount} and replace with your value. Write out each line you read in using TextWriter
Example:
int lineNumber = 1;
int userCount = 1234;
string line = null;
using(TextReader tr = File.OpenText("OriginalFile"))
using(TextWriter tw = File.CreateText("ResultFile"))
{
while((line = tr.ReadLine()) != null)
{
if(lineNumber == 1)
{
line = line.Replace("{UserCount}", userCount.ToString());
}
tw.WriteLine(line);
lineNumber++;
}
}
If the extracted file is only a few hundred megabytes, then you can easily keep all of the text in-memory until the extraction is complete. Then, you can write your output file as the last operation, starting with the record count.
Ok, earlier I suggested an approach that would be a better if dealing with existing files.
However in your situation you want to create the file and during the create process go back to the top and write out the user count. This will do just that.
Here is one way to do it that prevents you having to write the temporary file.
private void WriteUsers()
{
string userCountString = null;
ASCIIEncoding enc = new ASCIIEncoding();
byte[] userCountBytes = null;
int userCounter = 0;
using(StreamWriter sw = File.CreateText("myfile.txt"))
{
// Write a blank line and return
// Note this line will later contain our user count.
sw.WriteLine();
// Write out the records and keep track of the count
for(int i = 1; i < 100; i++)
{
sw.WriteLine("User" + i);
userCounter++;
}
// Get the base stream and set the position to 0
sw.BaseStream.Position = 0;
userCountString = "User Count: " + userCounter;
userCountBytes = enc.GetBytes(userCountString);
sw.BaseStream.Write(userCountBytes, 0, userCountBytes.Length);
}
}

.NET C# - Random access in text files - no easy way?

I've got a text file that contains several 'records' inside of it. Each record contains a name and a collection of numbers as data.
I'm trying to build a class that will read through the file, present only the names of all the records, and then allow the user to select which record data he/she wants.
The first time I go through the file, I only read header names, but I can keep track of the 'position' in the file where the header is. I need random access to the text file to seek to the beginning of each record after a user asks for it.
I have to do it this way because the file is too large to be read in completely in memory (1GB+) with the other memory demands of the application.
I've tried using the .NET StreamReader class to accomplish this (which provides very easy to use 'ReadLine' functionality, but there is no way to capture the true position of the file (the position in the BaseStream property is skewed due to the buffer the class uses).
Is there no easy way to do this in .NET?
There are some good answers provided, but I couldn't find some source code that would work in my very simplistic case. Here it is, with the hope that it'll save someone else the hour that I spent searching around.
The "very simplistic case" that I refer to is: the text encoding is fixed-width, and the line ending characters are the same throughout the file. This code works well in my case (where I'm parsing a log file, and I sometime have to seek ahead in the file, and then come back. I implemented just enough to do what I needed to do (ex: only one constructor, and only override ReadLine()), so most likely you'll need to add code... but I think it's a reasonable starting point.
public class PositionableStreamReader : StreamReader
{
public PositionableStreamReader(string path)
:base(path)
{}
private int myLineEndingCharacterLength = Environment.NewLine.Length;
public int LineEndingCharacterLength
{
get { return myLineEndingCharacterLength; }
set { myLineEndingCharacterLength = value; }
}
public override string ReadLine()
{
string line = base.ReadLine();
if (null != line)
myStreamPosition += line.Length + myLineEndingCharacterLength;
return line;
}
private long myStreamPosition = 0;
public long Position
{
get { return myStreamPosition; }
set
{
myStreamPosition = value;
this.BaseStream.Position = value;
this.DiscardBufferedData();
}
}
}
Here's an example of how to use the PositionableStreamReader:
PositionableStreamReader sr = new PositionableStreamReader("somepath.txt");
// read some lines
while (something)
sr.ReadLine();
// bookmark the current position
long streamPosition = sr.Position;
// read some lines
while (something)
sr.ReadLine();
// go back to the bookmarked position
sr.Position = streamPosition;
// read some lines
while (something)
sr.ReadLine();
FileStream has the seek() method.
You can use a System.IO.FileStream instead of StreamReader. If you know exactly, what file contains ( the encoding for example ), you can do all operation like with StreamReader.
If you're flexible with how the data file is written and don't mind it being a little less text editor-friendly, you could write your records with a BinaryWriter:
using (BinaryWriter writer =
new BinaryWriter(File.Open("data.txt", FileMode.Create)))
{
writer.Write("one,1,1,1,1");
writer.Write("two,2,2,2,2");
writer.Write("three,3,3,3,3");
}
Then, initially reading each record is simple because you can use the BinaryReader's ReadString method:
using (BinaryReader reader = new BinaryReader(File.OpenRead("data.txt")))
{
string line = null;
long position = reader.BaseStream.Position;
while (reader.PeekChar() > -1)
{
line = reader.ReadString();
//parse the name out of the line here...
Console.WriteLine("{0},{1}", position, line);
position = reader.BaseStream.Position;
}
}
The BinaryReader isn't buffered so you get the proper position to store and use later. The only hassle is parsing the name out of the line, which you may have to do with a StreamReader anyway.
Is the encoding a fixed-size one (e.g. ASCII or UCS-2)? If so, you could keep track of the character index (based on the number of characters you've seen) and find the binary index based on that.
Otherwise, no - you'd basically need to write your own StreamReader implementation which lets you peek at the binary index. It's a shame that StreamReader doesn't implement this, I agree.
I think that the FileHelpers library runtime records feature might help u. http://filehelpers.sourceforge.net/runtime_classes.html
A couple of items that may be of interest.
1) If the lines are a fixed set of characters in length, that is not of necessity useful information if the character set has variable sizes (like UTF-8). So check your character set.
2) You can ascertain the exact position of the file cursor from StreamReader by using the BaseStream.Position value IF you Flush() the buffers first (which will force the current position to be where the next read will begin - one byte after the last byte read).
3) If you know in advance that the exact length of each record will be the same number of characters, and the character set uses fixed-width characters (so each line is the same number of bytes long) the you can use FileStream with a fixed buffer size to match the size of a line and the position of the cursor at the end of each read will be, perforce, the beginning of the next line.
4) Is there any particular reason why, if the lines are the same length (assuming in bytes here) that you don't simply use line numbers and calculate the byte-offset in the file based on line size x line number?
Are you sure that the file is "too large"? Have you tried it that way and has it caused a problem?
If you allocate a large amount of memory, and you aren't using it right now, Windows will just swap it out to disk. Hence, by accessing it from "memory", you will have accomplished what you want -- random access to the file on disk.
This exact question was asked in 2006 here: http://www.devnewsgroups.net/group/microsoft.public.dotnet.framework/topic40275.aspx
Summary:
"The problem is that the StreamReader buffers data, so the value returned in
BaseStream.Position property is always ahead of the actual processed line."
However, "if the file is encoded in a text encoding which is fixed-width, you could keep track of how much text has been read and multiply that by the width"
and if not, you can just use the FileStream and read a char at a time and then the BaseStream.Position property should be correct
Starting with .NET 6, the methods in the System.IO.RandomAccess class is the official and supported way to randomly read and write to a file. These APIs work with Microsoft.Win32.SafeHandles.SafeFileHandles which can be obtained with the new System.IO.File.OpenHandle function, also introduced in .NET 6.

Categories