Loading multiple xml files with c# XmlDocument leading to memory leakage - c#

I am having serious troubles with processing a big number of xml files using XmlDocument. The idea is to extract from about 5000 .xml docs (approx. 20 MB each) a certain data, which is saved in text format and then imported into MySQL DB. This task is supposed to be done every day.
My problem is that after processing of each xml file, the system memory is not releasing it. So, all the documents are pilling up, until all the RAM is occupied and the application starts to run very slowly (once the hard drive starts helping the system memory).
I am using already created source code, so it is not possible to change to other classes like XmlReader and so on, so I am stuck with XmlDocument.
The function for xml loading is called like this:
foreach (string s in xmlFileNames)
{
i++;
if (mytest.LoadXml(s))
mytest.loadToExchangeTables();
}
The function looks like this:
public bool LoadXml(string fileName)
{
XmlDocument myXml = new XmlDocument();
myXml.Load(fileName);
.............
//searching for needed data
.............
}
Any ideas what might be the problem? And why garbage collection is not done?
Thank you very much in advance!

Try to comment that part with // searching for needed data and run test once again, it may be so you don't free something IDisposable (with the use of using or directly), same goes for loadToExchangeTables().

Related

how to read a file in batches/parts/1000 lines at a time

I am currently trying to make a method that can handle reading in large XML files. All i need is a method that would loading in say 1000 lines at a time or in small batches.
I have been looking at streamreaders, xmlreaders and filestreams, I have seen some mentions of just keeping the stream open while processing data to get what i need but i cant seem to get my head round it.
I have spent a long time checking the similar questions but can seem to find anything that will help me.
ps. first thought i was thinking of doing a for loop around the readline to a counter of 1000 but cant seem to figure out how to continue from that 1000 lines to reading another 1000 etc until the end of the file.
My feeling is that his will require a custom XML reader implementation.
For example - if your structure looks something like:
root
item
stuff
/item
item
stuff
/item
item
stuff
/item
item
stuff
/item
/root
You'll have to write code that reads a number 'item' blocks (as many as yuo wish to process in a batch), and then converts them into a valid XML doc for further processing.
If, however, your XML doc is one massive sprawling entity - I don't think there's any elegant way you can process it piece-meal.

Write as building table in HttpHandler?

I have to export very large files as an "excel export". Since .NET can't export excel files, I went with simple html tables.
It works fine, but it's slow.
Is it possible to context.response.write each line as they're being created instead of building some super huge string and trying to export the whole thing once it's done?
I could care less what function is used to do this, but I hope you know what I mean. I don't want to build a string into memory and then try to send it all at once. I'd rather export as I build the table.
Is this possible?
Thanks in advance!
Yes, using context.Response.Write on each line is just fine. If the reason for not wanting to build a large string is server memory use, then you'll need to turn off response buffering like so:
context.Response.BufferOutput = false;
Otherwise, .NET will just buffer your writes in memory until the end, anyway.
If the reason is execution time, then you may be experiencing performance hits from multiple string concatenations. In that case, you could use the StringBuilder class to construct the table instead.
For example...
StringBuilder sb = new StringBuilder();
for each (<row in database>) {
sb.AppendLine(<current table row>);
}
context.Response.Write(sb.ToString());
More info on response buffering:
http://msdn.microsoft.com/en-us/library/system.web.httpresponse.bufferoutput.aspx
More info on the StringBuilder class:
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder(v=vs.110).aspx

Handling strings more than 2 GB

I have an application where an XLS file with lots of data entered by the user is opened and the data in it is converted to XML. I have already mapped the columns in the XLS file to XML Maps. When I try to use the ExportXml method in XMLMaps, I get a string with the proper XML representation of the XLS file. I parse this string a bit and upload it to my server.
The problem is, when my XLS file is really large, the string produced for XML is over 2 GB and I get a Out of Memory exception. I understand that the limit for CLR objects is 2 GB. But in my case I need to handle this scenario. Presently I just message asking the user to send less data.
Any ideas on how I can do this?
EDIT:
This is just a jist of the operation I need to do on the generated XML.
Remove certain fields which are not needed for the server data.
Add something like ID numbers for each row of data.
Modify the values of certain elements.
Do validation on the data.
While the XMLReader stream is a good idea, I cannot perform these operations by that method. While data validation can be done by Excel itself, the other things cannot be done here.
Using XMLTextReader and XMLTextWriter and creating a custom method for each of the step is a solution I had thought of. But to go through the jist above, it requires the XML document to be gone through or processed 4 times. This is just not efficient.
If the XML is that large, then you might be able to use Export to a temporary file, rather than using ExportXML to a string - http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.xmlmap.export.aspx
If you then need to parse/handle the XML in C#, then for handling such large XML structures, you'll probably be better off implementing a custom XMLReader (or XMLWriter) which works at the stream level. See this question for some similar advice - What is the best way to parse large XML (size of 1GB) in C#?
I guess there is no other way then using x64-OS and FX if you really need to hold the whole thing in RAM, but using some other way to process the data like suggested by Stuart may is the better way to go...
What you need to do is to use "stream chaining", i.e. you open up an input stream which reads from your excel file and an output stream that writes to your xml file. Then your conversion class/method will take the two streams as input and read sufficient data from the input stream to be able to write to the output.
Edit: very simple minimal Example
Converting from file:
123
1244125
345345345
4566
11
to
<List>
<ListItem>123</ListItem>
<ListItem>1244125</ListItem>
...
</List>
using
void Convert(Stream fromStream, Stream toStream)
{
using(StreamReader from= new StreamReader(fromStream))
using(StreamWriter to = new StreamWriter(toStream))
{
to.WriteLine("<List>");
while(!from.EndOfStream)
{
string bulk = from.ReadLine(); //in this case, a single line is sufficient
//some code to parse the bulk or clean it up, e.g. remove '\r\n'
to.WriteLine(string.Format("<ListItem>{0}</ListItem>", bulk));
}
to.WriteLine("</List>");
}
}
Convert(File.OpenRead("source.xls"), File.OpenWrite("source.xml"));
Of course you could do this in much more elegent, abstract manner but this is only to show my point

Reading a xml file multithreaded

I've searched a lot but I couldn't find a propper solution for my problem. I wrote a xml file containing all episode information of a TV-Show. It's 38 kb and contains attributes and strings for about 680 variables. At first I simply read it with the help of XMLTextReader which worked fine with my quadcore. But my wifes five year old laptop took about 30 seconds to read it. So I thought about multithreading but I get an exception because the file is already opened.
Thread start looks like this
while (reader.Read())
{
...
else if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("Season1"))
{
current.seasonNr = 0;
current.currentSeason = season[0];
current.reader = reader;
seasonThread[0].Start(current);
}
else if (reader.Name.Equals("Season2"))
{
current.seasonNr = 1;
current.currentSeason = season[1];
current.reader = reader;
seasonThread[1].Start(current);
}
And the parsing method like this
reader.Read();
for (episodeNr = 0; episodeNr < tmp.currentSeason.episode.Length; episodeNr++)
{
reader.MoveToFirstAttribute();
tmp.currentSeason.episode[episodeNr].id = reader.ReadContentAsInt();
...
}
But it doesn't work...
I pass the reader because I want the 'cursor' to be in the right position. But I also have no clue if this could work at all.
Please help!
EDIT:
Guys where did I wrote about IE?? The program I wrote parses the file. I run it on my PC and on the laptop. No IE at all.
EDIT2:
I did some stopwatch research and figured out that parsing the xml file only takes about 200ms on my PC and 800ms on my wifes laptop. Is it WPF beeing so slow? What can I do?
I agree with most everyone's comments. Reading a 38Kb file should not take so long. Do you have something else running on the machine, antivirus / etc, that could be interfering with the processing?
The amount of time it would take you to create a thread will be far greater than the amount of time spent reading the file. If you could post the actual code used to read the file and the file itself, it might help analyze performance bottlenecks.
I think you can't parse XML in multiple threads, at least not in a way that would bring performance benefits, because to read from some point in the file, you need to know everything that comes before it, if nothing else, to know at what level you are.
Your code, if tit worked, would do something like this:
main season1 season2
read
read
skip read
skip read
read
skip read
skip read
Note that to do “skip”, you need to fully parse the XML, which means you're doing the same amount of work as before on the main thread. The only difference is that you're doing some additional work on the background threads.
Regarding the slowness, just parsing such a small XML file should be very fast. If it's slow, you're most likely doing something else that is slow, or you're parsing the file multiple times.
If I am understanding how your .xml file is being used, you have essentially created an .xml database.
If correct, I would recommend breaking your Xml into different .xml files, with an indexed .xml document. I would think you can then query - using Linq-2-Xml - a set of .xml data from a specific .xml source.
Of course, this means you will still need to load an .xml file; however, you will be loading significantly smaller files and you would be able to, although highly discouraged, asynchronously load .xml document objects.
Your XML schema doesn't lend itself to parallelism since you seem to have node names (Season1, Season2) that contain the same data but must be parsed individually. You could redesign you schema to have the same node names (i.e. Season) and attributes that express the differences in the data (i.e. Number to indicate the season number). Then you can parallelize i.e. using Linq to XML and PLinq:
XDocument doc = XDocument.Load(#"TVShowSeasons.xml");
var seasonData = doc.Descendants("Season")
.AsParallel()
.Select(x => new Season()
{
Number = (int)x.Attribute("Number"),
Descripton = x.Value
}).ToList();

getting element offset from XMLReader

how's everyone doing this morning?
I'm writing a program that will parse a(several) xml files.
This stage of the program is going to be focusing on adding/editing skills/schools/abilities/etc for a tabletop rpg (L5R). What I learn by this one example should carry me through the rest of the program.
So I've got the xml reading set up using XMLReader. The file I'm reading looks like...
<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
I set up a Skill class, which holds the data, and a SkillEdit class which reads in the data, and will eventually have methods for editing and adding.
I'm currently able to read in everything right, but I had the thought that since description can vary in length, once I write the edit method the best way to ensure no data is overwritten would be to just append the edited skill to the end of the file and wipe out its previous entry.
In order for me to do that, I would need to know where skill's file offset is, and where /skill's file offset is. I can't seem to find any way of getting those offsets though.
Is there a way to do that, or can you guys suggest a better implementation for editing an already existing skill?
If you read your XML into LINQ to XML's XDocument (or XElement), everything could become very easy. You can read, edit, add stuff, etc. to XML files using a simple interface.
e.g.,
var xmlStr = #"<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
";
var doc = XDocument.Parse(xmlStr);
// find the skill "some name"
var mySkill = doc
.Descendants("skill") // out of all skills
.Where(e => e.Element("name").Value == "some name") // that has the element name "some name"
.SingleOrDefault(); // select it
if (mySkill != null) // if found...
{
var skillType = mySkill.Element("type").Value; // read the type
var skillPage = (int)mySkill.Element("page"); // read the page (as an int)
mySkill.Element("description").Value = "an AWESOME skill"; // change the description
// etc...
}
No need to calculate offsets, manual, step-by-step reading or maintaining other state, it is all taken care of for you.
Don't do it! In general, you can't reliably know anything about physical offsets in the serialized XML because of possible character encoding differences, entity references, embedded comments and a host of other things that can cause the physical and logical layers to have a complex relationship.
If your XML is just sitting on the file system, your safest option is to have a method in your skill class which serializes to XML (you already have one to read XML already), and re-serialize whole objects when you need to.
Tyler,
Umm, sounds like you're suffering from a text-book case of premature optimization... Have you PROVEN that reading and writing the COMPLETE skill list to/from the xml-file is TOO slow? No? Well until it's been proven that there IS NO PERFORMANCE ISSUE, right? So we just write the simplest code that works (i.e. does what we want, without worrying too much about performance), and then move on directly to the next bit of trick functionality... testing as we go.
Iff (which is short for if-and-only-if) I had a PROVEN performance problem then-and-only-then I'd consider writing each skill to individual XML-file, to avert the necessisity for rewriting a potentially large list of skills each time a single skill was modified... But this is "reference data", right? I mean you wouldn't de/serialize your (volatile) game data to/from an XML file, would you? Because an RDBMS is known to be much better at that job, right? So you're NOT going to be rewriting this file often?
Cheers. Keith.

Categories