I have a requirement for low memory footprint while processing XML files upwards of 400-500MB size. This means that I can have the file loaded in-memory only once at any point in time (e.g. in a string object). The data structure is such that the elements nest in only a few levels, but are many in number (i.e. many rows of data but grouped to only a couple of levels).
During processing, I need to forward some of the data directly (i.e. exactly as read from the file, unicode character-for-character) to another stream. In other parts of the file, I need to remove/add information (usually in the form of attribute values) and possibly forward the result in a byte-consistent way to another stream (i.e. removing or adding data the same way will always produce the same result).
I've looked into XmlReader and XmlTextReader but they don't provide a way to get the exact text of the node that was Read(). Am I missing something?
Related
I know this sounds kind of confusing. but i was wondering if there is a way to maintain the structure of the file and editing it even if it's adding data at some part of the file or editing a value from a certain position.
What i do right now to edit binary files is to code the parser with the BinaryReader class (in C#), reading a certain structure with reader.readSingle, readInt, and so on.
Then i write the exact same thing with BinaryWriter, which seems kind of inefficent and maybe i can make mistakes and making differences between both reader and writer, making the format inconsistent.
Is there any sort of way, to define the file structure and do the whole process automatically for reading and writing with a single format definition? Or being able to open a file, edit some values of it, (or adding, since it's not a fixed format, reading it would imply some for loops for example), and saving those changes?
I hope i explained myself in a sightly understandable way
If you want to insert new data into a binary file, you have three options:
Move everything from that point forward down a bit so that you make space for the new data.
Somehow mark the existing data as no longer relevant (i.e. a deleted flag), and add the new data at the end of the file.
Replace the existing data with a pointer to another location in the file (typically the end of the file) where the new data is stored.
The first method requires rewriting the entire file.
The second method can work well if it's a file of records, for example, and if you don't depend on the order of records in the file. It becomes more difficult if the file has a complex structure of nested records, etc. It has the drawback of leaving a lot of empty space in the file.
The third method is similar to the second, but works well if you're using random access rather than sequential access. It still ends up wasting space in the file.
I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW
If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.
Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.
I am starting to work on an application (in C#) which shall convert different file types from one to another, the files all are GPS-Trackfiles of some kind, e.g. gpx or tcx files, or also less widely distributed formats. Common to all is that they are all basically some kind of xml format, or in some cases less structured text/csv files. The problem - but also the cause for writing this - is that the information / data within the files are stored quite different, e.g. in gpx the position data (lon/lat) of a point is stored as an attribute, while in tcx it is referenced to the point as a subnode. Another example is that in some formats the distances are stored as absolute, in some as relative values. Further, I want to handle a large variety of information here, e.g. power, climbing speeds, distances up/downhill or power zones just to name a few. The most complete format, for example, stores about 30 values per track point. It seems notable that for some formats some of the vales stored in others first have to be derived by calculation of e.g. the relative distance between two Geoposition entries.
Now for my question: I wonder what would be the best approach to deal with the described task?
My approach so far is to use an internal DataSet as internal Storage, read the file I want to import in another (source) dataset using standard xml readers or text readers, and then fill the target (internal) DataSet by going through the source dataset row by row, table by table. I would therefore write a variety of functions to do all necessary calculations to transfer the values from source to target as required by my internal DataSet's structure.
For writing data in a file of any format, I would then more or less reverse that approach, perhaps limited to the data fields truely relevant for a certain file type.
Is that the proper way to achive what I'm aiming on? Is there another way to do it I am not aware of?
Any hint is very much helping, I would like to make sure to use a suiteble way from the start since going through every data value requires a lot of calculations to be programmed, even if each single one is rather small/simple.
Thanks for your thoughts!
I am trying to do a merge sort on sorted chunks of XML files on disks. No chance that they all fit in memory. My XML files consists of records.
Say I have n XML files. If I had enough memory I would read the entire contents of each file into a correspoding Queue, one queue for each file, compare the timestamp on each item in each queue and output the one with the smallest timestamp to another file (the merge file). This way, I merge all the little files into one big file with all the entries time-sorted.
The problem is that I don't have enough memory to read all XML with .ReadToEnd to later pass to .Parse method of an XDocument.
Is there a clean way to read just enough records to keep each of the Queues filled for the next pass that compares their XElement attribute "TimeStamp", remembering which XElement from disk it has read?
Thank you.
An XmlReader is what you are looking for.
Represents a reader that provides fast, non-cached, forward-only
access to XML data.
So it has fallen out of fashion, but this is exactly the problem solved with SAX. It is the Simple API for XML, and is based on callbacks. You launch a read operation, and your code gets called back for each record. This may be an optioin, as this does not require the program to load in the entire XML file (ala XMLDocument). Google SAX.
If you like the linq to xml api, this codeplex project may suite your needs.
I want to use the powerful DataContractSerializer to write or read data to the XML file.
But as my concept, DataContractSerializer can only read or write data with entire structure or list of structure.
My use case is describe below....I cannot figure out how to optimize the performance by using this API.
I have a structure named "Information" and have a List<Information> with unexpectable number of elements in this list.
User may update or add new element into this list very often.
Per operation (Add or Update), I must serialize all the element in the list to the same XML file.
So, I will write the same data even they are not modified into XML again. It does not make sense but I cannot find any approach to avoid this happened.
Due to the tombstoning mechanism, I must save all the information in 10 secs.
I'm afraid of the performance and maybe make UI lag...
Could I use any workaround to partially update or add a data information into the XML file by DataContractSerializer?
DataContractSerializer can be used to serialize selected items - what you need to do is to come up with scheme to identify changed data and way to efficiently serialize it. For example, one of the way could be
You start by serializing entire list of structures to an file.
Whenever some object is added/updated/removed from list, you create a diff object that will identify kind of change and the object changed. Then you can serialize this object to xml and append the xml to file.
While reading the file, you may have to apply similar logic, first read list and then start applying diffs one after another.
Because you want to continuous append to file, you shouldn't have root element in your file. In other words, the file with diff info will not be an valid xml document. It would contain series of xml fragments. To read it, you have to enclose these fragments in a xml declaration and root element.
You may use some background task to write the entire list periodically to generate valid xml file. At this point, you may discard your diff file. Idea is to mimic transactional system - one data structure to have serialized/saved info and then another structure containing changes (akin to transaction log).
If performance is a concern then using something other than DataContractSerializer.
There is a good comparison of the options at
http://blogs.claritycon.com/kevinmarshall/2010/11/03/wp7-serialization-comparison/
If the size of the list is a concern, you could try breaking it into smaller lists. THe most appropriate way to do this will depend on the data in your list and typical usage/edit/addition patterns.
Depending on the frequency with which the data is changed you could try saving it whenever it is changed. This would remove the need to save it in the time available for deactivation.