Does LINQ to XML loads whole xml document during query? - c#

I have a large xml file which contains database!
400mb is a size.
it was created using LINQ itself and it was done in 10 minutes! Great result!
But in order to read a particle information from that xml file using LINQ it need 20 minutes and more!
Just imagine to read a small amount of information needs more time then to write a large information!
During read process it needs to call a function XDocument.Load(#"C:\400mb.xml") which is not IDisposable.
So when it will load whole xml document and when it gets my small information, Memory does not clears!
My target is to read "
XDocument XD1 = XDocument.Load(#"C:\400mb.xml");
string s1 = XD1.Root.Attribute("AnyAttribute").Value;
As you can see, I need to get an Attribute of the Root Element.
This means that in xml file the data I need might be on a first line and query must be done very quickly!
But instead of this it load whole Document and then returns that information!
So the question is How to read that small amount of information from a large xml file using anything?
Will System.Threading.Tasks namespace be useful? Or create asynchronous operations?
Or is even any kind of technique which will work on that xml file like a binary file?
I don't know! Help me Please!

Xdocument.Load is not the best approach, reason being Xdocument.Load loads the whole file into memory. According to MSDN memory usage will be proportional to the size of the file. You can use XMLReader (Check here) instead if you are just planning to search the XML doc. Read this documentation on MSDN.

Related

The best way to save huge amount of financial tick data of forex

I have lot of Forex Tick Data to be saved. My question is what is the best way?
Here is am example: I collect only 1 month data from the EURUSD pair. It is originally in CSV file which is 136MB large and has 2465671 rows. I use a library written by : http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader and it took around 30 seconds to read all the ticks and saved it in 2465671 objects. first of all, whether it is fast enough?
Secondly, is there any way better than CSV? For example, the binary file which might be faster and whether you have any recommendation about any database which is best? I tried the db4o but it is not very impressive. I think here are some overhead to save data as properties of object and when we have to save 2465671 objects in Yap file of db4o.
I've thought about this before, and if I was collecting this data, I would break up the process:
collect data from the feed, form a line (I'd use fixed width), and append it to a text file.
I would create a new text file every minute and name it something like rawdata.yymmddhhmm.txt
Then I would have another process working in the background reading these files and pushing then into a database via a parameterized insert query.
I would probably use text over a binary file because I know that would append without any problems, but I'd also look into opening a binary file for append as well. This might actually be a bit better.
Also, you want to open the file in append mode since that's the fastest way to write to a file. This will obviously need to be super fast.
Perhaps look at this product:
http://kx.com/kdb+.php
it seems to made for that purpose.
One way to save data space (and hopefully time) is to save numbers as numbers and not as text, which is what CSV does.
You can perhaps make an object out of each row, and the make the reading and writing each object a serialization problem, which there is good support for in C#.
Kx's kdb database would be a great of-the-shelf package if you had a few million to spare. However you could easily write your own column-orientated database to store and analyse high-frequency data for optimal performance.
I save terabytes as compressed binary files (GZIP) that I dynamically uncompress using C#/.NETs built-in gzip compression/decompression readers.
HDF5 is widely used for big data, including by some financial firms. Unlike KDB it's free to use, and there are plenty of libraries to go on top of it, such as the .NET wrapper
This SO question might help you get started.
HDF5 homepage

How expensive is a XSD validation of XML?

I want to validate large xml files by using xsd schemas in C#.
For a file of 1000 lines of xml code validation takes a long time.
Are there any tips and tricks to validate faster?
Can you post some code examples that work faster with large xml validation?
Edit 1 : I validate like this
Validating XML with XSD
Edit 2: For large files takes more than 10 seconds. And I need the validation to be very fast under a second.
Edit 3: File size is greater than 10 Mb
Edit 4: I am considering this approach too, I want to store xml file in database and xsd too.
You are currently loading the entire document into memory, which is expensive regardless of validation. A better option is to just parse via a reader, i.e. as shown here on MSDN. The key points from the example on that page:
it never loads the entire document
the while(reader.Reader()) just enumerates the entire file at the node level
validation is enabled via the XmlReaderSettings
It's reasonable to expect parsing a document with validation to take about twice as long as parsing without validation. But that ratio will vary a great deal depending on your schema. For example if every attribute is controlled by a regular expression, and the regex is complex, then the overhead of validation could be far higher than this rule-of-thumb suggests.
Also, this doesn't allow for the cost of building a complex schema. If you have a big schema defining hundreds of element types, compiling the schema could take longer than using it to validate a few megabytes of data.

Reading XML from disk one record at a time with memory

I am trying to do a merge sort on sorted chunks of XML files on disks. No chance that they all fit in memory. My XML files consists of records.
Say I have n XML files. If I had enough memory I would read the entire contents of each file into a correspoding Queue, one queue for each file, compare the timestamp on each item in each queue and output the one with the smallest timestamp to another file (the merge file). This way, I merge all the little files into one big file with all the entries time-sorted.
The problem is that I don't have enough memory to read all XML with .ReadToEnd to later pass to .Parse method of an XDocument.
Is there a clean way to read just enough records to keep each of the Queues filled for the next pass that compares their XElement attribute "TimeStamp", remembering which XElement from disk it has read?
Thank you.
An XmlReader is what you are looking for.
Represents a reader that provides fast, non-cached, forward-only
access to XML data.
So it has fallen out of fashion, but this is exactly the problem solved with SAX. It is the Simple API for XML, and is based on callbacks. You launch a read operation, and your code gets called back for each record. This may be an optioin, as this does not require the program to load in the entire XML file (ala XMLDocument). Google SAX.
If you like the linq to xml api, this codeplex project may suite your needs.

Should i use XML Serialization with large file

I have a set of very large XML data with XSD. One xml might be up to 300MB.
I need to move data from XML into SQL Server.
I found that Microsoft has serialization library to map xml into objects
http://msdn.microsoft.com/en-us/library/182eeyhh.aspx
The problem I am worrying about is, when it maps the xml into object, will it load all the data into memory? If it does, it seems I cannot use it.
So is XmlTextReader the best way for my case like read line by line and store data into database.
Yes, in .NET, XML serialization reads everything into memory at one time.
A more memory-efficient approach is to use a System.Xml.XmlReader to read the content line-by-line.

How to use DataContractSerializer efficiently with this use case?

I want to use the powerful DataContractSerializer to write or read data to the XML file.
But as my concept, DataContractSerializer can only read or write data with entire structure or list of structure.
My use case is describe below....I cannot figure out how to optimize the performance by using this API.
I have a structure named "Information" and have a List<Information> with unexpectable number of elements in this list.
User may update or add new element into this list very often.
Per operation (Add or Update), I must serialize all the element in the list to the same XML file.
So, I will write the same data even they are not modified into XML again. It does not make sense but I cannot find any approach to avoid this happened.
Due to the tombstoning mechanism, I must save all the information in 10 secs.
I'm afraid of the performance and maybe make UI lag...
Could I use any workaround to partially update or add a data information into the XML file by DataContractSerializer?
DataContractSerializer can be used to serialize selected items - what you need to do is to come up with scheme to identify changed data and way to efficiently serialize it. For example, one of the way could be
You start by serializing entire list of structures to an file.
Whenever some object is added/updated/removed from list, you create a diff object that will identify kind of change and the object changed. Then you can serialize this object to xml and append the xml to file.
While reading the file, you may have to apply similar logic, first read list and then start applying diffs one after another.
Because you want to continuous append to file, you shouldn't have root element in your file. In other words, the file with diff info will not be an valid xml document. It would contain series of xml fragments. To read it, you have to enclose these fragments in a xml declaration and root element.
You may use some background task to write the entire list periodically to generate valid xml file. At this point, you may discard your diff file. Idea is to mimic transactional system - one data structure to have serialized/saved info and then another structure containing changes (akin to transaction log).
If performance is a concern then using something other than DataContractSerializer.
There is a good comparison of the options at
http://blogs.claritycon.com/kevinmarshall/2010/11/03/wp7-serialization-comparison/
If the size of the list is a concern, you could try breaking it into smaller lists. THe most appropriate way to do this will depend on the data in your list and typical usage/edit/addition patterns.
Depending on the frequency with which the data is changed you could try saving it whenever it is changed. This would remove the need to save it in the time available for deactivation.

Categories