How expensive is a XSD validation of XML? - c#

I want to validate large xml files by using xsd schemas in C#.
For a file of 1000 lines of xml code validation takes a long time.
Are there any tips and tricks to validate faster?
Can you post some code examples that work faster with large xml validation?
Edit 1 : I validate like this
Validating XML with XSD
Edit 2: For large files takes more than 10 seconds. And I need the validation to be very fast under a second.
Edit 3: File size is greater than 10 Mb
Edit 4: I am considering this approach too, I want to store xml file in database and xsd too.

You are currently loading the entire document into memory, which is expensive regardless of validation. A better option is to just parse via a reader, i.e. as shown here on MSDN. The key points from the example on that page:
it never loads the entire document
the while(reader.Reader()) just enumerates the entire file at the node level
validation is enabled via the XmlReaderSettings

It's reasonable to expect parsing a document with validation to take about twice as long as parsing without validation. But that ratio will vary a great deal depending on your schema. For example if every attribute is controlled by a regular expression, and the regex is complex, then the overhead of validation could be far higher than this rule-of-thumb suggests.
Also, this doesn't allow for the cost of building a complex schema. If you have a big schema defining hundreds of element types, compiling the schema could take longer than using it to validate a few megabytes of data.

Related

Does LINQ to XML loads whole xml document during query?

I have a large xml file which contains database!
400mb is a size.
it was created using LINQ itself and it was done in 10 minutes! Great result!
But in order to read a particle information from that xml file using LINQ it need 20 minutes and more!
Just imagine to read a small amount of information needs more time then to write a large information!
During read process it needs to call a function XDocument.Load(#"C:\400mb.xml") which is not IDisposable.
So when it will load whole xml document and when it gets my small information, Memory does not clears!
My target is to read "
XDocument XD1 = XDocument.Load(#"C:\400mb.xml");
string s1 = XD1.Root.Attribute("AnyAttribute").Value;
As you can see, I need to get an Attribute of the Root Element.
This means that in xml file the data I need might be on a first line and query must be done very quickly!
But instead of this it load whole Document and then returns that information!
So the question is How to read that small amount of information from a large xml file using anything?
Will System.Threading.Tasks namespace be useful? Or create asynchronous operations?
Or is even any kind of technique which will work on that xml file like a binary file?
I don't know! Help me Please!
Xdocument.Load is not the best approach, reason being Xdocument.Load loads the whole file into memory. According to MSDN memory usage will be proportional to the size of the file. You can use XMLReader (Check here) instead if you are just planning to search the XML doc. Read this documentation on MSDN.

How to use DataContractSerializer efficiently with this use case?

I want to use the powerful DataContractSerializer to write or read data to the XML file.
But as my concept, DataContractSerializer can only read or write data with entire structure or list of structure.
My use case is describe below....I cannot figure out how to optimize the performance by using this API.
I have a structure named "Information" and have a List<Information> with unexpectable number of elements in this list.
User may update or add new element into this list very often.
Per operation (Add or Update), I must serialize all the element in the list to the same XML file.
So, I will write the same data even they are not modified into XML again. It does not make sense but I cannot find any approach to avoid this happened.
Due to the tombstoning mechanism, I must save all the information in 10 secs.
I'm afraid of the performance and maybe make UI lag...
Could I use any workaround to partially update or add a data information into the XML file by DataContractSerializer?
DataContractSerializer can be used to serialize selected items - what you need to do is to come up with scheme to identify changed data and way to efficiently serialize it. For example, one of the way could be
You start by serializing entire list of structures to an file.
Whenever some object is added/updated/removed from list, you create a diff object that will identify kind of change and the object changed. Then you can serialize this object to xml and append the xml to file.
While reading the file, you may have to apply similar logic, first read list and then start applying diffs one after another.
Because you want to continuous append to file, you shouldn't have root element in your file. In other words, the file with diff info will not be an valid xml document. It would contain series of xml fragments. To read it, you have to enclose these fragments in a xml declaration and root element.
You may use some background task to write the entire list periodically to generate valid xml file. At this point, you may discard your diff file. Idea is to mimic transactional system - one data structure to have serialized/saved info and then another structure containing changes (akin to transaction log).
If performance is a concern then using something other than DataContractSerializer.
There is a good comparison of the options at
http://blogs.claritycon.com/kevinmarshall/2010/11/03/wp7-serialization-comparison/
If the size of the list is a concern, you could try breaking it into smaller lists. THe most appropriate way to do this will depend on the data in your list and typical usage/edit/addition patterns.
Depending on the frequency with which the data is changed you could try saving it whenever it is changed. This would remove the need to save it in the time available for deactivation.

Reading and Writing XML as relational data - best practices

I'm supposed to do the fallowing:
1) read a huge (700MB ~ 10 million elements) XML file;
2) parse it preserving order;
3) create a text(one or more) file with SQL insert statements to bulk load it on the DB;
4) write the relational tuples and write them back in XML.
I'm here to exchange some ideas about the best (== fast fast fast...) way to do this. I will use C# 4.0 and SQL Server 2008.
I believe that XmlTextReader its a good start. But I do not know if it can handle such a huge file. Does it load all file when is instantiated or holds just the actual reading line in memory? I suppose I can do a while(reader.Read()) and that should be fine.
What is the best way to write the text files? As I should preserve the ordering of the XML (adopting some numbering schema) I will have to hold some parts of the tree in memory to do the calculations etc... Should I iterate with stringbuilder?
I will have two scenarios: one where every node (element, attribute or text) will be in the same table (i.e., will be the same object) and another scenario where for each type of node (just this three types, no comments etc..) I will have a table in the DB and a class to represent this entity.
My last specific question is how good is the DataSet ds.WriteXml? Will it handle 10M tuples? Maybe its best to bring chunks from the database and use a XmlWriter... I really dont know.
I'm testing all this stuff... But I decided to post this question to listen you guys, hopping your expertise can help me doing this things more correctly and faster.
Thanks in advance,
Pedro Dusso
I'd use the SQLXML Bulk Load Component for this. You provide a specially annotated XSD schema for your XML with embedded mappings to your relational model. It can then bulk load the XML data blazingly fast.
If your XML has no schema you can create one from visual studio by loading the file and selecting Create Schema from the XML menu. You will need to add the mappings to your relational model yourself however. This blog has some posts on how to do that.
Guess what? You don't have a SQL Server problem. You have an XML problem!
Faced with your situation, I wouldn't hesitate. I'd use Perl and one of its many XML modules to parse the data, create simple tab- or other-delimited files to bulk load, and bcp the resulting files.
Using the server to parse your XML has many disadvantages:
Not fast, more than likely
Positively useless error messages, in my experience
No debugger
Nowhere to turn when one of the above turns out to be true
If you use Perl on the other hand, you have line-by-line processing and debugging, error messages intended to guide a programmer, and many alternatives should your first choice of package turn out not to do the job.
If you do this kind of work often and don't know Perl, learn it. It will repay you many times over.

Storing settings: XML vs. SQLite?

I am currently writing an IRC client and I've been trying to figure out a good way to store the server settings. Basically a big list of networks and their servers as most IRC clients have.
I had decided on using SQLite but then I wanted to make the list freely available online in XML format (and perhaps definitive), for other IRC apps to use. So now I may just store the settings locally in the same format.
I have very little experience with either ADO.NET or XML so I'm not sure how they would compare in a situation like this.
Is one easier to work with programmatically? Is one faster? Does it matter?
It's a vaguer question than you realize. "Settings" can encompass an awful lot of things.
There's a good .NET infrastructure for handling application settings in configuration files. These, generally, are exposed to your program as properties of a global Settings object; the classes in the System.Configuration namespace take care of reading and persisting them, and there are tools built into Visual Studio to auto-generate the code for dealing with them. One of the data types that this infrastructure supports is StringCollection, so you could use that to store a list of servers.
But for a large list of servers, this wouldn't be my first choice, for a couple of reasons. I'd expect that the elements in your list are actually tuples (e.g. host name, port, description), not simple strings, in which case you'll end up having to format and parse the data to get it into a StringCollection, and that is generally a sign that you should be doing something else. Also, application settings are read-only (under Vista, at least), and while you can give a setting user scope to make it persistable, that leads you down a path that you probably want to understand before committing to.
So, another thing I'd consider: Is your list of servers simply a list, or do you have an internal object model representing it? In the latter case, I might consider using XML serialization to store and retrieve the objects. (The only thing I'd keep in the application configuration file would be the path to the serialized object file.) I'd do this because serializing and deserializing simple objects into XML is really easy; you don't have to be concerned with designing and testing a proper serialization format because the tools do it for you.
The primary reason I look at using a database is if my program performs a bunch of operations whose results need to be atomic and durable, or if for some reason I don't want all of my data in memory at once. If every time X happens, I want a permanent record of it, that's leading me in the direction of using a database. You don't want to use XML serialization for something like that, generally, because you can't realistically serialize just one object if you're saving all of your objects to a single physical file. (Though it's certainly not crazy to simply serialize your whole object model to save one change. In fact, that's exactly what my company's product does, and it points to another circumstance in which I wouldn't use a database: if the data's schema is changing frequently.)
I would personally use XML for settings - .NET is already built to do this and as such has many built-in facilities for storing your settings in XML configuration files.
If you want to use a custom schema (be it XML or DB) for storing settings then I would say that either XML or SQLite will work just as well since you ought to be using a decent API around the data store.
Every tool has its own right
There is plenty of hype arround XML, I know. But you should see, that XML is basically an exchange format -- not a storage format (unless you use a native XML-Database that gives you more options -- but also might add some headaches).
When your configuration is rather small (say less than 10.000 records), you might use XML and be fine. You will load the whole thing into your memory and access the entries there. Done.
But when your configuration is so big, that you dont want to load it completely, than you rethink your decission and stay with SQLite which gives you the option to dynamically load those parts of the configuration you need.
You could also provide a little tool to create a XML file from the DB-content -- creation of XML from a DB is a rather simple task.
Looks like you have two separate applications here: a web server and a desktop client (because that is traditionally where these things run), each with its own storage needs.
On the server side: go with a relational data store, not Xml. Basically at some point you need to keep user data separate from other user data on the server. XML is not a good store for that.
On the client: it doesn't really matter. Xml will probably be easier for you to manipulate. And don't think that because you are using one technology in one setting, you have to use it in the other.

What is the best way to generate XML from the data in the database?

If I have thousands of hierarchical records to take from database and generate xml, what will be the best way to do it with a good performance and less CPU utilization?
You can output XML directly from SQL Server 2005 using
FOR XML
The results of a query
are returned as an XML document. Must be used with
one of the three RAW, AUTO
and EXPLICIT options
RAW
Each row in the result set is an XML element with a generic
identifier as the element tag
AUTO
Results returned in a simple
nested XML tree. An element will
be generated for each table field in the
SELECT clause
EXPLICIT
Specifies the shape of the resulting
XML tree explicitly.
A query must be written in a
particular way so that additional
information about the nesting is
specified
XMLDATA
Returns the schema, but does not add the root element to the result
ELEMENTS
Specifies that the columns are
returned as child elements to the table
element. If not specified, they are mapped as
attributes
Generate an inline XSD schema at the same time using
XMLSCHEMA
You can handle null values in records using
XSINIL
You can also return data in Binary form.
You might want to have a look on MSDN for XML support in SQL Server 2005, for technologies such as XQuery, XML data type, etc.
That depends - if your application and database servers are on separate machines, then you need to specify which CPU you want to reduce the load on. If your database is already loaded up, you might be better off doing the XML transform on your application server, otherwise go and ahead and use SQL Server FOR XML capabilities.
Oracle has tools for that, so I guess SQL-Server does too, but you'll need a schema. Personally for small set I use a php script I have around, but for big stuff with need for customization is another story.

Categories