Transform (large) XML-files into relational SQL - c#

I've been tasked with the job of importing a set of XML files, transform them and upload them to an SQL database, and then re-transforming them to a different XML-format.
The XML files are rather large, and some of them a little complex, so I'm unsure of the best way to do this. I'd of course like to automate this process somehow - and was actually hoping there'd be some kind of Entity Framework-esque solution to this.
I'm quite new to handling and dealing with XML in .NET, so I don't really know what my options are. I've read about XSLT, but that seems to me, to be a "language" I need to learn first, making it kind of not a solution for me.
Just to set a bit of context, the final solution actually needs to import new/updated versions of the XML on a weekly basis, uploading the new data to sql, and re-exporting as the other XML-format.
If anyone could give me any ideas as to how to proceed, I'd be much obliged.
My first instict was to use something like XSD2DB or XML SPY to first create the database structure, but I don't really see how I'm then supposed to proceed either.
I'm quite blank in fact :)

XSLT is language used by XML processors to transform XML document in one format to XML document in another format. XSLT would be your choice if you don't need to store data in database as well.
All tools like XSD2DB or XML SPY will create some database schema for you but the quality of the schema will be very dependent on quality of XML document and XSD (do you have XSD or are you going to generate it from sample XML?). The generated database will probably not be to much useful for EF.
If you have XSD you can use xsd.exe tool shipped with Visual studio and generate classes representing data of your XML files in .NET code. You will be able to use XmlSerializer to deserialize the XML document into your generated classes. The problem is that some XSD constructs like choice are modeled in .NET code by very ugly way. Another problem can be performance if your XML files are really huge because deserialization must read all data at once. The last problem can be again EF - classes generated by XSD will most probably not be usable as entities and you will not be able to map them.
So either use EF and in such case you will have to analyze XSD and create custom entities and mapping to your own designed database and you will fill your classes either from XmlReader (best performance), XmlDocument or XDocument or use some tool helping you creating classes or database from XML and in such case use direct SQL to work with a database.
Reverse operation will again require custom approach. You will have data represented either by your custom EF entities or by some autogenerated classes and you will have to transform them to a new format. You can again use xsd.exe to get classes for a new format and write a custom .NET code filling new classes from old ones (and use XmlSerializer to persist a new structure to XML) or you can use XmlWriter, XDocument or XmlDocument to build target XML document directly.
Data migration in any form is not easy task with ready to use solution. In case of really huge data processing you can use tools like SQL Server Integration Services where you will interact with XML and SQL directly and process data in batches.

Have a look at SQLXML 4.0. It does exactly what you want (in upload part).

Related

Easiest Method to retrieve large sums of data from SQL Server in C#

In my situation, I have a C# DLL I wrote myself that has been registered in a SQL Server database containing sales/customer data. As of now, I'm a bit stuck.
The DLL makes a call to a remote server to obtain a token. The token is then added to the database. Ideally, the next step is to retrieve data from the SQL server into the DLL and then build and post a JSON file to a remote server, using the token the DLL retreived.
Where I am stuck is there are 134 elements, with different data types, in the receipt section of my JSON file alone. I will need to be able to handle all of that data in my C# DLL and in the future I may need to pull a lot more data into this JSON file to be posted. I've done some reasearch and using user defined type (UDT) wouldn't quite work and from what I can tell, is an option I should stay away from. My other two options I know of would be to either export to XML and parse it in my DLL or to create and read in 134+ variables.
My question is: Is there a simpler way to do this besides XML/hard coding? It would be ideal if there was a way to use an array or an object but neither seem to be supported according to what I've read here
Thank you.
Important note: Because of the database and the JSON library I'm using, I'm working in .Net framework 2.0
I would recommend you to use XML serialization on the C# side. You create an object that models your database schema.
As you are using .NET 2.0 you have already a good set of base classes to model your database schema in an object oriented way. Even nullable columns can be mapped to nullable objects to save memory and network space.
From your SQL side you use the FOR XML clause, that will change the output of your query from tabular to XML. You have to make just one good SP that will create XML in the exact hierarchy as your C# objects.
This XML has to match the names and the casing of the classes and the properties of your c# class(es).
Then you will de-serialize this XML from the C# side in no more than 10 lines of code. No matter how big or how complex the data hierarchy is, and you will have instantly in memory objects that you can immediately serialize into JSON again.
Let me know if you need some good examples on how to achieve this. And please clarify if you are running inside of the SQL Server CLR execution context, as you might need special permissions for serializing/deserialize data.
I guess its a very primitive way of achieving what Entity Framework does. but it works.
You should probably stick with using XML as your data is semi-structured. Especially if you know your schema will be changing overtime. SQL Server is not yet an OODBMS.

Reading and Writing XML as relational data - best practices

I'm supposed to do the fallowing:
1) read a huge (700MB ~ 10 million elements) XML file;
2) parse it preserving order;
3) create a text(one or more) file with SQL insert statements to bulk load it on the DB;
4) write the relational tuples and write them back in XML.
I'm here to exchange some ideas about the best (== fast fast fast...) way to do this. I will use C# 4.0 and SQL Server 2008.
I believe that XmlTextReader its a good start. But I do not know if it can handle such a huge file. Does it load all file when is instantiated or holds just the actual reading line in memory? I suppose I can do a while(reader.Read()) and that should be fine.
What is the best way to write the text files? As I should preserve the ordering of the XML (adopting some numbering schema) I will have to hold some parts of the tree in memory to do the calculations etc... Should I iterate with stringbuilder?
I will have two scenarios: one where every node (element, attribute or text) will be in the same table (i.e., will be the same object) and another scenario where for each type of node (just this three types, no comments etc..) I will have a table in the DB and a class to represent this entity.
My last specific question is how good is the DataSet ds.WriteXml? Will it handle 10M tuples? Maybe its best to bring chunks from the database and use a XmlWriter... I really dont know.
I'm testing all this stuff... But I decided to post this question to listen you guys, hopping your expertise can help me doing this things more correctly and faster.
Thanks in advance,
Pedro Dusso
I'd use the SQLXML Bulk Load Component for this. You provide a specially annotated XSD schema for your XML with embedded mappings to your relational model. It can then bulk load the XML data blazingly fast.
If your XML has no schema you can create one from visual studio by loading the file and selecting Create Schema from the XML menu. You will need to add the mappings to your relational model yourself however. This blog has some posts on how to do that.
Guess what? You don't have a SQL Server problem. You have an XML problem!
Faced with your situation, I wouldn't hesitate. I'd use Perl and one of its many XML modules to parse the data, create simple tab- or other-delimited files to bulk load, and bcp the resulting files.
Using the server to parse your XML has many disadvantages:
Not fast, more than likely
Positively useless error messages, in my experience
No debugger
Nowhere to turn when one of the above turns out to be true
If you use Perl on the other hand, you have line-by-line processing and debugging, error messages intended to guide a programmer, and many alternatives should your first choice of package turn out not to do the job.
If you do this kind of work often and don't know Perl, learn it. It will repay you many times over.

.NET Dual persistence architecture

I'm faced with the challenge of writing an object persistence mechanism that serializes/deserializes to a SQL database and XML files.
For the sake of illustration, imagine I have a graph of objects that has a single root object. Maybe a "tree", for example, which has all manner of child objects -- leaves, brances, nuts, squirrels, birds and the like.
I need a suggestion for an architecture that seamlessly moves between loading & saving a "tree" from a file and/or database. It needs to be able to load a "tree" from a file and save it to a database, or the other way around.
I'm currently using Entity Framework for my SQL persistence, and I'm happy enough with it. For the XML I'm using XDocument, which I also like a lot, but I'm wondering if there isn't some framework out there that already does all this.
Unless you want to do querying on your objects in Sql Server (or there are other sources that may update/manage relational data), using EF to convert into relation schema is a bit overkill. If all you want is to persist your object graph in different mediums then you should consider runtime serialization or DataContractSerializer. Essentially, you will get binary data or XML that you can dump into any storage medium including Sql Server. This will free you from changing relation schema in sql server when your object structures changes. However, you must consider versioning your objects while going from serialization approach.
You can try using the older, yet very nice XmlSerializer.
ps. need to watch out for anything Entity Framework may require from you when loading an object you serialized to a xml file.
Are there any strict requirements around the entities being saved in XML format? If not, another option could be to use SQLite (http://sqlite.phxsoftware.com/) with the entity framework when you need local/filesystem persistence.

Create Valid XML from XSD Loaded at Runtime (without xsd.exe)

Possible Duplicate:
Programmatically Create XML File From XSD
XML instance generation from XML schema (xsd)
How to generate sample XML documents from their DTD or XSD?
Here's the scenario: I've created an application that hooks into a commercial CRM product using their web service API, which unfortunately has a different schema for every installation, based on how the users create their custom fields. This schema can also be modified at any time. This application will be installed at the customer location, and will need to function even when they change their field structure.
In order to insert or update a record, I first call their Project.GetSchema() method, which returns the XSD file based on the current set of fields, and then I can call the Project.AddProject() method, passing in an XML file containing the project data.
My question is: What's the best way to generate the XML from the XSD file at runtime? I need to be able to check for the existence of fields, and fill them out only if they exist (for instance, if the customer deleted or renamed some fields).
I really don't want to have the application attempting to recompile classes on the fly using xsd.exe. There simply must be a better way.
[update] My current solution, that I'm working on, is to basically parse out the XSD file myself, since the majority of the schema is going to be the same for each installation. It's just an ugly solution, and I was hoping there was a better way. The biggest problem I have is that their schema uses xsd:sequence, so putting things in a different order always breaks validation.

MS Access database to XML, in .NET

I am looking to achieve two things:
a) To find a free, whether open-source or not, XML database that is simple to use and
b) To access an MS Access DB and convert it to the XML database. If this can be done automatically, so much the better :) Otherwise, what would be the easier way to do this?
Thanks!
Since you are already working with MS Access, my recommendation is MS SQL Express---but this may not be simple enough for your needs. SQL Express supports the FOR XML syntax to emit XML and it should support the native XML type.
You can use the built in export option in access 2007 to export a database into xml. It will even export child tables into the same file. It will also produce the Schema of data for you. (XSD). There is also the option to produce an Presentation file (XSL) but I not really used that feature.
So when export the database you get both the xml and the schema produced for you (xml + xsd file).
So the ability to export and convert an access database to xml is built in and no additional tools or 3rd party software is required.
Access 2003 also has export xml ability
It goes without saying, that ms-access can also consume (import) xml.
Not sure if it is "easy", but IBM's DB2 Express-C (for Community edition) is free and can work with native XML:
"The lowest priced hybrid XML and relational data server designed to meet the needs of small and medium businesses.."
I frankly never used it, and very few people seem to care about it, but it seems to be an interesting product.

Categories