Making XML in SQL Server faster - convert to tables?

Making XML in SQL Server faster - convert to tables? - c#

I have a field in the database that is XML because it represents a class that is used in C#/VB.Net. Problem is after the initial manipulation most, but not all, of the manipulation is done in SQL Server. This means that the XML field is converted on the fly.
As we are getting more fields and more records, this operation is getting slow. I think the slow down is the converting all of those fields to other data types.
So to speed it up I was thinking of a couple of ways:
Have a set of tables that represent the different pieces of the XML data. I would make these tables read only using a trigger on Insert/Update that would reject any changes. My 'main' table with the XML in it when it updates the XML would turn off the triggers, update the tables with the new values then turn the triggers back on.
The only real reason we use the XML is because it's really easy to convert it to the class in C#/VB.Net. But I'm getting the point where I may end up writing a routine that will take all the bits and pieces and convert it to a class and also a function to go the other way (class -> tables).
Can anybody give any ideas on a better way to do this? I'm not tied to the idea of using the XML structure. My concern is if we have separate tables to speed up SQL processing and somebody changes the value of a field in that table we have to make sure the XML is updated. Or don't allow the person to update it.
TIA - Jeff.

What is the purpose of the objects you are saving? If anything other than persistence of state, you are not doing yourself any favors and you are not properly separating concerns. If they are persistence of state, then at minimum, make columns out of the properties and fields (can include private as long as you leave an internal method to set the values when you reconstitute).

Disregarding the wisdom of what you're doing, you might look into creating an XML index. This should help you get started: http://msdn.microsoft.com/en-us/library/ms345121%28v=sql.90%29.aspx
The basic idea is that the right index can 'pre-shred' your XML and automatically build the sor of tables you are thinking of doing 'manually'. A downside is that this can really explode your storage requirements if you are storing lots of XML.

Related

How to save unknown properties in database?

I'm going to code a housekeeping book
So I create properties in code like Name, Category and some other need to create at run-time.
So how should I save that human-readable in a SQL Server database?
My suggestion is to create a table called Properties with 2 columns (Id, Name) and in that table I can store all my properties but it wouldn't anymore human-readable
I also not sure if it will be wise to create a column for each property in one big table
I could also create a XML "file" and store this in my DB but i don't thing this is a good idea either
Any advice is greatly appreciated

There are basically three approached to this
A column for every value
The one you are suggesting which is called an Entity Attribute Value model
Or the one you discounted which would be xml (or serialised objects)
They all have pros and cons, and some of the cons can get quite severe.
A column for every value means you have to change your db and model every time you want to store more data, which makes it very fragile and high maintenance.
EAV can easily lead to the queries becoming huge joins, and imposing data integrity on it is a hiding to nothing.
Object based can also lead to significant optimisation and maintenance issues, having to open every object to see if something is in it, for instance.
Now any one of these might be the best of a bad lot at the time you make the decision (they are all fragile in one respect or another), IF you insist on using a relational database.
Look at one of the NoSQL alternatives, they are designed for this sort of data.

Is File Reference for Deserialization acceptable in a database?

I have a situation where I need to store some data that just won't ...really fit into a database table. It is a little too abstract, and I've not enough knowledge to piecemeal it in such a way that it could be broken into tables and columns. The object in question is a System.Linq.Expressions.Expression<T>.
I have discovered a means of serializing such to xml using MetaLinq. and it works pretty well, albeit the xml it generates is excessively obese, I somewhat expected this much from something as complicated as an Expression. A modest expression turns out to around 19 kb.
So my thought was to use gzip compression on the file. This works well, it saves it to about 2 kb.
So then, my actual question is this : is it bad practice or 'dangerous' practice to basically use a table column to reference a filename for deserialization for an object? Like I would have a table for expressions, and it would have a filename, when that expression was called it would perform the gzip decompression, deserialize it, and return the object.
This seems like the ideal solution but it requires a lot of File I/O and a lot of various compression/zipping/serialization. I'm wondering if I could get the opinion of more experienced database admins out there. I am using Fluent nHibernate as my ORM mapper.
MetaLinq on codeplex

Not an experienced DBA, but I would store the serialized data in a BLOB field in the database. Database backups do no good if the files your data is depending on go away or vice versa. I think it would simplify things to just keep it all together. And the blob works fine since the data you are storing does not need to be queried.

Depends on the size of the data.
Sql has an XML data type for table columns now. So you could deserialize the object and then insert the whole object in the column again depending on size.
But if you must use the file system I would store a path and the file name in the column.
In your programs app.config keep the root of the drive like \\MyDrive or d:\
That way if information moves, just change the app config as long as the folder/file structure stays the same.
Edit:
Along with NerdFury suggestion you could you a binary serializer if you do not need to "see" the data in the database. XML serialization at least makes it readable

DB design when data is unknown about an entity?

I'm wondering if the following DB schema would have repercussions later. Let's say I'm writing a place entity. I'm not certain what properties of place will be stored in the DB. I'm thinking of making two tables: one to hold the required (or common) info, and one to hold additional info.
Table 1 - Place
PK PlaceId
Name
Lat
Lng
etc... (all the common fields)
Table 2 - PlaceData
PK DataId
PK FieldName
PK FK PlaceId
FieldData
Usage Scenario
I want certain visitors to have the capability of entering custom fields about a place. For example, a restaurant is a place that may have the following fields: HasParking, HasDriveThru, RequiresReservation, etc... but a car dealer is also a place, and those fields wouldn't make sense for a car dealer.
I want to support any type of place, from a single table (well, 2nd table has custom fields), because I don't know the number of types of places that will eventually be added to my site.
Overall goal
On my asp.net MVC (C#/Razor) site, where I display a place, it will show the attributes, as a unordered list populated by: SELECT * FROM PlaceData WHERE PlaceId = #0.
This way, I wouldn't need to show empty field names on the view (or do a string.IsNullOrWhitespace() check for each and every field. Which I would be forced to do if every attribute was a column on the table.
I'm assuming this scenario is quite common, but are there better ways to do it? Particularly from a performance perspective? What are the major drawbacks of this schema?

Your idea is referred to as an Entity-Attribute-Value table and is generally bad news in a RDBMS. RDBMSes are geared toward highly structured data.
The overall options are:
Model the db further in an RDBMS, which is most likely if someone is holding back specs from you.
Stick with the RDBMS, using XML columns for the data whose structure is variable. This makes the most sense if a relatively small portion of your data storage schema is semi- or un-structured. Speaking from a MS SQL Server perspective, this data can be indexed and you can perform checks that your data complies with an XML schema definition.
Move to a non-relational DB such as MongoDB, Cassandra, CouchDB, etc. This is what a lot of social sites and I suspect blog sites run with. Also, it is within reason to use a combination of RDBMS and non-relational stores if that's what your needs call for.
EAV gets to be a mess because you're creating a database within a database and lose all of the benefits a RDBMS can provide (foreign keys, data type enforcement, etc.) and the SQL code needed to reconstruct your objects goes from lasagna to fettuccine to spaghetti in the blink of an eye.
Given the information that's been added to the question, it would seem a good fit to create a PlaceDetails column of type XML in the Place table. You could also split that column into another table with a 1:1 relationship if performance requirements dictate it.
The upside to doing it that way is that you can retrieve the data using very simple SQL code, even using the xml data type's methods for searching the data. But that approach also allows you to do the more complex presentation-oriented data parsing in C#, which is better suited to that purpose than T-SQL is.

If you want your application to be able to create its own custom fields, this is a fine model. The Mantis Bugtracker uses this as well to allow Admins to add custom fields to their tickets.
If in any case, it's going to be the programmer that is going to create the field, I must agree with pst that this is more a premature optimization.

At any given time you can add new columns to the database (always watching for the third normalization rule) so you should go with what you want and only create a second table if needed or if such columns breaks any of the normal forms.

Customizeable database

What would be the best database/technique to use if I'd like to create a database that can "add", "remove" and "edit" tables and columns?
I'd like it to be scaleable and fast.
Should I use one table and four columns for this (Id, Table, Column, Type, Value) - Is there any good articles about this. Or is there any other solutions?
Maybe three tables: One that holds the tables, one that holds the columns and one for the values?
Maybe someone already has created a db for this purpose?
My requirements is that I'm using .NET (I guess the database don't have to be on windows, but I would prefer that)

Since (in comments on the question) you are aware of the pitfalls of the "inner platform effect", it is also true that this is a very common requirement - in particular to store custom user-defined columns. And indeed, most teams have needed this. Having tried various approaches, the one which I have found most successful is to keep the extra data in-line with the record - in particular, this makes it simple to obtain the data without requiring extra steps like a second complex query on an external table, and it means that all the values share things like timestamp/rowversion for concurrency.
In particular, I've found a CustomValues column (for example text or binary; typically json / xml, but could be more exotic) a very effective way to work, acting as a property-bag for the additional data. And you don't have to parse it (or indeed, SELECT it) until you know you need the extra data.
All you then need is a way to tie named keys to expected types, but you need that metadata anyway.
I will, however, stress the importance of making the data portable; don't (for example) store any specific platform-bespoke serialization (for example, BinaryFormatter for .NET) - things like xml / json are fine.
Finally, your RDBMS may also work with this column; for example, SQL Server has the xml data type that allows you to run specific queries and other operations on xml data. You must make your own decision whether that is a help or a hindrance ;p
If you also need to add tables, I wonder if you are truly using the RDBMS as an RDBMS; at that point I would consider switching from an RDBMS to a document-database such as CouchDB or Raven DB

Converting project to SQL Server, design thoughts?

Currently, I'm sitting on an ugly business application written in Access that takes a spreadsheet on a bi-daily basis and imports it into a MDB. I am currently converting a major project that includes this into SQL Server and .net, specifically in c#.
To house this information there are two tables (alias names here) that I will call Master_Prod and Master_Sheet joined on an identity key parent to the Master_Prod table, ProdID. There are also two more tables to store history, History_Prod and History_Sheet. There are more tables that extend off of Master_Prod but keeping this limited to two tables for explanation purposes.
Since this was written in Access, the subroutine to handle this file is littered with manually coded triggers to deal with history that were and have been a constant pain to keep up with, one reason why I'm glad this is moving to a database server rather than a RAD tool. I am writing triggers to handle history tracking.
My plan is/was to create an object modeling the spreadsheet, parse the data into it and use LINQ to do some checks client side before sending the data to the server... Basically I need to compare the data in the sheet to a matching record (Unless none exist, then its new). If any of the fields have been altered I want to send the update.
Originally I was hoping to put this procedure into some sort of CLR assembly that accepts an IEnumerable list since I'll have the spreadsheet in this form already but I've recently learned this is going to be paired with a rather important database server that I am very concerned with bogging down.
Is this worth putting a CLR stored procedure in for? There are other points of entry where data enters and if I could build a procedure to handle them given the objects passed in then I could take a lot of business rule away from the application at the expense of potential database performance.
Basically I want to take the update checking away from the client and put it on the database so the data system manages whether or not the table should be updated so the history trigger can fire off.
Thoughts on a better way to implement this along the same direction?

Use SSIS. Use Excel Source to read the spreadsheets, perhaps use a Lookup Transformation to detect new items and finally use a SQL Server Destination to insert the stream of missing items into SQL.
SSIS is way better fit to these kind of jobs that writing something from scratch, no matter how much fun linq is. SSIS Packages are easier to debug, maintain and refactor than some dll with forgoten sources. Besides, you will not be able to match the refinements SSIS has in managing its buffers for high troughput Data Flows.

Originally I was hoping to put this
procedure into some sort of CLR
assembly that accepts an IEnumerable
list since I'll have the spreadsheet
in this form already but I've recently
learned this is going to be paired
with a rather important database
server that I am very concerned with
bogging down.
Does not work. Any input into a C# written CLR procedure STILL has to follow normal SQL semantics. All that can change is the internal setup. Any communication up with the client has to be done in SQL. Which means executions / method calls. No way to directly pass in an enumerable of objects.

My plan is/was to create an object
modeling the spreadsheet, parse the
data into it and use LINQ to do some
checks client side before sending the
data to the server... Basically I need
to compare the data in the sheet to a
matching record (Unless none exist,
then its new). If any of the fields
have been altered I want to send the
update.
You probably need to pick a "centricity" for your approach - i.e. data-centric or object-centric.
I would probably model the data appropriately first. This is because relational databases (or even non-normalized models represented in relational databases) will often outlive client tools/libraries applications. I would probably start trying to model in a normal form and think about the triggers to maintain audit/history as you mention during this time also.
I would typically then think of the data coming in (not an object model or an entity, really). So then I focus on the format and semantics of the inputs and see if there is misfit in my data model - perhaps there were assumptions in my data model which were incorrect. Yes, I'm not thinking of making an object model which validates the spreadsheet even though spreadsheets are notoriously fickle input sources. Like Remus, I would simply use SSIS to bring it in - perhaps to a staging table and then some more validation before applying it to production tables with some T-SQL.
Then I would think about a client tool which had an object model based on my good solid data model.
Alternatively, the object approach would mean modeling the spreadsheet, but also an object model which needs to be persisted to the database - and perhaps you now have two object models (spreadsheet and full business domain) and database model (storage persistence), if the spreadsheet object model is not as complete as the system's business domain object model.
I can think of an example where I had a throwaway external object model kind of like this. It read a "master file" which was a layout file describing an input file. This object model allowed the program to build SSIS packages (and BCP and SQL scripts) to import/export/do other operations on these files. Effectively it was a throwaway object model - it was not used as the actual model for the data in the rows or any kind of navigation between parent and child rows, etc., but simply an internal representation for internal purposes - it didn't necessarily correspond to a "domain" entity.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.