Generic Batch Update For Multiple Tables - c#

In my application I am running a process that will run update commands (sometimes more than 10000) on over 100 different tables. I am using entity framework which for updating can be incredibly slow - on the order of 40+ minutes to update 13000 records by updating the entities and then calling saveChanges() after a sizeable batch of updates.
A Merge command wont work because I would need a temp table for every table updating and a stored procedure doesn't seem feasible either. So I started looking at UpdateCommand, passing it a data table, and I am having two problems. The first is this.
da.UpdateCommand.Parameters.Add("#YourField", SqlDbType.SmallDateTime).SourceColumn = "YourField"
There is no way to determine the dbtype of destination generically. So how do I map the columns into the parameters of the update? Secondly, I don't want to upsert, I just want to update and if there is no matching record, just ignore it and proceed. I know I can keep doing the updates on fail using
da.ContinueUpdateOnError = False
but I cant seem to find a way to prevent it from inserting on a record not found. Any help is greatly appreciated. Thank you!

If SQL replication is not an option you can use our SQL Data Compare Command Line tool - you can easily configure it to ignore the rows that exist on only one side and only sync (that is update) the rows that exist on both sides (you can use the GUI if you wish to review the process or use the command line utility if you want to schedule it to happen automatically).

Related

How to efficiently perform SQL Server database updates from my C# application

I am trying to figure out the best way to design my C# application which utilizes data from a SQL Server backend.
My application periodically has to update 55K rows each one at a time from a loop in my application. Before it does an update it needs to check if the record to be updated exists.
If it exists it updates one field.
If not it performs an insert of 4 fields.
The table to be updated has 600K rows.
What is the most efficient way to handle these updates/inserts from my application?
Should I create a data dictionary in c# and load the 600K records and query the dictionary first instead of the database?
Is this a faster approach?
Should I use a stored procedure?
What’s the best way to achieve maximum performance based on this scenario?
You could use SqlBulkCopy to upload to a temp table then have a SQL Server job do the merge.
You should try to avoid "update 55K rows each one at a time from a loop". That will be very slow.
Instead, try to find a way to batch the updates (n at a time). Look into SQL Server table value parameters as a way to send a set of data to a stored procedure.
Here's an article on updating multiple rows with TVPs: http://www.sqlmag.com/article/sql-server-2008/using-table-valued-parameters-to-update-multiple-rows
What if you did something like this, instead?
By some means, get those 55,000 rows of data into the database; if they're not already there. (If you're right now getting those rows from some query, arrange instead for the query-results to be stored in a temporary table on that database. (This might be a proper application for a stored procedure.)
Now, you could express the operations that you need to perform, perhaps, as two separate SQL queries: one to do the updates, and one or more others to do the inserts. The first query might use a clause such as "WHERE FOO IN (SELECT BAR FROM #TEMP_TABLE ...)" to identify the rows to be updated. The others might use "WHERE FOO NOT IN (...)"
This is, to be precise, exactly the sort of thing that I would expect to need to use a stored procedure to do, because, if you think about it, "the SQL server itself" is precisely the right party to be doing the work, because he's the only guy around who already has the data on-hand that you intend to manipulate. He alone doesn't have to "transmit" those 55,000 rows anywhere. Perfect.

C# and SQL Server 2008 - Batch Update

I have a requirement where I need to update thousands of records in live database table, and although there are many columns in this table, I only need to update 2-3 columns.
Further I can't hit database for thousand times just for updating which can be done in a batch update using SQL Server Table Valued Parameter. But again I shouldn't update all thousands records in one go for better error handling, instead want to update records in batches of x*100.
So, below is my approach, please give your valuable inputs for any other alternatives or any change in the proposed process -
1 Fetch required records from database to List<T> MainCollection
2 Save this collection to XML file with each element Status = Pending
3 Take first 'n' elements from XML file with Status = Pending and add them to new List<T> SubsetCollection
4 Loop over List<T> SubsetCollection - make required changes to T
5 Convert List<T> SubsetCollection to DataTable
6 Call Update Stored Procedure and pass above DataTable as TVP
7 Update Status = Processed for XML Elements corresponding to List<T> SubsetCollection
8 If more records with Pending status exists in XML file, go to Step# 3.
Please guide for a better approach or any enhancement in above process.
I would do a database-only approach if possible and if not possible, eliminate the parts that will be the slowest. If you are unable to do all the work in a stored procedure, then retrieve all the records and make changes.
The next step is to write the changes to a staging table with SQL Bulk Copy. This is a fast bulk loaded that will copy thousands of records in seconds. You will store the primary key and the columns to be updated as well as a batch number. The batch number is assigned to each batch of records, therefore allowing another batch to be loaded without conflicting with the first batch.
Use a stored procedure on the server to process the records in batches of 100 or 1000 depending on performance. Pass the batch number to the stored procedure.
We use such a method to load and update millions of records in batches. The best speed is obtained by eliminating the network and allowing the database server to handle the bulk of the work.
I hope this might provide you with an alternate solution to evaluate.
It may not be the best practice but you could embed some logic inside a SQL Server CLR function. This function could be called by a Query,StoProc or a schedule to run at a certain time.
The only issue I can see is getting step 4 to make the required changes on T. Embedding that logic into the database could be detrimental to maintenance, but this is no different to people who embed massive amounts of business logic into StoProcs.
Either way SQL Server CLR functions may be the way to go. You can create them in Visual Studio 2008, 2010 (Check the database new project types).
Tutorial : http://msdn.microsoft.com/en-us/library/w2kae45k(v=vs.80).aspx

Auditing record changes in sql server databases

Using only microsoft based technologies (MS SQL Server, C#, EAB, etc) if you needed keep the track of changes done on a record in a database which strategy would you will use? Triggers, AOP on the DAL, Other? And how you will display the collected data? Is there a pattern about it? Is there a tool or a framework that help to implement this kind of solution?
The problem with Change Data capture is that it isn't flexible enough for real auditing. You can't add the columns you need. Also it dumps the records every three days by default (you can change this, but I don't think you can store forever) so you have to have a job dunping the records to a real audit table if you need to keep the data for a long time which is typical of the need to audit records (we never dump our audit records).
I prefer the trigger approach. You have to be careful when you write the triggers to ensure that they will capture the data if multiple records are changed. We have two tables for each table audited, one to store the datetime and id of the user or process that took the action and one to store the old and new data. Since we do a lot of multiple record processes this is critical for us. If someone reports one bad record, we want to be able to see if it was a process that made the change and if so, what other records might have been affected as well.
At the time you create the audit process, create the scripts to restore a set of audited data to the old values. It's a lot easier to do this when under the gun to fix things, if you already have this set up.
Sql Server 2008 R2 has this built-in - lookup Change Data Capture in books online
This is probably not a popular opinion, but I'm going to throw it out there anyhow.
I prefer stored procedures for all database writes. If auditing is required, it's right there in the stored procedure. There's no magic happening outside the code, everything that happens is documented right at the point where writes occur.
If, in the future, a table needs to change, one has to go to the stored procedure to make the change. The need to update the audit is documented right there. And because we used a stored procedure, it's simpler to "version" both the table and its audit table.

Faster way to update 250k rows with SQL

I need to update about 250k rows on a table and each field to update will have a different value depending on the row itself (not calculated based on the row id or the key but externally).
I tried with a parametrized query but it turns out to be slow (I still can try with a table-value parameter, SqlDbType.Structured, in SQL Server 2008, but I'd like to have a general way to do it on several databases including MySql, Oracle and Firebird).
Making a huge concat of individual updates is also slow (BUT about 2 times faster than making thousands of individual calls (roundtrips!) using parametrized queries)
What about creating a temp table and running an update joining my table and the tmp one? Will it work faster?
How slow is "slow"?
The main problem with this is that it would create an enormous entry in the database's log file (in case there's a power failure half-way through the update, the database needs to log each action so that it can rollback in the event of failure). This is most likely where the "slowness" is coming from, more than anything else (though obviously with such a large number of rows, there are other ways to make the thing inefficient [e.g. doing one DB roundtrip per update would be unbearably slow], I'm just saying once you eliminate the obvious things, you'll still find it's pretty slow).
There's a few ways you can do it more efficiently. One would be to do the update in chunks, 1,000 rows at a time, say. That way, the database writes lots of small log entries, rather than one really huge one.
Another way would be to turn off - or turn "down" - the database's logging for the duration of the update. In SQL Server, for example, you can set the Recovery Model to "simple" or "bulk update" which would speed it up considerably (with the caveat that you are more at risk if there's a power failure or something during the update).
Edit Just to expand a little more, probably the most efficient way to actually execute the queries in the first place would be to do a BULK INSERT of all the new rows into a temporary table, and then do a single UPDATE of the existing table from that (or to do the UPDATE in chunks of 1,000 as I said above). Most of my answer was addressing the problem once you've implemented it like that: you'll still find it's pretty slow...
call a stored procedure if possible
If the columns updated are part of indexes you could
drop these indexes
do the update
re-create the indexes.
If you need these indexes to retrieve the data, well, it doesn't help.
You should use the SqlBulkCopy with the KeepIdentities flag set.
As part of a SqlTransaction do a query to SELECT all the records that need updating and then DELETE THEM, returning those selected (and now removed) records. Read them into C# in a single batch. Update the records on the C# side in memory, now that you've narrowed the selection and then SqlBulkCopy those updated records back, keys and all. And don't forget to commit the transaction. It's more work, but it's very fast.
Here's what I would do:
Retrieve the entire table, that is, the columns you need in order to calculate/retrieve/find/produce the changes externally
Calculate/produce those changes
Run a bulk insert to a temporary table, uploading the information you need server-side in order to do the changes. This would require the key information + new values for all the rows you intend to change.
Run SQL on the server to copy new values from the temporary table into the production table.
Pros:
Running the final step server-side is faster than running tons and tons of individual SQL, so you're going to lock the table in question for a shorter time
Bulk insert like this is fast
Cons:
Requires extra space in your database for the temporary table
Produces more log data, logging both the bulk insert and the changes to the production table
Here are things that can make your updates slow:
executing updates one by one through parametrized query
solution: do update in one statement
large transaction creates big log entry
see codeka's answer
updating indexes (RDBMS will update index after each row. If you change indexed column, it could be very costly on large table)
if you can, drop indices before update and recreate them after
updating field that has foreign key constraint - for each inserted record RDBMS will go and look for appropriate key
if you can, disable foreign key constraints before update and enable them after update
triggers and row level checks
if you can, disable triggers before update and enable them after

NHibernate bulk insert or update

Hi I'm working a project where we need to process several xml files once a day and populate a Database with the information contained in those files.
Each file is roughly 1Mb and contains about 1000 records; we usually need to process between 12 and 25 of these files. I've seen some information regarding bulk inserts using NHibernate but our problem is somehow trickier since the xml files contain new records mixed with updated records.
In the xml there is a flag that tells us is a specific record is a new one or an update to an existing record, but not what information has changed. The xml records do not contain our DB identifier, but we can use an identifier from the xml record to uniquely locate a record in our DB.
Our strategy so far has been to identify if the current record is an insert or an update and based on that we either perform an insert on the DB or we do a search, then we update the information on the object with the information coming from the xml record and finally we do an update on the DB.
The problem with our current approach is that we are having issues with DB locks and our performance degrades really fast. We have thought about some alternatives like having separate tables for the distinct operations or even separate DB’s but doing such a move would mean a big effort so before any decisions I would like to ask for the community opinion on this matter, thanks in advance.
A couple of ideas:
Always try to use IStatelessSession for bulk operations.
If you're still not happy with the performance, just skip NHibernate and use a stored procedure or parameterized query specific to this, or use IQuery.ExecuteUpdate()
If you're using SQL Server, you could convert your xml format to BCPFORMAT xml then run BULK INSERT on it (only for insertions)
If you're having too many DB locks, try grouping the operations (i.e. first find out what needs to be inserted and what updated, then get PKs for the updates, then run BULK INSERT for insertions, then run updates)
If parsing the source files is a performance issue (i.e. it maxes out a CPU core), try doing it in parallel (you could use Parallel Extensions)
This might help: http://ideas-net.blogspot.com/2009/03/nhibernate-update-performance-issue.html

Categories