I need to update about 250k rows on a table and each field to update will have a different value depending on the row itself (not calculated based on the row id or the key but externally).
I tried with a parametrized query but it turns out to be slow (I still can try with a table-value parameter, SqlDbType.Structured, in SQL Server 2008, but I'd like to have a general way to do it on several databases including MySql, Oracle and Firebird).
Making a huge concat of individual updates is also slow (BUT about 2 times faster than making thousands of individual calls (roundtrips!) using parametrized queries)
What about creating a temp table and running an update joining my table and the tmp one? Will it work faster?
How slow is "slow"?
The main problem with this is that it would create an enormous entry in the database's log file (in case there's a power failure half-way through the update, the database needs to log each action so that it can rollback in the event of failure). This is most likely where the "slowness" is coming from, more than anything else (though obviously with such a large number of rows, there are other ways to make the thing inefficient [e.g. doing one DB roundtrip per update would be unbearably slow], I'm just saying once you eliminate the obvious things, you'll still find it's pretty slow).
There's a few ways you can do it more efficiently. One would be to do the update in chunks, 1,000 rows at a time, say. That way, the database writes lots of small log entries, rather than one really huge one.
Another way would be to turn off - or turn "down" - the database's logging for the duration of the update. In SQL Server, for example, you can set the Recovery Model to "simple" or "bulk update" which would speed it up considerably (with the caveat that you are more at risk if there's a power failure or something during the update).
Edit Just to expand a little more, probably the most efficient way to actually execute the queries in the first place would be to do a BULK INSERT of all the new rows into a temporary table, and then do a single UPDATE of the existing table from that (or to do the UPDATE in chunks of 1,000 as I said above). Most of my answer was addressing the problem once you've implemented it like that: you'll still find it's pretty slow...
call a stored procedure if possible
If the columns updated are part of indexes you could
drop these indexes
do the update
re-create the indexes.
If you need these indexes to retrieve the data, well, it doesn't help.
You should use the SqlBulkCopy with the KeepIdentities flag set.
As part of a SqlTransaction do a query to SELECT all the records that need updating and then DELETE THEM, returning those selected (and now removed) records. Read them into C# in a single batch. Update the records on the C# side in memory, now that you've narrowed the selection and then SqlBulkCopy those updated records back, keys and all. And don't forget to commit the transaction. It's more work, but it's very fast.
Here's what I would do:
Retrieve the entire table, that is, the columns you need in order to calculate/retrieve/find/produce the changes externally
Calculate/produce those changes
Run a bulk insert to a temporary table, uploading the information you need server-side in order to do the changes. This would require the key information + new values for all the rows you intend to change.
Run SQL on the server to copy new values from the temporary table into the production table.
Pros:
Running the final step server-side is faster than running tons and tons of individual SQL, so you're going to lock the table in question for a shorter time
Bulk insert like this is fast
Cons:
Requires extra space in your database for the temporary table
Produces more log data, logging both the bulk insert and the changes to the production table
Here are things that can make your updates slow:
executing updates one by one through parametrized query
solution: do update in one statement
large transaction creates big log entry
see codeka's answer
updating indexes (RDBMS will update index after each row. If you change indexed column, it could be very costly on large table)
if you can, drop indices before update and recreate them after
updating field that has foreign key constraint - for each inserted record RDBMS will go and look for appropriate key
if you can, disable foreign key constraints before update and enable them after update
triggers and row level checks
if you can, disable triggers before update and enable them after
Related
I am trying to figure out the best way to design my C# application which utilizes data from a SQL Server backend.
My application periodically has to update 55K rows each one at a time from a loop in my application. Before it does an update it needs to check if the record to be updated exists.
If it exists it updates one field.
If not it performs an insert of 4 fields.
The table to be updated has 600K rows.
What is the most efficient way to handle these updates/inserts from my application?
Should I create a data dictionary in c# and load the 600K records and query the dictionary first instead of the database?
Is this a faster approach?
Should I use a stored procedure?
What’s the best way to achieve maximum performance based on this scenario?
You could use SqlBulkCopy to upload to a temp table then have a SQL Server job do the merge.
You should try to avoid "update 55K rows each one at a time from a loop". That will be very slow.
Instead, try to find a way to batch the updates (n at a time). Look into SQL Server table value parameters as a way to send a set of data to a stored procedure.
Here's an article on updating multiple rows with TVPs: http://www.sqlmag.com/article/sql-server-2008/using-table-valued-parameters-to-update-multiple-rows
What if you did something like this, instead?
By some means, get those 55,000 rows of data into the database; if they're not already there. (If you're right now getting those rows from some query, arrange instead for the query-results to be stored in a temporary table on that database. (This might be a proper application for a stored procedure.)
Now, you could express the operations that you need to perform, perhaps, as two separate SQL queries: one to do the updates, and one or more others to do the inserts. The first query might use a clause such as "WHERE FOO IN (SELECT BAR FROM #TEMP_TABLE ...)" to identify the rows to be updated. The others might use "WHERE FOO NOT IN (...)"
This is, to be precise, exactly the sort of thing that I would expect to need to use a stored procedure to do, because, if you think about it, "the SQL server itself" is precisely the right party to be doing the work, because he's the only guy around who already has the data on-hand that you intend to manipulate. He alone doesn't have to "transmit" those 55,000 rows anywhere. Perfect.
I have a requirement where I need to update thousands of records in live database table, and although there are many columns in this table, I only need to update 2-3 columns.
Further I can't hit database for thousand times just for updating which can be done in a batch update using SQL Server Table Valued Parameter. But again I shouldn't update all thousands records in one go for better error handling, instead want to update records in batches of x*100.
So, below is my approach, please give your valuable inputs for any other alternatives or any change in the proposed process -
1 Fetch required records from database to List<T> MainCollection
2 Save this collection to XML file with each element Status = Pending
3 Take first 'n' elements from XML file with Status = Pending and add them to new List<T> SubsetCollection
4 Loop over List<T> SubsetCollection - make required changes to T
5 Convert List<T> SubsetCollection to DataTable
6 Call Update Stored Procedure and pass above DataTable as TVP
7 Update Status = Processed for XML Elements corresponding to List<T> SubsetCollection
8 If more records with Pending status exists in XML file, go to Step# 3.
Please guide for a better approach or any enhancement in above process.
I would do a database-only approach if possible and if not possible, eliminate the parts that will be the slowest. If you are unable to do all the work in a stored procedure, then retrieve all the records and make changes.
The next step is to write the changes to a staging table with SQL Bulk Copy. This is a fast bulk loaded that will copy thousands of records in seconds. You will store the primary key and the columns to be updated as well as a batch number. The batch number is assigned to each batch of records, therefore allowing another batch to be loaded without conflicting with the first batch.
Use a stored procedure on the server to process the records in batches of 100 or 1000 depending on performance. Pass the batch number to the stored procedure.
We use such a method to load and update millions of records in batches. The best speed is obtained by eliminating the network and allowing the database server to handle the bulk of the work.
I hope this might provide you with an alternate solution to evaluate.
It may not be the best practice but you could embed some logic inside a SQL Server CLR function. This function could be called by a Query,StoProc or a schedule to run at a certain time.
The only issue I can see is getting step 4 to make the required changes on T. Embedding that logic into the database could be detrimental to maintenance, but this is no different to people who embed massive amounts of business logic into StoProcs.
Either way SQL Server CLR functions may be the way to go. You can create them in Visual Studio 2008, 2010 (Check the database new project types).
Tutorial : http://msdn.microsoft.com/en-us/library/w2kae45k(v=vs.80).aspx
I'm using an sql server and i have a specific table that can contain ~1million-~10 million recrdords max.
In each record i retrieve I do some checkings (i run a few simple lines of code), and then I want to mark that the records was checked in DateTime.Now;
so what i do is retrieve a record, check some stuff, run an 'update' query to set the 'last_checked_time' field to DateTime.Now, and then move to the next record.
I can then get all the records ordered by their 'last_checked_time' field (ascending), and then i can iterate over them ordered by the their check time..
Is this a good practice ? Can it still remain speedy as long as i have no more than 10 million records on that table ?
I've read somewhere that every 'update' query is actually a deletion and a creation of a new record.
I'd also like to mention that these records will be frequently retrieved by my ASP.net website ..
I was thinking of writing down the 'last_checked_time' on a local txt file/binary file,but i'm guessing it would mean implementing something that the database can already do for you.
If you need that "last checked time" value then the best, most efficient, place to hold it is on the row in the table. It doesn't matter how many rows there are in the table, each update will affect just the row(s) you updated.
How an update is implemented is up to the DBMS, but it is not generally done by deleting and re-inserting the row.
I would recommend retrieving your data or a portion of the data, doing your checks on all of them and sending the updates back in transactions to let the database operate more effectively. This would provide for fewer round trips.
As to if this is a good practice, I would say yes especially since you are using in in your queries. Definitely, do not store the last checked time in a file and try to match up after you load your database data. The database RDBMS is designed to effeciently handle this for you. Don't reinvent the wheel using cubes.
Personally, I see no issues with it. It seems perfectly reasonable to store the last checked time in the database, especially since it might be used in queries (for example, to find records that haven't been checked in over a week).
Maybe (just maybe) you could create a new table containing two rows: the id of the row in the first table and the checked date.
That way you wouldn't alter the original table, but depending on the usage of the data and the check date you would be forced to make a joined query which is maybe something you also don't want to do.
It makes sense to store the 'checked time' as part of the row you're updating, rather than in a separate file or even a separate table in the database. This approach should provide optimal performance and help to maintain consistency. Solutions involving more than one table or external data stores may introduce a requirement for distributed or multi-table transactional updates that involve significant locking, which can negatively impact performance and make it much more difficult to guarantee consistency.
In general, solutions that minimize the scope of transactions and, by extension, locking, are worth striving for. Also, simplicity itself is a useful goal.
I need to find the best way to insert or update data in database using sql server and asp.net. It is a standard scenario if data exist it is updated if not it is inserted. I know that there are many topic here about that but no one has answered what i need to know.
So my problem is that there is really no problem when you update/insert 5k - 10k rows but what with 50k and more.
My first idea was to use sql server 2008 MERGE command, but i have some performance consideration if it will be 50k+ rows. Also i don't know if i can marge data this way based not on primary id key (int) but on other unique key in the table. (to be precise an product serial number that will not change in time).
My second idea was to first get all product serials, then compare the new data serials with that and divide it into data to insert and data to update, then just make one bulk insert and one bulk update.
I just don't know which will be better, with MERGE i don't know what the performance will be and it is supported only by sql server 2008, but it looks quite simple, the second option doesn't need sql 2008, the batches should be fast but selecting first all serials and dividing based on them could have some performance penalties.
What is you opinion, what to choose ?
Merge performace way better because "One of the most important advantage of MERGE statement is all the data is read and processed only once"
You dont need a primary key, you can join on one or more fields what makes your records unique
There should be no problem performing the merge on the serial number as you've described it. You may want to read Optimizing MERGE Statement Performance for Microsoft's recommended best practices when using MERGE.
I have c# project that is using sqlserver compact edition and entity framework for data access. I have the need to insert or update a large amount of rows, 5000+ or more to the db, so if the key exists update the record if not insert it. I can not find a way to do this with compact edition and EF with out horrible performance, ie taking 2 mins plus on a core i7 computer. I have tried searching for the record to see if it exists then inserting if not or update if it does, the search is the killer on that. I have tried compiling the search query and that only gave a small improvement. Another thing ive tried is inserting the record in a try catch and if it fails update, but that forces me to savechanges on every record to get the exception as opposed to at the end which is a performance killer. Obviously I can't use stored procedures since it is compact edition. Also I've looked at just executing t-sql directly somehow on the db, but lack of process statements in compact seems to rule that out.
I've searched the world over and out of ideas. I really wanted to use compact if I can over express for the deployment benefits and ability to prevent the user from digging around the db. Any suggesitons would be appreciated.
Thanks
When we're using SQL CE (and SQL 2005 Express for that matter) we always call an update first and then call an insert if the udate gives a row count of 0. This is very simple to implement and does not require expensice try..catch blocks for control flow.
Maybe you could obtain the result you seek by using simple queries.
Let's say the the table you want to insert into or update is like this
TABLE original
id integer,
value char(100)
first you could create a temporary table with the new values (you can use a SELECT INTO or other ways to create it)
TABLE temp
id integer,
value char(100)
now, you need to do two things, update the rows in original and then insert the new values
UPDATE original
SET original.value = temp.value
FROM original, temp
WHERE original.id = temp.id
INSERT INTO original
SELECT * from temp
WHERE temp.id not IN (select o.id from original o)
Given your problem statement, I'm going to guess that this software assumes a relatively beefy environment. Have you considered taking the task of determining off of sqlce and doing it on your own? Essentially, grab a sorted list of all the IDs(keys?) from the relevant table and checking every object key against that list before queueing it for insertion?
This makes a few assumptions that would be bad news with a typical DB, but that you can probably get away with in sqlce. E.g., it assumes that rows won't be inserted or significantly modified by a different user while you're performing this insert.
If the list of keys is too long to reasonably hold in memory for such a check, I'm afraid I'd say that sqlce just might not be the right tool for the job. :(
I'm not sure if this is feasible or not, as I haven't used the Entity Framework, but have you tried running the update first and checking the rowcount -- inserting if no rows were updated? This may be faster than catching exceptions. It's generally a bad practise to use exceptions for control flow, and often slows things down dramatically.
If you can write the SQL directly, then the fastest way to do it would be to get all the data into a temporary table and then update what exists and insert the rests (as in Andrea Bertani's example above). You should get slightly better results by using a left join on the original table in the select in your insert, and excluding any rows with values from the original table that are not null:
INSERT INTO original
SELECT * FROM temp
LEFT JOIN original ON original.id = temp.id
WHERE original.id IS NULL
I would recommend using SqlCeResultSet directly. You lose the nice type-safeness of EF, but performance is incredibly fast. We switched from ADO.NET 2.0-style TypeDataSets to SqlCeResultSet and SqlCeDataReader and saw 20 to 50 times increases in speed.
See SqlCeResultSet. For a .NETCF project I removed almost all sql code in favor of this class.
Just search for "SqlCeResultSet" here and msdn.
A quick overview:
Open the resultSet.
If you need seek (for existence check) you will have to provide an index for the result set.
Seek on the result set & read to check whether you found the row. This is extremely fast even on tables with tens of thousands rows (because the seek uses the index).
Insert or update the record (see SqlCeResultSet.NewRecord).
We have successfully developed a project with a sqlce database with a main product table with over 65000 rows (read/write with 4 indexes).
SQL Server compact edition is pretty early in development at this point. Also, depending on your device, memory-disk access can be pretty slow, and SQLCE plus .NET type-safety overhead is pretty intensive. It works best with a pretty static data store.
I suggest you either use a lighter-weight API or consider SQLite.