I have a CSV file with 22000 Rows. For first time, i read all rows from csv file and add them into database using Entity Framework db.SaveChanges() where save changes takes 45 seconds. Then next time it reads all records and only add new records and update existing and this takes 2 minutes...
what is best way to insert bulk records?
When saving with the EF, there are 3 disticnt actions that occur and they are very different in speed. Inserts and deletes are done in a single call to SQL server but updates are done one record at a time so it can check for concurrency issues and report back to you which entities failed an why.
If you want to stick with using the EF, you should always INSERT and then do a MERGE (SQL command) on the server. You can also do the MERGE from a sqlcommand in C# if you dont have the ability to create a stored procedure.
The best option and fastest is to use BULK INSERT, if you can.
Related
I am having a list of records which I need to insert or update in a SQL DB based on whether the record is present or not present in the database.
The current flow is I process each record 1 by 1 and then call a Stored Procedure from my C# code which does the task of inserting or updating the database.
The above process is very inefficient, Can i use SQL Bulk Copy to insert these in once into the SQLDb .
Will above increase the performance .
Regards
Ankur
SqlBulkCopy can only insert. If you need to upsert, you might want to SqlBulkCopy into a staging table (a separate table off to one side that isn't part of the main model), and then do the merge in TSQL. You might also want to think about concurrency (how many people can be using the staging table at once, etc).
So I'm upgrading an old parser right now. It's written in C# and uses SQL to insert records into a database.
Currently it reads and parses a few thousand lines of data from a file, then inserts the new data into a database containing over a million records.
Sometimes it can take over 10 minutes just to add a few thousand lines.
I've come to the conclusion that this bottleneck in performance is due to a SQL command where it uses an IF NOT EXISTS statement to determine whether the row attempting to be inserted already exists, and if it doesn't insert the record.
I believe the problem is that it just takes way too long to call the IF NOT EXISTS on every single row in the new data.
Is there a faster way to determine whether data exists already or not?
I was thinking to insert all of the records first anyways using the SQLBulkCopy Class, then running a stored procedure to remove the duplicates.
Does anyone else have any suggestions or methods to do this as efficiently and quickly as possible? Anything would be appreciated.
EDIT: To clarify, I'd run a stored procedure (on the large table) after copying the new data into the large table
large table = 1,000,000+ rows
1. Create an IDataReader to loop over your source data.
2. Place the values into a strong dataset.
3. Every N number of rows, send the dataset (.GetXml) to a stored procedure. Let's say 1000 for the heck of it.
4. Have the stored procedure shred the xml.
5. Do your INSERT/UPDATE based on this shredded xml.
6. Return from the procedure, keep looping until you're done.
Here is an older example:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
The key is that you are doing "bulk" operations.......instead of row by row. And you can pick a sweet spot # (1000 for example) that gives you the best performance.
I have over 1 million rows that I check for changes and then update.
I completed my program that goes over each record and then updates the database. But this operation takes couple hours to complete with multithreading. I have optimized queries, inserts and checks to minimize the database load. I achieved much better results, but it is very slow.
Is there any way to maintain a DataTable with correct records in memory, and then upload whole data structure as 'virtual table' to SQL server in one update and let SQL server handle the updates?
I have seen similar in the past, it was done via function on Posrgre server (without involving C#). I need my program to be done in of minutes, not couple of hours.
Either insert your new data in a temp table with Bulk Copy or use a table valued parameter (TVP), then use the SQL MERGE command to update the rows in the existing table.
Have you looked at the SqlBulkCopy Class?
System.Data.SqlClient.SqlBulkCopy
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
I have a requirement where I need to update thousands of records in live database table, and although there are many columns in this table, I only need to update 2-3 columns.
Further I can't hit database for thousand times just for updating which can be done in a batch update using SQL Server Table Valued Parameter. But again I shouldn't update all thousands records in one go for better error handling, instead want to update records in batches of x*100.
So, below is my approach, please give your valuable inputs for any other alternatives or any change in the proposed process -
1 Fetch required records from database to List<T> MainCollection
2 Save this collection to XML file with each element Status = Pending
3 Take first 'n' elements from XML file with Status = Pending and add them to new List<T> SubsetCollection
4 Loop over List<T> SubsetCollection - make required changes to T
5 Convert List<T> SubsetCollection to DataTable
6 Call Update Stored Procedure and pass above DataTable as TVP
7 Update Status = Processed for XML Elements corresponding to List<T> SubsetCollection
8 If more records with Pending status exists in XML file, go to Step# 3.
Please guide for a better approach or any enhancement in above process.
I would do a database-only approach if possible and if not possible, eliminate the parts that will be the slowest. If you are unable to do all the work in a stored procedure, then retrieve all the records and make changes.
The next step is to write the changes to a staging table with SQL Bulk Copy. This is a fast bulk loaded that will copy thousands of records in seconds. You will store the primary key and the columns to be updated as well as a batch number. The batch number is assigned to each batch of records, therefore allowing another batch to be loaded without conflicting with the first batch.
Use a stored procedure on the server to process the records in batches of 100 or 1000 depending on performance. Pass the batch number to the stored procedure.
We use such a method to load and update millions of records in batches. The best speed is obtained by eliminating the network and allowing the database server to handle the bulk of the work.
I hope this might provide you with an alternate solution to evaluate.
It may not be the best practice but you could embed some logic inside a SQL Server CLR function. This function could be called by a Query,StoProc or a schedule to run at a certain time.
The only issue I can see is getting step 4 to make the required changes on T. Embedding that logic into the database could be detrimental to maintenance, but this is no different to people who embed massive amounts of business logic into StoProcs.
Either way SQL Server CLR functions may be the way to go. You can create them in Visual Studio 2008, 2010 (Check the database new project types).
Tutorial : http://msdn.microsoft.com/en-us/library/w2kae45k(v=vs.80).aspx
I need to update about 250k rows on a table and each field to update will have a different value depending on the row itself (not calculated based on the row id or the key but externally).
I tried with a parametrized query but it turns out to be slow (I still can try with a table-value parameter, SqlDbType.Structured, in SQL Server 2008, but I'd like to have a general way to do it on several databases including MySql, Oracle and Firebird).
Making a huge concat of individual updates is also slow (BUT about 2 times faster than making thousands of individual calls (roundtrips!) using parametrized queries)
What about creating a temp table and running an update joining my table and the tmp one? Will it work faster?
How slow is "slow"?
The main problem with this is that it would create an enormous entry in the database's log file (in case there's a power failure half-way through the update, the database needs to log each action so that it can rollback in the event of failure). This is most likely where the "slowness" is coming from, more than anything else (though obviously with such a large number of rows, there are other ways to make the thing inefficient [e.g. doing one DB roundtrip per update would be unbearably slow], I'm just saying once you eliminate the obvious things, you'll still find it's pretty slow).
There's a few ways you can do it more efficiently. One would be to do the update in chunks, 1,000 rows at a time, say. That way, the database writes lots of small log entries, rather than one really huge one.
Another way would be to turn off - or turn "down" - the database's logging for the duration of the update. In SQL Server, for example, you can set the Recovery Model to "simple" or "bulk update" which would speed it up considerably (with the caveat that you are more at risk if there's a power failure or something during the update).
Edit Just to expand a little more, probably the most efficient way to actually execute the queries in the first place would be to do a BULK INSERT of all the new rows into a temporary table, and then do a single UPDATE of the existing table from that (or to do the UPDATE in chunks of 1,000 as I said above). Most of my answer was addressing the problem once you've implemented it like that: you'll still find it's pretty slow...
call a stored procedure if possible
If the columns updated are part of indexes you could
drop these indexes
do the update
re-create the indexes.
If you need these indexes to retrieve the data, well, it doesn't help.
You should use the SqlBulkCopy with the KeepIdentities flag set.
As part of a SqlTransaction do a query to SELECT all the records that need updating and then DELETE THEM, returning those selected (and now removed) records. Read them into C# in a single batch. Update the records on the C# side in memory, now that you've narrowed the selection and then SqlBulkCopy those updated records back, keys and all. And don't forget to commit the transaction. It's more work, but it's very fast.
Here's what I would do:
Retrieve the entire table, that is, the columns you need in order to calculate/retrieve/find/produce the changes externally
Calculate/produce those changes
Run a bulk insert to a temporary table, uploading the information you need server-side in order to do the changes. This would require the key information + new values for all the rows you intend to change.
Run SQL on the server to copy new values from the temporary table into the production table.
Pros:
Running the final step server-side is faster than running tons and tons of individual SQL, so you're going to lock the table in question for a shorter time
Bulk insert like this is fast
Cons:
Requires extra space in your database for the temporary table
Produces more log data, logging both the bulk insert and the changes to the production table
Here are things that can make your updates slow:
executing updates one by one through parametrized query
solution: do update in one statement
large transaction creates big log entry
see codeka's answer
updating indexes (RDBMS will update index after each row. If you change indexed column, it could be very costly on large table)
if you can, drop indices before update and recreate them after
updating field that has foreign key constraint - for each inserted record RDBMS will go and look for appropriate key
if you can, disable foreign key constraints before update and enable them after update
triggers and row level checks
if you can, disable triggers before update and enable them after