I am facing an issue I hope to get it solved by here. I have 3 different tables in a DataSet and I want to insert it in the database table.
I know I can do this using SqlBulkCopy but there is a catch and that is I want to check if the data already exists in the database then I want it to get updated instead of insert.
And if the data doesn't exist in the database table, I want to insert it then. Any help on this would be appreciated.
I know I can iterate it through each record and then fire a procedure which will check for its existence if it exists den update or else insert. But the data size is huge and iterating through each record would be a time taking process, I don't want to use this approach.
Regards
Disclaimer: I'm the owner of the project Bulk Operations
This project allows to BulkInsert, BulkUpdate, BulkDelete, and BulkMerge (Upsert).
Under the hood, it does almost what #marc_s have suggested (Use SqlBulkCopy into a temporary table and perform a merge statement to insert or update depending on the primary key).
var bulk = new BulkOperation(connection);
bulk.BulkMerge(dt);
Related
I am trying to insert huge amount of data into SQL server. My destination table has an unique index called "Hash".
I would like to replace my SqlDataAdapter implementation with SqlBulkCopy. In SqlDataAapter there is a property called "ContinueUpdateOnError", when set to true adapter.Update(table) will insert all the rows possible and tag the error rows with RowError property.
The question is how can I use SqlBulkCopy to insert data as quickly as possible while keeping track of which rows got inserted and which rows did not (due to the unique index)?
Here is the additional information:
The process is iterative, often set on a schedule to repeat.
The source and destination tables can be huge, sometimes millions of rows.
Even though it is possible to check for the hash values first, it requires two transactions per row (first for selecting the hash from destination table, then perform the insertion). I think in the adapter.update(table)'s case, it is faster to check for the RowError than checking for hash hits per row.
SqlBulkCopy, has very limited error handling facilities, by default it doesn't even check constraints.
However, its fast, really really fast.
If you want to work around the duplicate key issue, and identify which rows are duplicates in a batch. One option is:
start tran
Grab a tablockx on the table select all current "Hash" values and chuck them in a HashSet.
Filter out the duplicates and report.
Insert the data
commit tran
This process will work effectively if you are inserting huge sets and the size of the initial data in the table is not too huge.
Can you please expand your question to include the rest of the context of the problem.
EDIT
Now that I have some more context here is another way you can go about it:
Do the bulk insert into a temp table.
start serializable tran
Select all temp rows that are already in the destination table ... report on them
Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows.
commit the tran
That process is very light on round trips, and considering your specs should end up being really fast;
Slightly different approach than already suggested; Perform the SqlBulkCopy and catch the SqlException thrown:
Violation of PRIMARY KEY constraint 'PK_MyPK'. Cannot insert duplicate
key in object 'dbo.MyTable'. **The duplicate key value is (17)**.
You can then remove all items from your source from ID 17, the first record that was duplicated. I'm making assumptions here that apply to my circumstances and possibly not yours; i.e. that the duplication is caused by the exact same data from a previously failed SqlBulkCopy due to SQL/Network errors during the upload.
Note: This is a recap of Sam's answer with slightly more details
Thanks to Sam for the answer. I have put it in an answer due to comment's space constraints.
Deriving from your answer I see two possible approaches:
Solution 1:
start tran
grab all possible hit "hash" values by doing "select hash in destinationtable where hash in (val1, val2, ...)
filter out duplicates and report
insert data
commit tran
solution 2:
Create temp table to mirror the
schema of destination table
bulk insert into the temp table
start serializable transaction
Get duplicate rows: "select hash from
tempTable where
tempTable.hash=destinationTable.hash"
report on duplicate rows
Insert the data in the temp table
into the destination table: "select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null"
commit the tran
Since we have two approaches, it comes down to which approach is the most optimized? Both approaches have to retrieve the duplicate rows and report while the second approach requires extra:
temp table creation and delete
one more sql command to move data from temp to destination table
depends on the percentage of hash collision, it also transfers a lot of unnecessary data across the wire
If these are the only solutions, it seems to me that the first approach wins. What do you guys think? Thanks!
I have used BulkCopy command to transfer rows from one table to another table with bulk data about 3 to 5 million rows. I want to update these rows.
Is there any BulkUpdate command similar to the BulkCopy command? I'm using ASP.NET with C#.
No, there isn't.
Q: What's an "lac"?
This might help:
http://itknowledgeexchange.techtarget.com/itanswers/bulk-update-in-sql-server-2005/
Assuming that you have a column with distict values to show you which
rows are which between the two tables this can be done with a simple
update statement.
UPDATE TableA
SET TableA.A1 = TableB.B1,
TableA.A2 = TableB.B2
FROM TableB
WHERE TableA.A3 = TableB.B3
If you are worried about creating one massive transaction you can
batch the operation into smaller chunks. This is done via the TOP
keyword.
UPDATE TOP (1000) TableA
SET TableA.A1 = TableB.B1,
TableA.A2 = TableB.B2
FROM TableB
WHERE TableA.A3 = TableB.B3
AND TableA.A1 <> TableB.B1
AND TableA.A2 <> TableB.B2
You can put that into a loop...
Here's another link (with basically the same solution):
http://www.sqlusa.com/bestpractices2005/hugeupdate/
A common approach here is:
bulk-load (SqlBulkCopy) into an empty *staging table - meaning: a table with the right columns/types as the actual data, but not part of the main transactional system
now do an update joining the real data to the staging data, to update the values in the real data
Disclaimer: I'm the owner of the project Bulk Operations
The Bulk Operations Library allow to Insert, Delete, Update and Merge millions of rows in few seconds.
It's very easy to learn and use if you already know the SqlBulkCopy class.
var bulk = new BulkOperation(connection);
// ... Mappings ....
bulk.BulkUpdate(dt);
I have a DataSet with two TableAdapters (1 to many relationship) that was created using visual studio 2010's Configuration Wizard.
I make a call to an external source and populate a Dictionary with the results. These results should be all of the entries in the database. To synchronize the DB I don't want to just clear all of the tables and then repopulate them like dropping the tables and creating them with new data in sql.
Is there a clean way possibly using the TableAdapter.Fill() method or do I have to loop through the two tables row by row and decide if it stay or gets deleted and then add the new entries? What is the best approach to make the data that is in the dictionary be the only data in my two tables with the DataSet?
First Question: if it's the same DB why do you have 2 tables with the same information?
To the question at hand: that largley depend on the sizes. If the tables are not big then use a transaction, clear the table (DELETE * FROM TABLE or whatever) and write your data in there again.
If the tables are big on the other hand the question is: can you load all this into your dictionary?
Of course you have to ask yourself what happens to inconsistent data (another user/app changed the data while you had it in your dictionary).
If this takes to long you could remember what you did to the data - that means: flag the changed data and remember the deleted keys and new inserted rows and make your updates based on that.
Both can be achieved by remembering the Filled DataTable and use this as backing field or by implementing your own mechanisms.
In any way I would recommend think on the problem: do you really need the dictionary? Why not make queries against the database to get the data? Or only cache a part of the data for quick access?
PS: the update method on you DataAdapter will do all the work (changing the changed, removing the deleted and inserting the new datarows but it will update the DataTable/Set so this will only work once)
It could be that it is quicker to repopulate the entire table than to itterate through and decide what record go / stay. Could you not do the process of deciding if a records is deleteed via an sql statement ? (Delete from table where active = false) if you want them to stay in the database but not in the dataset (select * from table where active = true)
You could have a date field and select all records that have been added since the date you late 'pooled' the database (select * from table where active = true and date-added > #12:30#)
This is my first post.. I have 2 SQL Server databases located on different servers..
Let's say SDT for source data table from source database SDB to DDT (Destination data table) for Database DDB
I'm using C# for bulk copying from SDT to DDT..
My code is something like this:
sqlcommand = "Delete * from DDT where locID = #LocIDParam" // #LocIDParam is the parameter for a specific location //
then bulk copy "Select * from SDT where locID = #LocIDParam" // the steps are well known..
I just don't want to go for useless details..
However, my SDT has a huge data so that it causes high traffic for bulk copying the whole table
Is there anyway for bulk copying the only updated records from SDT to DDT as well as inserting the new ones???
Do you think using an SQL trigger for updated and newly inserted data is the best idea for this kind of scenarios? (trigger to insert the primary key value into a single column table for the new and update then deleting and inserting from/to DDT based on this )
PS. I don't want to use SQL replication for that since it has a lot of problems..
Thank you in advance
From the date I suppose you already fond your solution. In case not, here is how we deal with a somehow similar situation.
On the source table we have a column that shows if the data has to be send to the destination. We use a boolean but you can also have a datetime field that shows last update date.
Then our pull process does following :
Pull all the flagged data in a temporary table on the destination server
Update records that exists in both table
Insert all records from temporary table that don't exist in destination table
Drop the temporary table
If you use SQL 2008, there is a merge option that I don't know. Here a link that explains it :
SQL 208 MERGE command
Hope this will help you if you still need.
I am trying to insert huge amount of data into SQL server. My destination table has an unique index called "Hash".
I would like to replace my SqlDataAdapter implementation with SqlBulkCopy. In SqlDataAapter there is a property called "ContinueUpdateOnError", when set to true adapter.Update(table) will insert all the rows possible and tag the error rows with RowError property.
The question is how can I use SqlBulkCopy to insert data as quickly as possible while keeping track of which rows got inserted and which rows did not (due to the unique index)?
Here is the additional information:
The process is iterative, often set on a schedule to repeat.
The source and destination tables can be huge, sometimes millions of rows.
Even though it is possible to check for the hash values first, it requires two transactions per row (first for selecting the hash from destination table, then perform the insertion). I think in the adapter.update(table)'s case, it is faster to check for the RowError than checking for hash hits per row.
SqlBulkCopy, has very limited error handling facilities, by default it doesn't even check constraints.
However, its fast, really really fast.
If you want to work around the duplicate key issue, and identify which rows are duplicates in a batch. One option is:
start tran
Grab a tablockx on the table select all current "Hash" values and chuck them in a HashSet.
Filter out the duplicates and report.
Insert the data
commit tran
This process will work effectively if you are inserting huge sets and the size of the initial data in the table is not too huge.
Can you please expand your question to include the rest of the context of the problem.
EDIT
Now that I have some more context here is another way you can go about it:
Do the bulk insert into a temp table.
start serializable tran
Select all temp rows that are already in the destination table ... report on them
Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows.
commit the tran
That process is very light on round trips, and considering your specs should end up being really fast;
Slightly different approach than already suggested; Perform the SqlBulkCopy and catch the SqlException thrown:
Violation of PRIMARY KEY constraint 'PK_MyPK'. Cannot insert duplicate
key in object 'dbo.MyTable'. **The duplicate key value is (17)**.
You can then remove all items from your source from ID 17, the first record that was duplicated. I'm making assumptions here that apply to my circumstances and possibly not yours; i.e. that the duplication is caused by the exact same data from a previously failed SqlBulkCopy due to SQL/Network errors during the upload.
Note: This is a recap of Sam's answer with slightly more details
Thanks to Sam for the answer. I have put it in an answer due to comment's space constraints.
Deriving from your answer I see two possible approaches:
Solution 1:
start tran
grab all possible hit "hash" values by doing "select hash in destinationtable where hash in (val1, val2, ...)
filter out duplicates and report
insert data
commit tran
solution 2:
Create temp table to mirror the
schema of destination table
bulk insert into the temp table
start serializable transaction
Get duplicate rows: "select hash from
tempTable where
tempTable.hash=destinationTable.hash"
report on duplicate rows
Insert the data in the temp table
into the destination table: "select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null"
commit the tran
Since we have two approaches, it comes down to which approach is the most optimized? Both approaches have to retrieve the duplicate rows and report while the second approach requires extra:
temp table creation and delete
one more sql command to move data from temp to destination table
depends on the percentage of hash collision, it also transfers a lot of unnecessary data across the wire
If these are the only solutions, it seems to me that the first approach wins. What do you guys think? Thanks!