I need to update a column in a large table (over 30 million rows) that has no primary key. A table row has a unique email address column. The update involves generating a value that must occur in C# and appending it to a column value. So the row must be read, the column value updated, and written back out.
I was hoping there was a concept of cursoring in ADO.NET but I do not see this. I can read the rows quickly enough, but the update call, using a WHERE clause for the email address, takes forever. After researching this most answers seem to be "put in a primary key!" but that is not an option here. Any thoughts?
For a 30mil rows heap, there's not many options. Without any index you can do basically nothing to speed it up.
Only solution is to check a fragmentation of a heap. You should add a clustered index to alleviate the table fragmentation, then drop it immediately. But if you cannot affect that table in any way, it could be faster to move all the data into a new table :-)
Related
I have a sql database where one of the columns is a varchar value. This value is always unique, it's not decided by me but a 3rd party application that supplies the data, it's length is undefined and is a mixture of numbers and letters. I should add that it's not declared as unique in the database as to my knowledge you can't for a varchar type?
Each week I run an import of this data from a csv file, however, the only way I know how to check if I'm importing a unique value is to loop through each row in the database and compare it to each line in the csv file to check if the corresponding value is unique.
Obviously this is very inefficient and is only going to get worse over time as the database gets bigger.
I've tried checking Google but no to avail, could be that I'm looking for the wrong thing though.
Any pointers would be much appreciated.
Application is written in C#
Look at running a MERGE command on SQL instead of an INSERT, which will allow you to explicitly guide action to be taken on a duplicate.
Note that if the unique field is indexed unique, then searching for a value is O(LOG(n)) and not O(n). THis means that overall performance for inserting N values is O(NLog(N)) and not O(NN). As N gets large, this is a substantial performance improvement.
Index the table on the unique field.
Do a 'if exists' on the unique key field value.If it returns a true, the row exists, update the row. If the return is false then this is a new row, insert the row.
I am trying to insert huge amount of data into SQL server. My destination table has an unique index called "Hash".
I would like to replace my SqlDataAdapter implementation with SqlBulkCopy. In SqlDataAapter there is a property called "ContinueUpdateOnError", when set to true adapter.Update(table) will insert all the rows possible and tag the error rows with RowError property.
The question is how can I use SqlBulkCopy to insert data as quickly as possible while keeping track of which rows got inserted and which rows did not (due to the unique index)?
Here is the additional information:
The process is iterative, often set on a schedule to repeat.
The source and destination tables can be huge, sometimes millions of rows.
Even though it is possible to check for the hash values first, it requires two transactions per row (first for selecting the hash from destination table, then perform the insertion). I think in the adapter.update(table)'s case, it is faster to check for the RowError than checking for hash hits per row.
SqlBulkCopy, has very limited error handling facilities, by default it doesn't even check constraints.
However, its fast, really really fast.
If you want to work around the duplicate key issue, and identify which rows are duplicates in a batch. One option is:
start tran
Grab a tablockx on the table select all current "Hash" values and chuck them in a HashSet.
Filter out the duplicates and report.
Insert the data
commit tran
This process will work effectively if you are inserting huge sets and the size of the initial data in the table is not too huge.
Can you please expand your question to include the rest of the context of the problem.
EDIT
Now that I have some more context here is another way you can go about it:
Do the bulk insert into a temp table.
start serializable tran
Select all temp rows that are already in the destination table ... report on them
Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows.
commit the tran
That process is very light on round trips, and considering your specs should end up being really fast;
Slightly different approach than already suggested; Perform the SqlBulkCopy and catch the SqlException thrown:
Violation of PRIMARY KEY constraint 'PK_MyPK'. Cannot insert duplicate
key in object 'dbo.MyTable'. **The duplicate key value is (17)**.
You can then remove all items from your source from ID 17, the first record that was duplicated. I'm making assumptions here that apply to my circumstances and possibly not yours; i.e. that the duplication is caused by the exact same data from a previously failed SqlBulkCopy due to SQL/Network errors during the upload.
Note: This is a recap of Sam's answer with slightly more details
Thanks to Sam for the answer. I have put it in an answer due to comment's space constraints.
Deriving from your answer I see two possible approaches:
Solution 1:
start tran
grab all possible hit "hash" values by doing "select hash in destinationtable where hash in (val1, val2, ...)
filter out duplicates and report
insert data
commit tran
solution 2:
Create temp table to mirror the
schema of destination table
bulk insert into the temp table
start serializable transaction
Get duplicate rows: "select hash from
tempTable where
tempTable.hash=destinationTable.hash"
report on duplicate rows
Insert the data in the temp table
into the destination table: "select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null"
commit the tran
Since we have two approaches, it comes down to which approach is the most optimized? Both approaches have to retrieve the duplicate rows and report while the second approach requires extra:
temp table creation and delete
one more sql command to move data from temp to destination table
depends on the percentage of hash collision, it also transfers a lot of unnecessary data across the wire
If these are the only solutions, it seems to me that the first approach wins. What do you guys think? Thanks!
In my database, there is a table which essentially contains questions with their options and answers. The first field is ~questionid~ and is the primary key, as expected (I've disabled AUTOINCREMENT for now). It's possible that my client wants to delete some questions. This leaves me with two options:
All subsequent questions move up so that there is no empty row. This option implies that those questions will have their question id changed
Leave it as it is so there will be empty rows. If a there's a new entry, it should fill the first empty row.
How do I go about implementing any of them? I prefer the second, actually, but if anyone has a different opinion, it's welcome.
I'm using a MySQL database and C#.
You are using a database so you don't have to worry about these issues.
There is no concept of "empty" row in a SQL table (well, one could say if all the columns are NULL then the row is empty, but that is not relevant here). Rows in a SQL table are not inherently ordered.
The rows themselves are stored on pages, which may or may not have extra space for more rows. This may be what you are thinking of when you think of an empty row.
When a row is deleted, the data is not rearranged. There is just some additional space on the page in case a new row is added later. If you add in a new row with a primary key between two existing rows, and the page is full, then the database "splits" the page into two. The two other pages have extra space.
The important point, though, is not how this works. One reason you are using a relational database for your application is so you can add and delete rows without having to worry about their actual physical storage.
If you have a database that has lots of transactions -- deletions and insertions -- then you may want to periodically rearrange the data so it fits better on the pages. Such optimizations though are usually necessary only when there is a high volume of such transactions.
One thing, though. Your application should not depend on the primary keys being sequential, so it can handle deletes correctly.
I am not sure how you have implemented. I would have done it in this way,
questions
question_id - pk
question
answers
answer_id - pk
answers
question_answer
question_id
answer_id
This will give more advantage. many questions will have same answer. if a question can be deleted then delete them along with their answers from question_answer table
I have a C# app which allows the user to update some columns in a DB. My problem is that I have 300.000 records in the DB, and just updating 50.000 took 30 mins. Can I do something to speed things up?
My update query looks like this:
UPDATE SET UM = 'UM', Code = 'Code' WHERE Material = 'MaterialCode'.
My only unique constrain is Material. I read the file the user selects, and put the data in a datatable, and then I go row by row, and update the corresponding material in the DB
Limit the number of indexes in your database especially if your application updates data very frequently.This is because each index takes up disk space and slow the adding, deleting, and updating of rows, you should create new indexes only after analyze the uses of the data, the types and frequencies of queries performed, and how your queries will use the new indexes.
In many cases, the speed advantages of creating the new indexes outweigh the disadvantages of additional space used and slowly rows modification. However, avoid using redundant indexes, create them only when it is necessary. For read-only table, the number of indexes can be increased.
Use non clustered index on the table if the update is frequent.
Use clustered index on the table if the updates/inserts are not frequent.
C# code may not be a problem , your update statement is important. Where clause of the update statement is the place to lookout for. You need to have some indexed column in the where clause.
Another thing, is the field material, indexed? And also, is the where clause, needed to be on a field with a varchar value? Can't it be an integer valued field?
Performance will be better if you filter on fields having integers and not strings. Not sure if this is possible for you.
I am trying to insert huge amount of data into SQL server. My destination table has an unique index called "Hash".
I would like to replace my SqlDataAdapter implementation with SqlBulkCopy. In SqlDataAapter there is a property called "ContinueUpdateOnError", when set to true adapter.Update(table) will insert all the rows possible and tag the error rows with RowError property.
The question is how can I use SqlBulkCopy to insert data as quickly as possible while keeping track of which rows got inserted and which rows did not (due to the unique index)?
Here is the additional information:
The process is iterative, often set on a schedule to repeat.
The source and destination tables can be huge, sometimes millions of rows.
Even though it is possible to check for the hash values first, it requires two transactions per row (first for selecting the hash from destination table, then perform the insertion). I think in the adapter.update(table)'s case, it is faster to check for the RowError than checking for hash hits per row.
SqlBulkCopy, has very limited error handling facilities, by default it doesn't even check constraints.
However, its fast, really really fast.
If you want to work around the duplicate key issue, and identify which rows are duplicates in a batch. One option is:
start tran
Grab a tablockx on the table select all current "Hash" values and chuck them in a HashSet.
Filter out the duplicates and report.
Insert the data
commit tran
This process will work effectively if you are inserting huge sets and the size of the initial data in the table is not too huge.
Can you please expand your question to include the rest of the context of the problem.
EDIT
Now that I have some more context here is another way you can go about it:
Do the bulk insert into a temp table.
start serializable tran
Select all temp rows that are already in the destination table ... report on them
Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows.
commit the tran
That process is very light on round trips, and considering your specs should end up being really fast;
Slightly different approach than already suggested; Perform the SqlBulkCopy and catch the SqlException thrown:
Violation of PRIMARY KEY constraint 'PK_MyPK'. Cannot insert duplicate
key in object 'dbo.MyTable'. **The duplicate key value is (17)**.
You can then remove all items from your source from ID 17, the first record that was duplicated. I'm making assumptions here that apply to my circumstances and possibly not yours; i.e. that the duplication is caused by the exact same data from a previously failed SqlBulkCopy due to SQL/Network errors during the upload.
Note: This is a recap of Sam's answer with slightly more details
Thanks to Sam for the answer. I have put it in an answer due to comment's space constraints.
Deriving from your answer I see two possible approaches:
Solution 1:
start tran
grab all possible hit "hash" values by doing "select hash in destinationtable where hash in (val1, val2, ...)
filter out duplicates and report
insert data
commit tran
solution 2:
Create temp table to mirror the
schema of destination table
bulk insert into the temp table
start serializable transaction
Get duplicate rows: "select hash from
tempTable where
tempTable.hash=destinationTable.hash"
report on duplicate rows
Insert the data in the temp table
into the destination table: "select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null"
commit the tran
Since we have two approaches, it comes down to which approach is the most optimized? Both approaches have to retrieve the duplicate rows and report while the second approach requires extra:
temp table creation and delete
one more sql command to move data from temp to destination table
depends on the percentage of hash collision, it also transfers a lot of unnecessary data across the wire
If these are the only solutions, it seems to me that the first approach wins. What do you guys think? Thanks!