I'm doing some work that involves inserting a batch of records into a Sql database. The size of the batch will vary but for arguments sake we can say 5000 records every 5 secs. It is likely to be less though. Multiple processes will be writing to this table, nothing is reading from it.
What I have noticed during a quick test is that using a SqlTransaction around this whole batch insert seems to improve performance.
e.g.
SqlTransaction trans = Connection.BeginTransaction()
myStoredProc.Transaction = trans;
sampleData.ForEach(ExecuteNonQueryAgainstDB);
transaction.Commit();
I'm not interested in having the ability to rollback my changes so I wouldn't have really considered using a transaction except it seems to improve performance. If I remove this Transaction code my inserts go from taking 300ms to around 800ms!
What is the logic for this? Because my understanding is the transaction still writes the data to the DB but locks the records until it is committed. I would have expected this to have an overhead...
What I am looking for is the fastest way to do this insert.
The commit is what costs time. Without your explicit transaction, you have one transaction per query executed. With the explicit transaction, no additional transaction is created for your queries. So, you have one transaction vs. multiple transactions. That's where the performance improvement comes from.
If you are looking for a fast wqay to insert/load data have a look at SqlBulkCopy Class
What you're getting is perfectly normal.
If your working with a usual isolation level (let's say commited or snapshot) then when you don't use transactions the database engine has to check for conflicts every time you make an insert. That is, it has to make sure that whenever someone reads from that table (with a SELECT *) for example, it doesn't get dirty reads, that is, mantain the insertion so that while the insertion itself it's taking place noone else is reading.
That will mean, lock, insert row, unlock, lock, insert row, unlock and so on.
When you encapsulate all that in a transaction what you're effectively achieving is reducing that series of "lock" and "unlock" into just one in the commit phase.
I've just finished writing a blog post on the performance gains you can get by explicitly specifying where transactions start and finish.
With Dapper i have observed transactions cutting batch insert down to 1/2 the original time and batch update times down to 1/3 of the original time
Related
Does a transaction lock my table when I'm running multiple queries?
Example: if another user will try to send data in same time which I use transaction, what will happen?
Also how can I avoid this, but also to be sure that all data has inserted successfully into database?
Begin Tran;
Insert into Customers (name) values(name1);
Update CustomerTrans
set CustomerName = (name2);
Commit;
You have to implement transaction smartly. Below are some performance related points :-
Locking Optimistic/Pessimistic. In pessimistic locking whole table is locked. but in optimistic locking only specific row is locked.
Isolation level Read Committed/Read Uncommitted. When table is locked it depends upon on your business scenario if it allowed you then you can go for dirty read using with NoLock.
Try to use where clause in update and do proper indexing. For any heavy query check the query plan.
Transaction timeout should be very less. So if the table is locked then it should throw error and In catch block you can retry.
These are few points you can do.
You cannot avoid that multiples users load data to the database. It is neither feasible nor clever to lock every time a single user requested the usage of a table. Actually you do not have to worry about it, because the DB itself will provide mechanism to avoid such issues. I would recommend you reading into ACID properties.
Atomicity
Consistency
Isolation
Durability
What may happen is that you could suffer a ghost read, which basically consist that you cannot read data unless the user who is inserting data commits. And even if you have finished inserting data and do not commit, there is a fair chance that you will not see the changes.
DDL operations such as creation, removal, etc. are themselves committed at the end. However DML operation, such as update, insert, delete, etc. are not committed at the end.
I'm using Entity 6 with PostgreSQL database (with Npgsql connector). Everything works fine, except for poor performance of this setup. When I try to insert not-so-large amount of objects to database (about 20k records), it takes much more time than it should. As this is my first time using Entity Framework, I was rather confused why inserting 20k records to database on my local machine would take more than 1 minute.
In order to optimize inserts I followed every tip I found. I tried to set AutoDetectChangesEnabled to false, call SaveChanges() every 100 or 1000 records, re-creating database context object and use DbContextTransaction objects (by calling dbContext.Database.BeginTransaction() and commiting transaction at the end of the operation or every 100/1000 records). Nothing improved inserts performance even a little bit.
By loging SQL queries generated by Entity I was finally able to discover that no matter what I do, every object is inserted separatedly and every insert takes 2-4 ms. Without re-creating DB context objects and without transactions, there is just one commit after over 20k inserts. When I use transactions and commit every few records, there are more commits and new transaction creations (same when I re-create DB context object, just with connection being re-established as well). If I use transactions and commit them every a few records, I should notice a performance boost, no? But in the end there is no difference in performance, no matter if I use multiple transactions or not. I know transactions won't improve performance drastically, but they should help at least a little bit. Instead, every insert still takes at least 2ms to execute on my local DB.
Database on local machine is one thing, but performing creation of 20k objects on remote database takes much, much, MUCH longer than one minute - logs indicate that single insert can take even 30ms (!), with transactions being commited and created again every 100 or 1000 records. On the other hand, if I execute a single insert manually (taking it straight from the log), it takes less than 1ms to execute. It seems like Entity takes its sweet time inserting every single object to database, even though it uses transactions to wrap larger amount of inserts together. I don't really get it...
What can I do to speed it up for real?
In case anyone's interested, I found a solution to my problem. Entity Framework 6 is unable to provide fast bulk inserts without additional third-party libraries (as mentioned in comments to my question), which are either expensive or not supporting other databases than SQL Server. Entity Framework Core, on the other hand, is another story. It supports fast bulk insertions and can replace EF 6 in project with just a bunch of changes in code: https://learn.microsoft.com/pl-pl/ef/core/index
My scenario is common:
I have a stored procedure that need to update multiple tables.
if one of updates failed - all the updates should be rolled back.
the strait forward answer is to include all the updates in one transaction and just roll that back. however, in system like ours , this will cause concurrency issues.
when we break the updates into multiple short transactions - we get throughput of ~30 concurrent executions per second before and deadlocking issues start to emerge.
if we put it to one transaction which span all of them - we get concurrent ~2 per second before deadlock shows up.
in our case, we place a try-catch block after every short transaction, and manually DELETE/Update back the changes from the previous ones. so essentially we mimic the transaction behavior in a very expensive way...
It is working alright since its well written and dont get many "rollbacks"...
one thing this approach cannot resolve at all is a case of command timeout from the web server / client.
I have read extensively in many forms and blogs and scanned through the MSDN and cannot find a good solution. many have presented the problem but I am yet to see a good solution.
The question is this: is there ANY solution to this issue that will allow a stable rollback of update to multiple tables, without require to establish exclusivity lock on all of the rows for the entire duration of the long transaction.
Assume that it is not an optimization issue. The tables are almost at the max optimization probably, and can give a very high throughput as long as deadlock don't hit it. there are no table locks/page locks etc. all row locks on updates - but when you have so many concurrent sessions some of them need to update the same row...
it can be via SQL, client side C#, server side C# (extend the SQL server?).
Is there such solution in any book/blog that i have not found?
we are using SQL server 2008 R2, with .NET client/web server connecting to it.
Code example:
Create procedure sptest
Begin transaction
Update table1
Update table2
Commit transaction
In this case, if sptest is run twice, the second instance cannot update table 1 until instance 1 has committed.
Compared to this
Create sptest2
Update table1
Update table2
Sptest2 has a much higher throughput - but it has chance to corrupt the data.
This is what we are trying to solve. Is there even a theoretical solution to this?
Thanks,
JS
I would say that you should dig deeper to find out the reason why deadlock occurs. Possibly you should change the order of updates to avoid them. Maybe some index is "guilty".
You cannot roolback changes if other transactions can change data. So you need to have update lock on them. But you can use snapshot isolation level to allow consistent reads before update commits.
For all inner joined tables that are mostly static or with a high degree of probability not effect the query by using dirty data then you can apply:
INNER JOIN LookupTable (with NOLOCK) lut on lut.ID=SomeOtherTableID
This will tell the query that I do not care about updates made to SomeOtherTable
This can reduce your issue in most cases. For more difficult deadlocks I have implemented a deadlock graph that is generated and emailed when a deadlock occurs contains all the detailed info for the deadlock.
I need to update several rows of one of my tables as an atomic operation.
The update concerns incrementing some values in int columns of certain rows. I need to increment values in several rows as a single action.
What would be the best way to do this?
Answering this question for me comes down to answering the following two:
If I use LINQ to SQL, how do I achieve the atomicity of the increment
operation (do I use transaction, or is there a better way)?
Are stored procedures executed atomically (in case I invoke the procedure on the DB)?
I am working in C# with SQL Server.
In SQL Server Atomicity between different operations is achieved by using Explicit Transactions, Where the user Explicitly Starts a transaction by using the key words BEGIN TRANSACTION and once all the operations are done without any erros you can commit the transaction by using key words COMMIT TRANSACTION, in case of an error/exception you can undo the work anywhere in the ongoing transaction by using key words ROLLBACK TRANSACTION
Write Ahead Strategy
SQL server uses Write Ahead Strategy to make sure the atomicity of the transactions and durability of data, When we are making any changes/Updates to the data, SQL Server takes following steps
Loads data pages into a buffer cache.
Updates the copy in the buffer.
Creates a log record in a log cache.
Saves the log record to disk via the checkpoint process.
Saves the data to disk.
So anywhere in the process of all these steps if you decide to ROLLBACK the transaction. Your is actual data on the disk is left unchanged.
My Suggestion
BEGIN TRY
BEGIN TRANSACTION
------ Your Code Here ------
---- IF everything Goes fine (No errors/No Exceptions)
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION --< this will ROLLBACK any half done operations
-- Your Code here ---------
END CATCH
I found my answer: The increment cannot be realized through LINQ to SQL directly. However, stored procedures can be called from LINQ, and increment can be realized there.
My solution was to create a stored procedure that would execute necessary updates within a single while loop in a transaction. This way all the updates are executed as a single, atomic, operation.
The UPDATE statement is atomic by itself.
I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?
The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.
Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform