I have various large data modification operations in a project built on c# and Fluent NHibernate.
The DB is sqlite (on disk rather than in memory as I'm interested in performance)
I wanted to check performance of these so I created some tests to feed in large amounts of data and let the processes do their thing. The results from 2 of these processes have got me pretty confused.
The first is a fairly simple case of taking data supplied in an XML file doing some light processing and importing it. The XML contains around 172,000 rows and the process takes a total of around 60 seconds to run with the actual inserts taking around 40 seconds.
In the next process, I do some processing on the same set of data. So I have a DB with approx 172,000 rows in one table. The process then works through this data, doing some heavier processing and generating a whole bunch of DB updates (inserts and updates to the same table).
In total, this results in around 50,000 rows inserted and 80,000 updated.
In this case, the processing takes around 30 seconds, which is fine, but saving the changes to the DB takes over 30 mins! and it crashes before it finishes with an sqlite 'disk or i/o error'
So the question is: why are the inserts/updates in the second process so much slower? They are working on the same table of the same database with the same connection. In both cases, IStatelessSession is used and ado.batch_size is set to 1000.
In both cases, the code looks that does the update like this:
BulkDataInsert((IStatelessSession session) =>
{
foreach (Transaction t in transToInsert) { session.Insert(t); }
foreach (Transaction t in transToUpdate) { session.Update(t); }
});
(although the first process has no 'transToUpdate' line as it's only inserts - Removing the update line and just doing the inserts still takes almost 10 minutes.)
The transTo* variables are List with the objects to be updated/inserted.
BulkDataInsert creates the session and handles the DB transaction.
I didn't understand your second process. However, here are some things to consider:
Are there any clustered or non-clustered indexes on the table?
How many disk drives do you have?
How many threads are writing to the DB in the second test?
It seems that you are experiencing IO bottlenecks that can be resolved by having more disks, more threads, indexes, etc.
So, assuming a lot of things, here is what I "think" is happening:
In the first test your table probably has no indexes, and since you are just inserting data, it is a sequential insert in a single thread which can be pretty fast - especially if you are writing to one disk.
Now, in the second test, you are reading data and then updating data. Your SQL instance has to find the record that it needs to update. If you do not have any indexes this "find" action is basically a table scan, which will happen for each one of those 80,000 row updates. This will make your application really really slow.
The simplest thing you could probably do is add a clustered index on the table for a unique key, and the best option is to use the columns that you are using in the where clause to "update" those rows.
Hope this helps.
DISCLAIMER: I made quite a few assumptions
The problem was due to my test setup.
As is pretty common with nhibernate based projects, I had been using in-memory sqlite databases for unit testing. These work great but one downside is that if you close the session, it destroys the database.
Consequently, my unit of work implementation contains a 'PreserveSession' property to keep the session alive and just create new transactions when needed.
My new performance tests are using on-disk databases but they still use the common code for setting up test databases and so have PreserveSession set to true.
It seems that having several sessions all left open (even though they're not doing anything) starts to cause problems after a while including the performance drop off and the disk IO error.
I re-ran the second test with PreserveSession set to false and immediately I'm down from over 30 minutes to under 2 minutes. Which is more where I'd expect it to be.
Related
I'm using Entity 6 with PostgreSQL database (with Npgsql connector). Everything works fine, except for poor performance of this setup. When I try to insert not-so-large amount of objects to database (about 20k records), it takes much more time than it should. As this is my first time using Entity Framework, I was rather confused why inserting 20k records to database on my local machine would take more than 1 minute.
In order to optimize inserts I followed every tip I found. I tried to set AutoDetectChangesEnabled to false, call SaveChanges() every 100 or 1000 records, re-creating database context object and use DbContextTransaction objects (by calling dbContext.Database.BeginTransaction() and commiting transaction at the end of the operation or every 100/1000 records). Nothing improved inserts performance even a little bit.
By loging SQL queries generated by Entity I was finally able to discover that no matter what I do, every object is inserted separatedly and every insert takes 2-4 ms. Without re-creating DB context objects and without transactions, there is just one commit after over 20k inserts. When I use transactions and commit every few records, there are more commits and new transaction creations (same when I re-create DB context object, just with connection being re-established as well). If I use transactions and commit them every a few records, I should notice a performance boost, no? But in the end there is no difference in performance, no matter if I use multiple transactions or not. I know transactions won't improve performance drastically, but they should help at least a little bit. Instead, every insert still takes at least 2ms to execute on my local DB.
Database on local machine is one thing, but performing creation of 20k objects on remote database takes much, much, MUCH longer than one minute - logs indicate that single insert can take even 30ms (!), with transactions being commited and created again every 100 or 1000 records. On the other hand, if I execute a single insert manually (taking it straight from the log), it takes less than 1ms to execute. It seems like Entity takes its sweet time inserting every single object to database, even though it uses transactions to wrap larger amount of inserts together. I don't really get it...
What can I do to speed it up for real?
In case anyone's interested, I found a solution to my problem. Entity Framework 6 is unable to provide fast bulk inserts without additional third-party libraries (as mentioned in comments to my question), which are either expensive or not supporting other databases than SQL Server. Entity Framework Core, on the other hand, is another story. It supports fast bulk insertions and can replace EF 6 in project with just a bunch of changes in code: https://learn.microsoft.com/pl-pl/ef/core/index
Background
I need to write some integration tests in C# (about 120 of them) for a C#/SQL Server application. Now, initially before any test, the database will already be there, the reason it will be there is because lot of scripts will be run to set it up (about 20 minutes running time). Now when I run my test, few tables will be updated (CRUD operations). For e.g. in 10-11 tables, few rows will be added and say in 15-16 tables, few rows will be updated and in 4-5 tables few rows will be deleted.
Problem
After every test is run, the database needs to be reset to it's original state. How can I achieve that?
Bad Solution
After every run of a test, re-run the database creation scripts (20 minutes of running time). Since there will be around 120 tests, this comes to 40 hours which is not an option. Secondly there is a process that has several connections open against this database so the database cannot be dropped/re-created.
Good Solution?
I would like to know if there is any other way of solving this problem? Another problem I have is that, for each of those tests, I don't even know what tables will be updated and I will have to manually go and check to see what tables were updated anyway if I were to delete, revert the database to it's original state manually by writing queries.
You should take a look at the possibilities MSSQl gives you with taking a snapshot of your database. It is potentially a lot faster than reverting to a backup or recreating the database.
Managing a test database
In a testing environment, it can be useful when repeatedly running a test protocol for the database to contain identical data at the start of each round of testing. Before running the first round, an application developer or tester can create a database snapshot on the test database. After each test run, the database can be quickly returned to its prior state by reverting the database snapshot.
I am working with aws sqs queue. The queue may having massive messages i.e if i do not process there will be more than a million mesasge per hour.
I am processing all the messages and putting them into a mysql table. Innodb with 22 columns. Insert on Duplicate Key Update. I have a primary key and unique key.
I am working with C# where i ran 80 threads in order to pull messages from sqs.
I applied transaction in c# run the query like "insert on duplicate key update"
at the same time i am using lock in c# so only single thread can update the table. if id do not use C# lock then an exception is thrown from mysql dead lock occured.
Problem is here i can see there are a lot of threads are waiting before C# lock and this time gradually increasing. Can any body suggest me what is the best way to do this..
Note, i have 8GB RAM intell xeon 2.53 with 1GE internet speed. please suggest me in this regard.
If I were to do it, the C# program would primarily be creating the CSV file to empty your SQS queue. Or at least a significant chunk of it. The file would then be used for bulk insert into an empty non-indexed in anyway worktable. I would steer for non-temporary but whatever. I see no reason to add temporary to the mix when this is recurring, and when done the worktable is truncated anyway.
The bulk insert would be achieved through LOAD DATA FROM INFILE fired off from the c# program. Alternatively, a value in a new row in some other table could be written with an incrementer saying file2 is ready, file3 is ready, and the LOAD happens in an event triggered, say every n minutes. An event that was put together with mysql Create Event. Six of one, half dozen of another.
But the benefits of a sentinal, a mutex, might be of value, as this whole thing happens in batches. And the next batch(es) to be processed need to be suspended while this occurs. Let's call this concept The Blocker, and the one being worked on is row N.
Ok, now your data is in the worktable. And it is safe from being stomped on until processed. Let's say you have 250k rows. Other batches shortly to follow. If you have special processing to have happen, you may wish to create indexes. But at this moment there are none.
You perform a normal insert on duplicate key update (IODKU) to the REAL table using this worktable. It would, in that IODKU follow a normal insert into select pattern, where the select part comes from the worktable.
At the end of that statement, the worktable is truncated, any indexes dropped, row N has its status set to complete, and The Blocker is free to work on row N+1 when it appears.
The indexes are dropped to facilitate the next round of bulk insert, where maintaining indexes is of least importance. And indexes on the worktable may very well be overhead baggage unnecessary during IODKU.
In this manner, you get the best of both worlds
LOAD DATA FROM INFILE
IODKU
And the focus is taken off of multi-threading, a good thing to take one's focus off of.
Here is a nice article on performance and strategies titled Testing the Fastest Way to Import a Table into MySQL. Don't let the mysql version of the title or inside the article scare you away. Jumping to the bottom and picking up some conclusions:
The fastest way you can import a table into MySQL without using raw
files is the LOAD DATA syntax. Use parallelization for InnoDB for
better results, and remember to tune basic parameters like your
transaction log size and buffer pool. Careful programming and
importing can make a >2-hour problem became a 2-minute process. You
can disable temporarily some security features for extra performance
I would separate the C# routine entirely from the actual LOAD DATA and IODKU update effort and leave that to the event mentioned with Create Event for several reasons. Mainly better design. As such the C# program is only dealing with SQS and writing out files with incrementing file #'s.
I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?
The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.
Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform
Here I am dealing with a database containing tens of millions of records. I have an application which connects to the database, gets all the data from a single column in a table and does some operation on it and updates it (for SQL Server - using cursors).
For millions of records it is taking very very ... long time to update. So I want to make it faster by
using multiple threads with an independent connection for each thread.
or
by using a single connection throughout all the threads to fire the update queries.
Which one is faster, or if you have any other ideas plz explain.
I need a solution which is independent of database type , or even if you know specific solutions for each type of db, please reply.
The speedup you're trying to achieve won't work. To the contrary, it will slow down the overall processing as the database now has also to keep multiple connections/sessions/transactions in sync.
Keep with as few connections/transactions as possible for repetitive and comparable operations.
If it takes too long for your taste, maybe try to analyze if the queries can be optimized somehow. Also have a look at database-specific extensions (ie bulk operations) suitable for your problem.
All depends on the database, and the hardware it is running on.
If the database can make use of concurrent processing, and avoids contention on shared resources (e.g. page base locks would span multiple records, record based would not). Shared resources in this case include hardware, a single core box will not be able to execute multiple CPU intensive activities (e.g. parsing SQL) truely in parallel.
Network latency is something you might help alleviate with concurrent inserts even if the database is itself not able to exploit concurrency.
As with any question of performance there is substitute for testing in your specific scenario.
If possible try to use the Stored procedure the do all the processing and update the records.