Firstly, I am not much of an expert in multi-threading and parallel programming.
I am trying to optimize the performance of a legacy application (.Net 4, NHibernate 2.1).
**So far, upgrading NHibernate is not a priority, but is in the pipeline.
Over time, performance has become a nightmare with the growth of data. One item I have seen is a Parallel.ForEach statement that calls a method that fetches and updates a complex entity(with multiple relationships - propeties & collections).
The piece of code has the following form (simplified for clarity):
void SomeMethod(ICollection<TheClass> itemsToProcess)
{
Parallel.ForEach(itemsToProcess, item => ProcessItem(item);
}
TheClass ProcessItem(TheClass i)
{
var temp = NHibernateRepository.SomeFetchMethod(i);
var result = NHibernateRepository.Update(temp);
return result;
}
SQL Server intermittently reports database lock errors with the following error:
Transaction (Process ID 20) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction
I suspect it is due to some race condition happening leading up to a deadlock, even if ISessions are separate.
The ICollection<TheClass> can have up to 1000 items and each with properties and sub-collections that are processed, generating many SELECT and UPDATE statements (confirmed using 'NHibernate Profiler')
Is there a better way to handle this in a parallel way, or shall I refactor the code to a traditional loop?
I do know that I can alternatively implement my code using:
A foreach loop in the same ISession context
With a Stateless Session
With Environment.BatchSize set to a reasonable value
OR
Using SQL BulkCopy
I have also read quite a bit of good info about SQL Server deadlocks and Parallel.ForEach being an easy pitfall:
SQL Transaction was deadlocked
Using SQL Bulk Copy as an alternative
Potential Pitfalls in Data and Task Parallelism
Multi threading C# application with SQL Server database calls
This is a very complicated topic. There's one strategy that is guaranteed to be safe and probably will result in a speedup:
Retry in case of deadlock.
Since a deadlock rolls back the transaction you can safely retry the entire transaction. If the deadlock rate is low the parallelism speedup will be high.
The nice thing about retry is that you can make a simple code change in a central place.
Since it's not apparent from the code posted: Make sure, that threads do not share the session or entities. Neither of them are thread-safe.
Related
How can I run the above code in the fastest way. What is the best practice?
public ActionResult ExampleAction()
{
// 200K items
var results = dbContext.Results.ToList();
foreach (var result in results)
{
// 10 - 40 items
result.Kazanim = JsonConvert.SerializeObject(
dbContext.SubTables // 2,5M items
.Where(x => x.FooId == result.FooId)
.Select(select => new
{
BarId = select.BarId,
State = select.State,
}).ToList());
dbContext.Entry(result).State = EntityState.Modified;
dbContext.SaveChanges();
}
return Json(true, JsonRequestBehavior.AllowGet);
}
This process takes an average of 500 ms as sync. I have about 2M records. The process is done 200K times.
How should I code asynchronously?
How can I do it faster and easier with an async method.
Here are two suggestions that can improve the performance multiple orders of magnitude:
Do work in batches:
Make the client send a page of data to process; and/or
In the web server code add items to a queue and process them separately.
Use SQL instead of EF:
Write an efficient SQL; and/or
Use the stored proc to do the work inside the db rather than move data between the db and the code.
There's nothing you can do with that code asynchronously for improving its performance. But there's something that can certainly make it faster.
If you call dbContext.SaveChanges() inside the loop, EF will write back the changes to the database for every single entity as a separate transaction.
Move your dbContext.SaveChanges() after the loop. This way EF will write back all your changes at once after in one single transaction.
Always try to have as few calls to .SaveChanges() as possible. One call with 50 changes is much better, faster and more efficient than 50 calls for 1 change each.
and welcome.
There's quite a lot I see incorrect in terms of asynchronicity, but I guess it only matters if there are concurrent users calling your server. This has to do with scalability and the thread pool in charge of spinning up threads to take care of your incoming HTTP requests.
You see, if you occupy a thread pool thread for a long time, that thread will not contribute to dequeueing incoming HTTP requests. This pretty much puts you in a position where you can spin up a maximum of around 2 new thread pool threads per second. If your incoming HTTP request rate is faster than the pool's ability to produce threads, all of your HTTP requests will start seeing increased response times (slowness).
So as a general rule, when doing I/O intensive work, always go async. There are asynchronous versions of most (or all) of the materializing methods like .ToList(): ToListAsync(), CountAsync(), AnyAsync(), etc. There is also a SaveChangesAsync(). First thing I would do is use these under normal circumstances. Yours don't seem to be, so I mentioned this for completeness only.
I think that you must, at the very least, run this heavy process outside the thread pool. Use Task.Factory.StartNew() with the TaskCreationOptions.LongRunning but run synchronous code so you don't fall in the trap of awaiting the returned task in vain.
Now, all that just to have a "proper" skeleton. We haven't really talked about how to make this run faster. Let's do that.
Personally, I think you need some benchmarking between different methods. It looks like you have benchmarked this code. Now listen to #tymtam and see if a stored procedure version runs faster. My hunch, just like #tymtam's, is that it will be definitely faster.
If for whatever reason you insist in running this with C#, I would parallelize the work. The problem with this is Entity Framework. As per usual, my very popular, yet unfriendly ORM, is giving us a big but. EF's DB context works with a single connection and disallows multiple simultaneous queries. So you cannot parallelize this with EF. I would then move to my good, amazing friend, Dapper. Using Dapper, you could divide the workload in threads, and each thread would do an independent DB connection, and through that connection, take care of a portion of the 200K result set you obtain at the beginning.
Thanks for the valuable information you provided.
I decided to use hangfire in line with your suggestions.
I used it with Hangfire Inmemory. I have prepared a function that will throw it into the hangfire queue in the foreach. After getting my relevant values before starting the foreach, I set my function to import parameters that it will calculate and save to the database. I won't prolong it.
A job that took 30 minutes on average fell to 3 minutes with hangfire. Maybe it's still not ideal, but it has worked for me now. Instead of making the user wait, I can show your action as currently in progress. I end the process with a warning that another job has been successfully completed before the end of the last thread.
I haven't used it here for Dapper for now. But I used it on another subject. It really has tremendous performance compared to Entity Framework.
Thanks again.
session.StartTransaction();
await mongo.Collection1.UpdateOneAsync(session, filter1, update1);
await mongo.Collection2.BulkWriteAsync(session, updatesToDifferentDocs);
await mongo.Collection3.UpdateOneAsync(session, filter2, update2);
await session.CommitTransactionAsync();
The above code is running concurrently on multiple threads. The final update for Collection3 has a high chance of writing on the same document by multiple threads. I wanted the transactions across the 3 collections to be atomic which is why I put them in one session, which is what I thought session is essentially used for, however, I'm not familiar with the details of its inner-workings.
Without knowing much about the built-in features of Mongo. It's pretty obvious why this is giving me a write conflict. I simply can't write to the same document in Collection3 at the same time on multiple threads.
However, I tried Googling a bit and it seems like Mongo >= 3.2 has WiredTiger Storage Engine by default which has Document level locks that doesn't need to be used by the developer. I've read that it automatically retries the transaction if the document was initially locked.
I don't really know if I'm using session incorrectly here, or I just have to manually implement some kind of lock/semaphore/queue system. Another option would be to manually check for write conflict and re-attempt the entire session. But it feels like I'm just reinventing the wheel here if Mongo is already supposed to have concurrency support.
Should have updated this thread earlier but nonetheless, here is what I ended up doing to solve my problem. While MongoDB does have automatic retries to transactions along with some locking mechanisms, I couldn't find a clean way to do leverage this for my specific problematic session. Kept getting write conflicts even though I thought I'd acquired locks on all the colliding documents start of each session.
Having to maintain atomicity for a session that reads and writes across multiple collections not just documents, I thought it was cleaner to simply wrap it in custom retry logic. I followed the example bottom of page here and used a timeout that I thought was reasonable for my use case.
I decided to use a timeout-based retry logic because I knew most of the collisions would be tightly packed together temporally. For less predictable collisions, maybe some queue-based mechanism would be better.
the automatic retry behavior is different for transactions in mongodb. by default a transactional write operation will wait for 5ms and aborts the transaction if a lock cannot be aquired.
you could try increasing the 5ms timeout by running the following admin command on your mongodb instance:
db.adminCommand( { setParameter: 1, maxTransactionLockRequestTimeoutMillis: 100 } )
more details about that can be found here.
alternatively you could manually implement some retry logic for failed transactions.
How many concurrent statements does C# SqlConnection support?
Let's say I am working on Windows service running 10 threads. All threads use the same SqlConnection object but different SqlCommand object and perform operations like select, insert, update and delete on either different tables or same table but different data. Will it work? Will a single SqlConnection object be able to handle 10 simultaneous statements?
How many concurrent statements does C# SqlConnection support?
You can technically have multiple "in-flight" statements, but only one acutally executing.
A single SqlConnection maps to a single Connection and Session in SQL Server. In Sql Server a Session can only have a single request active at-a-time. If you enable MultipeActiveResultsets you can start a new query before the previous one is finished, but the statements are interleaved, never run in parallel.
MARS enables the interleaved execution of multiple requests within a
single connection. That is, it allows a batch to run, and within its
execution, it allows other requests to execute. Note, however, that
MARS is defined in terms of interleaving, not in terms of parallel
execution.
And
execution can only be switched at well defined points.
https://learn.microsoft.com/en-us/sql/relational-databases/native-client/features/using-multiple-active-result-sets-mars?view=sql-server-ver15
So you can't even guarantee that another statement will run whenever one becomes blocked. So if you want to run statements in parallel, you need to use multiple SqlConnections.
Note also that a single query might use a parallel execution plan, and have multiple tasks running in parallel.
David Browne gave you the answer the ask, but there might be something else you need to know:
Let's say I am working on Windows service running 10 threads. All threads use the same SqlConnection object but different SqlCommand object and perform operations like select, insert, update and delete on either different tables or same table but different data.
This design just seems wrong on several fronts:
You keep a disposeable resource around and open. My rule for Disposeable stuff is: "Create. Use. Dispose. All in the same piece of code, ideally using a using block." Keeping disposeable stuff around or even sharing it between threads is jsut not worth the danger of forgetting to close it.
There is no performance advantage: SqlConnection uses internall connection pooling without any side effects. And even if there is a relevant speed advantage, they would not be worth the dangers.
You are using Mutltithreading with Database Access. Multithreading is one way to implement multitasking, but not one you should use until you need it. Multithreading is only usefull with CPU bound work. Otherweise you should generally be using async/await or similar appraoches. DB Operations are either disk or network bound.
There is one exception to this rule, and that is if your application is a Server. Servers are teh rare example of something being pleasingly parallel. So having a large Threadpool to process incomming requests in paralell is very common. It is rather rare that you write one of those, however. Mostly you just run your code in a existing server infrastructure that deals with that.
If you do have heavy CPU work, chances are you are retreiving to much. It is a very common beginners mistake to retreive a lot, then do filtering in C# code. Do not do that. Do as much filtering and processing as possible in the Query. You will not be able to beat the speed of the DB-Server, and at best you tie up your network pointlessly.
I have a FileShare crawler (getting permissions and dropping them somewhere for later Audit). Currently it is starting multiple threads to crawl the same folder (to speed up the process).
In C#, each SqlConnection object has its own SqlTransaction, initiated by the SqlConnection.BeginTransaction() call.
Here is the pseudo code of the current solution:
Get list of folders
For each folder get list of sub-folders
For each sub folder start a thread to collect file shares
Each thread will save collected data to database
Run Audit reports on the database
The problem arise when one of the sub folders threads fails. We end up with partial folder scanning which "cannot be detected easily". The main reason is that each thread is running on a separate connection.
I would like to have each folder to be committed in the same transaction rather than having incomplete scanning (current situation, when some threads fail). No transaction concept is implemented but I am evaluating the options.
Based on the comments of this answer, the producer/consumer queue would be an option but unfortunately memory is a limit (due to the number of started threads). In case the producer/consumer space is committed to disk to overcome the RAM limit, the execution time will go up (due to the very limited disk I/O compared to memory I/O). I guess I am stuck with a memory/time compromise. Any other suggestions?
It is possible to share the same transaction on multiple connections with SQL Server using the obsolete bind transaction feature. I have never used it and I wouldn't base new development on it. It also seems unnecessary here.
Can't you just have all the producers use the same connection and transaction? Put a lock around it. This obviously bottlenecks the process but it might still be fast enough.
You say you execute INSERT statements. For bulk inserts you can use the SqlBulkCopy class which is very much faster. Batch up the rows and only execute a bulk insert when you have >>1000 rows buffered.
I don't even see the need for producer/consumer here. It would indeed benefit performance by pipelining production with consumption but it also introduces far more complex threading. If you want to go this route you should probably give an IEnumerable<SqlDataRecord> to the SqlBulkCopy class to directly stream all rows that have been produced into it without intermediate buffering.
I have code that carries out data retrieval - basically executes anything from 3 to 12 SQL (oracle) read statements to retrieve data about an object.
Unfortunantly its running slowly (no SQL statement in particular, its just the fact I have so many of them - and they take around 0.2 seconds per statement, which can mean over 2 secs for the code to complete).
I am looking into ways of improving the performance. One way is to merge some of the tables into a single query (which can reduce the combined results by 0.5 secs). However it doesn't make sense to merge the rest since there will only be data there under certain cicumstances, and trying to determine when there is data there to marshal could get tricky.
I am considering introducing threading into my program, so after the initial query, I would spawn a thread for each of the other queries, so they are executed at the same time. However I have never used threading and am wary of introducing deadlocks or other pit falls.
Currently the other queries marshal the results into different sections of the SAME object. Would this cause any issues (i.e. since we are accessing/updating the same object in different threads though different sections/fields within the object?). Would it be better to return the results and marshal into the object after all the threads have finished?
I know these types of questions are hard to answer since its more general advice, but I would appreciate if anyone thought it was a good idea, or had other suggestions?
If you are doing only reading (select from) - don't worry about deadlocks. Oracle readings are not blockable (mostly). The biggest problem with threading queries to oracle would be how to deal with connections. To create connection, run a query and close connection - is very very very bad. Connections are expensive. They are also limited, so you don't want to create one million connections to execute your logic.
As a result, you would use some sort of connection pool and put your queries in a queue.
Also, I hope you are using bind variables and not string concatenation to pass queries to oracle.
In general, I would collect all the data (better in one query) and only then update the object. You could also consider to brake your object into it sections.
Threading workss perfectly. 2 years ago I did a project that used a multi strage / multi threading approeach to push data into a oracle database (and pull some data out of it for updates).
I basicallly used a staged approach (a request would go through multiple stages, get consumed there and new data be pusehd to the next stage) and every stage used a configurable thread pool, which would take a message, process it and post the new messages.
We used I think at that time close to 200 threads to process about a million SQL statements per minute (hitting an Oracle Exadata that was really getting some work out of that).
So, multithreading "just works" - obviously if you know how to do it and you have to get your architecture and the sql statements nice and non blocking. Databases in general are perfectly calable of handling multiple threads.
Now, for details: THAT DEPENDS.
Example:
Currently the other queries marshal the results into different
sections of the SAME object. Would this cause any issues (i.e. since
we are accessing/updating the same object in different threads though
different sections/fields within the object?)
Absolutely no problem as long as:
You make suer all updates are finished before moving the object to the next phase and
The updates do not overlap or have a cardinality (1 must finish for 2 to have the required data).
These are implementation details and it is really hard to make a generic answer for those (totally impossible). Especially as this is multi threading 101 - and has nothing to do with any database access.
In general - you will also have to tune the number of threads. .NET can not do that itself - as it will see the CPU not busy and spawn up more threads, even if the database server is the bottleneck. This is why we went with multiple stages - so we could tune the number of threads depending what they do (and the last stage used bulk inserting to insert the aggregated data into temporary staging tables with a small number of threads, moving a lot of data in every statement - this will require some tuning possibilities to not totally overload the database side).