re-creating blocked environment with many threads and high concurrency

re-creating blocked environment with many threads and high concurrency - c#

We are experiencing an issue where several hundred threads are trying to update a table ID, similar to this post, and sometimes encountering errors such as:
Cannot insert duplicate key in object dbo.theTable. The duplicate
key value is (100186).
The method that is being executed hundreds of times in parallel executes several stored procedures:
using (var createTempTableCommand = new SqlCommand())
{
createTempTableCommand.CommandText = createTempTableScript;
createTempTableCommand.Connection = omniaConnection;
createTempTableCommand.ExecuteNonQuery();
}
foreach (var command in listOfSqlCommands)
{
using (var da = new SqlDataAdapter(command))
{
da.Fill(dtResults);
}
}
In order to recreate such an environment/scenario, is it advisable to simply record a trace and then simply replay it?
How do we recreate an environment with high concurrency?

You can avoid all deadlocks/dirty read only when you will rewrite your solution into sequencial instead of paralel.
You can accept some error and create appropriate error handling. Blocked or wrong run with duplicate key can be started again.
You can try rewrite your solution without touching the same rows with more thread at the same time. You have to change your transaction isolation level (https://msdn.microsoft.com/en-us/library/ms709374(v=vs.85).aspx), change your locking to rowlocking (probably combination of ROWLOCK, UPDLOCK hints). This solution will minimalize your errors, but cannot handle all errors.
So I recommend 2. In some solutions is better way to run command without transation - you can handle it without blocking other threads and enforce relations in next step.
And for "similar post" - the same way. Error handling will be better in your app. Prevent to use cursor solutions like in similar post, because in goes against database fundamendals. Collect data into sets and use sets.

I don't think tracing is a good approach to reproducing a high concurrency environment, because the cost of tracing will itself skew the results and it's not really designed for that purpose. Playback won't necessarily be faithful to the timing of the incoming events.
I think you're better off creating specific load tests to hopefully exercise the problems you're encountering, rent some virtual machines and beat the heck out of a load-test db.
Having said that, tracing is a good way to discover what the actual workload is. Sometimes, you're not seeing all the activity that's coming against your database. Maybe there are some "oh yeah" jobs running when the particular problems present themselves. Hundreds of possibilities I'm afraid - and not something that can be readily diagnosed without a lot more clues.

Related

Mongo throwing write conflicts for concurrent sessions

session.StartTransaction();
await mongo.Collection1.UpdateOneAsync(session, filter1, update1);
await mongo.Collection2.BulkWriteAsync(session, updatesToDifferentDocs);
await mongo.Collection3.UpdateOneAsync(session, filter2, update2);
await session.CommitTransactionAsync();
The above code is running concurrently on multiple threads. The final update for Collection3 has a high chance of writing on the same document by multiple threads. I wanted the transactions across the 3 collections to be atomic which is why I put them in one session, which is what I thought session is essentially used for, however, I'm not familiar with the details of its inner-workings.
Without knowing much about the built-in features of Mongo. It's pretty obvious why this is giving me a write conflict. I simply can't write to the same document in Collection3 at the same time on multiple threads.
However, I tried Googling a bit and it seems like Mongo >= 3.2 has WiredTiger Storage Engine by default which has Document level locks that doesn't need to be used by the developer. I've read that it automatically retries the transaction if the document was initially locked.
I don't really know if I'm using session incorrectly here, or I just have to manually implement some kind of lock/semaphore/queue system. Another option would be to manually check for write conflict and re-attempt the entire session. But it feels like I'm just reinventing the wheel here if Mongo is already supposed to have concurrency support.

Should have updated this thread earlier but nonetheless, here is what I ended up doing to solve my problem. While MongoDB does have automatic retries to transactions along with some locking mechanisms, I couldn't find a clean way to do leverage this for my specific problematic session. Kept getting write conflicts even though I thought I'd acquired locks on all the colliding documents start of each session.
Having to maintain atomicity for a session that reads and writes across multiple collections not just documents, I thought it was cleaner to simply wrap it in custom retry logic. I followed the example bottom of page here and used a timeout that I thought was reasonable for my use case.
I decided to use a timeout-based retry logic because I knew most of the collisions would be tightly packed together temporally. For less predictable collisions, maybe some queue-based mechanism would be better.

the automatic retry behavior is different for transactions in mongodb. by default a transactional write operation will wait for 5ms and aborts the transaction if a lock cannot be aquired.
you could try increasing the 5ms timeout by running the following admin command on your mongodb instance:
db.adminCommand( { setParameter: 1, maxTransactionLockRequestTimeoutMillis: 100 } )
more details about that can be found here.
alternatively you could manually implement some retry logic for failed transactions.

How many concurrent statements does SqlConnection support

How many concurrent statements does C# SqlConnection support?
Let's say I am working on Windows service running 10 threads. All threads use the same SqlConnection object but different SqlCommand object and perform operations like select, insert, update and delete on either different tables or same table but different data. Will it work? Will a single SqlConnection object be able to handle 10 simultaneous statements?

How many concurrent statements does C# SqlConnection support?
You can technically have multiple "in-flight" statements, but only one acutally executing.
A single SqlConnection maps to a single Connection and Session in SQL Server. In Sql Server a Session can only have a single request active at-a-time. If you enable MultipeActiveResultsets you can start a new query before the previous one is finished, but the statements are interleaved, never run in parallel.
MARS enables the interleaved execution of multiple requests within a
single connection. That is, it allows a batch to run, and within its
execution, it allows other requests to execute. Note, however, that
MARS is defined in terms of interleaving, not in terms of parallel
execution.
And
execution can only be switched at well defined points.
https://learn.microsoft.com/en-us/sql/relational-databases/native-client/features/using-multiple-active-result-sets-mars?view=sql-server-ver15
So you can't even guarantee that another statement will run whenever one becomes blocked. So if you want to run statements in parallel, you need to use multiple SqlConnections.
Note also that a single query might use a parallel execution plan, and have multiple tasks running in parallel.

David Browne gave you the answer the ask, but there might be something else you need to know:
Let's say I am working on Windows service running 10 threads. All threads use the same SqlConnection object but different SqlCommand object and perform operations like select, insert, update and delete on either different tables or same table but different data.
This design just seems wrong on several fronts:
You keep a disposeable resource around and open. My rule for Disposeable stuff is: "Create. Use. Dispose. All in the same piece of code, ideally using a using block." Keeping disposeable stuff around or even sharing it between threads is jsut not worth the danger of forgetting to close it.
There is no performance advantage: SqlConnection uses internall connection pooling without any side effects. And even if there is a relevant speed advantage, they would not be worth the dangers.
You are using Mutltithreading with Database Access. Multithreading is one way to implement multitasking, but not one you should use until you need it. Multithreading is only usefull with CPU bound work. Otherweise you should generally be using async/await or similar appraoches. DB Operations are either disk or network bound.
There is one exception to this rule, and that is if your application is a Server. Servers are teh rare example of something being pleasingly parallel. So having a large Threadpool to process incomming requests in paralell is very common. It is rather rare that you write one of those, however. Mostly you just run your code in a existing server infrastructure that deals with that.
If you do have heavy CPU work, chances are you are retreiving to much. It is a very common beginners mistake to retreive a lot, then do filtering in C# code. Do not do that. Do as much filtering and processing as possible in the Query. You will not be able to beat the speed of the DB-Server, and at best you tie up your network pointlessly.

Use of multi threading to execute SQL statements

I have code that carries out data retrieval - basically executes anything from 3 to 12 SQL (oracle) read statements to retrieve data about an object.
Unfortunantly its running slowly (no SQL statement in particular, its just the fact I have so many of them - and they take around 0.2 seconds per statement, which can mean over 2 secs for the code to complete).
I am looking into ways of improving the performance. One way is to merge some of the tables into a single query (which can reduce the combined results by 0.5 secs). However it doesn't make sense to merge the rest since there will only be data there under certain cicumstances, and trying to determine when there is data there to marshal could get tricky.
I am considering introducing threading into my program, so after the initial query, I would spawn a thread for each of the other queries, so they are executed at the same time. However I have never used threading and am wary of introducing deadlocks or other pit falls.
Currently the other queries marshal the results into different sections of the SAME object. Would this cause any issues (i.e. since we are accessing/updating the same object in different threads though different sections/fields within the object?). Would it be better to return the results and marshal into the object after all the threads have finished?
I know these types of questions are hard to answer since its more general advice, but I would appreciate if anyone thought it was a good idea, or had other suggestions?

If you are doing only reading (select from) - don't worry about deadlocks. Oracle readings are not blockable (mostly). The biggest problem with threading queries to oracle would be how to deal with connections. To create connection, run a query and close connection - is very very very bad. Connections are expensive. They are also limited, so you don't want to create one million connections to execute your logic.
As a result, you would use some sort of connection pool and put your queries in a queue.
Also, I hope you are using bind variables and not string concatenation to pass queries to oracle.
In general, I would collect all the data (better in one query) and only then update the object. You could also consider to brake your object into it sections.

Threading workss perfectly. 2 years ago I did a project that used a multi strage / multi threading approeach to push data into a oracle database (and pull some data out of it for updates).
I basicallly used a staged approach (a request would go through multiple stages, get consumed there and new data be pusehd to the next stage) and every stage used a configurable thread pool, which would take a message, process it and post the new messages.
We used I think at that time close to 200 threads to process about a million SQL statements per minute (hitting an Oracle Exadata that was really getting some work out of that).
So, multithreading "just works" - obviously if you know how to do it and you have to get your architecture and the sql statements nice and non blocking. Databases in general are perfectly calable of handling multiple threads.
Now, for details: THAT DEPENDS.
Example:
Currently the other queries marshal the results into different
sections of the SAME object. Would this cause any issues (i.e. since
we are accessing/updating the same object in different threads though
different sections/fields within the object?)
Absolutely no problem as long as:
You make suer all updates are finished before moving the object to the next phase and
The updates do not overlap or have a cardinality (1 must finish for 2 to have the required data).
These are implementation details and it is really hard to make a generic answer for those (totally impossible). Especially as this is multi threading 101 - and has nothing to do with any database access.
In general - you will also have to tune the number of threads. .NET can not do that itself - as it will see the CPU not busy and spawn up more threads, even if the database server is the bottleneck. This is why we went with multiple stages - so we could tune the number of threads depending what they do (and the last stage used bulk inserting to insert the aggregated data into temporary staging tables with a small number of threads, moving a lot of data in every statement - this will require some tuning possibilities to not totally overload the database side).

Best Option for Queuing Status Updates .NET

I'm looking for some feedback in regards to the best option for a problem I am working on.
To give you some background I recently inherited a broken business application (our project was using it, so we gained responsibility to fix it), I come from a SharePoint development background so a little C#, ASP.NET and SQL.
Currently we have an issue with the application where we continually receive timeout errors, I have narrowed it down to the web application calling a bunch of stored procedures to update status fields in other tables when something changes that might affect the status of other objects.
Without completely overhauling this application I have determined our best option is to offload these stored procedures to run in the background and not be tied to the UI. I've looked at a couple of options including:
Creating a separate thread to handle the execution. (Still times out)
Using BackgroundWorker (still times out, obviously it shouldn't but I can't seem to find out what is causing it to wait for the BackgroundWorker to finish)
Moving the Stored Proc execution to a job, which I then call from another SP. (This works, but the limitation is that I can only have one job running at once, and if multiple users update objects they then receive an exception because the job won't start)
Right now we have moved these stored procedures into a twice a day script, which updates all objects, however this is only a temporary fix.
I have two options that I'm looking at, and I'm hoping to get some guidance on the implementation of whatever you consider to be the best option:
Continue using the job and have the executing stored proc queue up items in a db which the job will loop through until empty. The executing stored proc will have to check if the job is running when it adds a new entry and then act accordingly.
It's been recommended that I look at using the Service Broker, but I am not familiar with it's use at all. I understand that it would likely be a better overall solution, as it allows me to queue up these updates in a more transactional way.
I think both these options are viable although I need some help in understanding the implementation of the second option. My other dilemma is with these stored procedures running anywhere from 45s to 20m how can I notify the user that changed the object that his/her updates have been made? This is where I fallback to using the job because i could simply add a user field into the 'queue' and have the stored proc send a quick email at the end.
Thoughts, suggestions? Maybe I'm over-thinking this?

If you are on .NET 4.5 and C# 5.0 use async and if you are on .NET 4.0 use TPL. They have the same underlying (almost) and async feature is built upon TPL (with some extra internals).
In any case TPL would be a proper choice.

Sounds like Service Broker would be an excellent solution to this problem. It's true that there is a bit of a learning curve to climb to get your head round how it works, but it's fundamentally pretty simple especially when your implementation is in a single database.
There's a good (and mercifully short) intro to how it works at http://msdn.microsoft.com/en-US/library/ms345108(v=SQL.90).aspx

Have a look at Asynchronous Procedure Execution. But I would look first if the updates can be improved, perhaps a simple index can eliminate the timeouts, and/or try to leverage snapshot isolation. These would be much simpler to try out w/o committing to 'major overhaul' of the application code.
I must also urge you to read Waits and Queues. This is a SQL Server methodology for identifying performance bottlenecks. Is a great way of narrowing down the problems of 'timeouts' to something more actionable (blocking, IO, indexes etc).

Scalability and availability

I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.

I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.