I'm looking for some strategic help here, since I am new to TPL.
Situation
I have an application that coordinates data between 2 disparate LOB systems, ones that do not talk to each other. So, it looks a bit like:
[ System 1 ] < ----- [ App ] ----- > [ System 2 ]
During its processing, the app performs the following tasks:
App creates a connection to System 1. This connection must screen-scrape a web application, so it uses a and System 2, verifying each one is available.
App requests list of IDs from System A.
This list is run through, item by item. Processing that list:
App requests data from System 1. This system does not provide any service interface, so the app uses a WebRequest to both GET and POST requests to System 1. In addition to web page data scraped, a file may also be downloaded.
With data from System 1, App submits data to System 2 via several web service calls. Several calls may be made, and a file may be uploaded.
There are often tens of thousands of items in the loop. There is no dependency between these items, so they seem to be a good candidate for Task-based processing.
However, at most, there can be about 20 connections to System 1 and about 10 connections to System 2. So, the simple idea of just creating and destroying sessions for each item in the loop (like you might do in a simple Parallel.ForEach Task) would be prohibitively costly. Rather, I want to share the connections, in effect, creating a connection pool of sorts. That pool would be created before the tasks started up. When each Task starts its work, it would basically wait until it could get a connection from the pool. Once the task is complete, the connection would be released, and another Task could get ahold of it. In this case, the Scheduler limit is not just the CPUs; it's also the maximum number of connections to System 2.
Desire
I'm looking for the approach. I don't mind doing the work to figure out the implementation, but I need the best strategic approach.
How do I get the task loop to work with a limited number of these connections? Or do I have to go back to the old style of Thread allocation, and just manually pass the freed up connections as the threads complete their tasks? Some kind of mutex array? If so, how will the Tasks grab an open connection? Some type of concurrent bag or am I just going the wrong way?
Any help would be greatly appreciated.
I think a BlockingCollection for each connection pool will work well. If a thread attempts to get a connection from an empty pool, that thread will be blocked until another thread returns a connection to the pool.
You should also set MaxDegreeOfParallelism to the size of the bigger pool, to make sure there aren't unnecessarily many threads, most of them waiting to get a connection from the pool.
With that, your code could look like this:
var connection = serviceAConnections.Take();
// use the connection
serviceAConnections.Add(connection);
But a better approach might be to add a level of abstraction over that:
using (var connectionHolder = serviceAConnections.Get())
{
var connection = connectionHolder.Connection;
// use the connection
}
Related
I have to refactor a fairly time-consuming process in one of my applications and after doing some research I think it's a perfect match for using TPL. I wanted to clarify my understanding of it and ask if there are any more issues which I should take into account.
In few words, I have a windows service, which runs overnight and sends out emails with data updates to around 10000 users. At presence, the whole process takes around 8 hrs to complete. I would like to reduce it to 2 hrs max.
Application workflow follows steps below:
1. Iterate through all users list
2. Check if this user has to be notified
3. If so, create an email body by calling external service
4. Send an email
Analysis of the code has shown that step 3 is the most time-consuming one and takes around 3,5 sec to complete. It means, that when processing 10000 users, my application waits well over 6 hrs in total for a response from the external service! I think this is a reason good enough to try to introduce some asynchronous and parallel processing.
So, my plan is to use Parallel class and ForEach method to iterate through users in step 1. As I can understand this should distribute processing each user into a separate thread, making them run in parallel? Processes are completely independent of each other and each doesn't return any value. In the case of any exception being thrown it will be persisted in logs db. As with regards to step 3, I would like to convert a call to external service into an async call. As I can understand this would release the resources on the thread so it could be reused by the Parallel class to start processing next user from the list?
I had a read through MS documentation regarding TPL, especially Potential Pitfalls in Data and Task Parallelism document and the only point I'm not sure about is "Avoid Writing to Shared Memory Locations". I am using a local integer to count a total number of emails processed. As with regards to all of the rest, I'm quite positive they're not applicable to my scenario.
My question is, without any implementation as yet. Is what I'm trying to achieve possible (especially the async await part for external service call)? Should I be aware of any other obstacles that might affect my implementation? Is there any better way of improving the workflow?
Just to clarify I'm using .Net v4.0
Yes, you can use the TPL for your problem. If you cannot influence your external problem, then this might be the best way.
However, you can make the best gains if you can get your external source to accept batches. Because this source could actually optimize the performance. Right now you have a message overhead of 10000 messages to serialize, send, work on, receive and deserialize. This is stuff that could be done once. In addition, your external source might be able to optimize the work they do if they know they will get multiple records.
So the bottom line is: if you need to optimize locally, the TPL is fine. If you want to optimize your whole process for actual gains, try to find out if your external source can help you, because that is where you can make some real progress.
You didn't show any code, and I'm assuming that step 4 (send an e-mail) is not that fast either.
With the presented case, unless your external service from step 3 (create an email body by calling external service) processes requests in parallel and supports a good load of simultaneous requests, you will not gain much with this refactor.
In other words, test the external service and the e-mail server first for:
Parallel request execution
The way to test this is to send at least 2 simultaneous requests and observe how long it takes to process them.
If it takes about double the time of a single, the requests have some serial processing, either they're queued or some broad lock is being taken.
Load test
Go up to 4, 8, 12, 16, 20, etc, and see where it starts to degrade.
You should set a limit on the amount of simultaneous requests to something that keeps execution time above e.g. 80% of the time it takes to process a single request, assuming you're the sole consumer
Or a few requests before it starts degrading (e.g. divide by the number of consumers) to leave the external service available for other consumers.
Only then can you decide if the refactor is worth. If you can't change the external service or the e-mail server, you must weight it they offer enough parallel capability without degrading.
Even so, be realistic. Don't let your service push the external service and the e-mail server to their limits in production.
I have a Windows Service that has code similar to the following:
List<Buyer>() buyers = GetBuyers();
var results = new List<Result();
Parallel.Foreach(buyers, buyer =>
{
// do some prep work, log some data, etc.
// call out to an external service that can take up to 15 seconds each to return
results.Add(Bid(buyer));
}
// Parallel foreach must have completed by the time this code executes
foreach (var result in results)
{
// do some work
}
This is all fine and good and it works, but I think we're suffering from a scalability issue. We average 20-30 inbound connections per minute and each of those connections fire this code. The "buyers" collection for each of those inbound connections can have from 1-15 buyers in it. Occasionally our inbound connection count sees a spike to 100+ connections per minute and our server grinds to a halt.
CPU usage is only around 50% on each server (two load balanced 8 core servers) but the thread count continues to rise (spiking up to 350 threads on the process) and our response time for each inbound connection goes from 3-4 seconds to 1.5-2 minutes.
I suspect the above code is responsible for our scalability problems. Given this usage scenario (parallelism for I/O operations) on a Windows Service (no UI), is Parallel.ForEach the best approach? I don't have a lot of experience with async programming and am looking forward to using this opportunity to learn more about it, figured I'd start here to get some community advice to supplement what I've been able to find on Google.
Parallel.Foreach has a terrible design flaw. It is prone to consume all available thread-pool resources over time. The number of threads that it will spawn is literally unlimited. You can get up to 2 new ones per second driven by heuristics that nobody understands. The CoreCLR has a hill climbing algorithm built into it that just doesn't work.
call out to an external service
Probably, you should find out what's the right degree of parallelism calling that service. You need to find out by testing different amounts.
Then, you need to restrict Parallel.Foreach to only spawn as many threads as you want at a maximum. You can do that using a fixed concurrency TaskScheduler.
Or, you change this to use async IO and use SemaphoreSlim.WaitAsync. That way no threads are blocked. The pool exhaustion is solved by that and the overloading of the external service as well.
We are running a Http Api and want to be able to set a limit to the number of requests a user can do per time unit. When this limit has been reached, we don't want the users to receive errors, such as Http 429. Instead we want to increase the response times. This has the result that the users can continue to work, but slower, and can then choose to upgrade or not upgrade its paying plan. This solution can quite easily be implemented using Thread.sleep (or something similar) for x number of seconds, on all requests of a user that has passed its limit.
We think that in worst case there might be a problem with the number of possible connections for a single server, since as long as we keep delaying the response, we keep a connection open, and therefore limiting the number of possible other connections.
All requests to the Api is running asynchronously. The Server itself is built to be scalable and is running behind a load balancer. We can start up additional servers if necessary.
When searching for this type of throttling, we find very few examples of this way of limiting the users, and the examples we found seemed not concerned at all about connections running out. So we wonder is this not a problem?
Are there any downsides to this that we are missing, or is this a feasible solution? How many connections can we have open simultaneously without starting to get problems? Can our vision be solved in another way, that is without giving errors to the user?
Thread.Sleep() is pretty much the worst possible thing you can do on a web server. It doesn't matter that you are running things asynchronously because that only applies to I/O bound operations and then frees the thread to do more work.
By using a Sleep() command, you will effectively be taking that thread out of commission for the time it sleeps.
ASP.Net App Pools have a limited number of threads available to them, and therefore in the worst case scenario, you will max out the total number of connections to your server at 40-50 (whatever the default is), if all of them are sleeping at once.
Secondly
This opens up a major attack vector in terms of DOS. If I am an attacker, I could easily take out your entire server by spinning up 100 or 1000 connections, all using the same API key. Using this approach, the server will dutifully start putting all the threads to sleep and then it's game over.
UPDATE
So you could use Task.Delay() in order to insert an arbitrary amount of latency in the response. Under the hood it uses a Timer which is much lighter weight than using a thread.
await Task.Delay(numberOfMilliseconds);
However...
This only takes care of one side of the equation. You still have an open connection to your server for the duration of the delay. Because this is a limited resource it still leaves you vulnerable to a DOS attack that wouldn't have normally existed.
This may be an acceptable risk for you, but you should at least be aware of the possibility.
Why not simply add a "Please Wait..." on the client to artificially look like it's processing? Adding artificial delays on server costs you, it leaves connections as well as threads tied up unnecessarily.
We've built this app that needs to have some calculations done on a remote machine (actually a MatLab server). We're using web services to connect to the MatLab server and perform the calculations.
In order to speed things up, we've used Parallel.ForEach() in order to have multiple service calls going at the same time. If we're very conservative in setting ParallelOptions.MaxDegreeOfParallelism (DOP) to 4 or something, everything works fine and well.
However, if we let the framework decide on the DOP it will spawn so many threads that it forces the remote machine on its knees and timeouts start occurring ( > 10 minutes ).
How can we solve this issue? What I would LOVE to be able to do is use the response time to throttle the calls. If response time is less than 30 sec, keep adding threads, as soon as it's over 30 sec, use less. Any suggestions?
N.B. Related to the response in this question: https://stackoverflow.com/a/20192692/896697
Simplest way would be to tune for the best number of concurrent requests and hardcode that as you have done so far, however there are some nicer options if you are willing to put in some effort.
You could move from a Parallel.ForEach to using a thread pool. That way as things come back from the remote server you can either manually or programatically tune the number of available threads. reducing/increasing the number of available threads as things slow down/speed up, or even kill them if needed.
You could also do a variant of the above using Tasks which are the newer way of doing parallel/async stuff in .net.
Another option would be to use a timers and/or jobs model to schedule jobs every x milliseconds, which could then be throttled/relaxed as results returned from the server. The easiest way to get started would be using Quartz.Net.
I am doing a project that needs to communicate with 20 small computer boards. I will need to keep check of their connections and they will also return some data to me. So my aim is to build a control/monitoring system for these boards.
I will be using Visual Studio 2010 and C# WPF.
My idea/ plan would be like this:
On the main thread:
There will be only one control window, so a main thread will be created mainly to update the data to be displayed. Datas of each board will be display and refreshed at a time interval of 1s. The source of data will be from a database where the main thread will look for the latest data(I have not decided on which kind of database to use yet).
There will be control buttons on the control window too. I already have a .dll library, so I will only need to call the functions inside to direct the boards to action (by starting another thread).
There will be two services:
(Timer service) One will be a scheduled timer to turn the boards on/ off at a specific time. Users would be able to change the on/ off time. It would read from the database to get the on/ off time.
(Connection service) Another one will be responsible to ask and receive information/ status from the board every 30s or less. The work would be including connecting with the board through internet, asking for data, receiving the data and then writing the data to the database. And also writing down the exceptions thrown if the internet connection failed.
My questions:
1) For the connection service, I am wondering if I should be starting 20 threads to do this, one thread per connection to a board. Because if the connections were made by only one thread, the next board connection must wait for the first to finish, which may add up to 1-2 mins for the whole process to end. So I would need around 20 - 40 mins to get all the data back. But if I separate the connection to 20 threads, will it make a big difference in the performance? As the 20 threads never dies, it keeps asking for data every 30s if possible. Besides, does that mean I will have to have 20 database, as it would clash the database if 20 threads are writing in at the same time?
2) For updating the display of data on the main thread for every 1s, should I also start a service to do this? And as the connection service is also accessing the same database, will this clash the database?
There will be more than 100 boards to control and monitor in the future, so I would like to make the program as light as possible.
Thank you very much! Comments and ideas very much appreciated!
Starting 20 threads would be the best bet. (Or as Ralf said, use a thread when needed, in your specific case, it would probably be 20 at some point). Most databases are thread safe, meaning you can write into them from separate threads. If you use a "real" database, this isn't any issue at all.
No, use a Timer on the main thread to update your UI. The UI can easily read from the DB. As long as the update action itself is not taking a lot of time, it is OK to do it on the UI thread.
1) Why not use threads when needed. You can use one DBMS they are build to processing large amounts of information.
2) Not sure what you mean by start a service for the UI thread. As with 1) Database Management Systems are build to process data.