I'm trying to implement .NET 4 helper/utility class which should retrieve HTML page sources based on the url list for webtesting tool. The solution should be scalable and have high performance.
I have been researching and trying different solutions already many days, but cannot find out proper solution.
Based on my understanding best way to achieve my goal would be to use asynchronous webrequests running parallel using TPL.
In order to have full control to headers etc. I'm using HttpWebResponse instead of WebClient which is wrapping HttpWebResponse. In some cases the output should be chained to other tasks thus using TPL tasks could make sense.
What I have achieved so far after many different trials/approaches,
Implemented basic synchronous, asynchronous (APM) and parallel (using TPL tasks) solutions to see performance level of different solutions.
To see the performance of asynchrounous parallel solution I used APM approach, BeginGetResponse and BeginRead, and run it in Parallel.ForEach. Everything works fine and I'm happy with the performance. Somehow I feel that using simple Parallel.ForEach is not the way to go and for example I don't know how would I use task chaining.
Then I tried more sophisticated system using tasks for wrapping the APM solution by using TaskCompletionSource and iterator to iterate through the APM flow. I believe that this solution could be what I'm looking for, but there is a strange delay, something between 6-10s, which happens 2-3 times when running 500 urls list.
Based on the logs the execution has went back to the thread which is calling async fetch in a loop when the delay happens. The delay doesn't happen always when execution moves back to the loop, just 2-3 times, other times it works fine. It looks like that the looping thread would create a set of tasks those would be processed by other threads and while most/all tasks are completed there would be delay (6-8s) before the loop continues creating remaining tasks and other threads are active again.
The principle of iterator inside loop is:
IEnumerable<Task> DoExample(string input)
{
var aResult = DoAAsync(input);
yield return aResult;
var bResult = DoBAsync(aResult.Result);
yield return bResult;
var cResult = DoCAsync(bResult.Result);
yield return cResult;
…
}
Task t = Iterate(DoExample(“42”));
I'm resolving the connection limit by using System.Net.ServicePointManager.DefaultConnectionLimit and timeout using ThreadPool.RegisterWaitForSingleObject
My question is simply, what would be the best approach to implement helper/utility class for retrieving html pages which would:
be scalable and have high performance
use webrequests
be easily chained to other tasks
be able to use timeout
use .NET 4 framework
If you think that the solution of using APM, TaskCompletionSource and iterator, which I presented above, is fine I would appreciate any help for trying to solve the delay problem.
I'm totally new to C# and Windows development so please don't mind if something what I'm trying out doesn't make too much sense.
Any help would be highly appreciated as without getting this solved I have to drop my test tool development.
Thanks
Using iterators was a great solution in the pre-TPL .NET (e.g., the Coordination and Concurrency Runtime (CCR) out of MS Robotics made heavy use of them and helped inspire TPL). One problem is that iterators alone aren't going to give you what you need - you also need a scheduler to effectively distribute the workload. That's almost done by Stephen Toub's snippet that you linked to - but note that one line:
enumerator.Current.ContinueWith(recursiveBody, TaskContinuationOptions.ExecuteSynchronously);
I think the intermittent problems you're seeing might be linked to forcing "ExecuteSynchronously" - it could be causing an uneven distribution of work across the available cores/threads.
Take a look at some of the other alternatives that Stephen proposes in his blog article. In particular, see what just doing a simple chaining of ContinueWith() calls will do (if necessary, followed by matching Unwrap() calls). The syntax won't be the prettiest, but it's the simplest and interferes as little as possible with the underlying work-stealing runtime, so you'll hopefully get better results.
Related
I'm learning C#/DOTNET as one of the main reasons are incredible speeds over Node.js and OO syntax.
Now the tutorial I am following all of a sudden introduced async, and that's cool, but I could have done that with Node.js as well, so I feel a little disappointed.
My thought was maybe we could take this to the next level with Multithreading, but a lot of questions came up, with discrepancy in the database (like thread one is expecting to get data that thread two updated, but thread two was not executed before thread one retrieved, so thread one is working with an outdated data).
And searching for this seems to return very little information, mostly it's people misunderstanding multithreading and asynchronous programing.
So I'm guessing you would not want to mix API with multithreading?
Yes, it's a thing, and you're already doing it with async tasks.
.NET has a Task Scheduler that assigns your tasks to available threads from the Thread Pool. Default behavior is to create a pool of threads for each available CPU.
Clarification: this doesn't mean 1 task : 1 thread. There's a large collection of work to be done by a number of workers. Scheduler hands a worker a job, worker works until it's done or an 'await' is reached.
From the perspective of a regular async method, it can be hard to see where the 'multi-threading' comes into play. There isn't an obvious difference between Get() and await GetAsync() when your code has to sit and wait either way.
But it's not always about your code. This example might make it more clear.
List<Task> work = new();
foreach(var uri in uriList)
{
work.Add(http.GetAsync(uri));
}
await Task.WhenAll(work);
This code will execute all those GetAsyncs at the same time.
The framework making your API work is doing something similar. It would be pretty silly if the whole server was tied up because a single user requested a big file over dialup.
Async await is used for multi-threading but it is not used only for multi-threading.
I have not pesronally used/seen multi-threading in API but only console jobs. Using TPL in console jobs has improved the efficiency more than 100% for me
Async/Await is powerful and should be used for asynchronic processing in API's too.
Please go through Shiv's videos https://www.youtube.com/watch?v=iMcycFie-nk
In this article: https://blog.stephencleary.com/2013/11/taskrun-etiquette-examples-dont-use.html , it is advised against using Task.Run. however there are lot of libraries that provide methods that ends with Async and hence I expect those methods to return a running task that I can await (which however is not necessary, since those libraries could decide to return a synchronous task).
The context is a ASP.NET application. How am I supposed to make a method running in parallel?
What I understand is that async calls are executed in parallel if they contain at least one "await" operator inside, the problem is that the innermost call, should be parallel to achieve that, and to do that I have somewhat to resort to Task.Run
I have also seen some examples using TaskCompletionSource, is this necessary to implement the "inner most async method" to run a method in parallel in a ASP.NET application?
In an ASP.Net application we tend to value requests/s over individual response times1 - certainly if we're directly trading off one versus the other. So we don't try to focus more CPU power at satisfying one request.
And really, focussing more CPU power at a task is what Task.Run is for - it's for when you have a distinct chunk of work to be done, you can't do it on the current thread (because its got its own work to do) and when you're free to use as much CPU as possible.
In ASP.Net, where async shines is when we're dealing with I/O. Nasty slow things like accessing the file system or talking to a database across the network. And wonderfully, at the lowest level, the windows I/O system is async already and we don't have to devote a thread just to waiting for things to finish.
So, you won't be using Task.Run. Instead you'll be looking for I/O related objects that expose Async methods. And those methods themselves will not, as above, be using Task.Run. What this does allow us to do is to stop using any threads for servicing our particular request whilst there's no work to be done, and so improve out requests/s metric.
1This is a generalization but single user/request ASP.Net sites are rare in my experience.
I have a .NET 4.5.1 WCF service that handles synchronization from an app that will be used by thousands of users. I currently use Task.WaitAll as shown below and it works fine but I read that this is bad, can cause deadlocks, etc. I believe I tried WhenAll in the past and it didn't work, I don't recall the issues as I'm returning to this for review again just to make sure I'm doing this right. My concern is whether or not the blocking is needed and preferred in this use, a WCF service method hence why the WaitAll appears to work without issue.
I have about a dozen methods that each update an entity in Entity Framework 6 processing the incoming data with existing data and making the necessary changes. Each of these methods can be expensive so I would like to use parallelism mainly to get all methods working at the same time on this powerful 24 core server. Each method returns as Task as wraps its contents in Task.Run. The DoSync method created a new List and adds each of these sync methods to the list. I then call Task.WaitAll(taskList.ToArray()) and all works great.
Is this the right way of doing this? I want to make sure this method will scale well, not cause problems, and work properly in a WCF service scenario.
In high-scale services it is often a good idea to use async IO (which you are not - you use Task.Run). "High scale" is very loosely defined. The benefit of async IO on the server is that it does not block threads. This leads to less memory usage and less context switching. That is all there is to it.
If you do not need these benefits you can use sync IO and blocking all you like. Nothing bad will happen. Understand, that running 10 queries on background threads and waiting for them will temporarily block 11 threads. This might be fine, or not, depending on the number of concurrent operations you expect.
I suggest you do a little research regarding the scalability benefits of async IO so that you better understand when to use it. Remember that there is a cost to going async: Slower development and more concurrency bugs.
Understand, that async IO is different from just using the thread-pool (Task.Run). The thread-pool is not thread-less while async IO does not use any threads at all. Not even "invisible" threads managed by the runtime.
What I often find is: If you have to ask, you don't need it.
Task.WhenAll is the non-blocking equivalent of Task.WaitAll, and without seeing your code I can't think of any reason why it wouldn't work and wouldn't be preferable. But note that Task.WhenAll itself returns a Task which you must await. Did you do that?
I recently read an article about c#-5 and new & nice asynchronous programming features . I see it works greate in windows application. The question came to me is if this feature can increase ASP.Net performance?
consider this two psudo code:
public T GetData()
{
var d = GetSomeData();
return d;
}
and
public async T GetData2()
{
var d = await GetSomeData();
return d;
}
Has in an ASP.Net appication that two codes difference?
thanks
Well for a start your second piece of code would return Task<T> rather than T. The ultimate answer is "it depends".
If your page needs to access multiple data sources, it may make it simpler to parallelize access to those sources, using the result of each access only where necessary. So for example, you may want to start making a long-running data fetch as the first part of your page handling, then only need the result at the end. It's obviously possible to do this without using async/await, but it's a lot simpler when the language is helping you out.
Additionally, asynchrony can be used to handle a huge number of long-running requests on a small number of threads, if most of those requests will be idle for a lot of the time - for example in a long-polling scenario. I can see the async abilities being used as an alternative to SignalR in some scenarios.
The benefits of async on the server side are harder to pin down than on the client side because there are different ways in which it helps - whereas the "avoid doing work on the UI thread" side is so obvious, it's easy to see the benefit.
Don't forget that there can be a lot more to server-side coding than just the front-end. In my experience, async is most likely to be useful when implementing RPC services, particularly those which talk to multiple other RPC services.
As Pasi says, it's just syntactic sugar - but I believe that it's sufficiently sweet sugar that it may well make the difference between proper asynchronous handling being feasible and it being simply too much effort and complexity.
Define 'performance'.
Ultimately the application is going to be doing the same amount of work as it would have done synchronously, it's just that the calling thread in the asynchronous version will wait for the operation to complete on another, whereas in the synchronous model it's the same thread performing the task.
Ultimately in both cases the client will wait the same amount of time before seeing a response from the web server and therefore won't notice any difference in performance.
If the web request is being handled via an asynchronous handler, then, again, the response will still take the same amount of time to return - however, you can decrease the pressure on the thread pool, making the webserver itself more responsive in accepting requests - see this other SO for more details on that.
Since the code is executed on the server and the user will still have to wait for the response, the question is like - does async call go faster than sync call.
Well, that depends mostly on the server implementation. For IIS and multiple number of users, too many threads would be spawned per user (even without async) and async would be inefficient. But in case of small amount of users it should be faster.
One way to see is to try out.
No. These are purely syntactic sugar and make your job as a programmer easier by allowing writing simple code and read code simply when you fix the bugs.
It does allow simpler approach to loading data without blocking UI, so in a way yes, but not really.
It would only increase performance if you needed to do multiple things, that can all be done without the need for any other information. Otherwise you may as well just do them in sequence.
In terms of your example, the answer is no. The page needs to wait for each one regardless.
My goal is to write a program that handles an arbitrary number of tasks based on given user input.
Let's say the # of tasks are 1000 in this case.
Now, I'd like to be able to have a dynamic number of threads that are spawned and start handling the tasks one by one.
I would assume I need to use a "synchronous" method, as opposed to a "asynchronous" one, so that in case one tasks has a problem, I wouldn't want it to slow down the completion of the rest.
What method would I use to accomplish the above? Semaphores? ThreadPools? And how do I make sure that a thread does not try to start a task that is already being handled by another thread? Would a "lock" handle this?
Code examples and/or links to sites that will point me in the right direction will be appreciated.
edit: The problem with the MSDN Fibonacci example is that the waitall method can only handle up to 64 waits. I need more than that due to the 1000 tasks. How to fix that situation without creating deadlocks?
Are these tasks independent? If so, you basically want a producer/consumer queue or a custom threadpool, which are effectively different views on the same thing. You need to be able to place tasks in a queue, and have multiple threads be able to read from that queue.
I have a custom threadpool in MiscUtil or there's a simple (nongeneric due to age) producer/consumer queue in my threading tutorial (about half way down this page).
If these tasks are reasonably long-running, I wouldn't use the system threadpool for this - it will spawn more threads than you probably want. If you're using .NET 4.0 beta 1 you could use Parallel Extensions though.
I'm not quite sure about your comment on WaitAll... are you trying to work out when everything's finished? In the producer/consumer queue case, that would probably involve having some sort of "stop" entry in the queue (e.g. null references which the consuming threads would understand to mean "quit") and then add a "WaitUntilEmpty" method (which should be fairly easy to implement). Note that you wouldn't need to wait until the last items had been processed, as they'd all be stop signals... by the time the queue has emptied, all the real work items will definitely have been processed anyway.
You'll probably want to use the ThreadPool to manage this.
I recommend reading up on MSDN on How to use the ThreadPool in C#. It covers many aspects of this, including firing tasks, and simple synchronization.
Using Threading in C# is the main section, and will cover other options.
If you happen to be using VS 2010 beta, and targetting .NET 4, the Task Parallel Library is a very good option for this - it simplifies some of these patterns.
You can't use it (yet) but the new Task class in .NET 4 would be ideal for this kind of situation.
Until then, the ThreadPool is your best bet. It has a (very) limited form of load-balancing. Note that if you try to start 1000 Threads you will probably get an Out Of Memory exception. The ThreadPool will handle that with ease.
Your sync problem can be handled with a simple (Interlocked) counter, if the timing is such that you can tolerate a Sleep(1) loop in the main thread. The ThreadPool is missing a more convenient way to do this.
A simple strategy to avoid a task is get by two or more thread is a syncronized (with a mutext for example) vector.
See this http://msdn.microsoft.com/en-us/library/yy12yx1f.aspx
Perhaps you can use the BackgroundWorker class. It creates a nice abstraction on top of the thread pool. You can even subclass it if you want to setup many similar jobs.
As has been mentioned, .NET 4 features the excellent Task Parallel Library. But you can use the June 2008 CTP of it in .NET 3.5 just fine. I've been doing this for some hobby projects myself, but if this is a commercial project, you should check out if there are legal issues.