Concurrent requests with HttpClient take longer than expected - c#

I have a webservice which receives multiple requests at the same time. For each request, I need to call another webservice (authentication things). The problem is, if multiple (>20) requests happen at the same time, the response time suddenly gets a lot worse.
I made a sample to demonstrate the problem:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
namespace CallTest
{
public class Program
{
private static readonly HttpClient _httpClient = new HttpClient(new HttpClientHandler { Proxy = null, UseProxy = false });
static void Main(string[] args)
{
ServicePointManager.DefaultConnectionLimit = 100;
ServicePointManager.Expect100Continue = false;
// warmup
CallSomeWebsite().GetAwaiter().GetResult();
CallSomeWebsite().GetAwaiter().GetResult();
RunSequentiell().GetAwaiter().GetResult();
RunParallel().GetAwaiter().GetResult();
}
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite());
}
await Task.WhenAll(tasks);
}
private static async Task RunSequentiell()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
await CallSomeWebsite();
}
}
private static async Task CallSomeWebsite()
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("http://example.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine(watch.ElapsedMilliseconds);
}
}
}
}
Sequential calls are no problem. They take a few milliseconds to finish and the response time is mostly the same.
However, parallel request start taking longer and longer the more requests are being sent. Sometimes it takes even a few seconds. I tested it on .NET Framework 4.6.1 and on .NET Core 2.0 with the same results.
What is even stranger: I traced the HTTP requests with WireShark and they always take around the same time. But the sample program reports much higher values for parallel requests than WireShark.
How can I get the same performance for parallel requests? Is this a thread pool issue?

This behaviour has been fixed with .NET Core 2.1. I think the problem was the underlying windows WinHTTP handler, which was used by the HttpClient.
In .NET Core 2.1, they rewrote the HttpClientHandler (see https://blogs.msdn.microsoft.com/dotnet/2018/04/18/performance-improvements-in-net-core-2-1/#user-content-networking):
In .NET Core 2.1, HttpClientHandler has a new default implementation implemented from scratch entirely in C# on top of the other System.Net libraries, e.g. System.Net.Sockets, System.Net.Security, etc. Not only does this address the aforementioned behavioral issues, it provides a significant boost in performance (the implementation is also exposed publicly as SocketsHttpHandler, which can be used directly instead of via HttpClientHandler in order to configure SocketsHttpHandler-specific properties).
This turned out to remove the bottlenecks mentioned in the question.
On .NET Core 2.0, I get the following numbers (in milliseconds):
Fetching URL 500 times...
Sequentiell Total: 4209, Max: 35, Min: 6, Avg: 8.418
Parallel Total: 822, Max: 338, Min: 7, Avg: 69.126
But on .NET Core 2.1, the individual parallel HTTP requests seem to have improved a lot:
Fetching URL 500 times...
Sequentiell Total: 4020, Max: 40, Min: 6, Avg: 8.040
Parallel Total: 795, Max: 76, Min: 5, Avg: 7.972

In the question's RunParallel() function, a stopwatch is started for all 300 calls in the first second of the program running, and ended when each http request completes.
Therefore these times can't really be compared to the sequential iterations.
For smaller numbers of parallel tasks e.g. 50, if you measure the wall time that the sequential and parallel methods take you should find that the parallel method is faster due to it pipelining as many GetAsync tasks as it can.
That said, when running the code for 300 iterations I did find a repeatable several-second stall when running outside the debugger only:
Debug build, in debugger: Sequential 27.6 seconds, parallel 0.6 seconds
Debug build, without debugger: Sequential 26.8 seconds, parallel 3.2 seconds
[Edit]
There's a similar scenario described in this question, its possibly not relevant to your problem anyway.
This problem gets worse the more tasks are run, and disappears when:
Swapping the GetAsync work for an equivalent delay
Running against a local server
Slowing the rate of tasks creation / running less concurrent tasks
The watch.ElapsedMilliseconds diagnostic stops for all connections, indicating that all connections are affected by the throttling.
Seems to be some sort of (anti-syn-flood?) throttling in the host or network, that just halts the flow of packets once a certain number of sockets start connecting.

It sounds like for whatever reason, you're hitting a point of diminishing returns at around 20 concurrent Tasks. So, your best option might be to throttle your parallelism. TPL Dataflow is a great library for achieving this. To follow your pattern, add a method like this:
private static Task RunParallelThrottled()
{
var throtter = new ActionBlock<int>(i => CallSomeWebsite(),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
for (var i = 0; i < 300; i++)
{
throttler.Post(i);
}
throttler.Complete();
return throttler.Completion;
}
You might need to experiment with MaxDegreeOfParallelism until you find the sweet spot. Note that this is more efficient than doing batches of 20. In that scenario, all 20 in the batch would need to complete before the next batch begins. With TPL Dataflow, as soon as one completes, another is allowed to begin.

The reason that you are having issues is that .NET does not resume Tasks in the order that they are awaited, an awaited Task is only resumed when a calling function cannot resume execution, and Task is not for Parallel execution.
If you make a few modifications so that you pass in i to the CallSomeWebsite function and call Console.WriteLine("All loaded"); after you add all the tasks to the list, you will get something like this: (RequestNumber: Time)
All loaded
0: 164
199: 236
299: 312
12: 813
1: 837
9: 870
15: 888
17: 905
5: 912
10: 952
13: 952
16: 961
18: 976
19: 993
3: 1061
2: 1061
Do you notice how every Task is created before any of the times are printed out to the screen? The entire loop of creating Tasks completes before any of the Tasks resume execution after awaiting the network call.
Also, see how request 199 is completed before request 1? .NET will resume Tasks in the order that it deems best (This is guaranteed to be more complicated but I am not exactly sure how .NET decides which Task to continue).
One thing that I think you might be confusing is Asynchronous and Parallel. They are not the same, and Task is used for Asynchronous execution. What that means is that all of these tasks are running on the same thread (Probably. .NET can start a new thread for tasks if needed), so they are not running in Parallel. If they were truly Parallel, they would all be running in different threads, and the execution times would not be increasing for each execution.
Updated functions:
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite(i));
}
Console.WriteLine("All loaded");
await Task.WhenAll(tasks);
}
private static async Task CallSomeWebsite(int i)
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("https://www.google.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine($"{i}: {watch.ElapsedMilliseconds}");
}
}
As for the reason that the time printed is longer for the Asynchronous execution then the Synchronous execution, your current method of tracking time does not take into account the time that was spent between execution halt and continuation. That is why all of the reporting execution times are increasing over the set of completed requests. If you want an accurate time, you will need to find a way of subtracting the time that was spent between the await occurring and execution continuing. The issue isn't that it is taking longer, it is that you have an inaccurate reporting method. If you sum the time for all the Synchronous calls, it is actually significantly more than the max time of the Asynchronous call:
Sync: 27965
Max Async: 2341

Related

Why async controller method is executing just like a sync method?

I've created sample .NET Core WebApi application to test how async methods can increase the throughput. App is hosted on IIS 10.
Here is a code of my controller:
[HttpGet("sync")]
public IEnumerable<string> Get()
{
return this.GetValues().Result;
}
[HttpGet("async")]
public async Task<IEnumerable<string>> GetAsync()
{
return await this.GetValues();
}
[HttpGet("isresponding")]
public Task<bool> IsResponding()
{
return Task.FromResult(true);
}
private async Task<IEnumerable<string>> GetValues()
{
await Task.Delay(TimeSpan.FromSeconds(10)).ConfigureAwait(false);
return new string[] { "value1", "value2" };
}
there are methods:
Get() - to get result synchronously
GetAsync() - to get result asynchronously.
IsResponding() - to check that server can serve requests
Then I created sample console app, which creates 100 requests to sync and async method (no waiting for result) of the controller. Then I call method IsResponding() to check whether server is available.
Console app code is:
using (var httpClient = new HttpClient())
{
var methodUrl = $"http://localhost:50001/api/values/{apiMethod}";
Console.WriteLine($"Executing {methodUrl}");
//var result1 = httpClient.GetAsync($"http://localhost:50001/api/values/{apiMethod}").Result.Content.ReadAsStringAsync().Result;
Parallel.For(0, 100, ((i, state) =>
{
httpClient.GetAsync(methodUrl);
}));
var sw = Stopwatch.StartNew();
var isAlive = httpClient.GetAsync($"http://localhost:50001/api/values/isresponding").Result.Content;
Console.WriteLine($"{sw.Elapsed.TotalSeconds} sec.");
Console.ReadKey();
}
where {apiMethod} is "sync" or "async", depending on user input.
In both cases server is not responding for a long time (about 40 sec).
I expexted that in async case server should continue serving requests fast, but it doesn't.
UPDATE 1:
I've changed client code like this:
Parallel.For(0, 10000, ((i, state) =>
{
var httpClient = new HttpClient();
httpClient.GetAsync($"http://localhost:50001/api/values/{apiMethod}");
}));
using (var httpClient = new HttpClient())
{
var sw = Stopwatch.StartNew();
// this method should evaluate fast when we called async version and should evaluate slowly when we called sync method (due to busy threads ThreadPool)
var isAlive = httpClient.GetAsync($"http://localhost:50001/api/values/isresponding").Result.Content;
Console.WriteLine($"{sw.Elapsed.TotalSeconds} sec.");
}
and calling IsResponding() method executing for a very long time.
UPDATE 2
Yes, I know how async methods work. Yes, I know how to use HttpClient. It's just a sample to prove theory.
UPDATE 3
As it mentioned by StuartLC in one of the comments, IIS somehow throtling or blocking requests. When I started my WebApi as SelfHosted it started working as expected:
Executing time of "isresponsible" method after bunch of requests to ASYNC method is very fast, at about 0.02 sec.
Executing time of "isresponsible" method after bunch of requests to SYNC method is very slow, at about 35 sec.
You don't seem to understand async. It doesn't make the response return faster. The response cannot be returned until everything the action is doing is complete, async or not. If anything, async is actually slower, if only slightly, because there's additional overhead involved in asynchronous processing not necessary for synchronous processing.
What async does do is potentially allow the active thread servicing the request to be returned to the pool to service other requests. In other words, async is about scale, not performance. You'll only see benefits when your server is slammed with requests. Then, when incoming requests would normally have been queued sync, you'll process additional requests from some of the async tasks forfeiting their threads to the cause. Additionally, there is no guarantee that the thread will be freed at all. If the async task completes immediately or near immediately, the thread will be held, just as with sync.
EDIT
You should also realize that IIS Express is single-threaded. As such, it's not a good guage for performance tuning. If you're running 1000 simultaneous requests, 999 are instantly queued. Then, you're not doing any asynchronous work - just returning a completed task. As such, the thread will never be released, so there is literally no difference between sync and async in this scenario. Therefore, you're down to just how long it takes to process through the queue of 999 requests (plus your status check at the end). You might have better luck at teasing out a difference if you do something like:
await Task.Delay(500);
Instead of just return Task.FromResult. That way, there's actual idle time on the thread that may allow it to be returned to the pool.
IIS somehow throtling or blocking requests (as mentioned in one of the comments). When I started my WebApi as SelfHosted it started working as expected:
Executing time of isresponsible method after bunch of requests to ASYNC method is very fast, at about 0.02 sec.
Executing time of isresponsible method after bunch of requests to SYNC method is very slow, at about 35 sec.
I'm not sure this will yield any major improvement, but you should call ConfigureAwait(false) on every awaitable in the server, including in GetAsync.
It should produce better use of the threads.

Async/await performance

I'm working on performance optimization of the program which widely uses async/await feature. Generally speaking it downloads thousands of json documents through HTTP in parallel, parses them and builds some response using this data. We experience some issues with performance, when we handle many requests simultaneously (e.g. download 1000 jsons), we can see that a simple HTTP request can take a few minutes.
I wrote a small console app to test it on a simplified example:
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 100000; i++)
{
Task.Run(IoBoundWork);
}
Console.ReadKey();
}
private static async Task IoBoundWork()
{
var sw = Stopwatch.StartNew();
await Task.Delay(1000);
Console.WriteLine(sw.Elapsed);
}
}
And I can see similar behavior here:
The question is why "await Task.Delay(1000)" eventually takes 23 sec.
Task.Delay isn't broken, but you're performing 100,000 tasks which each take some time. It's the call to Console.WriteLine that is causing the problem in this particular case. Each call is cheap, but they're accessing a shared resource, so they aren't very highly parallelizable.
If you remove the call to Console.WriteLine, all the tasks complete very quickly. I changed your code to return the elapsed time that each task observes, and then print just a single line of output at the end - the maximum observed time. On my computer, without any Console.WriteLine call, I see output of about 1.16 seconds, showing very little inefficiency:
using System;
using System.Linq;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
class Program
{
static void Main(string[] args)
{
ThreadPool.SetMinThreads(50000, 50000);
var tasks = Enumerable.Repeat(0, 100000)
.Select(_ => Task.Run(IoBoundWork))
.ToArray();
Task.WaitAll(tasks);
var maxTime = tasks.Max(t => t.Result);
Console.WriteLine($"Max: {maxTime}");
}
private static async Task<double> IoBoundWork()
{
var sw = Stopwatch.StartNew();
await Task.Delay(1000);
return sw.Elapsed.TotalSeconds;
}
}
You can then modify IoBoundWork to do different tasks, and see the effect. Examples of work to try:
CPU work (do something actively "hard" for the CPU, but briefly)
Synchronous sleeping (so the thread is blocked, but the CPU isn't)
Synchronous IO which doesn't have any shared bottlenecks (although that's generally hard, given that the disk or network is likely to end up being a shared resource bottleneck even if you're writing to different files etc)
Synchronous IO with a shared bottleneck such as Console.WriteLine
Asynchronous IO (await foo.WriteAsync(...) etc)
You can also try removing the call to Task.Delay(1000) or changing it. I found that by removing it entirely, the result was very small - whereas replacing it with Task.Yield was very similar to Task.Delay. It's worth remembering that as soon as your async method has to actually "pause" you're effectively doubling the task scheduling problem - instead of scheduling 100,000 operations, you're scheduling 200,000.
You'll see a different pattern in each case. Fundamentally, you're starting 100,000 tasks, asking them all to wait for a second, then asking them all to do something. That causes issues in terms of continuation scheduling that's async/await specific, but also plain resource management of "Performing 100,000 tasks each of which needs to write to the console is going to take a while."
If your problem is performance, async-await is the wrong solution.
async-await is all about availability. Availability to handle the screen and user impute, availability to handle HTTP requests, etc.
The synchronization work behind async-await will use more resources and take more time than simply blocking until the operation completes.
Your HTTP server will handle more requests because less threads will be blocked waiting for operations to complete but each request will take slightly longer.

How can I make sure a dataflow block only creates threads on a on-demand basis?

I've written a small pipeline using the TPL Dataflow API which receives data from multiple threads and performs handling on them.
Setup 1
When I configure it to use MaxDegreeOfParallelism = Environment.ProcessorCount (comes to 8 in my case) for each block, I notice it fills up buffers in multiple threads and processing the second block doesn't start until +- 1700 elements have been received across all threads. You can see this in action here.
Setup 2
When I set MaxDegreeOfParallelism = 1 then I notice all elements are received on a single thread and processing the sending already starts after +- 40 elements are received. Data here.
Setup 3
When I set MaxDegreeOfParallelism = 1 and I introduce a delay of 1000ms before sending each input, I notice elements get sent as soon as they are received and every received element is put on a separate thread. Data here.
So far the setup. My questions are the following:
When I compare setups 1 & 2 I notice that processing elements starts much faster when done in serial compared to parallel (even after accounting for the fact that parallel has 8x as many threads). What causes this difference?
Since this will be run in an ASP.NET environment, I don't want to spawn unnecessary threads since they all come from a single threadpool. As shown in setup 3 it will still spread itself over multiple threads even when there is only a handful of data. This is also surprising because from setup 1 I would assume that data is spread sequentially over threads (notice how the first 50 elements all go to thread 16). Can I make sure it only creates new threads on a on-demand basis?
There is another concept called the BufferBlock<T>. If the TransformBlock<T> already queues input, what would be the practical difference of swapping the first step in my pipeline (ReceiveElement) for a BufferBlock?
class Program
{
static void Main(string[] args)
{
var dataflowProcessor = new DataflowProcessor<string>();
var amountOfTasks = 5;
var tasks = new Task[amountOfTasks];
for (var i = 0; i < amountOfTasks; i++)
{
tasks[i] = SpawnThread(dataflowProcessor, $"Task {i + 1}");
}
foreach (var task in tasks)
{
task.Start();
}
Task.WaitAll(tasks);
Console.WriteLine("Finished feeding threads"); // Needs to use async main
Console.Read();
}
private static Task SpawnThread(DataflowProcessor<string> dataflowProcessor, string taskName)
{
return new Task(async () =>
{
await FeedData(dataflowProcessor, taskName);
});
}
private static async Task FeedData(DataflowProcessor<string> dataflowProcessor, string threadName)
{
foreach (var i in Enumerable.Range(0, short.MaxValue))
{
await Task.Delay(1000); // Only used for the delayedSerialProcessing test
dataflowProcessor.Process($"Thread name: {threadName}\t Thread ID:{Thread.CurrentThread.ManagedThreadId}\t Value:{i}");
}
}
}
public class DataflowProcessor<T>
{
private static readonly ExecutionDataflowBlockOptions ExecutionOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
private static readonly TransformBlock<T, T> ReceiveElement = new TransformBlock<T, T>(element =>
{
Console.WriteLine($"Processing received element in thread {Thread.CurrentThread.ManagedThreadId}");
return element;
}, ExecutionOptions);
private static readonly ActionBlock<T> SendElement = new ActionBlock<T>(element =>
{
Console.WriteLine($"Processing sent element in thread {Thread.CurrentThread.ManagedThreadId}");
Console.WriteLine(element);
}, ExecutionOptions);
static DataflowProcessor()
{
ReceiveElement.LinkTo(SendElement);
ReceiveElement.Completion.ContinueWith(x =>
{
if (x.IsFaulted)
{
((IDataflowBlock) ReceiveElement).Fault(x.Exception);
}
else
{
ReceiveElement.Complete();
}
});
}
public void Process(T newElement)
{
ReceiveElement.Post(newElement);
}
}
Before you deploy your solution to the ASP.NET environment, I suggest you to change your architecture: IIS can suspend threads in ASP.NET for it's own use after the request handled so your task could be unfinished. Better approach is to create a separate windows service daemon, which handles your dataflow.
Now back to the TPL Dataflow.
I love the TPL Dataflow library but it's documentation is a real mess.
The only useful document I've found is Introduction to TPL Dataflow.
There are some clues in it which can be helpful, especially the ones about Configuration Settings (I suggest you to investigate the implementing your own TaskScheduler with using your own TheadPool implementation, and MaxMessagesPerTask option) if you need:
The built-in dataflow blocks are configurable, with a wealth of control provided over how and where blocks perform their work. Here are some key knobs available to the developer, all of which are exposed through the DataflowBlockOptions class and its derived types (ExecutionDataflowBlockOptions and GroupingDataflowBlockOptions), instances of which may be provided to blocks at construction time.
TaskScheduler customization, as #i3arnon mentioned:
By default, dataflow blocks schedule work to TaskScheduler.Default, which targets the internal workings of the .NET ThreadPool.
MaxDegreeOfParallelism
It defaults to 1, meaning only one thing may happen in a block at a time. If set to a value higher than 1, that number of messages may be processed concurrently by the block. If set to DataflowBlockOptions.Unbounded (-1), any number of messages may be processed concurrently, with the maximum automatically managed by the underlying scheduler targeted by the dataflow block. Note that MaxDegreeOfParallelism is a maximum, not a requirement.
MaxMessagesPerTask
TPL Dataflow is focused on both efficiency and control. Where there are necessary trade-offs between the two, the system strives to provide a quality default but also enable the developer to customize behavior according to a particular situation. One such example is the trade-off between performance and fairness. By default, dataflow blocks try to minimize the number of task objects that are necessary to process all of their data. This provides for very efficient execution; as long as a block has data available to be processed, that block’s tasks will remain to process the available data, only retiring when no more data is available (until data is available again, at which point more tasks will be spun up). However, this can lead to problems of fairness. If the system is currently saturated processing data from a given set of blocks, and then data arrives at other blocks, those latter blocks will either need to wait for the first blocks to finish processing before they’re able to begin, or alternatively risk oversubscribing the system. This may or may not be the correct behavior for a given situation. To address this, the MaxMessagesPerTask option exists.
It defaults to DataflowBlockOptions.Unbounded (-1), meaning that there is no maximum. However, if set to a positive number, that number will represent the maximum number of messages a given block may use a single task to process. Once that limit is reached, the block must retire the task and replace it with a replica to continue processing. These replicas are treated fairly with regards to all other tasks scheduled to the scheduler, allowing blocks to achieve a modicum of fairness between them. In the extreme, if MaxMessagesPerTask is set to 1, a single task will be used per message, achieving ultimate fairness at the potential expense of more tasks than may otherwise have been necessary.
MaxNumberOfGroups
The grouping blocks are capable of tracking how many groups they’ve produced, and automatically complete themselves (declining further offered messages) after that number of groups has been generated. By default, the number of groups is DataflowBlockOptions.Unbounded (-1), but it may be explicitly set to a value greater than one.
CancellationToken
This token is monitored during the dataflow block’s lifetime. If a cancellation request arrives prior to the block’s completion, the block will cease operation as politely and quickly as possible.
Greedy
By default, target blocks are greedy and want all data offered to them.
BoundedCapacity
This is the limit on the number of items the block may be storing and have in flight at any one time.

Best way to work on 15000 work items that need 1-2 I/O calls each

I have a C#/.NET 4.5 application that does work on around 15,000 items that are all independent of each other. Each item has a relatively small cpu work to do (no more than a few milliseconds) and 1-2 I/O calls to WCF services implemented in .NET 4.5 with SQL Server 2008 backend. I assume they will queue concurrent requests that they can't process quick enough? These I/O operations can take anywhere from a few milliseconds to a full second. The work item then has a little more cpu work(less than 100 milliseconds) and it is done.
I am running this on a quad-core machine with hyper-threading. Using the task parallel library, I am trying to get the best performance with machine as I can with as little waiting on I/O as possible by running those operations asynchronously and the CPU work done in parallel.
Synchronously, with no parallel processes and no async operations, the application takes around 9 hours to run. I believe I can speed this up to under an hour or less but I am not sure if I am going about this the right way.
What is the best way to do the work per item in .NET? Should I make 15000 threads and have them doing all the work with context switching? Or should I just make 8 threads (how many logical cores I have) and go about it that way? Any help on this would be greatly appreciated.
My usuall suggestion is TPL Dataflow.
You can use an ActionBlock with an async operation and set the parallelism as high as you need it to be:
var block = new ActionBlock<WorkItem>(wi =>
{
DoWork(wi);
await Task.WhenAll(DoSomeWorkAsync(wi), DoOtherWorkAsync(wi));
},
new ExecutionDataflowBlockOptions{ MaxDegreeOfParallelism = 1000 });
foreach (var workItem in workItems)
{
block.Post(workItem);
}
block.Complete();
await block.Completion;
That way you can test and tweak MaxDegreeOfParallelism until you find the number that fits your specific situation the most.
For CPU intensive work having higher parallelism than your cores doesn't help, but for I/O (and other async operations) it definitely does so if your CPU intensive work is short then I would go with at least a 1000.
You definitely don't want to kick off 15000 threads and let them all thrash it out. If you can make your I/O methods completely async - meaning I/O completion ports based - then you can get some very nice controlled parallelism going on.
If you have to tie up threads whilst waiting for I/O you're going to be massively limiting your ability to process the items.
TaskFactory taskFactory = new TaskFactory(new WorkStealingTaskScheduler(Environment.ProcessorCount));
public Job[] GetJobs() { get { return new Job[15000];} }
public async Task ProcessJobs(Job[] jobs)
{
var jobTasks = jobs.Select(j => StartJob(j));
await Task.WhenAll(jobTasks);
}
private async Task StartJob(Job j)
{
var initialCpuResults = await taskFactory.StartNew(() => j.DoInitialCpuWork());
var wcfResult = await DoIOCalls(initialCpuResults);
await taskFactory.StartNew(() => j.DoLastCpuWork(wcfResult));
}
private async Task<bool> DoIOCalls(Result r)
{
// Sequential...
await myWcfClientProxy.DoIOAsync(...); // These MUST be fully IO completion port based methods [not Task.Run etc] to achieve good throughput
await mySQLServerClient.DoIOAsync(...);
// or in Parallel...
// await Task.WhenAll(myWcfClientProxy.DoIOAsync(...), mySQLServerClient.DoIOAsync(...));
return true;
}

Multithreading a web scraper?

I've been thinking about making my web scraper multithreaded, not like normal threads (egThread scrape = new Thread(Function);) but something like a threadpool where there can be a very large number of threads.
My scraper works by using a for loop to scrape pages.
for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)
So how could I multithread the function (that contains the loop) with something like a threadpool? I've never used threadpools before and the examples I've seen have been quite confusing or obscure to me.
I've modified my loop into this:
int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
//Scraping
});
Would that work or have I got something wrong?
The problem with using pool threads is that they spend most of their time waiting for a response from the Web site. And the problem with using Parallel.ForEach is that it limits your parallelism.
I got the best performance by using asynchronous Web requests. I used a Semaphore to limit the number of concurrent requests, and the callback function did the scraping.
The main thread creates the Semaphore, like this:
Semaphore _requestsSemaphore = new Semaphore(20, 20);
The 20 was derived by trial and error. It turns out that the limiting factor is DNS resolution and, on average, it takes about 50 ms. At least, it did in my environment. 20 concurrent requests was the absolute maximum. 15 is probably more reasonable.
The main thread essentially loops, like this:
while (true)
{
_requestsSemaphore.WaitOne();
string urlToCrawl = DequeueUrl(); // however you do that
var request = (HttpWebRequest)WebRequest.Create(urlToCrawl);
// set request properties as appropriate
// and then do an asynchronous request
request.BeginGetResponse(ResponseCallback, request);
}
The ResponseCallback method, which will be called on a pool thread, does the processing, disposes of the response, and then releases the semaphore so that another request can be made.
void ResponseCallback(IAsyncResult ir)
{
try
{
var request = (HttpWebRequest)ir.AsyncState;
// you'll want exception handling here
using (var response = (HttpWebResponse)request.EndGetResponse(ir))
{
// process the response here.
}
}
finally
{
// release the semaphore so that another request can be made
_requestSemaphore.Release();
}
}
The limiting factor, as I said, is DNS resolution. It turns out that DNS resolution is done on the calling thread (the main thread in this case). See Is this really asynchronous? for more information.
This is simple to implement and works quite well. It's possible to get even more than 20 concurrent requests, but doing so takes quite a bit of effort, in my experience. I had to do a lot of DNS caching and ... well, it was difficult.
You can probably simplify the above by using Task and the new async stuff in C# 5.0 (.NET 4.5). I'm not familiar enough with those to say how, though.
It's better to go with the TPL, namely Parallel.ForEach using an overload with a Partitioner. It manages workload automatically.
FYI. You should understand that more threads doesn't mean faster. I'd advice you to make some tests to compare unparametrized Parallel.ForEach and user defined.
Update
public void ParallelScraper(int fromInclusive, int toExclusive,
Action<int> scrape, int desiredThreadsCount)
{
int chunkSize = (toExclusive - fromInclusive +
desiredThreadsCount - 1) / desiredThreadsCount;
ParallelOptions pOptions = new ParallelOptions
{
MaxDegreeOfParallelism = desiredThreadsCount
};
Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, chunkSize),
rng =>
{
for (int i = rng.Item1; i < rng.Item2; i++)
scrape(i);
});
}
Note You could be better with async in your situation.
If you think your web scraper like using for loop, you could have a look at Parallel.ForEach() that would similar to foreach loop; however, in that, it iterates over an enumerable data. Parallel.ForEach use multiple threads to invoke loop body.
For more details, see Parallel loops
Update:
Parallel.For() is very similar to Parallel.ForEach(), it depends on the context like you use for or foreach loop.
This is a perfect scenario for TPL Dataflow's ActionBlock. You can easily configure it to limit concurrency. Here is one of the examples from the documentation:
var downloader = new ActionBlock<string>(async url =>
{
byte [] imageData = await DownloadAsync(url);
Process(imageData);
}, new DataflowBlockOptions { MaxDegreeOfParallelism = 5 });
downloader.Post("http://msdn.com/concurrency ");
downloader.Post("http://blogs.msdn.com/pfxteam");
You can read about ActionBlock (including the referenced example) by downloading Introduction to TPL Dataflow.
During the tests for our "Crawler-Lib Framework" I found that parallel, TPL or threading attempts won't get you the throughput you want to have. You stuck on 300-500 requests per second on a local machine. If you want to execute thousands of requests in parallel, you must execute them async pattern and process the results in parallel. Our Crawler-Lib Engine (a workflow enabled request processor) does this with about 10.000 - 20.000 requests / second on a local machine. If you want to have a fast scraper don't try to use TPL. Instead use the Async Pattern (Begin... End...) and start all your requests in one thread.
If many of your requests tend to time out lets say after 30 seconds, the situation is even worse. In tis case the TPL based solutions will get an ugly bad throughput of 5? 1? requests per second. The async pattern gives you at least 100-300 requests per second. The Crawler-Lib Engine handles this well and get the maximum possible requests. Lets say your TCP/IP tack is configured to have 60000 outbound connections (65535 is the maximum, because every connection need a outbound port) then you will get a throughput of 60000 connections / 30 seconds timeout = 2000 requests / second.

Categories