Multithreading a web scraper? - c#

I've been thinking about making my web scraper multithreaded, not like normal threads (egThread scrape = new Thread(Function);) but something like a threadpool where there can be a very large number of threads.
My scraper works by using a for loop to scrape pages.
for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)
So how could I multithread the function (that contains the loop) with something like a threadpool? I've never used threadpools before and the examples I've seen have been quite confusing or obscure to me.
I've modified my loop into this:
int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
//Scraping
});
Would that work or have I got something wrong?

The problem with using pool threads is that they spend most of their time waiting for a response from the Web site. And the problem with using Parallel.ForEach is that it limits your parallelism.
I got the best performance by using asynchronous Web requests. I used a Semaphore to limit the number of concurrent requests, and the callback function did the scraping.
The main thread creates the Semaphore, like this:
Semaphore _requestsSemaphore = new Semaphore(20, 20);
The 20 was derived by trial and error. It turns out that the limiting factor is DNS resolution and, on average, it takes about 50 ms. At least, it did in my environment. 20 concurrent requests was the absolute maximum. 15 is probably more reasonable.
The main thread essentially loops, like this:
while (true)
{
_requestsSemaphore.WaitOne();
string urlToCrawl = DequeueUrl(); // however you do that
var request = (HttpWebRequest)WebRequest.Create(urlToCrawl);
// set request properties as appropriate
// and then do an asynchronous request
request.BeginGetResponse(ResponseCallback, request);
}
The ResponseCallback method, which will be called on a pool thread, does the processing, disposes of the response, and then releases the semaphore so that another request can be made.
void ResponseCallback(IAsyncResult ir)
{
try
{
var request = (HttpWebRequest)ir.AsyncState;
// you'll want exception handling here
using (var response = (HttpWebResponse)request.EndGetResponse(ir))
{
// process the response here.
}
}
finally
{
// release the semaphore so that another request can be made
_requestSemaphore.Release();
}
}
The limiting factor, as I said, is DNS resolution. It turns out that DNS resolution is done on the calling thread (the main thread in this case). See Is this really asynchronous? for more information.
This is simple to implement and works quite well. It's possible to get even more than 20 concurrent requests, but doing so takes quite a bit of effort, in my experience. I had to do a lot of DNS caching and ... well, it was difficult.
You can probably simplify the above by using Task and the new async stuff in C# 5.0 (.NET 4.5). I'm not familiar enough with those to say how, though.

It's better to go with the TPL, namely Parallel.ForEach using an overload with a Partitioner. It manages workload automatically.
FYI. You should understand that more threads doesn't mean faster. I'd advice you to make some tests to compare unparametrized Parallel.ForEach and user defined.
Update
public void ParallelScraper(int fromInclusive, int toExclusive,
Action<int> scrape, int desiredThreadsCount)
{
int chunkSize = (toExclusive - fromInclusive +
desiredThreadsCount - 1) / desiredThreadsCount;
ParallelOptions pOptions = new ParallelOptions
{
MaxDegreeOfParallelism = desiredThreadsCount
};
Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, chunkSize),
rng =>
{
for (int i = rng.Item1; i < rng.Item2; i++)
scrape(i);
});
}
Note You could be better with async in your situation.

If you think your web scraper like using for loop, you could have a look at Parallel.ForEach() that would similar to foreach loop; however, in that, it iterates over an enumerable data. Parallel.ForEach use multiple threads to invoke loop body.
For more details, see Parallel loops
Update:
Parallel.For() is very similar to Parallel.ForEach(), it depends on the context like you use for or foreach loop.

This is a perfect scenario for TPL Dataflow's ActionBlock. You can easily configure it to limit concurrency. Here is one of the examples from the documentation:
var downloader = new ActionBlock<string>(async url =>
{
byte [] imageData = await DownloadAsync(url);
Process(imageData);
}, new DataflowBlockOptions { MaxDegreeOfParallelism = 5 });
downloader.Post("http://msdn.com/concurrency ");
downloader.Post("http://blogs.msdn.com/pfxteam");
You can read about ActionBlock (including the referenced example) by downloading Introduction to TPL Dataflow.

During the tests for our "Crawler-Lib Framework" I found that parallel, TPL or threading attempts won't get you the throughput you want to have. You stuck on 300-500 requests per second on a local machine. If you want to execute thousands of requests in parallel, you must execute them async pattern and process the results in parallel. Our Crawler-Lib Engine (a workflow enabled request processor) does this with about 10.000 - 20.000 requests / second on a local machine. If you want to have a fast scraper don't try to use TPL. Instead use the Async Pattern (Begin... End...) and start all your requests in one thread.
If many of your requests tend to time out lets say after 30 seconds, the situation is even worse. In tis case the TPL based solutions will get an ugly bad throughput of 5? 1? requests per second. The async pattern gives you at least 100-300 requests per second. The Crawler-Lib Engine handles this well and get the maximum possible requests. Lets say your TCP/IP tack is configured to have 60000 outbound connections (65535 is the maximum, because every connection need a outbound port) then you will get a throughput of 60000 connections / 30 seconds timeout = 2000 requests / second.

Related

Making bulk REST calls using HttpClient without waiting for previous request to finish

I have a bunch of independent REST calls to make (say 1000) , each call has differing body. How to make these calls in the least amount of time?
I am using a Parallel.foreach loop to to make the calls , but doesn't a call wait for the previous call to finish (on a single thread) , is there any callback kind of system to prevent this and make the process faster?
Parallel.foreach(...){
(REST call)
HttpResponseMessage response = this.client.PostAsync(new Uri(url), content).Result;
}
Using await also gives almost same results.
Make all the calls with PostAsync:
var urlAndContentArray = ...
// The fast part
IEnumerable<Task<HttpResponseMessage>> tasks = urlAndContentArray.Select
(x => this.client.PostAsync(new Uri(x.Url), x.Content));
// IF you care about results: here is the slow part:
var responses = await Task.WhenAll(tasks);
Note that this will make all the calls very quickly as requested, but indeed time it takes to get results is mostly not related to how many requests you send out - it's limited by number of outgoing requests .Net will run in parallel as well as how fast those servers reply (and if they have any DOS protection/throttling).
The simplest way to do many async actions in parallel, is to start them without waiting, capture tasks in a collection and then wait untill all tasks will be completed.
For example
var httpClient = new HttpClient();
var payloads = new List<string>(); // this is 1000 payloads
var tasks = payloads.Select(p => httpClient.PostAsync("https://addresss", new StringContent(p)));
await Task.WhenAll(tasks);
It should be enough for start, but mind 2 things.
There is still a connection pool per hostname, what defaults to 4. You can use HttpSocketsHandler to control the pool size.
It will really start or the 1000 items in parallel, what sometimes might be not what you want. To control MAX amount of parallel items you can check ActionBlock

Concurrent requests with HttpClient take longer than expected

I have a webservice which receives multiple requests at the same time. For each request, I need to call another webservice (authentication things). The problem is, if multiple (>20) requests happen at the same time, the response time suddenly gets a lot worse.
I made a sample to demonstrate the problem:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
namespace CallTest
{
public class Program
{
private static readonly HttpClient _httpClient = new HttpClient(new HttpClientHandler { Proxy = null, UseProxy = false });
static void Main(string[] args)
{
ServicePointManager.DefaultConnectionLimit = 100;
ServicePointManager.Expect100Continue = false;
// warmup
CallSomeWebsite().GetAwaiter().GetResult();
CallSomeWebsite().GetAwaiter().GetResult();
RunSequentiell().GetAwaiter().GetResult();
RunParallel().GetAwaiter().GetResult();
}
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite());
}
await Task.WhenAll(tasks);
}
private static async Task RunSequentiell()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
await CallSomeWebsite();
}
}
private static async Task CallSomeWebsite()
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("http://example.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine(watch.ElapsedMilliseconds);
}
}
}
}
Sequential calls are no problem. They take a few milliseconds to finish and the response time is mostly the same.
However, parallel request start taking longer and longer the more requests are being sent. Sometimes it takes even a few seconds. I tested it on .NET Framework 4.6.1 and on .NET Core 2.0 with the same results.
What is even stranger: I traced the HTTP requests with WireShark and they always take around the same time. But the sample program reports much higher values for parallel requests than WireShark.
How can I get the same performance for parallel requests? Is this a thread pool issue?
This behaviour has been fixed with .NET Core 2.1. I think the problem was the underlying windows WinHTTP handler, which was used by the HttpClient.
In .NET Core 2.1, they rewrote the HttpClientHandler (see https://blogs.msdn.microsoft.com/dotnet/2018/04/18/performance-improvements-in-net-core-2-1/#user-content-networking):
In .NET Core 2.1, HttpClientHandler has a new default implementation implemented from scratch entirely in C# on top of the other System.Net libraries, e.g. System.Net.Sockets, System.Net.Security, etc. Not only does this address the aforementioned behavioral issues, it provides a significant boost in performance (the implementation is also exposed publicly as SocketsHttpHandler, which can be used directly instead of via HttpClientHandler in order to configure SocketsHttpHandler-specific properties).
This turned out to remove the bottlenecks mentioned in the question.
On .NET Core 2.0, I get the following numbers (in milliseconds):
Fetching URL 500 times...
Sequentiell Total: 4209, Max: 35, Min: 6, Avg: 8.418
Parallel Total: 822, Max: 338, Min: 7, Avg: 69.126
But on .NET Core 2.1, the individual parallel HTTP requests seem to have improved a lot:
Fetching URL 500 times...
Sequentiell Total: 4020, Max: 40, Min: 6, Avg: 8.040
Parallel Total: 795, Max: 76, Min: 5, Avg: 7.972
In the question's RunParallel() function, a stopwatch is started for all 300 calls in the first second of the program running, and ended when each http request completes.
Therefore these times can't really be compared to the sequential iterations.
For smaller numbers of parallel tasks e.g. 50, if you measure the wall time that the sequential and parallel methods take you should find that the parallel method is faster due to it pipelining as many GetAsync tasks as it can.
That said, when running the code for 300 iterations I did find a repeatable several-second stall when running outside the debugger only:
Debug build, in debugger: Sequential 27.6 seconds, parallel 0.6 seconds
Debug build, without debugger: Sequential 26.8 seconds, parallel 3.2 seconds
[Edit]
There's a similar scenario described in this question, its possibly not relevant to your problem anyway.
This problem gets worse the more tasks are run, and disappears when:
Swapping the GetAsync work for an equivalent delay
Running against a local server
Slowing the rate of tasks creation / running less concurrent tasks
The watch.ElapsedMilliseconds diagnostic stops for all connections, indicating that all connections are affected by the throttling.
Seems to be some sort of (anti-syn-flood?) throttling in the host or network, that just halts the flow of packets once a certain number of sockets start connecting.
It sounds like for whatever reason, you're hitting a point of diminishing returns at around 20 concurrent Tasks. So, your best option might be to throttle your parallelism. TPL Dataflow is a great library for achieving this. To follow your pattern, add a method like this:
private static Task RunParallelThrottled()
{
var throtter = new ActionBlock<int>(i => CallSomeWebsite(),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
for (var i = 0; i < 300; i++)
{
throttler.Post(i);
}
throttler.Complete();
return throttler.Completion;
}
You might need to experiment with MaxDegreeOfParallelism until you find the sweet spot. Note that this is more efficient than doing batches of 20. In that scenario, all 20 in the batch would need to complete before the next batch begins. With TPL Dataflow, as soon as one completes, another is allowed to begin.
The reason that you are having issues is that .NET does not resume Tasks in the order that they are awaited, an awaited Task is only resumed when a calling function cannot resume execution, and Task is not for Parallel execution.
If you make a few modifications so that you pass in i to the CallSomeWebsite function and call Console.WriteLine("All loaded"); after you add all the tasks to the list, you will get something like this: (RequestNumber: Time)
All loaded
0: 164
199: 236
299: 312
12: 813
1: 837
9: 870
15: 888
17: 905
5: 912
10: 952
13: 952
16: 961
18: 976
19: 993
3: 1061
2: 1061
Do you notice how every Task is created before any of the times are printed out to the screen? The entire loop of creating Tasks completes before any of the Tasks resume execution after awaiting the network call.
Also, see how request 199 is completed before request 1? .NET will resume Tasks in the order that it deems best (This is guaranteed to be more complicated but I am not exactly sure how .NET decides which Task to continue).
One thing that I think you might be confusing is Asynchronous and Parallel. They are not the same, and Task is used for Asynchronous execution. What that means is that all of these tasks are running on the same thread (Probably. .NET can start a new thread for tasks if needed), so they are not running in Parallel. If they were truly Parallel, they would all be running in different threads, and the execution times would not be increasing for each execution.
Updated functions:
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite(i));
}
Console.WriteLine("All loaded");
await Task.WhenAll(tasks);
}
private static async Task CallSomeWebsite(int i)
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("https://www.google.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine($"{i}: {watch.ElapsedMilliseconds}");
}
}
As for the reason that the time printed is longer for the Asynchronous execution then the Synchronous execution, your current method of tracking time does not take into account the time that was spent between execution halt and continuation. That is why all of the reporting execution times are increasing over the set of completed requests. If you want an accurate time, you will need to find a way of subtracting the time that was spent between the await occurring and execution continuing. The issue isn't that it is taking longer, it is that you have an inaccurate reporting method. If you sum the time for all the Synchronous calls, it is actually significantly more than the max time of the Asynchronous call:
Sync: 27965
Max Async: 2341

Is there any value in using async await functionality when pulling JSON and store it in database?

I recently have started programming in C# (after having some experience with PHP and JavaScript), and i built a simple console program that downloads a JSON string and stores certain values in a database. The data in question is approx. 70.000 sets (converted into rows into my database). Due to a limitation on the server where I download this JSON from (Quandl), it was recommended to download it with 100 datasets per request, so I have 700 requests to make.
With every request, I download the JSON string, deserialize it and loop through it a 100 times to store the respective values in the database. I am using WebClient to make the request and I utilize JSON.net for the deserialization.
Currently, with the setup I have, it takes approx. 7 seconds for every request and including inserting the data into the database, it takes about one and half hour to finish.
The question then becomes; is there anyway to speed this up with the async/await method? Everything I read is more on the UI side of things (i.e. the UI is not frozen while a request is processed), but I was wondering if it were possible to start the requests maybe simultaneously (or, per 10 at the time or something). For completion, I have added a sanitized version of my code (made it a bit shorter but no logic has been removed).
https://dotnetfiddle.net/S0fnBc
async/await are for asynchronous operations. Asynchronous execution does not equal parallel execution. Asynchronous execution does not block the caller, and parallel execution allows for concurrent execution. You need parallel execution. To do this, you can use the Task Parallel Library. There is also a patterns and practices book that is a great read. Here is a simplified implementation:
var httpClient = new HttpClient();
httpClient.BaseAddress = new Uri("/path/to/data");
var tasks = new Task<Task<HttpResponseMessage>>[5];
for (var i = 0; i < tasks.Length; i++)
{
tasks[i] = Task<Task<HttpResponseMessage>>.Factory.StartNew(async () => await httpClient.GetAsync("?updatedFilterParams"));
}
Task.WhenAll(tasks); // wait for them to complete
foreach (var task in tasks)
{
var data = task.Result.Result.Content.ReadAsStringAsync();
// do something
}
Some things to note: WebClient is not capable of concurrent requests so you'll either have to new up another one for every request or use HttpClient as I have. Also, there are multiple things in between your code and the data that can and often do impose limits on concurrent requests for the same origin, so you'll want to throttle how many requests you fire off at a time.

How can I make sure a dataflow block only creates threads on a on-demand basis?

I've written a small pipeline using the TPL Dataflow API which receives data from multiple threads and performs handling on them.
Setup 1
When I configure it to use MaxDegreeOfParallelism = Environment.ProcessorCount (comes to 8 in my case) for each block, I notice it fills up buffers in multiple threads and processing the second block doesn't start until +- 1700 elements have been received across all threads. You can see this in action here.
Setup 2
When I set MaxDegreeOfParallelism = 1 then I notice all elements are received on a single thread and processing the sending already starts after +- 40 elements are received. Data here.
Setup 3
When I set MaxDegreeOfParallelism = 1 and I introduce a delay of 1000ms before sending each input, I notice elements get sent as soon as they are received and every received element is put on a separate thread. Data here.
So far the setup. My questions are the following:
When I compare setups 1 & 2 I notice that processing elements starts much faster when done in serial compared to parallel (even after accounting for the fact that parallel has 8x as many threads). What causes this difference?
Since this will be run in an ASP.NET environment, I don't want to spawn unnecessary threads since they all come from a single threadpool. As shown in setup 3 it will still spread itself over multiple threads even when there is only a handful of data. This is also surprising because from setup 1 I would assume that data is spread sequentially over threads (notice how the first 50 elements all go to thread 16). Can I make sure it only creates new threads on a on-demand basis?
There is another concept called the BufferBlock<T>. If the TransformBlock<T> already queues input, what would be the practical difference of swapping the first step in my pipeline (ReceiveElement) for a BufferBlock?
class Program
{
static void Main(string[] args)
{
var dataflowProcessor = new DataflowProcessor<string>();
var amountOfTasks = 5;
var tasks = new Task[amountOfTasks];
for (var i = 0; i < amountOfTasks; i++)
{
tasks[i] = SpawnThread(dataflowProcessor, $"Task {i + 1}");
}
foreach (var task in tasks)
{
task.Start();
}
Task.WaitAll(tasks);
Console.WriteLine("Finished feeding threads"); // Needs to use async main
Console.Read();
}
private static Task SpawnThread(DataflowProcessor<string> dataflowProcessor, string taskName)
{
return new Task(async () =>
{
await FeedData(dataflowProcessor, taskName);
});
}
private static async Task FeedData(DataflowProcessor<string> dataflowProcessor, string threadName)
{
foreach (var i in Enumerable.Range(0, short.MaxValue))
{
await Task.Delay(1000); // Only used for the delayedSerialProcessing test
dataflowProcessor.Process($"Thread name: {threadName}\t Thread ID:{Thread.CurrentThread.ManagedThreadId}\t Value:{i}");
}
}
}
public class DataflowProcessor<T>
{
private static readonly ExecutionDataflowBlockOptions ExecutionOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
private static readonly TransformBlock<T, T> ReceiveElement = new TransformBlock<T, T>(element =>
{
Console.WriteLine($"Processing received element in thread {Thread.CurrentThread.ManagedThreadId}");
return element;
}, ExecutionOptions);
private static readonly ActionBlock<T> SendElement = new ActionBlock<T>(element =>
{
Console.WriteLine($"Processing sent element in thread {Thread.CurrentThread.ManagedThreadId}");
Console.WriteLine(element);
}, ExecutionOptions);
static DataflowProcessor()
{
ReceiveElement.LinkTo(SendElement);
ReceiveElement.Completion.ContinueWith(x =>
{
if (x.IsFaulted)
{
((IDataflowBlock) ReceiveElement).Fault(x.Exception);
}
else
{
ReceiveElement.Complete();
}
});
}
public void Process(T newElement)
{
ReceiveElement.Post(newElement);
}
}
Before you deploy your solution to the ASP.NET environment, I suggest you to change your architecture: IIS can suspend threads in ASP.NET for it's own use after the request handled so your task could be unfinished. Better approach is to create a separate windows service daemon, which handles your dataflow.
Now back to the TPL Dataflow.
I love the TPL Dataflow library but it's documentation is a real mess.
The only useful document I've found is Introduction to TPL Dataflow.
There are some clues in it which can be helpful, especially the ones about Configuration Settings (I suggest you to investigate the implementing your own TaskScheduler with using your own TheadPool implementation, and MaxMessagesPerTask option) if you need:
The built-in dataflow blocks are configurable, with a wealth of control provided over how and where blocks perform their work. Here are some key knobs available to the developer, all of which are exposed through the DataflowBlockOptions class and its derived types (ExecutionDataflowBlockOptions and GroupingDataflowBlockOptions), instances of which may be provided to blocks at construction time.
TaskScheduler customization, as #i3arnon mentioned:
By default, dataflow blocks schedule work to TaskScheduler.Default, which targets the internal workings of the .NET ThreadPool.
MaxDegreeOfParallelism
It defaults to 1, meaning only one thing may happen in a block at a time. If set to a value higher than 1, that number of messages may be processed concurrently by the block. If set to DataflowBlockOptions.Unbounded (-1), any number of messages may be processed concurrently, with the maximum automatically managed by the underlying scheduler targeted by the dataflow block. Note that MaxDegreeOfParallelism is a maximum, not a requirement.
MaxMessagesPerTask
TPL Dataflow is focused on both efficiency and control. Where there are necessary trade-offs between the two, the system strives to provide a quality default but also enable the developer to customize behavior according to a particular situation. One such example is the trade-off between performance and fairness. By default, dataflow blocks try to minimize the number of task objects that are necessary to process all of their data. This provides for very efficient execution; as long as a block has data available to be processed, that block’s tasks will remain to process the available data, only retiring when no more data is available (until data is available again, at which point more tasks will be spun up). However, this can lead to problems of fairness. If the system is currently saturated processing data from a given set of blocks, and then data arrives at other blocks, those latter blocks will either need to wait for the first blocks to finish processing before they’re able to begin, or alternatively risk oversubscribing the system. This may or may not be the correct behavior for a given situation. To address this, the MaxMessagesPerTask option exists.
It defaults to DataflowBlockOptions.Unbounded (-1), meaning that there is no maximum. However, if set to a positive number, that number will represent the maximum number of messages a given block may use a single task to process. Once that limit is reached, the block must retire the task and replace it with a replica to continue processing. These replicas are treated fairly with regards to all other tasks scheduled to the scheduler, allowing blocks to achieve a modicum of fairness between them. In the extreme, if MaxMessagesPerTask is set to 1, a single task will be used per message, achieving ultimate fairness at the potential expense of more tasks than may otherwise have been necessary.
MaxNumberOfGroups
The grouping blocks are capable of tracking how many groups they’ve produced, and automatically complete themselves (declining further offered messages) after that number of groups has been generated. By default, the number of groups is DataflowBlockOptions.Unbounded (-1), but it may be explicitly set to a value greater than one.
CancellationToken
This token is monitored during the dataflow block’s lifetime. If a cancellation request arrives prior to the block’s completion, the block will cease operation as politely and quickly as possible.
Greedy
By default, target blocks are greedy and want all data offered to them.
BoundedCapacity
This is the limit on the number of items the block may be storing and have in flight at any one time.

Limiting the number of threadpool threads

I am using ThreadPool in my application. I have first set the limit of the thread pool by using the following:
ThreadPool.SetMaxThreads(m_iThreadPoolLimit,m_iThreadPoolLimit);
m_Events = new ManualResetEvent(false);
and then I have queued up the jobs using the following
WaitCallback objWcb = new WaitCallback(abc);
ThreadPool.QueueUserWorkItem(objWcb, m_objThreadData);
Here abc is the name of the function that I am calling.
After this I am doing the following so that all my threads come to 1 point and the main thread takes over and continues further
m_Events.WaitOne();
My thread limit is 3. The problem that I am facing is, inspite of the thread pool limit set to 3, my application is processing more than 3 files at the same time, whereas it was supposed to process only 3 files at a time. Please help me solve this issue.
What kind of computer are you using?
From MSDN
You cannot set the number of worker
threads or the number of I/O
completion threads to a number smaller
than the number of processors in the
computer.
If you have 4 cores, then the smallest you can have is 4.
Also note:
If the common language runtime is
hosted, for example by Internet
Information Services (IIS) or SQL
Server, the host can limit or prevent
changes to the thread pool size.
If this is a web site hosted by IIS then you cannot change the thread pool size either.
A better solution involves the use of a Semaphore which can throttle the concurrent access to a resource1. In your case the resource would simply be a block of code that processes work items.
var finished = new CountdownEvent(1); // Used to wait for the completion of all work items.
var throttle = new Semaphore(3, 3); // Used to throttle the processing of work items.
foreach (WorkItem item in workitems)
{
finished.AddCount();
WorkItem capture = item; // Needed to safely capture the loop variable.
ThreadPool.QueueUserWorkItem(
(state) =>
{
throttle.WaitOne();
try
{
ProcessWorkItem(capture);
}
finally
{
throttle.Release();
finished.Signal();
}
}, null);
}
finished.Signal();
finished.Wait();
In the code above WorkItem is a hypothetical class that encapsulates the specific parameters needed to process your tasks.
The Task Parallel Library makes this pattern a lot easier. Just use the Parallel.ForEach method and specify a ParallelOptions.MaxDegreesOfParallelism that throttles the concurrency.
var options = new ParallelOptions();
options.MaxDegreeOfParallelism = 3;
Parallel.ForEach(workitems, options,
(item) =>
{
ProcessWorkItem(item);
});
1I should point out that I do not like blocking ThreadPool threads using a Semaphore or any blocking device. It basically wastes the threads. You might want to rethink your design entirely.
You should use Semaphore object to limit concurent threads.
You say the files are open: are they actually being actively processed, or just left open?
If you're leaving them open: Been there, done that! Relying on connections and resources (it was a DB connection in my case) to close at end of scope should work, but it can take for the dispose / garbage collection to kick in.

Categories