What is the most recommended .NET custom threadpool that can have separate instances i.e more than one threadpool per application?
I need an unlimited queue size (building a crawler), and need to run a separate threadpool in parallel for each site I am crawling.
Edit :
I need to mine these sites for information as fast as possible, using a separate threadpool for each site would give me the ability to control the number of threads working on each site at any given time. (no more than 2-3)
Thanks
Roey
I believe Smart Thread Pool can do this. It's ThreadPool class is instantiated so you should be able to create and manage your separate site specific instances as you require.
Ami bar wrote an excellent Smart thread pool that can be instantiated.
take a look here
Ask Jon Skeet: http://www.yoda.arachsys.com/csharp/miscutil/
Parallel extensions for .Net (TPL) should actually work much better if you want a large number of parallel running tasks.
Using BlockingCollection can be used as a queue for the threads.
Here is an implementation of it.
Updated at 2018-04-23:
public class WorkerPool<T> : IDisposable
{
BlockingCollection<T> queue = new BlockingCollection<T>();
List<Task> taskList;
private CancellationTokenSource cancellationToken;
int maxWorkers;
private bool wasShutDown;
int waitingUnits;
public WorkerPool(CancellationTokenSource cancellationToken, int maxWorkers)
{
this.cancellationToken = cancellationToken;
this.maxWorkers = maxWorkers;
this.taskList = new List<Task>();
}
public void enqueue(T value)
{
queue.Add(value);
waitingUnits++;
}
//call to signal that there are no more item
public void CompleteAdding()
{
queue.CompleteAdding();
}
//create workers and put then running
public void startWorkers(Action<T> worker)
{
for (int i = 0; i < maxWorkers; i++)
{
taskList.Add(new Task(() =>
{
string myname = "worker " + Guid.NewGuid().ToString();
try
{
while (!cancellationToken.IsCancellationRequested)
{
var value = queue.Take();
waitingUnits--;
worker(value);
}
}
catch (Exception ex) when (ex is InvalidOperationException) //throw when collection is closed with CompleteAdding method. No pretty way to do this.
{
//do nothing
}
}));
}
foreach (var task in taskList)
{
task.Start();
}
}
//wait for all workers to be finish their jobs
public void await()
{
while (waitingUnits >0 || !queue.IsAddingCompleted)
Thread.Sleep(100);
shutdown();
}
private void shutdown()
{
wasShutDown = true;
Task.WaitAll(taskList.ToArray());
}
//case something bad happen dismiss all pending work
public void Dispose()
{
if (!wasShutDown)
{
queue.CompleteAdding();
shutdown();
}
}
}
Then use like this:
WorkerPool<int> workerPool = new WorkerPool<int>(new CancellationTokenSource(), 5);
workerPool.startWorkers(value =>
{
log.Debug(value);
});
//enqueue all the work
for (int i = 0; i < 100; i++)
{
workerPool.enqueue(i);
}
//Signal no more work
workerPool.CompleteAdding();
//wait all pending work to finish
workerPool.await();
You can have as many polls has you like simply creating new WorkPool objects.
This free nuget library here: CodeFluentRuntimeClient has a CustomThreadPool class that you can reuse. It's very configurable, you can change pool threads priority, number, COM apartment state, even name (for debugging), and also culture.
Another approach is to use a Dataflow Pipeline. I added these later answer because i find Dataflows a much better approach for these kind of problem, the problem of having several thread pools. They provide a more flexible and structured approach and can easily scale vertically.
You can broke your code into one or more blocks, link then with Dataflows and let then the Dataflow engine allocate threads according to CPU and memory availability
I suggest to broke into 3 blocks, one for preparing the query to the site page , one access site page, and the last one to Analise the data.
This way the slow block (get) may have more threads allocated to compensate.
Here how would look like the Dataflow setup:
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
prepareBlock.LinkTo(get, linkOptions);
getBlock.LinkTo(analiseBlock, linkOptions);
Data will flow from prepareBlock to getBlock and then to analiseBlock.
The interfaces between blocks can be any class, just have to bee the same. See the full example on Dataflow Pipeline
Using the Dataflow would be something like this:
while ...{
...
prepareBlock.Post(...); //to send data to the pipeline
}
prepareBlock.Complete(); //when done
analiseBlock.Completion.Wait(cancellationTokenSource.Token); //to wait for all queues to empty or cancel
Related
I know that asynchronous programming has seen a lot of changes over the years. I'm somewhat embarrassed that I let myself get this rusty at just 34 years old, but I'm counting on StackOverflow to bring me up to speed.
What I am trying to do is manage a queue of "work" on a separate thread, but in such a way that only one item is processed at a time. I want to post work on this thread and it doesn't need to pass anything back to the caller. Of course I could simply spin up a new Thread object and have it loop over a shared Queue object, using sleeps, interrupts, wait handles, etc. But I know things have gotten better since then. We have BlockingCollection, Task, async/await, not to mention NuGet packages that probably abstract a lot of that.
I know that "What's the best..." questions are generally frowned upon so I'll rephrase it by saying "What is the currently recommended..." way to accomplish something like this using built-in .NET mechanisms preferably. But if a third party NuGet package simplifies things a bunch, it's just as well.
I considered a TaskScheduler instance with a fixed maximum concurrency of 1, but seems there is probably a much less clunky way to do that by now.
Background
Specifically, what I am trying to do in this case is queue an IP geolocation task during a web request. The same IP might wind up getting queued for geolocation multiple times, but the task will know how to detect that and skip out early if it's already been resolved. But the request handler is just going to throw these () => LocateAddress(context.Request.UserHostAddress) calls into a queue and let the LocateAddress method handle duplicate work detection. The geolocation API I am using doesn't like to be bombarded with requests which is why I want to limit it to a single concurrent task at a time. However, it would be nice if the approach was allowed to easily scale to more concurrent tasks with a simple parameter change.
To create an asynchronous single degree of parallelism queue of work you can simply create a SemaphoreSlim, initialized to one, and then have the enqueing method await on the acquisition of that semaphore before starting the requested work.
public class TaskQueue
{
private SemaphoreSlim semaphore;
public TaskQueue()
{
semaphore = new SemaphoreSlim(1);
}
public async Task<T> Enqueue<T>(Func<Task<T>> taskGenerator)
{
await semaphore.WaitAsync();
try
{
return await taskGenerator();
}
finally
{
semaphore.Release();
}
}
public async Task Enqueue(Func<Task> taskGenerator)
{
await semaphore.WaitAsync();
try
{
await taskGenerator();
}
finally
{
semaphore.Release();
}
}
}
Of course, to have a fixed degree of parallelism other than one simply initialize the semaphore to some other number.
Your best option as I see it is using TPL Dataflow's ActionBlock:
var actionBlock = new ActionBlock<string>(address =>
{
if (!IsDuplicate(address))
{
LocateAddress(address);
}
});
actionBlock.Post(context.Request.UserHostAddress);
TPL Dataflow is robust, thread-safe, async-ready and very configurable actor-based framework (available as a nuget)
Here's a simple example for a more complicated case. Let's assume you want to:
Enable concurrency (limited to the available cores).
Limit the queue size (so you won't run out of memory).
Have both LocateAddress and the queue insertion be async.
Cancel everything after an hour.
var actionBlock = new ActionBlock<string>(async address =>
{
if (!IsDuplicate(address))
{
await LocateAddressAsync(address);
}
}, new ExecutionDataflowBlockOptions
{
BoundedCapacity = 10000,
MaxDegreeOfParallelism = Environment.ProcessorCount,
CancellationToken = new CancellationTokenSource(TimeSpan.FromHours(1)).Token
});
await actionBlock.SendAsync(context.Request.UserHostAddress);
Actually you don't need to run tasks in one thread, you need them to run serially (one after another), and FIFO. TPL doesn't have class for that, but here is my very lightweight, non-blocking implementation with tests. https://github.com/Gentlee/SerialQueue
Also have #Servy implementation there, tests show it is twice slower than mine and it doesn't guarantee FIFO.
Example:
private readonly SerialQueue queue = new SerialQueue();
async Task SomeAsyncMethod()
{
var result = await queue.Enqueue(DoSomething);
}
Use BlockingCollection<Action> to create a producer/consumer pattern with one consumer (only one thing running at a time like you want) and one or many producers.
First define a shared queue somewhere:
BlockingCollection<Action> queue = new BlockingCollection<Action>();
In your consumer Thread or Task you take from it:
//This will block until there's an item available
Action itemToRun = queue.Take()
Then from any number of producers on other threads, simply add to the queue:
queue.Add(() => LocateAddress(context.Request.UserHostAddress));
I'm posting a different solution here. To be honest I'm not sure whether this is a good solution.
I'm used to use BlockingCollection to implement a producer/consumer pattern, with a dedicated thread consuming those items. It's fine if there are always data coming in and consumer thread won't sit there and do nothing.
I encountered a scenario that one of the application would like to send emails on a different thread, but total number of emails is not that big.
My initial solution was to have a dedicated consumer thread (created by Task.Run()), but a lot of time it just sits there and does nothing.
Old solution:
private readonly BlockingCollection<EmailData> _Emails =
new BlockingCollection<EmailData>(new ConcurrentQueue<EmailData>());
// producer can add data here
public void Add(EmailData emailData)
{
_Emails.Add(emailData);
}
public void Run()
{
// create a consumer thread
Task.Run(() =>
{
foreach (var emailData in _Emails.GetConsumingEnumerable())
{
SendEmail(emailData);
}
});
}
// sending email implementation
private void SendEmail(EmailData emailData)
{
throw new NotImplementedException();
}
As you can see, if there are not enough emails to be sent (and it is my case), the consumer thread will spend most of them sitting there and do nothing at all.
I changed my implementation to:
// create an empty task
private Task _SendEmailTask = Task.Run(() => {});
// caller will dispatch the email to here
// continuewith will use a thread pool thread (different to
// _SendEmailTask thread) to send this email
private void Add(EmailData emailData)
{
_SendEmailTask = _SendEmailTask.ContinueWith((t) =>
{
SendEmail(emailData);
});
}
// actual implementation
private void SendEmail(EmailData emailData)
{
throw new NotImplementedException();
}
It's no longer a producer/consumer pattern, but it won't have a thread sitting there and does nothing, instead, every time it is to send an email, it will use thread pool thread to do it.
My lib, It can:
Run random in queue list
Multi queue
Run prioritize first
Re-queue
Event all queue completed
Cancel running or cancel wait for running
Dispatch event to UI thread
public interface IQueue
{
bool IsPrioritize { get; }
bool ReQueue { get; }
/// <summary>
/// Dont use async
/// </summary>
/// <returns></returns>
Task DoWork();
bool CheckEquals(IQueue queue);
void Cancel();
}
public delegate void QueueComplete<T>(T queue) where T : IQueue;
public delegate void RunComplete();
public class TaskQueue<T> where T : IQueue
{
readonly List<T> Queues = new List<T>();
readonly List<T> Runnings = new List<T>();
[Browsable(false), DefaultValue((string)null)]
public Dispatcher Dispatcher { get; set; }
public event RunComplete OnRunComplete;
public event QueueComplete<T> OnQueueComplete;
int _MaxRun = 1;
public int MaxRun
{
get { return _MaxRun; }
set
{
bool flag = value > _MaxRun;
_MaxRun = value;
if (flag && Queues.Count != 0) RunNewQueue();
}
}
public int RunningCount
{
get { return Runnings.Count; }
}
public int QueueCount
{
get { return Queues.Count; }
}
public bool RunRandom { get; set; } = false;
//need lock Queues first
void StartQueue(T queue)
{
if (null != queue)
{
Queues.Remove(queue);
lock (Runnings) Runnings.Add(queue);
queue.DoWork().ContinueWith(ContinueTaskResult, queue);
}
}
void RunNewQueue()
{
lock (Queues)//Prioritize
{
foreach (var q in Queues.Where(x => x.IsPrioritize)) StartQueue(q);
}
if (Runnings.Count >= MaxRun) return;//other
else if (Queues.Count == 0)
{
if (Runnings.Count == 0 && OnRunComplete != null)
{
if (Dispatcher != null && !Dispatcher.CheckAccess()) Dispatcher.Invoke(OnRunComplete);
else OnRunComplete.Invoke();//on completed
}
else return;
}
else
{
lock (Queues)
{
T queue;
if (RunRandom) queue = Queues.OrderBy(x => Guid.NewGuid()).FirstOrDefault();
else queue = Queues.FirstOrDefault();
StartQueue(queue);
}
if (Queues.Count > 0 && Runnings.Count < MaxRun) RunNewQueue();
}
}
void ContinueTaskResult(Task Result, object queue_obj) => QueueCompleted((T)queue_obj);
void QueueCompleted(T queue)
{
lock (Runnings) Runnings.Remove(queue);
if (queue.ReQueue) lock (Queues) Queues.Add(queue);
if (OnQueueComplete != null)
{
if (Dispatcher != null && !Dispatcher.CheckAccess()) Dispatcher.Invoke(OnQueueComplete, queue);
else OnQueueComplete.Invoke(queue);
}
RunNewQueue();
}
public void Add(T queue)
{
if (null == queue) throw new ArgumentNullException(nameof(queue));
lock (Queues) Queues.Add(queue);
RunNewQueue();
}
public void Cancel(T queue)
{
if (null == queue) throw new ArgumentNullException(nameof(queue));
lock (Queues) Queues.RemoveAll(o => o.CheckEquals(queue));
lock (Runnings) Runnings.ForEach(o => { if (o.CheckEquals(queue)) o.Cancel(); });
}
public void Reset(T queue)
{
if (null == queue) throw new ArgumentNullException(nameof(queue));
Cancel(queue);
Add(queue);
}
public void ShutDown()
{
MaxRun = 0;
lock (Queues) Queues.Clear();
lock (Runnings) Runnings.ForEach(o => o.Cancel());
}
}
I know this thread is old, but it seems all the present solutions are extremely onerous. The simplest way I could find uses the Linq Aggregate function to create a daisy-chained list of tasks.
var arr = new int[] { 1, 2, 3, 4, 5};
var queue = arr.Aggregate(Task.CompletedTask,
(prev, item) => prev.ContinueWith(antecedent => PerformWorkHere(item)));
The idea is to get your data into an IEnumerable (I'm using an int array), and then reduce that enumerable to a chain of tasks, starting with a default, completed, task.
I want to loop through a list of URLs and check each URL if the website is down or not using multiple threads.
My approach:
while (_lURLs.Count > 0)
{
while (_iRunningThreads < _iNumThreads)
{
Thread t = new Thread(new ParameterizedThreadStart(CheckWebsite));
string strUrl = GetNextURL();
if (!string.IsNullOrEmpty(strUrl))
{
t.Start(strUrl);
_iRunningThreads++;
}
else
{
break;
}
}
}
private string GetNextURL()
{
lock (_lURLs)
{
if (_lURLs.Count > 0)
{
string strRetVal = _lURLs[0];
_lURLs.RemoveAt(0);
return strRetVal;
}
else
{
return string.Empty;
}
}
}
When a thread is finished the _iRunningThreads property gets decremented.
My problem is: The outer while loop blocks everything "while (_lURLs.Count > 0)".
Adding a Application.DoEvents() in the outer while loop helps but I want to use the code in a c# library where Application.DoEvents() is not available.
Thank you for you help.
Instead of managing the threads yourself, you can use the TPL.
Also, if you're using .Net Framework 4.5 you can even add async/await and the WhenAll method to prevent blocking...
Here is a small example:
private async Task CheckUrl()
{
List<Task> tasks = new List<Task>();
string url = GetNextUrl();
while (!String.IsNullOrEmpty(url))
{
tasks.Add(Task.Run(() => CheckWebSite(url)));
url = GetNextUrl();
}
await Task.WhenAll(tasks);
// All tasks have finished...
}
I think using the .NET ThreadPool would be a good idea in this case, if the tasks take quite a short time to complete.
Check out: http://msdn.microsoft.com/en-us/library/4yd16hza.aspx
This allows you to simplify your code a bit as the ThreadPool automatically manages the count of the worker threads. You just have to call ThreadPool.QueueUserWorkItem for each URL you have and increment a running task counter. Queuing items into the ThreadPool won't block the UI thread.
Have the ThreadPool tasks decrement the counter (as you have now) and when the counter gets to zero (all tasks have been ran) call a callback function so that your main code knows when all the URLs have been processed. You can update the UI or what ever else you want to do from that callback.
I need to implement a sort of task buffer. Basic requirements are:
Process tasks in a single background thread
Receive tasks from multiple threads
Process ALL received tasks i.e. make sure buffer is drained of buffered tasks after a stop signal is received
Order of tasks received per thread must be maintained
I was thinking of implementing it using a Queue like below. Would appreciate feedback on the implementation. Are there any other brighter ideas to implement such a thing?
public class TestBuffer
{
private readonly object queueLock = new object();
private Queue<Task> queue = new Queue<Task>();
private bool running = false;
public TestBuffer()
{
}
public void start()
{
Thread t = new Thread(new ThreadStart(run));
t.Start();
}
private void run()
{
running = true;
bool run = true;
while(run)
{
Task task = null;
// Lock queue before doing anything
lock (queueLock)
{
// If the queue is currently empty and it is still running
// we need to wait until we're told something changed
if (queue.Count == 0 && running)
{
Monitor.Wait(queueLock);
}
// Check there is something in the queue
// Note - there might not be anything in the queue if we were waiting for something to change and the queue was stopped
if (queue.Count > 0)
{
task = queue.Dequeue();
}
}
// If something was dequeued, handle it
if (task != null)
{
handle(task);
}
// Lock the queue again and check whether we need to run again
// Note - Make sure we drain the queue even if we are told to stop before it is emtpy
lock (queueLock)
{
run = queue.Count > 0 || running;
}
}
}
public void enqueue(Task toEnqueue)
{
lock (queueLock)
{
queue.Enqueue(toEnqueue);
Monitor.PulseAll(queueLock);
}
}
public void stop()
{
lock (queueLock)
{
running = false;
Monitor.PulseAll(queueLock);
}
}
public void handle(Task dequeued)
{
dequeued.execute();
}
}
You can actually handle this with the out-of-the-box BlockingCollection.
It is designed to have 1 or more producers, and 1 or more consumers. In your case, you would have multiple producers and one consumer.
When you receive a stop signal, have that signal handler
Signal producer threads to stop
Call CompleteAdding on the BlockingCollection instance
The consumer thread will continue to run until all queued items are removed and processed, then it will encounter the condition that the BlockingCollection is complete. When the thread encounters that condition, it just exits.
You should think about ConcurrentQueue, which is FIFO, in fact. If not suitable, try some of its relatives in Thread-Safe Collections. By using these you can avoid some risks.
I suggest you take a look at TPL DataFlow. BufferBlock is what you're looking for, but it offers so much more.
Look at my lightweight implementation of threadsafe FIFO queue, its a non-blocking synchronisation tool that uses threadpool - better than create own threads in most cases, and than using blocking sync tools as locks and mutexes. https://github.com/Gentlee/SerialQueue
Usage:
var queue = new SerialQueue();
var result = await queue.Enqueue(() => /* code to synchronize */);
You could use Rx on .NET 3.5 for this. It might have never come out of RC, but I believe it is stable* and in use by many production systems. If you don't need Subject you might find primitives (like concurrent collections) for .NET 3.5 you can use that didn't ship with the .NET Framework until 4.0.
Alternative to Rx (Reactive Extensions) for .net 3.5
* - Nit picker's corner: Except for maybe advanced time windowing, which is out of scope, but buffers (by count and time), ordering, and schedulers are all stable.
Let's say I have a business object that is very expensive to instantiate, and I would never want to create more than say 10 instances of that object in my application. So, that would mean I would never want to have more than 10 concurrent worker threads running at one time.
I'd like to use the new System.Threading.Tasks to create a task like this:
var task = Task.Factory.StartNew(() => myPrivateObject.DoSomethingProductive());
Is there a sample out there that would show how to:
create an 'object pool' for use by the TaskFactory?
limit the TaskFactory to a specified number of threads?
lock an instance in the object pool so it can only be used by one task at a time?
Igby's answer led me to this excellent blog post from Justin Etheridge. which then prompted me to write this sample:
using System;
using System.Collections.Concurrent;
using System.Threading.Tasks;
namespace MyThreadedApplication
{
class Program
{
static void Main(string[] args)
{
// build a list of 10 expensive working object instances
var expensiveStuff = new BlockingCollection<ExpensiveWorkObject>();
for (int i = 65; i < 75; i++)
{
expensiveStuff.Add(new ExpensiveWorkObject(Convert.ToChar(i)));
}
Console.WriteLine("{0} expensive objects created", expensiveStuff.Count);
// build a list of work to be performed
Random r = new Random();
var work = new ConcurrentQueue<int>();
for (int i = 0; i < 1000; i++)
{
work.Enqueue(r.Next(10000));
}
Console.WriteLine("{0} items in work queue", work.Count);
// process the list of work items in fifteen threads
for (int i = 1; i < 15; i++)
{
Task.Factory.StartNew(() =>
{
while (true)
{
var expensiveThing = expensiveStuff.Take();
try
{
int workValue;
if (work.TryDequeue(out workValue))
{
expensiveThing.DoWork(workValue);
}
}
finally
{
expensiveStuff.Add(expensiveThing);
}
}
});
}
}
}
}
class ExpensiveWorkObject
{
char identity;
public void DoWork(int someDelay)
{
System.Threading.Thread.Sleep(someDelay);
Console.WriteLine("{0}: {1}", identity, someDelay);
}
public ExpensiveWorkObject(char Identifier)
{
identity = Identifier;
}
}
So, I'm using the BlockingCollection as an object pool, and the worker threads don't check the queue for available work until they have an exclusive control over one of the expensive object instances. I think this meets my requirements, but I would really like feedback from people who know this stuff better than I do...
Two thoughts:
Limited Concurrency Scheduler
You can use a custom task scheduler which limits the number of concurrent tasks. Internally it will allocate up to n Task instances. If you pass it more tasks than it has available instances, it will put them in a queue. Adding custom schedulers like this is a design feature of the TPL.
Here is a good example of such a scheduler. I have sucessfully used a modified version of this.
Object Pool
Another option is to use an object pool. It's a very similar concept except that instead of putting the limitation at the task level, you put it on the number of object instances, and force tasks to wait for a free instance to become available. This has the benefit of reducing the overhead of object creation, but you need to ensure the object is written in a way that allows instances of it to be recycled. You could create an object pool around a concurrent producer-consumer collection such as ConcurrentStack where the consumer adds the instance back to the collection when it's finished.
I've got application which has two main task: encoding, processing video.
These tasks are independant.
Each task I would like run with configurable number of threads.
For this reason for one task I usually use ThreadPool and SetMaxThreads. But now I've got two tasks and would like "two configurable(number of threads) threapool for each task".
Well, ThreadPool is a static class. So how can I implement my strategy(easy configurable number of threads for each task).
Thanks
You will probably want your own thread pool. If you are using .NET 4.0 then it is actually fairly easy to roll your own if you use the BlockingCollection class.
public class CustomThreadPool
{
private BlockingCollection<Action> m_WorkItems = new BlockingCollection<Action>();
public CustomThreadPool(int numberOfThreads)
{
for (int i = 0; i < numberOfThreads; i++)
{
var thread = new Thread(
() =>
{
while (true)
{
Action action = m_WorkItems.Take();
action();
}
});
thread.IsBackground = true;
thread.Start();
}
}
public void QueueUserWorkItem(Action action)
{
m_WorkItems.Add(action);
}
}
That is really all there is to it. You would create a CustomThreadPool for each actual pool you want to control. I posted the minimum amount of code to get a crude thread pool going. Naturally, you might want to tweak and expand this implementation to suit your specific need.