Parallel.For & getting the next free thread/worker - c#

I have the following parallelised code. What I'm not sure of is how to set the workerIndex variable:
// Initializing Worker takes time & must be done before the actual work
Worker[] w = new Worker[3]; // I would like to limit the parallelism to 3
for (int i = 0; i < 3; i++)
w[i] = new Worker();
...
element[] elements = GetArrayOfElements(); // elements.Length > 3
ParallelOptions options = new ParallelOption();
options.MaxDegreeOfParallelism = 3;
Parallel.For(0, elements.Length, options, i =>
{
element e = elements[i];
w[workerIndex].Work(e); // how to set "workerIndex"?
});
Is there some mechanism that says which worker thread id is free next?

How about you just new Worker() in the Parallel.For loop, and add them to w (you would need to change w to a concurrent list).
Possibly (if it is not too complex) move the contents of the .Work(e) method to the body of the loop, eliminating the need for the Worker class.
Edit:
If you change your Worker array to a IEnumerable (i.e. a List<Worker>) you can use .AsParallel() to make it parallel. Then you can use .ForAll(worker -> worker.Work()) to do the work in parallel. This would require you to pass the element into the worker through the constructor.

Looks like Object pool pattern is best for you. You can write the code like this:
const int Limit = 3;
using (var pool = new QueueObjectPool<Worker>(a => new Worker(), Limit)) {
element[] elements = GetArrayOfElements();
var options = new ParallelOptions { MaxDegreeOfParallelism = Limit };
Parallel.For(0, elements.Length, options, i => {
element e = elements[i];
Worker worker = null;
try {
worker = pool.Acquire();
worker.Work(e);
} finally {
pool.Release(worker);
}
});
}
During start every element will wait for available worker and only three workers will be initialized form beginning. Here is simplified queue base object pool implementation:
public sealed class QueueObjectPool<TObject> : IDisposable {
private readonly Queue<TObject> _poolQueue;
private readonly Func<QueueObjectPool<TObject>, TObject> _factory;
private readonly int _capacity;
private readonly SemaphoreSlim _throttler;
public QueueObjectPool(Func<QueueObjectPool<TObject>, TObject> factory, int capacity) {
_factory = factory;
_capacity = capacity;
_throttler = new SemaphoreSlim(initialCount: capacity, maxCount: capacity);
_poolQueue = CreatePoolQueue();
}
public TObject Acquire() {
_throttler.Wait();
lock (_poolQueue) {
return _poolQueue.Dequeue();
}
}
public void Release(TObject poolObject) {
lock (_poolQueue) {
_poolQueue.Enqueue(poolObject);
}
_throttler.Release();
}
private Queue<TObject> CreatePoolQueue() {
var queue = new Queue<TObject>(_capacity);
int itemsLeft = _capacity;
while (itemsLeft > 0) {
TObject queueObject = _factory(this);
queue.Enqueue(queueObject);
itemsLeft -= 1;
}
return queue;
}
public void Dispose() {
throw new NotImplementedException();
}
}
This code is for demo purposes. In real work it is better to use async/await based logic which is easily achieved using SemaphoreSlim.WaitAsync and you can replace Parallel.For with simple loop.

Related

How to span MaxDegreeOfParallelism across multiple TPL Dataflow blocks?

I want to limit the total number of queries that I submit to my database server across all Dataflow blocks to 30. In the following scenario, the throttling of 30 concurrent tasks is per block so it always hits 60 concurrent tasks during execution. Obviously I could limit my parallelism to 15 per block to achieve a system wide total of 30 but this wouldn't be optimal.
How do I make this work? Do I limit (and block) my awaits using SemaphoreSlim, etc, or is there an intrinsic Dataflow approach that works better?
public class TPLTest
{
private long AsyncCount = 0;
private long MaxAsyncCount = 0;
private long TaskId = 0;
private object MetricsLock = new object();
public async Task Start()
{
ExecutionDataflowBlockOptions execOption
= new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 30 };
DataflowLinkOptions linkOption = new DataflowLinkOptions()
{ PropagateCompletion = true };
var doFirstIOWorkAsync = new TransformBlock<Data, Data>(
async data => await DoIOBoundWorkAsync(data), execOption);
var doCPUWork = new TransformBlock<Data, Data>(
data => DoCPUBoundWork(data));
var doSecondIOWorkAsync = new TransformBlock<Data, Data>(
async data => await DoIOBoundWorkAsync(data), execOption);
var doProcess = new TransformBlock<Data, string>(
i => $"Task finished, ID = : {i.TaskId}");
var doPrint = new ActionBlock<string>(
s => Debug.WriteLine(s));
doFirstIOWorkAsync.LinkTo(doCPUWork, linkOption);
doCPUWork.LinkTo(doSecondIOWorkAsync, linkOption);
doSecondIOWorkAsync.LinkTo(doProcess, linkOption);
doProcess.LinkTo(doPrint, linkOption);
int taskCount = 150;
for (int i = 0; i < taskCount; i++)
{
await doFirstIOWorkAsync.SendAsync(new Data() { Delay = 2500 });
}
doFirstIOWorkAsync.Complete();
await doPrint.Completion;
Debug.WriteLine("Max concurrent tasks: " + MaxAsyncCount.ToString());
}
private async Task<Data> DoIOBoundWorkAsync(Data data)
{
lock(MetricsLock)
{
AsyncCount++;
if (AsyncCount > MaxAsyncCount)
MaxAsyncCount = AsyncCount;
}
if (data.TaskId <= 0)
data.TaskId = Interlocked.Increment(ref TaskId);
await Task.Delay(data.Delay);
lock (MetricsLock)
AsyncCount--;
return data;
}
private Data DoCPUBoundWork(Data data)
{
data.Step = 1;
return data;
}
}
Data Class:
public class Data
{
public int Delay { get; set; }
public long TaskId { get; set; }
public int Step { get; set; }
}
Starting point:
TPLTest tpl = new TPLTest();
await tpl.Start();
Why don't you marshal everything to an action block that has the actual limitation?
var count = 0;
var ab1 = new TransformBlock<int, string>(l => $"1:{l}");
var ab2 = new TransformBlock<int, string>(l => $"2:{l}");
var doPrint = new ActionBlock<string>(
async s =>
{
var c = Interlocked.Increment(ref count);
Console.WriteLine($"{c}:{s}");
await Task.Delay(5);
Interlocked.Decrement(ref count);
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 15 });
ab1.LinkTo(doPrint);
ab2.LinkTo(doPrint);
for (var i = 100; i > 0; i--)
{
if (i % 3 == 0) await ab1.SendAsync(i);
if (i % 5 == 0) await ab2.SendAsync(i);
}
ab1.Complete();
ab2.Complete();
await ab1.Completion;
await ab2.Completion;
This is the solution I ended up going with (unless I can figure out how to use a single generic DataFlow block for marshalling every type of database access):
I defined a SemaphoreSlim at the class level:
private SemaphoreSlim ThrottleDatabaseQuerySemaphore = new SemaphoreSlim(30, 30);
I modified the I/O class to call a throttling class:
private async Task<Data> DoIOBoundWorkAsync(Data data)
{
if (data.TaskId <= 0)
data.TaskId = Interlocked.Increment(ref TaskId);
Task t = Task.Delay(data.Delay); ;
await ThrottleDatabaseQueryAsync(t);
return data;
}
The throttling class: (I also have a generic version of the throttling routine because I couldn't figure out how to write one routine to handle both Task and Task<TResult>)
private async Task ThrottleDatabaseQueryAsync(Task task)
{
await ThrottleDatabaseQuerySemaphore.WaitAsync();
try
{
lock (MetricsLock)
{
AsyncCount++;
if (AsyncCount > MaxAsyncCount)
MaxAsyncCount = AsyncCount;
}
await task;
}
finally
{
ThrottleDatabaseQuerySemaphore.Release();
lock (MetricsLock)
AsyncCount--;
}
}
}
The simplest solution to this problem is to configure all your blocks with a limited-concurrency TaskScheduler:
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 30).ConcurrentScheduler;
ExecutionDataflowBlockOptions execOption = new()
{
TaskScheduler = scheduler,
MaxDegreeOfParallelism = scheduler.MaximumConcurrencyLevel,
};
TaskSchedulers can only limit the concurrency of work done on threads. They can't throttle asynchronous operations that are not running on threads. So in order to enforce the MaximumConcurrencyLevel policy, unfortunately you must pass synchronous delegates to all the Dataflow blocks. For example:
TransformBlock<Data, Data> doFirstIOWorkAsync = new(data =>
{
return DoIOBoundWorkAsync(data).GetAwaiter().GetResult();
}, execOption);
This change will increase the demand for ThreadPool threads, so you'd better increase the number of threads that the ThreadPool creates instantly on demand to a higher value than the default Environment.ProcessorCount:
ThreadPool.SetMinThreads(100, 100); // At the start of the program
I am proposing this solution not because it is optimal, but because it is easy to implement. My understanding is that wasting some RAM on ~30 threads that are going to be blocked most of the time, won't have any measurable negative effect on the type of application that you are working with.

Async Producer / Consumer with throttled duration and batched consumption

I am trying to build a service that provides a queue for many asynchronous clients to make requests and await a response. I need to be able to throttle the queue processing by X requests per Y duration. For example: 50 web requests per second. It is for a 3rd party REST Service where I can only issue X requests per second.
Found many SO questions, it is lead me down the path of using TPL Dataflow, I've used a TranformBlock to provide my custom throttling and then X number of ActionBlocks to complete the tasks in parallel. The implementation of the Action seems a bit clunky, so wondering if there is a better way for me to pass Tasks into the pipeline that notify the callers once completed.
I'm wondering if there is there a better or more optimal/simpler way to do what I want? Is there any glaring issues with my implementation? I know it is missing cancellation and exception handing and I'll be doing this next, but your comments are most welcomed.
I've Extended Stephen Cleary's example for my Dataflow pipeline and used
svick's concept of a time throttled TransformBlock. I am wondering if what I've built could be easily achieved with a pure SemaphoreSlim design, its the time based throttling with max operations that I think will complicate things.
Here is the latest implementation. FIFO queue async queue where I can pass in custom actions.
public class ThrottledProducerConsumer<T>
{
private class TimerState<T1>
{
public SemaphoreSlim Sem;
public T1 Value;
}
private BufferBlock<T> _queue;
private IPropagatorBlock<T, T> _throttleBlock;
private List<Task> _consumers;
private static IPropagatorBlock<T1, T1> CreateThrottleBlock<T1>(TimeSpan Interval, Int32 MaxPerInterval)
{
SemaphoreSlim _sem = new SemaphoreSlim(MaxPerInterval);
return new TransformBlock<T1, T1>(async (x) =>
{
var sw = new Stopwatch();
sw.Start();
//Console.WriteLine($"Current count: {_sem.CurrentCount}");
await _sem.WaitAsync();
sw.Stop();
var now = DateTime.UtcNow;
var releaseTime = now.Add(Interval) - now;
//-- Using timer as opposed to Task.Delay as I do not want to await or wait for it to complete
var tm = new Timer((s) => {
var state = (TimerState<T1>)s;
//Console.WriteLine($"RELEASE: {state.Value} was released {DateTime.UtcNow:mm:ss:ff} Reset Sem");
state.Sem.Release();
}, new TimerState<T1> { Sem = _sem, Value = x }, (int)Interval.TotalMilliseconds,
-1);
/*
Task.Delay(delay).ContinueWith((t)=>
{
Console.WriteLine($"RELEASE(FAKE): {x} was released {DateTime.UtcNow:mm:ss:ff} Reset Sem");
//_sem.Release();
});
*/
//Console.WriteLine($"{x} was tramsformed in {sw.ElapsedMilliseconds}ms. Will release {now.Add(Interval):mm:ss:ff}");
return x;
},
//new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
//
new ExecutionDataflowBlockOptions { BoundedCapacity = 5, MaxDegreeOfParallelism = 10 });
}
public ThrottledProducerConsumer(TimeSpan Interval, int MaxPerInterval, Int32 QueueBoundedMax = 5, Action<T> ConsumerAction = null, Int32 MaxConsumers = 1)
{
var consumerOptions = new ExecutionDataflowBlockOptions { BoundedCapacity = 1, };
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true, };
//-- Create the Queue
_queue = new BufferBlock<T>(new DataflowBlockOptions { BoundedCapacity = QueueBoundedMax, });
//-- Create and link the throttle block
_throttleBlock = CreateThrottleBlock<T>(Interval, MaxPerInterval);
_queue.LinkTo(_throttleBlock, linkOptions);
//-- Create and link the consumer(s) to the throttle block
var consumerAction = (ConsumerAction != null) ? ConsumerAction : new Action<T>(ConsumeItem);
_consumers = new List<Task>();
for (int i = 0; i < MaxConsumers; i++)
{
var consumer = new ActionBlock<T>(consumerAction, consumerOptions);
_throttleBlock.LinkTo(consumer, linkOptions);
_consumers.Add(consumer.Completion);
}
//-- TODO: Add some cancellation tokens to shut this thing down
}
/// <summary>
/// Default Consumer Action, just prints to console
/// </summary>
/// <param name="ItemToConsume"></param>
private void ConsumeItem(T ItemToConsume)
{
Console.WriteLine($"Consumed {ItemToConsume} at {DateTime.UtcNow}");
}
public async Task EnqueueAsync(T ItemToEnqueue)
{
await this._queue.SendAsync(ItemToEnqueue);
}
public async Task EnqueueItemsAsync(IEnumerable<T> ItemsToEnqueue)
{
foreach (var item in ItemsToEnqueue)
{
await this._queue.SendAsync(item);
}
}
public async Task CompleteAsync()
{
this._queue.Complete();
await Task.WhenAll(_consumers);
Console.WriteLine($"All consumers completed {DateTime.UtcNow}");
}
}
The test method
public class WorkItem<T>
{
public TaskCompletionSource<T> tcs;
//public T respone;
public string url;
public WorkItem(string Url)
{
tcs = new TaskCompletionSource<T>();
url = Url;
}
public override string ToString()
{
return $"{url}";
}
}
public static void TestQueue()
{
Console.WriteLine("Created the queue");
var defaultAction = new Action<WorkItem<String>>(async i => {
var taskItem = ((WorkItem<String>)i);
Console.WriteLine($"Consuming: {taskItem.url} {DateTime.UtcNow:mm:ss:ff}");
//-- Assume calling another async method e.g. await httpClient.DownloadStringTaskAsync(url);
await Task.Delay(5000);
taskItem.tcs.SetResult($"{taskItem.url}");
//Console.WriteLine($"Consumed: {taskItem.url} {DateTime.UtcNow}");
});
var queue = new ThrottledProducerConsumer<WorkItem<String>>(TimeSpan.FromMilliseconds(2000), 5, 2, defaultAction);
var results = new List<Task>();
foreach (var no in Enumerable.Range(0, 20))
{
var workItem = new WorkItem<String>($"http://someurl{no}.com");
results.Add(queue.EnqueueAsync(workItem));
results.Add(workItem.tcs.Task);
results.Add(workItem.tcs.Task.ContinueWith(response =>
{
Console.WriteLine($"Received: {response.Result} {DateTime.UtcNow:mm:ss:ff}");
}));
}
Task.WhenAll(results).Wait();
Console.WriteLine("All Work Items Have Been Processed");
}
Since asking, I have created a ThrottledConsumerProducer class based on TPL Dataflow. It was tested over a number of days which included concurrent producers which were queued and completed in order, approx 281k without any problems, however there my be bugs I've not discovered.
I am using a BufferBlock as an asynchronous queue, this is linked to:
A TransformBlock which provides the throttling and blocking I need. It is used in conjunction with a SempahoreSlim to control the max requests. As each item is passed through the block, it increments the semaphore and schedules a task to run X duration later to release the semaphore by one. This way I have a sliding window of X requests per duration; exactly what I wanted. Because of TPL I am also leveraging parallelism to the connected:
ActionBlock(s) which are responsible for performing the task I need.
The classes are generic, so it might be useful to others if they need something similar. I have not written cancellation or error handling, but thought I should just mark this as answered to move it along. I would be quite happy to see some alternatives and feedback, rather than mark mine as an accepted answer. Thanks for reading.
NOTE: I removed the Timer from the original implementation as it was doing weird stuff causing the semaphore to release more than the maximum, I am assuming it is dynamic context error, it occurred when I started running concurrent requests. I worked around it using Task.Delay to schedule a release of a semaphore lock.
Throttled Producer Consumer
public class ThrottledProducerConsumer<T>
{
private BufferBlock<T> _queue;
private IPropagatorBlock<T, T> _throttleBlock;
private List<Task> _consumers;
private static IPropagatorBlock<T1, T1> CreateThrottleBlock<T1>(TimeSpan Interval,
Int32 MaxPerInterval, Int32 BlockBoundedMax = 2, Int32 BlockMaxDegreeOfParallelism = 2)
{
SemaphoreSlim _sem = new SemaphoreSlim(MaxPerInterval, MaxPerInterval);
return new TransformBlock<T1, T1>(async (x) =>
{
//Log($"Transform blk: {x} {DateTime.UtcNow:mm:ss:ff} Semaphore Count: {_sem.CurrentCount}");
var sw = new Stopwatch();
sw.Start();
//Console.WriteLine($"Current count: {_sem.CurrentCount}");
await _sem.WaitAsync();
sw.Stop();
var delayTask = Task.Delay(Interval).ContinueWith((t) =>
{
//Log($"Pre-RELEASE: {x} {DateTime.UtcNow:mm:ss:ff} Semaphore Count {_sem.CurrentCount}");
_sem.Release();
//Log($"PostRELEASE: {x} {DateTime.UtcNow:mm:ss:ff} Semaphoere Count {_sem.CurrentCount}");
});
//},TaskScheduler.FromCurrentSynchronizationContext());
//Log($"Transformed: {x} in queue {sw.ElapsedMilliseconds}ms. {DateTime.Now:mm:ss:ff} will release {DateTime.Now.Add(Interval):mm:ss:ff} Semaphoere Count {_sem.CurrentCount}");
return x;
},
//-- Might be better to keep Bounded Capacity in sync with the semaphore
new ExecutionDataflowBlockOptions { BoundedCapacity = BlockBoundedMax,
MaxDegreeOfParallelism = BlockMaxDegreeOfParallelism });
}
public ThrottledProducerConsumer(TimeSpan Interval, int MaxPerInterval,
Int32 QueueBoundedMax = 5, Action<T> ConsumerAction = null, Int32 MaxConsumers = 1,
Int32 MaxThrottleBuffer = 20, Int32 MaxDegreeOfParallelism = 10)
{
//-- Probably best to link MaxPerInterval and MaxThrottleBuffer
// and MaxConsumers with MaxDegreeOfParallelism
var consumerOptions = new ExecutionDataflowBlockOptions { BoundedCapacity = 1, };
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true, };
//-- Create the Queue
_queue = new BufferBlock<T>(new DataflowBlockOptions { BoundedCapacity = QueueBoundedMax, });
//-- Create and link the throttle block
_throttleBlock = CreateThrottleBlock<T>(Interval, MaxPerInterval);
_queue.LinkTo(_throttleBlock, linkOptions);
//-- Create and link the consumer(s) to the throttle block
var consumerAction = (ConsumerAction != null) ? ConsumerAction : new Action<T>(ConsumeItem);
_consumers = new List<Task>();
for (int i = 0; i < MaxConsumers; i++)
{
var consumer = new ActionBlock<T>(consumerAction, consumerOptions);
_throttleBlock.LinkTo(consumer, linkOptions);
_consumers.Add(consumer.Completion);
}
//-- TODO: Add some cancellation tokens to shut this thing down
}
/// <summary>
/// Default Consumer Action, just prints to console
/// </summary>
/// <param name="ItemToConsume"></param>
private void ConsumeItem(T ItemToConsume)
{
Log($"Consumed {ItemToConsume} at {DateTime.UtcNow}");
}
public async Task EnqueueAsync(T ItemToEnqueue)
{
await this._queue.SendAsync(ItemToEnqueue);
}
public async Task EnqueueItemsAsync(IEnumerable<T> ItemsToEnqueue)
{
foreach (var item in ItemsToEnqueue)
{
await this._queue.SendAsync(item);
}
}
public async Task CompleteAsync()
{
this._queue.Complete();
await Task.WhenAll(_consumers);
Console.WriteLine($"All consumers completed {DateTime.UtcNow}");
}
private static void Log(String messageToLog)
{
System.Diagnostics.Trace.WriteLine(messageToLog);
Console.WriteLine(messageToLog);
}
}
- Example Usage -
A Generic WorkItem
public class WorkItem<Toutput,Tinput>
{
private TaskCompletionSource<Toutput> _tcs;
public Task<Toutput> Task { get { return _tcs.Task; } }
public Tinput InputData { get; private set; }
public Toutput OutputData { get; private set; }
public WorkItem(Tinput inputData)
{
_tcs = new TaskCompletionSource<Toutput>();
InputData = inputData;
}
public void Complete(Toutput result)
{
_tcs.SetResult(result);
}
public void Failed(Exception ex)
{
_tcs.SetException(ex);
}
public override string ToString()
{
return InputData.ToString();
}
}
Creating the action block executed in the pipeline
private Action<WorkItem<Location,PointToLocation>> CreateProcessingAction()
{
return new Action<WorkItem<Location,PointToLocation>>(async i => {
var sw = new Stopwatch();
sw.Start();
var taskItem = ((WorkItem<Location,PointToLocation>)i);
var inputData = taskItem.InputData;
//Log($"Consuming: {inputData.Latitude},{inputData.Longitude} {DateTime.UtcNow:mm:ss:ff}");
//-- Assume calling another async method e.g. await httpClient.DownloadStringTaskAsync(url);
await Task.Delay(500);
sw.Stop();
Location outData = new Location()
{
Latitude = inputData.Latitude,
Longitude = inputData.Longitude,
StreetAddress = $"Consumed: {inputData.Latitude},{inputData.Longitude} Duration(ms): {sw.ElapsedMilliseconds}"
};
taskItem.Complete(outData);
//Console.WriteLine($"Consumed: {taskItem.url} {DateTime.UtcNow}");
});
}
Test Method
You'll need to provide your own implementation for PointToLocation and Location. Just an example of how you'd use it with your own classes.
int startRange = 0;
int nextRange = 1000;
ThrottledProducerConsumer<WorkItem<Location,PointToLocation>> tpc;
private void cmdTestPipeline_Click(object sender, EventArgs e)
{
Log($"Pipeline test started {DateTime.Now:HH:mm:ss:ff}");
if(tpc == null)
{
tpc = new ThrottledProducerConsumer<WorkItem<Location, PointToLocation>>(
//1010, 2, 20000,
TimeSpan.FromMilliseconds(1010), 45, 100000,
CreateProcessingAction(),
2,45,10);
}
var workItems = new List<WorkItem<Models.Location, PointToLocation>>();
foreach (var i in Enumerable.Range(startRange, nextRange))
{
var ptToLoc = new PointToLocation() { Latitude = i + 101, Longitude = i + 100 };
var wrkItem = new WorkItem<Location, PointToLocation>(ptToLoc);
workItems.Add(wrkItem);
wrkItem.Task.ContinueWith(t =>
{
var loc = t.Result;
string line = $"[Simulated:{DateTime.Now:HH:mm:ss:ff}] - {loc.StreetAddress}";
//txtResponse.Text = String.Concat(txtResponse.Text, line, System.Environment.NewLine);
//var lines = txtResponse.Text.Split(new string[] { System.Environment.NewLine},
// StringSplitOptions.RemoveEmptyEntries).LongCount();
//lblLines.Text = lines.ToString();
//Log(line);
});
//}, TaskScheduler.FromCurrentSynchronizationContext());
}
startRange += nextRange;
tpc.EnqueueItemsAsync(workItems);
Log($"Pipeline test completed {DateTime.Now:HH:mm:ss:ff}");
}

How to pass different instances while multithreading?

I am building a scraper. My goal is to start X browsers (where X is number of threads) and proceed to scrape a list of URLs with each of them by splitting that list in X parts.
I decide to use 3 threads (3 browsers) with list of 10 URLs.
Question: How to separate each task between the browsers like this:
Browser1 scrapes items in the list from 0 to 3
Browser2 scrapes items in the list from 4 to 7
Browser3 scrapes items in the list from 8 to 10
All browsers should be working at the same time scraping the passed list of URLs.
I already have this BlockingCollection:
BlockingCollection<Action> _taskQ = new BlockingCollection<Action>();
public Multithreading(int workerCount)
{
// Create and start a separate Task for each consumer:
for (int i = 0; i < workerCount; i++)
Task.Factory.StartNew(Consume);
}
public void Dispose() { _taskQ.CompleteAdding(); }
public void EnqueueTask(Action action) { _taskQ.Add(action); }
void Consume()
{
// This sequence that we’re enumerating will block when no elements
// are available and will end when CompleteAdding is called.
foreach (Action action in _taskQ.GetConsumingEnumerable())
action(); // Perform task.
}
public int ItemsCount()
{
return _taskQ.Count;
}
It can be used like this:
Multithreading multithread = new Multithreading(3); //3 threads
foreach(string url in urlList){
multithread.EnqueueTask(new Action(() =>
{
startScraping(browser1); //or browser2 or browser3
}));
}
I need to create the browsers instances before scraping, because I do not want to start a new browser with every thread.
Taking Henk Holtermans comment into account that you may want maximum speed, i.e. keep browsers busy as much as possible, use this:
private static void StartScraping(int id, IEnumerable<Uri> urls)
{
// Construct browser here
foreach (Uri url in urls)
{
// Use browser to process url here
Console.WriteLine("Browser {0} is processing url {1}", id, url);
}
}
in main:
int nrWorkers = 3;
int nrUrls = 10;
BlockingCollection<Uri> taskQ = new BlockingCollection<Uri>();
foreach (int i in Enumerable.Range(0, nrWorkers))
{
Task.Run(() => StartScraping(i, taskQ.GetConsumingEnumerable()));
}
foreach (int i in Enumerable.Range(0, nrUrls))
{
taskQ.Add(new Uri(String.Format("http://Url{0}", i)));
}
taskQ.CompleteAdding();
I think the usual approach is to have a single blocking queue, a provider thread and an arbitrary pool of workers.
The provider thread is responsible for adding URLs to the queue. It blocks when there are none to add.
A worker thread instantiates a browser, and then retrieves a single URL from the queue, scrapes it and then loops back for more. It blocks when the queue is empty.
You can start as many workers as you like, and they just sort it out between them.
The mainline starts all the threads and retires to the sidelines. It looks after the UI, if there is one.
Multithreading can be really hard to debug. You might want to look at using Tasks for at least part of the job.
You could give some Id to the tasks and also Workers. Then you'll have BlockingCollection[] instead of just BlockingCollection. Every consumer will consume from its own BlockingCollection from the array. Our job is to find the right consumer and post the job.
BlockingCollection<Action>[] _taskQ;
private int taskCounter = -1;
public Multithreading(int workerCount)
{
_taskQ = new BlockingCollection<Action>[workerCount];
for (int i = 0; i < workerCount; i++)
{
int workerId = i;//To avoid closure issue
_taskQ[workerId] = new BlockingCollection<Action>();
Task.Factory.StartNew(()=> Consume(workerId));
}
}
public void EnqueueTask(Action action)
{
int value = Interlocked.Increment(ref taskCounter);
int index = value / 4;//Your own logic to find the index here
_taskQ[index].Add(action);
}
void Consume(int workerId)
{
foreach (Action action in _taskQ[workerId].GetConsumingEnumerable())
action();// Perform task.
}
A simple solution using background workers can limit the number of threads:
public class Scraper : IDisposable
{
private readonly BlockingCollection<Action> tasks;
private readonly IList<BackgroundWorker> workers;
public Scraper(IList<Uri> urls, int numberOfThreads)
{
for (var i = 0; i < urls.Count; i++)
{
var url = urls[i];
tasks.Add(() => Scrape(url));
}
for (var i = 0; i < numberOfThreads; i++)
{
var worker = new BackgroundWorker();
worker.DoWork += (sender, args) =>
{
Action task;
while (tasks.TryTake(out task))
{
task();
}
};
workers.Add(worker);
worker.RunWorkerAsync();
}
}
public void Scrape(Uri url)
{
Console.WriteLine("Scraping url {0}", url);
}
public void Dispose()
{
throw new NotImplementedException();
}
}

What is the best scenario for one fast producer multiple slow consumers?

I'm looking for the best scenario to implement one producer multiple consumer multithreaded application.
Currently I'm using one queue for shared buffer but it's much slower than the case of one producer one consumer.
I'm planning to do it like this:
Queue<item>[] buffs = new Queue<item>[N];
object[] _locks = new object[N];
static void Produce()
{
int curIndex = 0;
while(true)
{
// Produce item;
lock(_locks[curIndex])
{
buffs[curIndex].Enqueue(curItem);
Monitor.Pulse(_locks[curIndex]);
}
curIndex = (curIndex+1)%N;
}
}
static void Consume(int myIndex)
{
item curItem;
while(true)
{
lock(_locks[myIndex])
{
while(buffs[myIndex].Count == 0)
Monitor.Wait(_locks[myIndex]);
curItem = buffs[myIndex].Dequeue();
}
// Consume item;
}
}
static void main()
{
int N = 100;
Thread[] consumers = new Thread[N];
for(int i = 0; i < N; i++)
{
consumers[i] = new Thread(Consume);
consumers[i].Start(i);
}
Thread producer = new Thread(Produce);
producer.Start();
}
Use a BlockingCollection
BlockingCollection<item> _buffer = new BlockingCollection<item>();
static void Produce()
{
while(true)
{
// Produce item;
_buffer.Add(curItem);
}
// eventually stop producing
_buffer.CompleteAdding();
}
static void Consume(int myIndex)
{
foreach (var curItem in _buffer.GetConsumingEnumerable())
{
// Consume item;
}
}
static void main()
{
int N = 100;
Thread[] consumers = new Thread[N];
for(int i = 0; i < N; i++)
{
consumers[i] = new Thread(Consume);
consumers[i].Start(i);
}
Thread producer = new Thread(Produce);
producer.Start();
}
If you don't want to specify number of threads from start you can use Parallel.ForEach instead.
static void Consume(item curItem)
{
// consume item
}
void Main()
{
Thread producer = new Thread(Produce);
producer.Start();
Parallel.ForEach(_buffer.GetConsumingPartitioner(), Consumer)
}
Using more threads won't help. It may even reduce performance. I suggest you try to use ThreadPool where every work item is one item created by the producer. However, that doesn't guarantee the produced items to be consumed in the order they were produced.
Another way could be to reduce the number of consumers to 4, for example and modify the way they work as follows:
The producer adds the new work to the queue. There's only one global queue for all worker threads. It then sets a flag to indicate there is new work like this:
ManualResetEvent workPresent = new ManualResetEvent(false);
Queue<item> workQueue = new Queue<item>();
static void Produce()
{
while(true)
{
// Produce item;
lock(workQueue)
{
workQueue.Enqueue(newItem);
workPresent.Set();
}
}
}
The consumers wait for work to be added to the queue. Only one consumer will get to do its job. It then takes all the work from the queue and resets the flag. The producer will not be able to add new work until that is done.
static void Consume()
{
while(true)
{
if (WaitHandle.WaitOne(workPresent))
{
workPresent.Reset();
Queue<item> localWorkQueue = new Queue<item>();
lock(workQueue)
{
while (workQueue.Count > 0)
localWorkQueue.Enqueue(workQueue.Dequeue());
}
// Handle items in local work queue
...
}
}
}
That outcome of this, however, is a bit unpredictable. It could be that one thread is doing all the work and the others do nothing.
I don't see why you have to use multiple queues. Just reduce the amount of locking. Here is an sample where you can have a large number of consumers and they all wait for new work.
public class MyWorkGenerator
{
ConcurrentQueue<object> _queuedItems = new ConcurrentQueue<object>();
private object _lock = new object();
public void Produce()
{
while (true)
{
_queuedItems.Enqueue(new object());
Monitor.Pulse(_lock);
}
}
public object Consume(TimeSpan maxWaitTime)
{
if (!Monitor.Wait(_lock, maxWaitTime))
return null;
object workItem;
if (_queuedItems.TryDequeue(out workItem))
{
return workItem;
}
return null;
}
}
Do note that Pulse() will only trigger one consumer at a time.
Example usage:
static void main()
{
var generator = new MyWorkGenerator();
var consumers = new Thread[20];
for (int i = 0; i < consumers.Length; i++)
{
consumers[i] = new Thread(DoWork);
consumers[i].Start(generator);
}
generator.Produce();
}
public static void DoWork(object state)
{
var generator = (MyWorkGenerator) state;
var workItem = generator.Consume(TimeSpan.FromHours(1));
while (workItem != null)
{
// do work
workItem = generator.Consume(TimeSpan.FromHours(1));
}
}
Note that the actual queue is hidden in the producer as it's imho an implementation detail. The consumers doesn't really have to know how the work items are generated.

A producer consumer queue with an additional thread for a periodic backup of data

I'm trying to implement a concurrent producer/consumer queue with multiple producers and one consumer: the producers add some data to the Queue, and the consumer dequeues these data from the queue in order to update a collection. This collection must be periodically backed up to a new file. For this purpose I created a custom serializable collection: serialization could be performed by using the DataContractSerializer.
The queue is only shared between the consumer and the producers, so access to this queue must be managed to avoid race conditions.
The custom collection is shared between the consumer and a backup thread.
The backup thread could be activated periodically using a System.Threading.Timer object: it may initially be scheduled by the consumer, and then it would be scheduled at the end of every backup procedure.
Finally, a shutdown method should stop the queuing by producers, then stop the consumer, perform the last backup and dispose the timer.
The dequeuing of an item at a time may not be efficient, so I thought of using two queues: when the first queue becomes full, the producers notify the consumer by invoking Monitor.Pulse. As soon as the consumer receives the notification, the queues are swapped, so while producers enqueue new items, the consumer can process the previous ones.
The sample that I wrote seems to work properly. I think it is also thread-safe, but I'm not sure about that. The following code, for simplicity I used a Queue<int>. I also used (again for simplicity) an ArrayList instead of collection serializable.
public class QueueManager
{
private readonly int m_QueueMaxSize;
private readonly TimeSpan m_BackupPeriod;
private readonly object m_SyncRoot_1 = new object();
private Queue<int> m_InputQueue = new Queue<int>();
private bool m_Shutdown;
private bool m_Pulsed;
private readonly object m_SyncRoot_2 = new object();
private ArrayList m_CustomCollection = new ArrayList();
private Thread m_ConsumerThread;
private Timer m_BackupThread;
private WaitHandle m_Disposed;
public QueueManager()
{
m_ConsumerThread = new Thread(Work) { IsBackground = true };
m_QueueMaxSize = 7;
m_BackupPeriod = TimeSpan.FromSeconds(30);
}
public void Run()
{
m_Shutdown = m_Pulsed = false;
m_BackupThread = new Timer(DoBackup);
m_Disposed = new AutoResetEvent(false);
m_ConsumerThread.Start();
}
public void Shutdown()
{
lock (m_SyncRoot_1)
{
m_Shutdown = true;
Console.WriteLine("Worker shutdown...");
Monitor.Pulse(m_SyncRoot_1);
}
m_ConsumerThread.Join();
WaitHandle.WaitAll(new WaitHandle[] { m_Disposed });
if (m_InputQueue != null) { m_InputQueue.Clear(); }
if (m_CustomCollection != null) { m_CustomCollection.Clear(); }
Console.WriteLine("Worker stopped!");
}
public void Enqueue(int item)
{
lock (m_SyncRoot_1)
{
if (m_InputQueue.Count == m_QueueMaxSize)
{
if (!m_Pulsed)
{
Monitor.Pulse(m_SyncRoot_1); // it notifies the consumer...
m_Pulsed = true;
}
Monitor.Wait(m_SyncRoot_1); // ... and waits for Pulse
}
m_InputQueue.Enqueue(item);
Console.WriteLine("{0} \t {1} >", Thread.CurrentThread.Name, item.ToString("+000;-000;"));
}
}
private void Work()
{
m_BackupThread.Change(m_BackupPeriod, TimeSpan.FromMilliseconds(-1));
Queue<int> m_SwapQueueRef, m_WorkerQueue = new Queue<int>();
Console.WriteLine("Worker started!");
while (true)
{
lock (m_SyncRoot_1)
{
if (m_InputQueue.Count < m_QueueMaxSize && !m_Shutdown) Monitor.Wait(m_SyncRoot_1);
Console.WriteLine("\nswapping...");
m_SwapQueueRef = m_InputQueue;
m_InputQueue = m_WorkerQueue;
m_WorkerQueue = m_SwapQueueRef;
m_Pulsed = false;
Monitor.PulseAll(m_SyncRoot_1); // all producers are notified
}
Console.WriteLine("Worker\t < {0}", String.Join(",", m_WorkerQueue.ToArray()));
lock (m_SyncRoot_2)
{
Console.WriteLine("Updating custom dictionary...");
foreach (int item in m_WorkerQueue)
{
m_CustomCollection.Add(item);
}
Thread.Sleep(1000);
Console.WriteLine("Custom dictionary updated successfully!");
}
if (m_Shutdown)
{
// schedule last backup
m_BackupThread.Change(0, Timeout.Infinite);
return;
}
m_WorkerQueue.Clear();
}
}
private void DoBackup(object state)
{
try
{
lock (m_SyncRoot_2)
{
Console.WriteLine("Backup...");
Thread.Sleep(2000);
Console.WriteLine("Backup completed at {0}", DateTime.Now);
}
}
finally
{
if (m_Shutdown) { m_BackupThread.Dispose(m_Disposed); }
else { m_BackupThread.Change(m_BackupPeriod, TimeSpan.FromMilliseconds(-1)); }
}
}
}
Some objects are initialized in the Run method to allow you to restart this QueueManager after it is stopped, as shown in the code below.
public static void Main(string[] args)
{
QueueManager queue = new QueueManager();
var t1 = new Thread(() =>
{
for (int i = 0; i < 50; i++)
{
queue.Enqueue(i);
Thread.Sleep(1500);
}
}) { Name = "t1" };
var t2 = new Thread(() =>
{
for (int i = 0; i > -30; i--)
{
queue.Enqueue(i);
Thread.Sleep(3000);
}
}) { Name = "t2" };
t1.Start(); t2.Start(); queue.Run();
t1.Join(); t2.Join(); queue.Shutdown();
Console.ReadLine();
var t3 = new Thread(() =>
{
for (int i = 0; i < 50; i++)
{
queue.Enqueue(i);
Thread.Sleep(1000);
}
}) { Name = "t3" };
var t4 = new Thread(() =>
{
for (int i = 0; i > -30; i--)
{
queue.Enqueue(i);
Thread.Sleep(2000);
}
}) { Name = "t4" };
t3.Start(); t4.Start(); queue.Run();
t3.Join(); t4.Join(); queue.Shutdown();
Console.ReadLine();
}
I would suggest using the BlockingCollection for a producer/consumer queue. It was designed specifically for that purpose. The producers add items using Add and the consumers use Take. If there are no items to take then it will block until one is added. It is already designed to be used in a multithreaded environment, so if you're just using those methods there's no need to explicitly use any locks or other synchronization code.

Categories