Consuming blocking collection with multiple tasks/consumers

Consuming blocking collection with multiple tasks/consumers - c#

I have the following code that I populate users from a source, for the sake of example it is as below. what I want to do is to consume BlockingCollection with multiple consumers.
Is below the right way to do that? Also what would be the best number of threads ? ok this would depend on hardware, memory etc. Or how can i do it in a better way?
Also would below implementation ensure that i will process everything in the collection until it is empty?
class Program
{
public static readonly BlockingCollection<User> users = new BlockingCollection<User>();
static void Main(string[] args)
{
for (int i = 0; i < 100000; i++)
{
var u = new User {Id = i, Name = "user " + i};
users.Add(u);
}
Run();
}
static void Run()
{
for (int i = 0; i < 100; i++)
{
Task.Factory.StartNew(Process, TaskCreationOptions.LongRunning);
}
}
static void Process()
{
foreach (var user in users.GetConsumingEnumerable())
{
Console.WriteLine(user.Id);
}
}
}
public class User
{
public int Id { get; set; }
public string Name { get; set; }
}

A few small things
You never called CompleteAdding, by not doing that your consuming foreach loops will never complete and hang forever. Fix that by doing users.CompleteAdding() after the initial for loop.
You never wait for the work to finish, Run() will spin up your 100 threads (which likely WAY too much unless your real process involves a lot of waiting for uncontested resources). Because Tasks are not foreground threads they will not keep your program open when your Main exits. You need a CountdownEvent to track when everything is done.
You don't start up your consumers till after your producer has finished all of it's work, you should spin off the producer in to a separate thread or start the consumers first so they are ready to work while you populate the producer on the main thread.
here is a updated version of the code with the fixes
class Program
{
private const int MaxThreads = 100; //way to high for this example.
private static readonly CountdownEvent cde = new CountdownEvent(MaxThreads);
public static readonly BlockingCollection<User> users = new BlockingCollection<User>();
static void Main(string[] args)
{
Run();
for (int i = 0; i < 100000; i++)
{
var u = new User {Id = i, Name = "user " + i};
users.Add(u);
}
users.CompleteAdding();
cde.Wait();
}
static void Run()
{
for (int i = 0; i < MaxThreads; i++)
{
Task.Factory.StartNew(Process, TaskCreationOptions.LongRunning);
}
}
static void Process()
{
foreach (var user in users.GetConsumingEnumerable())
{
Console.WriteLine(user.Id);
}
cde.Signal();
}
}
public class User
{
public int Id { get; set; }
public string Name { get; set; }
}
For the "Best number of threads" like I said earlier, it really depends on what you are waiting on.
If what you are processing is CPU bound, the optimum number of threads is likely Enviorment.ProcessorCount.
If what you are doing is waiting on a external resource, but new requests do not affect old requests (for example asking 20 different servers for information, server the load on server n does not affect the load on server n+1) in that case I would let Parallel.ForEach just choose the number of threads for you.
If you are waiting on a resource that is contended (for example reading/writing to a hard disk) you will want to not use very many threads at all (perhaps even only use one). I just posted a answer in another question about that, when reading in from the hard disk, you should only just use one thread at a time so the hard drive is not jumping around all over trying to complete all the reads at once.

Related

Threads monitoring a Queue<Actions>

I doing a small project to map a network (routers only) using SNMP. In order to speed things up, I´m trying to have a pool of threads responsible for doing the jobs I need, apart from the first job which is done by the main thread.
At this time I have two jobs, one takes a parameter the other doesn´t:
UpdateDeviceInfo(NetworkDevice nd)
UpdateLinks() *not defined yet
What I´m trying to achieve is to have those working threads waiting for a job to
appear on a Queue<Action> and wait while it is empty. The main thread will add the first job and then wait for all workers, which might add more jobs, to finish before starting adding the second job and wake up the sleeping threads.
My problem/questions are:
How to define the Queue<Actions> so that I can insert the methods and the parameters if any. If not possible I could make all functions accept the same parameter.
How to launch the working threads indefinitely. I not sure where should I create the for(;;).
This is my code so far:
public enum DatabaseState
{
Empty = 0,
Learning = 1,
Updating = 2,
Stable = 3,
Exiting = 4
};
public class NetworkDB
{
public Dictionary<string, NetworkDevice> database;
private Queue<Action<NetworkDevice>> jobs;
private string _community;
private string _ipaddress;
private Object _statelock = new Object();
private DatabaseState _state = DatabaseState.Empty;
private readonly int workers = 4;
private Object _threadswaitinglock = new Object();
private int _threadswaiting = 0;
public Dictionary<string, NetworkDevice> Database { get => database; set => database = value; }
public NetworkDB(string community, string ipaddress)
{
_community = community;
_ipaddress = ipaddress;
database = new Dictionary<string, NetworkDevice>();
jobs = new Queue<Action<NetworkDevice>>();
}
public void Start()
{
NetworkDevice nd = SNMP.GetDeviceInfo(new IpAddress(_ipaddress), _community);
if (nd.Status > NetworkDeviceStatus.Unknown)
{
database.Add(nd.Id, nd);
_state = DatabaseState.Learning;
nd.Update(this); // The first job is done by the main thread
for (int i = 0; i < workers; i++)
{
Thread t = new Thread(JobRemove);
t.Start();
}
lock (_statelock)
{
if (_state == DatabaseState.Learning)
{
Monitor.Wait(_statelock);
}
}
lock (_statelock)
{
if (_state == DatabaseState.Updating)
{
Monitor.Wait(_statelock);
}
}
foreach (KeyValuePair<string, NetworkDevice> n in database)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(n.Value.Name + ".txt")
{
file.WriteLine(n);
}
}
}
}
public void JobInsert(Action<NetworkDevice> func, NetworkDevice nd)
{
lock (jobs)
{
jobs.Enqueue(item);
if (jobs.Count == 1)
{
// wake up any blocked dequeue
Monitor.Pulse(jobs);
}
}
}
public void JobRemove()
{
Action<NetworkDevice> item;
lock (jobs)
{
while (jobs.Count == 0)
{
lock (_threadswaitinglock)
{
_threadswaiting += 1;
if (_threadswaiting == workers)
Monitor.Pulse(_statelock);
}
Monitor.Wait(jobs);
}
lock (_threadswaitinglock)
{
_threadswaiting -= 1;
}
item = jobs.Dequeue();
item.Invoke();
}
}
public bool NetworkDeviceExists(NetworkDevice nd)
{
try
{
Monitor.Enter(database);
if (database.ContainsKey(nd.Id))
{
return true;
}
else
{
database.Add(nd.Id, nd);
Action<NetworkDevice> action = new Action<NetworkDevice>(UpdateDeviceInfo);
jobs.Enqueue(action);
return false;
}
}
finally
{
Monitor.Exit(database);
}
}
//Job1 - Learning -> Update device info
public void UpdateDeviceInfo(NetworkDevice nd)
{
nd.Update(this);
try
{
Monitor.Enter(database);
nd.Status = NetworkDeviceStatus.Self;
}
finally
{
Monitor.Exit(database);
}
}
//Job2 - Updating -> After Learning, create links between neighbours
private void UpdateLinks()
{
}
}

Your best bet seems like using a BlockingCollection instead of the Queue class. They behave effectively the same in terms of FIFO, but a BlockingCollection will let each of your threads block until an item can be taken by calling GetConsumingEnumerable or Take. Here is a complete example.
http://mikehadlow.blogspot.com/2012/11/using-blockingcollection-to-communicate.html?m=1
As for including the parameters, it seems like you could use closure to enclose the NetworkDevice itself and then just enqueue Action instead of Action<>

C# Multi-threading, wait for all task to complete in a situation when new tasks are being constantly added

I have a situation where new tasks are being constantly generated and added to a ConcurrentBag<Tasks>.
I need to wait all tasks to complete.
Waiting for all the tasks in the ConcurrentBag via WaitAll is not enough as the number of tasks would have grown while the previous wait is completed.
At the moment I am waiting it in the following way:
private void WaitAllTasks()
{
while (true)
{
int countAtStart = _tasks.Count();
Task.WaitAll(_tasks.ToArray());
int countAtEnd = _tasks.Count();
if (countAtStart == countAtEnd)
{
break;
}
#if DEBUG
if (_tasks.Count() > 100)
{
tokenSource.Cancel();
break;
}
#endif
}
}
I am not very happy with the while(true) solution.
Can anyone suggest a better more efficient way to do this (without having to pool the processor constantly with a while(true))
Additional context information as requested in the comments. I don't think though this is relevant to the question.
This piece of code is used in a web crawler. The crawler scans page content and looks for two type of information. Data Pages and Link Pages. Data pages will be scanned and data will be collected, Link Pages will be scanned and more links will be collected from them.
As each of the tasks carry-on the activities and find more links, they add the links to an EventList. There is an event OnAdd on the list (code below) that is used to trigger other task to scan the newly added URLs. And so forth.
The job is complete when there are no more running tasks (so no more links will be added) and all items have been processed.
public IEventList<ISearchStatus> CurrentLinks { get; private set; }
public IEventList<IDataStatus> CurrentData { get; private set; }
public IEventList<System.Dynamic.ExpandoObject> ResultData { get; set; }
private readonly ConcurrentBag<Task> _tasks = new ConcurrentBag<Task>();
private readonly CancellationTokenSource tokenSource = new CancellationTokenSource();
private readonly CancellationToken token;
public void Search(ISearchDefinition search)
{
CurrentLinks.OnAdd += UrlAdded;
CurrentData.OnAdd += DataUrlAdded;
var status = new SearchStatus(search);
CurrentLinks.Add(status);
WaitAllTasks();
_exporter.Export(ResultData as IList<System.Dynamic.ExpandoObject>);
}
private void DataUrlAdded(object o, EventArgs e)
{
var item = o as IDataStatus;
if (item == null)
{
return;
}
_tasks.Add(Task.Factory.StartNew(() => ProcessObjectSearch(item), token));
}
private void UrlAdded(object o, EventArgs e)
{
var item = o as ISearchStatus;
if (item==null)
{
return;
}
_tasks.Add(Task.Factory.StartNew(() => ProcessFollow(item), token));
_tasks.Add(Task.Factory.StartNew(() => ProcessData(item), token));
}
public class EventList<T> : List<T>, IEventList<T>
{
public EventHandler OnAdd { get; set; }
private readonly object locker = new object();
public new void Add(T item)
{
//lock (locker)
{
base.Add(item);
}
OnAdd?.Invoke(item, null);
}
public new bool Contains(T item)
{
//lock (locker)
{
return base.Contains(item);
}
}
}

I think that this task can be done with TPL Dataflow library with very basic setup. You'll need a TransformManyBlock<Task, IEnumerable<DataTask>> and an ActionBlock (may be more of them) for actual data processing, like this:
// queue for a new urls to parse
var buffer = new BufferBlock<ParseTask>();
// parser itself, returns many data tasks from one url
// similar to LINQ.SelectMany method
var transform = new TransformManyBlock<ParseTask, DataTask>(task =>
{
// get all the additional urls to parse
var parsedLinks = GetLinkTasks(task);
// get all the data to parse
var parsedData = GetDataTasks(task);
// setup additional links to be parsed
foreach (var parsedLink in parsedLinks)
{
buffer.Post(parsedLink);
}
// return all the data to be processed
return parsedData;
});
// actual data processing
var consumer = new ActionBlock<DataTask>(s => ProcessData(s));
After that you need to link the blocks between each over:
buffer.LinkTo(transform, new DataflowLinkOptions { PropagateCompletion = true });
transform.LinkTo(consumer, new DataflowLinkOptions { PropagateCompletion = true });
Now you have a nice pipeline which will execute in background. At the moment you realize that everything you need is parsed, you simply call the Complete method for a block so it stops accepting news messages. After the buffer became empty, it will propagate the completion down the pipeline to transform block, which will propagate it down to consumer(s), and you need to wait for Completion task:
// no additional links would be accepted
buffer.Complete();
// after all the tasks are done, this will get fired
await consumer.Completion;
You can check the moment for a completion, for example, if both buffer' Count property and transform' InputCount and transform' CurrentDegreeOfParallelism (this is internal property for the TransformManyBlock) are equal to 0.
However, I suggested you to implement some additional logic here to determine current transformers number, as using the internal logic isn't a great solution. As for cancelling the pipeline, you can create a TPL block with a CancellationToken, either the one for all, or a dedicated for each block, getting the cancellation out of box.

Why not write one function that yields your tasks as necessary, when they are created? This way you can just use Task.WhenAll to wait for them to complete or, have I missed the point? See this working here.
using System;
using System.Threading.Tasks;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
try
{
Task.WhenAll(GetLazilyGeneratedSequenceOfTasks()).Wait();
Console.WriteLine("Fisnished.");
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
}
public static IEnumerable<Task> GetLazilyGeneratedSequenceOfTasks()
{
var random = new Random();
var finished = false;
while (!finished)
{
var n = random.Next(1, 2001);
if (n < 50)
{
finished = true;
}
if (n > 499)
{
yield return Task.Delay(n);
}
Task.Delay(20).Wait();
}
yield break;
}
}
Alternatively, if your question is not as trivial as my answer may suggest, I'd consider a mesh with TPL Dataflow. The combination of a BufferBlock and an ActionBlock would get you very close to what you need. You could start here.
Either way, I'd suggest you want to include a provision for accepting a CancellationToken or two.

How to monitor/wait on array of objects?

I am using concurrent bag to store a set of objects. I want to implement something like
if(an object is present)
return it
else wait until one get free, if it does not get free in a specific time throw an exception.
if(object has been returned)
add to bag
I was thinking to use monitors but monitor can wait on a specific object. I want to wait till any of them is free. How can I implement it?

Extending the msdn example found here:
public class FiniteObjectPool<T>: IDisposable
{
System.Threading.AutoResetEvent m_Wait = new System.Threading.AutoResetEvent(false);
private ConcurrentBag<T> _objects;
public FiniteObjectPool()
{
_objects = new ConcurrentBag<T>();
}
public T GetObject()
{
T item;
while(!_objects.TryTake(out item))
{
m_Wait.WaitOne(); //an object was not available, wait until one is
}
return item;
}
public void PutObject(T item)
{
_objects.Add(item);
m_Wait.Set(); //signal a waiting thread that object may now be available
}
public void Dispose()
{
m_Wait.Dispose();
}
}
EDIT - example usage with 'Context' idiom wrapper
class Program
{
public class FiniteObjectPoolContext<T>: IDisposable
{
FiniteObjectPool<T> m_Pool = new FiniteObjectPool<T>();
public T Value { get; set; }
public FiniteObjectPoolContext(FiniteObjectPool<T> pool)
{
m_Pool = pool;
Value = pool.GetObject(); //take an object out - this will block if none is available
}
public void Dispose()
{
m_Pool.PutObject(Value); //put the object back because this context is finished
}
}
static void Main(string[] args)
{
FiniteObjectPool<int> pool = new FiniteObjectPool<int>();
for (int i = 0; i < 10; i++)
{
pool.PutObject(i);
}
List<Task> tasks = new List<Task>();
for (int i = 0; i < 20; i++)
{
int id = i;
tasks.Add(Task.Run(() =>
{
Console.WriteLine("Running task " + id);
using (var con = new FiniteObjectPoolContext<int>(pool))
{
Console.WriteLine("Task " + id + " got object from pool: " + con.Value);
System.Threading.Thread.Sleep(5000);
Console.WriteLine("Task " + id + " is finished with pool object: " + con.Value);
}
}));
}
Task.WaitAll(tasks.ToArray());
Console.WriteLine("DONE");
Console.ReadLine();
}
}
Notice the latency injected by the thread synchronization mechanisms.

Try Semaphores for the operations that you want to do. .NET has two implementations for Semaphores. Semaphore and SemaphoreSlim, both can be used to implement many threads trying to access a pool of resources.

Performance of Parallelism with Dynamic Objects

The following code runs in roughly 2.5 seconds:
static void Main(string[] args)
{
var service = new Service();
Parallel.For(0, 100, i => {
dynamic user = new ExpandoObject();
user.data = new ExpandoObject();
user.data.id = i;
user.data.name = "User Name";
var parsed = service.Parse(user);
});
}
public class Service
{
public User Parse(dynamic dynamicUser)
{
if (dynamicUser.data != null)
{
return new User
{
Id = dynamicUser.data.id,
Name = dynamicUser.data.name
};
}
return null;
}
}
public class User
{
public int Id { get; set; }
public string Name { get; set; }
}
However, if I change the Parallel.For() loop to a simple For loop, it runs in about 200 miliseconds:
for (var i = 0; i < 100; i++)
So my question is, why is this much slower when run in parallel?
My theory is that there is some overhead in parsing the dynamic object that is done once per thread. In the simple loop, the DLR does its thing the first time and then doesn't need to for each subsequent call.
But in parallel, the overhead of the DLR happens in each call.
Is this a correct assumption, or am I way off base?

I suspect you're being mislead by your diagnostics. In particular, if running a loop 100 times takes 2.5 seconds, that's really, really slow. Is this under the debugger, by any chance?
Here are the results on my box for code compiled with /o+ and then run in the console. Note that I'm running 1,000,000 loop iterations in each test.
Void ExecuteParallel(): 00:00:00.7311773
Void ExecuteSerial(): 00:00:02.0514120
Void ExecuteParallel(): 00:00:00.6897816
Void ExecuteSerial(): 00:00:02.0389325
Void ExecuteParallel(): 00:00:00.6754025
Void ExecuteSerial(): 00:00:02.0653801
Void ExecuteParallel(): 00:00:00.7136330
Void ExecuteSerial(): 00:00:02.0477593
Void ExecuteParallel(): 00:00:00.6742260
Void ExecuteSerial(): 00:00:02.0476146
It's not as much faster in parallel as you might expect from a quad-core i7, but I suspect that's due to the context switches etc mentioned by Servy - and also possibly contention on the execution cache in the DLR. Still, it's faster than running in series.
Try the code yourself, and see what you get on your box - but not under a debugger.
Code:
using System;
using System.Diagnostics;
using System.Dynamic;
using System.Threading.Tasks;
class Test
{
const int Iterations = 1000000;
static void Main(string[] args)
{
for (int i = 0; i < 5; i++)
{
RunTest(ExecuteParallel);
RunTest(ExecuteSerial);
}
}
static void RunTest(Action action)
{
var sw = Stopwatch.StartNew();
action();
sw.Stop();
Console.WriteLine("{0}: {1}", action.Method, sw.Elapsed);
}
static void ExecuteParallel()
{
var service = new Service();
Parallel.For(0, Iterations, i => {
dynamic user = new ExpandoObject();
user.data = new ExpandoObject();
user.data.id = i;
user.data.name = "User Name";
var parsed = service.Parse(user);
});
}
static void ExecuteSerial()
{
var service = new Service();
for (int i = 0; i < Iterations; i++)
{
dynamic user = new ExpandoObject();
user.data = new ExpandoObject();
user.data.id = i;
user.data.name = "User Name";
var parsed = service.Parse(user);
}
}
}
public class Service
{
public User Parse(dynamic dynamicUser)
{
if (dynamicUser.data != null)
{
return new User
{
Id = dynamicUser.data.id,
Name = dynamicUser.data.name
};
}
return null;
}
}
public class User
{
public int Id { get; set; }
public string Name { get; set; }
}

Here the tasks that you're doing are so simple, take so little time, and there are few enough of them, that the overhead of creating threads, breaking up the tasks, scheduling them, dealing with context switches, memory barriers, and all of that, is significant in comparison to the amount of productive work that you're doing. If you were doing work that took longer then the overhead of parallelizing it would be much less in comparison.

How to ensure thread safe ASP.net page to access static list of objects

In my web application i am having following common objectList for all online users.
public static List<MyClass> myObjectList = new List<MyClass>();
so when multiple online users try to read data from this object myObjectList then are there any chances of thread synchronization issue.
In another scenario multiple users are reading from myObjectList and few of them are writing also but every user is writing on a different index of List . Every user may add a new item to this list . So now I think there are chances of synchronization issue.
How to write thread safe utility class that can read and write data from this object in safer way.
Suggestions are highly welcome
Code suggested by Angelo looks like this
using System;
using System.Collections.Concurrent;
using System.Threading;
using System.Threading.Tasks;
namespace ObjectPoolExample
{
public class ObjectPool<T>
{
private ConcurrentBag<T> _objects;
private Func<T> _objectGenerator;
public ObjectPool(Func<T> objectGenerator)
{
if (objectGenerator == null) throw new ArgumentNullException("objectGenerator");
_objects = new ConcurrentBag<T>();
_objectGenerator = objectGenerator;
}
public T GetObject()
{
T item;
if (_objects.TryTake(out item)) return item;
return _objectGenerator();
}
public void PutObject(T item)
{
_objects.Add(item);
}
}
class Program
{
static void Main(string[] args)
{
CancellationTokenSource cts = new CancellationTokenSource();
// Create an opportunity for the user to cancel.
Task.Factory.StartNew(() =>
{
if (Console.ReadKey().KeyChar == 'c' || Console.ReadKey().KeyChar == 'C')
cts.Cancel();
});
ObjectPool<MyClass> pool = new ObjectPool<MyClass> (() => new MyClass());
// Create a high demand for MyClass objects.
Parallel.For(0, 1000000, (i, loopState) =>
{
MyClass mc = pool.GetObject();
Console.CursorLeft = 0;
// This is the bottleneck in our application. All threads in this loop
// must serialize their access to the static Console class.
Console.WriteLine("{0:####.####}", mc.GetValue(i));
pool.PutObject(mc);
if (cts.Token.IsCancellationRequested)
loopState.Stop();
});
Console.WriteLine("Press the Enter key to exit.");
Console.ReadLine();
}
}
// A toy class that requires some resources to create.
// You can experiment here to measure the performance of the
// object pool vs. ordinary instantiation.
class MyClass
{
public int[] Nums {get; set;}
public double GetValue(long i)
{
return Math.Sqrt(Nums[i]);
}
public MyClass()
{
Nums = new int[1000000];
Random rand = new Random();
for (int i = 0; i < Nums.Length; i++)
Nums[i] = rand.Next();
}
}
}
I think i can go with this approach.

If you are using .NET 4.0 you are better off changing to one of the thread-safe collections already supported by the runtime, like for example a ConcurrentBag.
The concurrent bag however does not support access by index if I recall correctly so you may need to resort to a ConcurrentDictionary if you need access to an object by a given key.
If .NET 4.0 is not an option you should read the following blog post:
Why are thread safe collections so hard?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Consuming blocking collection with multiple tasks/consumers - c#

Related

Threads monitoring a Queue<Actions>

C# Multi-threading, wait for all task to complete in a situation when new tasks are being constantly added

How to monitor/wait on array of objects?

Performance of Parallelism with Dynamic Objects

How to ensure thread safe ASP.net page to access static list of objects

Categories

Resources