I'm trying to write a windows service whose producers and consumers work like this:
Producer: At scheduled times, get all unprocessed items (Processed = 0 on their row in the db) and add each one to the work queue that isn't already in the work queue
Consumer: Constantly pull items from the work queue and process them and update the db (Processed = 1 on their row)
I've tried to look for examples of this exact data flow in C#.NET so I can leverage the existing libraries. But so far I haven't found exactly that.
I see on https://blog.stephencleary.com/2012/11/async-producerconsumer-queue-using.html the example
private static void Produce(BufferBlock<int> queue, IEnumerable<int> values)
{
foreach (var value in values)
{
queue.Post(value);
}
queue.Complete();
}
private static async Task<IEnumerable<int>> Consume(BufferBlock<int> queue)
{
var ret = new List<int>();
while (await queue.OutputAvailableAsync())
{
ret.Add(await queue.ReceiveAsync());
}
return ret;
}
Here's the "idea" of what I'm trying to modify that to do:
while(true)
{
if(await WorkQueue.OutputAvailableAsync())
{
ProcessItem(await WorkQueue.ReceiveAsync());
}
else
{
await Task.Delay(5000);
}
}
...would be how the Consumer works, and
MyTimer.Elapsed += Produce;
static async void Produce(object source, ElapsedEventArgs e)
{
IEnumerable<Item> items = GetUnprocessedItemsFromDb();
foreach(var item in items)
if(!WorkQueue.Contains(w => w.Id == item.Id))
WorkQueue.Enqueue(item);
}
...would be how the Producer works.
That's a rough idea of what I'm trying to do. Can any of you show me the right way to do it, or link me to the proper documentation for solving this type of problem?
Creating a custom BufferBlock<T> that rejects duplicate messages is anything but trivial. The TPL Dataflow components do not expose their internal state for the purpose of customization. You can see here an attempt to circumvent this limitation, by creating a custom ActionBlock<T> with an exposed IEnumerable<T> InputQueue property. The code is lengthy and obscure, and creating a custom BufferUniqueBlock<T> might need double the amount of code, because this class implements the ISourceBlock<T> interface too.
My suggestion is to find some other way to avoid processing twice an Item, instead of preventing duplicates from entering the queue. Maybe you could add the responsibility to the Consumer to query the database, and check if the currently received item is unprocessed, before actually processing it.
Related
I have observer module which takes care of subscriptions of some reactive stream I have created from Kafka. Sadly I need to Poll in order to receive messages from kafka, so I need to dedicate one background thread for that. My first solution was this one:
public void Poll()
{
if (Interlocked.Exchange(ref _state, POLLING) == NOTPOLLING)
{
Task.Run(() =>
{
while (CurrentSubscriptions.Count != 0)
{
_consumer.Poll(TimeSpan.FromSeconds(1));
}
_state = NOTPOLLING;
});
}
}
Now my reviewer suggested that I should Task because it have statuses and can be checked if they are running or not. This led to this code:
public void Poll()
{
// checks for statuses: WaitingForActivation, WaitingToRun, Running
if (_runningStatuses.Contains(_pollingTask.Status)) return;
_pollingTask.Start(); // this obviously throws exception once Task already completes and then I want to start it again
}
Task remained pretty much the same but check changed, now since my logic is that I want to start polling when I have subscriptions and stop when I don't I need to sort of re-use the Task, but since I can't I am wondering do I need to go back to my first implementation or is there any other neat way of doing this that right now I am missing?
I am wondering do I need to go back to my first implementation or is there any other neat way of doing this that right now I am missing?
Your first implementation looks fine. You might use a ManualResetEventSlim instead of enum and Interlocked.Exchange, but that's essentially the same... as long as you have just two states.
I think I made a compromise and removed Interlocked API for MethodImpl(MethodImpl.Options.Synchronized) it lets me have simple method body without possibly confusing Interlocked API code for eventual newcomer/inexperienced guy.
[MethodImpl(MethodImplOptions.Synchronized)]
public void Poll()
{
if (!_polling)
{
_polling = true;
new Task(() =>
{
while (_currentSubscriptions.Count != 0)
{
_consumer.Poll(TimeSpan.FromSeconds(1));
}
_polling = false;
}, TaskCreationOptions.LongRunning).Start();
}
}
I have 30 sub companies and every one has implemented their web service (with different technologies).
I need to implement a web service to aggregate them, for example, all the sub company web services have a web method with name GetUserPoint(int nationalCode) and I need to implement my web service that will call all of them and collect all of the responses (for example sum of points).
This is my base class:
public abstract class BaseClass
{ // all same attributes and methods
public long GetPoint(int nationalCode);
}
For each of sub companies web services, I implement a class that inherits this base class and define its own GetPoint method.
public class Company1
{
//implement own GetPoint method (call a web service).
}
to
public class CompanyN
{
//implement own GetPoint method (call a web service).
}
so, this is my web method:
[WebMethod]
public long MyCollector(string nationalCode)
{
BaseClass[] Clients = new BaseClass[] { new Company1(),//... ,new Company1()}
long Result = 0;
foreach (var item in Clients)
{
long ResultTemp = item.GetPoint(nationalCode);
Result += ResultTemp;
}
return Result;
}
OK, it works but it's so slow, because every sub companys web service is hosted on different servers (on the internet).
I can use parallel programing like this:(is this called parallel programing!?)
foreach (var item in Clients)
{
Tasks.Add(Task.Run(() =>
{
Result.AddRange(item.GetPoint(MasterLogId, mobileNumber));
}
}
I think parallel programing (and threading) isn't good for this solution, because my solution is IO bound (not CPU intensive)!
Call every external web service is so slow, am i right? Many thread that are pending to get response!
I think async programming is the best way but I am new to async programming and parallel programing.
What is the best way? (parallel.foreach - async TAP - async APM - async EAP -threading)
Please write for me an example.
It's refreshing to see someone who has done their homework.
First things first, as of .NET 4 (and this is still very much the case today) TAP is the preferred technology for async workflow in .NET. Tasks are easily composable, and for you to parallelise your web service calls is a breeze if they provide true Task<T>-returning APIs. For now you have "faked" it with Task.Run, and for the time being this may very well suffice for your purposes. Sure, your thread pool threads will spend a lot of time blocking, but if the server load isn't very high you could very well get away with it even if it's not the ideal thing to do.
You just need to fix a potential race condition in your code (more on that towards the end).
If you want to follow the best practices though, you go with true TAP. If your APIs provide Task-returning methods out of the box, that's easy. If not, it's not game over as APM and EAP can easily be converted to TAP. MSDN reference: https://msdn.microsoft.com/en-us/library/hh873178(v=vs.110).aspx
I'll also include some conversion examples here.
APM (taken from another SO question):
MessageQueue does not provide a ReceiveAsync method, but we can get it to play ball via Task.Factory.FromAsync:
public static Task<Message> ReceiveAsync(this MessageQueue messageQueue)
{
return Task.Factory.FromAsync(messageQueue.BeginReceive(), messageQueue.EndPeek);
}
...
Message message = await messageQueue.ReceiveAsync().ConfigureAwait(false);
If your web service proxies have BeginXXX/EndXXX methods, this is the way to go.
EAP
Assume you have an old web service proxy derived from SoapHttpClientProtocol, with only event-based async methods. You can convert them to TAP as follows:
public Task<long> GetPointAsyncTask(this PointWebService webService, int nationalCode)
{
TaskCompletionSource<long> tcs = new TaskCompletionSource<long>();
webService.GetPointAsyncCompleted += (s, e) =>
{
if (e.Cancelled)
{
tcs.SetCanceled();
}
else if (e.Error != null)
{
tcs.SetException(e.Error);
}
else
{
tcs.SetResult(e.Result);
}
};
webService.GetPointAsync(nationalCode);
return tcs.Task;
}
...
using (PointWebService service = new PointWebService())
{
long point = await service.GetPointAsyncTask(123).ConfigureAwait(false);
}
Avoiding races when aggregating results
With regards to aggregating parallel results, your TAP loop code is almost right, but you need to avoid mutating shared state inside your Task bodies as they will likely execute in parallel. Shared state being Result in your case - which is some kind of collection. If this collection is not thread-safe (i.e. if it's a simple List<long>), then you have a race condition and you may get exceptions and/or dropped results on Add (I'm assuming AddRange in your code was a typo, but if not - the above still applies).
A simple async-friendly rewrite that fixes your race would look like this:
List<Task<long>> tasks = new List<Task<long>>();
foreach (BaseClass item in Clients) {
tasks.Add(item.GetPointAsync(MasterLogId, mobileNumber));
}
long[] results = await Task.WhenAll(tasks).ConfigureAwait(false);
If you decide to be lazy and stick with the Task.Run solution for now, the corrected version will look like this:
List<Task<long>> tasks = new List<Task<long>>();
foreach (BaseClass item in Clients)
{
Task<long> dodgyThreadPoolTask = Task.Run(
() => item.GetPoint(MasterLogId, mobileNumber)
);
tasks.Add(dodgyThreadPoolTask);
}
long[] results = await Task.WhenAll(tasks).ConfigureAwait(false);
You can create an async version of the GetPoint:
public abstract class BaseClass
{ // all same attributes and methods
public abstract long GetPoint(int nationalCode);
public async Task<long> GetPointAsync(int nationalCode)
{
return await GetPoint(nationalCode);
}
}
Then, collect the tasks for each client call. After that, execute all tasks using Task.WhenAll. This will execute them all in parallell. Also, as pointed out by Kirill, you can await the results of each task:
var tasks = Clients.Select(x => x.GetPointAsync(nationalCode));
long[] results = await Task.WhenAll(tasks);
If you do not want to make the aggregating method async, you can collect the results by calling .Result instead of awaiting, like so:
long[] results = Task.WhenAll(tasks).Result;
I'm hoping to find some advice on the best way to achieve fetching a bunch of id values (like a database Identity values) before I need them. I have a number of classes that require a unique id (int) and what I'd like to do is fetch the next available id (per class, per server) and have it cached locally ready. When an id is taken I want to get the next one ready etc.
I've produced some code to demonstrate what I am trying to do. The code is terrible (it should contain locks etc.) but I think it gets the point across. Losing the odd id is not a problem - a duplicate id is (a problem). I'm happy with the guts of GetNextIdAsync - it calls a proc
this.Database.SqlQuery<int>("EXEC EntityNextIdentityValue #Key",
new SqlParameter("Key", key))).First();
on SQL Server that uses sp_getapplock to ensure each return value is unique (and incremental).
static class ClassId
{
static private Dictionary<string, int> _ids = new Dictionary<string,int>();
static private Dictionary<string, Thread> _threads = new Dictionary<string,Thread>();
static ClassId()
{
//get the first NextId for all known classes
StartGetNextId("Class1");
StartGetNextId("Class2");
StartGetNextId("Class3");
}
static public int NextId(string key)
{
//wait for a current call for nextId to finish
while (_threads.ContainsKey(key)) { }
//get the current nextId
int nextId = _ids[key];
//start the call for the next nextId
StartGetNextId(key);
//return the current nextId
return nextId;
}
static private void StartGetNextId(string key)
{
_threads.Add(key, new Thread(() => GetNextIdAsync(key)));
_threads[key].Start();
}
static private void GetNextIdAsync(string key)
{
//call the long running task to get the next available value
Thread.Sleep(1000);
if (_ids.ContainsKey(key)) _ids[key] += 1;
else _ids.Add(key, 1);
_threads.Remove(key);
}
}
My question is - what is the best way to always have the next value I'm going to need before I need it? How should the class be arranged and where should the locks be? E.g. lock inside GetNextIdAsync() add the new thread but don't start it and change StartGetNextId() to call .Start()?
You should have your database generate the identity values, by marking that column appropriately. You can retrieve that value with SCOPE_IDENTITY or similar.
The main failings of your implementation are the busy wait in NextId and accessing the Dictionary simultaneously from multiple threads. The simplest solution would be to use a BlockingCollection like ohadsc suggests below. You'll need to anticipate the case where your database goes down and you can't get more id's - you don't want to deadlock your application. So you would want to use the Take() overload that accepts a ConcellationToken, which you would notify in the event that accessing the database fails.
This seems like a good application for a producer-consumer pattern.
I'm thinking something like:
private ConcurrentDictionary<string, int> _ids;
private ConcurrentDictionary<string, Thread> _threads;
private Task _producer;
private Task _consumer;
private CancellationTokenSource _cancellation;
private void StartProducer()
{
_producer = Task.Factory.StartNew(() =>
while (_cancellation.Token.IsCancellationRequested == false)
{
_ids.Add(GetNextKeyValuePair());
}
)
}
private void StartConsumer()
{
_consumer = Task.Factory.StartNew(() =>
while (_cancellation.Token.IsCancellationRequested == false)
{
UseNextId(id);
_ids.Remove(id);
}
)
}
A few things to point out...
Firstly, and you probably know this already, it's very important to use thread-safe collections like ConcurrentDictionary or BlockingCollection instead of plain Dictonary or List. If you don't do this, bad things will happen, people will die and babies will cry.
Second, you might need something a little less hamfisted than the basic CancellationTokenSource, that's just what I'm used to from my service programming. The point is to have some way to cancel these things so you can shut them down gracefully.
Thirdly, consider throwing sleeps in there to keep it from pounding the processor too hard.
The particulars of this will vary based on how fast you can generate these things as opposed to how fast you can consume them. My code gives absolutely no guarantee that you will have the ID you want before the consumer asks for it, if the consumer is running at a much higher speed than the producer. However, this is a decent, albeit basic way to organize preparing of this sort of data concurrently.
You could use a BlockingCollection for this. Basically you'll have a thread pumping new IDs into a buffer:
BlockingCollection<int> _queue = new BlockingCollection<int>(BufferSize);
void Init()
{
Task.Factory.StartNew(PopulateIdBuffer, TaskCreationOptions.LongRunning);
}
void PopulateIdBuffer()
{
int id = 0;
while (true)
{
Thread.Sleep(1000); //Simulate long retrieval
_queue.Add(id++);
}
}
void SomeMethodThatNeedsId()
{
var nextId = _queue.Take();
....
}
Why will the Parallel.ForEach will not finish executing a series of tasks until MoveNext returns false?
I have a tool that monitors a combination of MSMQ and Service Broker queues for incoming messages. When a message is found, it hands that message off to the appropriate executor.
I wrapped the check for messages in an IEnumerable, so that I could hand the Parallel.ForEach method the IEnumerable plus a delegate to run. The application is designed to run continuously w/ the IEnumerator.MoveNext processing in a loop until it's able to get work, then the IEnumerator.Current giving it the next item.
Since the MoveNext will never die until I set the CancelToken to true, this should continue to process for ever. Instead what I'm seeing is that once the Parallel.ForEach has picked up all the messages and the MoveNext is no longer returning "true", no more tasks are processed. Instead it seems like the MoveNext thread is the only thread given any work while it waits for it to return, and the other threads (including waiting and scheduled threads) do not do any work.
Is there a way to tell the Parallel to keep working while it waits for a response from the MoveNext?
If not, is there another way to structure the MoveNext to get what I want? (having it return true and then the Current returning a null object spawns a lot of bogus Tasks)
Bonus Question: Is there a way to limit how many messages the Parallel pulls off at once? It seems to pull off and schedule a lot of messages at once (the MaxDegreeOfParallelism only seems to limit how much work it does at once, it doesn't stop it from pulling off a lot of messages to be scheduled)
Here is the IEnumerator for what I've written (w/o some extraneous code):
public class DataAccessEnumerator : IEnumerator<TransportMessage>
{
public TransportMessage Current
{ get { return _currentMessage; } }
public bool MoveNext()
{
while (_cancelToken.IsCancellationRequested == false)
{
TransportMessage current;
foreach (var task in _tasks)
{
if (task.QueueType.ToUpper() == "MSMQ")
current = _msmq.Get(task.Name);
else
current = _serviceBroker.Get(task.Name);
if (current != null)
{
_currentMessage = current;
return true;
}
}
WaitHandle.WaitAny(new [] {_cancelToken.WaitHandle}, 500);
}
return false;
}
public DataAccessEnumerator(IDataAccess<TransportMessage> serviceBroker, IDataAccess<TransportMessage> msmq, IList<JobTask> tasks, CancellationToken cancelToken)
{
_serviceBroker = serviceBroker;
_msmq = msmq;
_tasks = tasks;
_cancelToken = cancelToken;
}
private readonly IDataAccess<TransportMessage> _serviceBroker;
private readonly IDataAccess<TransportMessage> _msmq;
private readonly IList<JobTask> _tasks;
private readonly CancellationToken _cancelToken;
private TransportMessage _currentMessage;
}
Here is the Parallel.ForEach call where _queueAccess is the IEnumerable that holds the above IEnumerator and RunJob processes a TransportMessage that is returned from that IEnumerator:
var parallelOptions = new ParallelOptions
{
CancellationToken = _cancelTokenSource.Token,
MaxDegreeOfParallelism = 8
};
Parallel.ForEach(_queueAccess, parallelOptions, x => RunJob(x));
It sounds to me like Parallel.ForEach isn't really a good match for what you want to do. I suggest you use BlockingCollection<T> to create a producer/consumer queue instead - create a bunch of threads/tasks to service the blocking collection, and add work items to it as and when they arrive.
Your problem might be to do with the Partitioner being used.
In your case, the TPL will choose the Chunk Partitioner, which will take multiple items from the enum before passing them on to be processed. The number of items taken in each chunk will increase with time.
When your MoveNext method blocks, the TPL is left waiting for the next item and won't process the items that it has already taken.
You have a couple of options to fix this:
1) Write a Partitioner that always returns individual items. Not as tricky as it sounds.
2) Use the TPL instead of Parallel.ForEach:
foreach ( var item in _queueAccess )
{
var capturedItem = item;
Task.Factory.StartNew( () => RunJob( capturedItem ) );
}
The second solution changes the behaviour a bit. The foreach loop will complete when all the Tasks have been created, not when they have finished. If this is a problem for you, you can add a CountdownEvent:
var ce = new CountdownEvent( 1 );
foreach ( var item in _queueAccess )
{
ce.AddCount();
var capturedItem = item;
Task.Factory.StartNew( () => { RunJob( capturedItem ); ce.Signal(); } );
}
ce.Signal();
ce.Wait();
I haven't gone to the effort to make sure of this, but the impression I'd received from discussions of Parallel.ForEach was that it would pull all the items out of the enumerable them make appropriate decisions about how to divide them across threads. Based on your problem, that seems correct.
So, to keep most of your current code, you should probably pull the blocking code out of the iterator and place it into a loop around the call to Parallel.ForEach (which uses the iterator).
I have multiple producers and a single consumer. However if there is something in the queue that is not yet consumed a producer should not queue it again. (unique no duplicates blocking collection that uses the default concurrent queue)
if (!myBlockingColl.Contains(item))
myBlockingColl.Add(item)
However the blocking collection does not have a contains method nor does it provide any kind of TryPeek() like method. How can I access the underlying concurrent queue so I can do something like
if (!myBlockingColl.myConcurQ.trypeek(item)
myBlockingColl.Add(item)
In a tail spin?
This is an interesting question. This is the first time I have seen someone ask for a blocking queue that ignores duplicates. Oddly enough I could find nothing like what you want that already exists in the BCL. I say this is odd because BlockingCollection can accept a IProducerConsumerCollection as the underlying collection which has the TryAdd method that is advertised as being able to fail when duplicates are detected. The problem is that I see no concrete implementation of IProducerConsumerCollection that prevents duplicates. At least we can write our own.
public class NoDuplicatesConcurrentQueue<T> : IProducerConsumerCollection<T>
{
// TODO: You will need to fully implement IProducerConsumerCollection.
private Queue<T> queue = new Queue<T>();
public bool TryAdd(T item)
{
lock (queue)
{
if (!queue.Contains(item))
{
queue.Enqueue(item);
return true;
}
return false;
}
}
public bool TryTake(out T item)
{
lock (queue)
{
item = null;
if (queue.Count > 0)
{
item = queue.Dequeue();
}
return item != null;
}
}
}
Now that we have our IProducerConsumerCollection that does not accept duplicates we can use it like this:
public class Example
{
private BlockingCollection<object> queue = new BlockingCollection<object>(new NoDuplicatesConcurrentQueue<object>());
public Example()
{
new Thread(Consume).Start();
}
public void Produce(object item)
{
bool unique = queue.TryAdd(item);
}
private void Consume()
{
while (true)
{
object item = queue.Take();
}
}
}
You may not like my implementation of NoDuplicatesConcurrentQueue. You are certainly free to implement your own using ConcurrentQueue or whatever if you think you need the low-lock performance that the TPL collections provide.
Update:
I was able to test the code this morning. There is some good news and bad news. The good news is that this will technically work. The bad news is that you probably will not want to do this because BlockingCollection.TryAdd intercepts the return value from the underlying IProducerConsumerCollection.TryAdd method and throws an exception when false is detected. Yep, that is right. It does not return false like you would expect and instead generates an exception. I have to be honest, this is both surprising and ridiculous. The whole point of the TryXXX methods is that they should not throw exceptions. I am deeply disappointed.
In addition to the caveat Brian Gideon mentioned after Update, his solution suffers from these performance issues:
O(n) operations on the queue (queue.Contains(item)) have a severe impact on performance as the queue grows
locks limit concurrency (which he does mention)
The following code improves on Brian's solution by
using a hash set to do O(1) lookups
combining 2 data structures from the System.Collections.Concurrent namespace
N.B. As there is no ConcurrentHashSet, I'm using a ConcurrentDictionary, ignoring the values.
In this rare case it is luckily possible to simply compose a more complex concurrent data structure out of multiple simpler ones, without adding locks. The order of operations on the 2 concurrent data structures is important here.
public class NoDuplicatesConcurrentQueue<T> : IProducerConsumerCollection<T>
{
private readonly ConcurrentDictionary<T, bool> existingElements = new ConcurrentDictionary<T, bool>();
private readonly ConcurrentQueue<T> queue = new ConcurrentQueue<T>();
public bool TryAdd(T item)
{
if (existingElements.TryAdd(item, false))
{
queue.Enqueue(item);
return true;
}
return false;
}
public bool TryTake(out T item)
{
if (queue.TryDequeue(out item))
{
bool _;
existingElements.TryRemove(item, out _);
return true;
}
return false;
}
...
}
N.B. Another way at looking at this problem: You want a set that preserves the insertion order.
I would suggest implementing your operations with lock so that you don't read and write the item in a way that corrupts it, making them atomic. For example, with any IEnumerable:
object bcLocker = new object();
// ...
lock (bcLocker)
{
bool foundTheItem = false;
foreach (someClass nextItem in myBlockingColl)
{
if (nextItem.Equals(item))
{
foundTheItem = true;
break;
}
}
if (foundTheItem == false)
{
// Add here
}
}
How to access the underlying default concurrent queue of a blocking collection?
The BlockingCollection<T> is backed by a ConcurrentQueue<T> by default. In other words if you don't specify explicitly its backing storage, it will create a ConcurrentQueue<T> behind the scenes. Since you want to have direct access to the underlying storage, you can create manually a ConcurrentQueue<T> and pass it to the constructor of the BlockingCollection<T>:
ConcurrentQueue<Item> queue = new();
BlockingCollection<Item> collection = new(queue);
Unfortunately the ConcurrentQueue<T> collection doesn't have a TryPeek method with an input parameter, so what you intend to do is not possible:
if (!queue.TryPeek(item)) // Compile error, missing out keyword
collection.Add(item);
Also be aware that the queue is now owned by the collection. If you attempt to mutate it directly (by issuing Enqueue or TryDequeue commands), the collection will throw exceptions.