i have a set of IDs on which i do some operations:
Queue<string> queue = new Queue<string>();
queue.Enqueue("1");
queue.Enqueue("2");
...
queue.Enqueue("10");
foreach (string id in queue)
{
DoSomeWork(id);
}
static void DoSomeWork(string id)
{
// Do some work and oooo there are new ids which should also be processed :)
foreach(string newID in newIDs)
{
if(!queue.Contains(newID)) queue.Enqueue(newID);
}
}
Is it possible to add some new items to queue in DoSomeWork() which will be also processed bei the main foreach-Loop?
What you're doing is to use an iterator over a changing collection. This is bad practice, since some collections will throw an exception when doing this (e.g. the collection should not change during the enumeration).
Use the following approach, which does use new items as well:
while (queue.Count > 0)
{
DoSomeWork(queue.Dequeue());
}
Use Dequeue instead of a foreach-loop. Most enumerators become invalid whenever the underlying container is changed. And En-/Dequeue are the natural operations on a Queue. Else you could use List<T> or HashSet<T>
while(queue.Count>0)
{
var value=queue.Dequeue();
...
}
To check if an item has already been processed a HashSet<T> is a fast solution. I typically use a combination of HashSet and Queue in those cases. The advantage of this solution is that it's O(n) because checking and adding to a HashSet is O(1). Your original code was O(n^2) since Contains on a Queue is O(n).
Queue<string> queue=new Queue<string>();
HashSet<string> allItems=new HashSet<string>();
void Add(string item)
{
if(allItems.Add(item))
queue.Enqueue(item);
}
void DoWork()
{
while(queue.Count>0)
{
var value=queue.Dequeue();
...
}
}
It is common for loop iterations to add more work; just pass the queue into the method as an argument and adding to it should work fine.
The problem is that ou should be using Dequeue:
while(queue.Count>0) {
DoSomeWork(queue.Dequeue());
}
Related
What I'm trying to do add an existing list from another class/thread (which is constantly growing) to a new list that contains values to be validated. But I'm not sure how to do this without processing the same values over and over. I would just like to process the newest added values. See below code
public static void ParsePhotos()
{
int tmprow = 0;
string checkthis = "";
List<String> PhotoCheck = new List<String>();
while (Core.Hashtag.PhotoUrls.Count > Photocheck.Count)
{
foreach (string photourl in Core.Hashtag.PhotoUrls)
{
PhotoCheck.Add(photourl);
}
checkthis = PhotoCheck[tmprow];
//validate checkthis here
//add checkthis to new list if valid here
tmprow++;
while (Thread.Sleep(10000);
};
}
HashSet<String>Core.Hashtag.PhotoUrls is being updated every few seconds in another thread.
There are no practical ways to do it that way, since HashSet is an unordered collections. Regular collections are also not thread safe, and should not be used without locking when using multiple threads.
A typical way to do it would be to use a concurrent queue, where the validation removes items from the queue, and adds them to another queue. The thread that produces the items should be modified to add items to the concurrent queue instead of or in addition to the hashSet.
public static Task ValidateOnWorkerThread<T>(
BlockingCollection<T> queue,
Func<T, bool> validateMethod,
ConcurrentBag<T> validatedItems,
CancellationToken cancel)
{
return Task.Run(ProcessInternal, cancel);
void ProcessInternal()
{
foreach (var item in queue.GetConsumingEnumerable(cancel))
{
if (validateMethod(item))
{
validatedItems.Add(item);
}
}
}
}
This also have the advantage of processing items in real time instead of needing to sleep all the time.
I'm trying to write a windows service whose producers and consumers work like this:
Producer: At scheduled times, get all unprocessed items (Processed = 0 on their row in the db) and add each one to the work queue that isn't already in the work queue
Consumer: Constantly pull items from the work queue and process them and update the db (Processed = 1 on their row)
I've tried to look for examples of this exact data flow in C#.NET so I can leverage the existing libraries. But so far I haven't found exactly that.
I see on https://blog.stephencleary.com/2012/11/async-producerconsumer-queue-using.html the example
private static void Produce(BufferBlock<int> queue, IEnumerable<int> values)
{
foreach (var value in values)
{
queue.Post(value);
}
queue.Complete();
}
private static async Task<IEnumerable<int>> Consume(BufferBlock<int> queue)
{
var ret = new List<int>();
while (await queue.OutputAvailableAsync())
{
ret.Add(await queue.ReceiveAsync());
}
return ret;
}
Here's the "idea" of what I'm trying to modify that to do:
while(true)
{
if(await WorkQueue.OutputAvailableAsync())
{
ProcessItem(await WorkQueue.ReceiveAsync());
}
else
{
await Task.Delay(5000);
}
}
...would be how the Consumer works, and
MyTimer.Elapsed += Produce;
static async void Produce(object source, ElapsedEventArgs e)
{
IEnumerable<Item> items = GetUnprocessedItemsFromDb();
foreach(var item in items)
if(!WorkQueue.Contains(w => w.Id == item.Id))
WorkQueue.Enqueue(item);
}
...would be how the Producer works.
That's a rough idea of what I'm trying to do. Can any of you show me the right way to do it, or link me to the proper documentation for solving this type of problem?
Creating a custom BufferBlock<T> that rejects duplicate messages is anything but trivial. The TPL Dataflow components do not expose their internal state for the purpose of customization. You can see here an attempt to circumvent this limitation, by creating a custom ActionBlock<T> with an exposed IEnumerable<T> InputQueue property. The code is lengthy and obscure, and creating a custom BufferUniqueBlock<T> might need double the amount of code, because this class implements the ISourceBlock<T> interface too.
My suggestion is to find some other way to avoid processing twice an Item, instead of preventing duplicates from entering the queue. Maybe you could add the responsibility to the Consumer to query the database, and check if the currently received item is unprocessed, before actually processing it.
I have a ConcurrentQueue with a list of URLs that I need to get the the source of. When using the Parallel.ForEach with the ConcurrentQueue object as the input parameter, the Pop method won't work nothing (Should return a string).
I'm using Parallel with the MaxDegreeOfParallelism set to four. I really need to block the number of concurrent threads. Is using a queue with Parallelism redundant?
Thanks in advance.
// On the main class
var items = await engine.FetchPageWithNumberItems(result);
// Enqueue List of items
itemQueue.EnqueueList(items);
var crawl = Task.Run(() => { engine.CrawlItems(itemQueue); });
// On the Engine class
public void CrawlItems(ItemQueue itemQueue)
{
Parallel.ForEach(
itemQueue,
new ParallelOptions {MaxDegreeOfParallelism = 4},
item =>
{
var worker = new Worker();
// Pop doesn't return anything
worker.Url = itemQueue.Pop();
/* Some work */
});
}
// Item Queue
class ItemQueue : ConcurrentQueue<string>
{
private ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
public string Pop()
{
string value = String.Empty;
if(this.queue.Count == 0)
throw new Exception();
this.queue.TryDequeue(out value);
return value;
}
public void Push(string item)
{
this.queue.Enqueue(item);
}
public void EnqueueList(List<string> list)
{
list.ForEach(this.queue.Enqueue);
}
}
You don't need ConcurrentQueue<T> if all you're going to do is to first add items to it from a single thread and then iterate it in Parallel.ForEach(). A normal List<T> would be enough for that.
Also, your implementation of ItemQueue is very suspicious:
It inherits from ConcurrentQueue<string> and also contains another ConcurrentQueue<string>. That doesn't make much sense, is confusing and inefficient.
The methods on ConcurrentQueue<T> were designed very carefully to be thread-safe. Your Pop() isn't thread-safe. What could happen is that you check Count, notice it's 1, then call TryDequeue() and not get any value (i.e. value will be null), because another thread removed the item from the queue in the time between the two calls.
The issue is with CrawlItems method, since you shouldn't call Pop in the action provided to the ForEach method. The reason is that the action is being called on each popped item, hence the item was already popped. This is the reason that the action has an 'item' argument.
I assume that you're getting null since all of the items already popped by the other threads, by the ForEach method.
Therefore, your code should look like this:
public void CrawlItems(ItemQueue itemQueue)
{
Parallel.ForEach(
itemQueue,
new ParallelOptions {MaxDegreeOfParallelism = 4},
item =>
{
worker.Url = item;
/* Some work */
});
}
I have multiple producers and a single consumer. However if there is something in the queue that is not yet consumed a producer should not queue it again. (unique no duplicates blocking collection that uses the default concurrent queue)
if (!myBlockingColl.Contains(item))
myBlockingColl.Add(item)
However the blocking collection does not have a contains method nor does it provide any kind of TryPeek() like method. How can I access the underlying concurrent queue so I can do something like
if (!myBlockingColl.myConcurQ.trypeek(item)
myBlockingColl.Add(item)
In a tail spin?
This is an interesting question. This is the first time I have seen someone ask for a blocking queue that ignores duplicates. Oddly enough I could find nothing like what you want that already exists in the BCL. I say this is odd because BlockingCollection can accept a IProducerConsumerCollection as the underlying collection which has the TryAdd method that is advertised as being able to fail when duplicates are detected. The problem is that I see no concrete implementation of IProducerConsumerCollection that prevents duplicates. At least we can write our own.
public class NoDuplicatesConcurrentQueue<T> : IProducerConsumerCollection<T>
{
// TODO: You will need to fully implement IProducerConsumerCollection.
private Queue<T> queue = new Queue<T>();
public bool TryAdd(T item)
{
lock (queue)
{
if (!queue.Contains(item))
{
queue.Enqueue(item);
return true;
}
return false;
}
}
public bool TryTake(out T item)
{
lock (queue)
{
item = null;
if (queue.Count > 0)
{
item = queue.Dequeue();
}
return item != null;
}
}
}
Now that we have our IProducerConsumerCollection that does not accept duplicates we can use it like this:
public class Example
{
private BlockingCollection<object> queue = new BlockingCollection<object>(new NoDuplicatesConcurrentQueue<object>());
public Example()
{
new Thread(Consume).Start();
}
public void Produce(object item)
{
bool unique = queue.TryAdd(item);
}
private void Consume()
{
while (true)
{
object item = queue.Take();
}
}
}
You may not like my implementation of NoDuplicatesConcurrentQueue. You are certainly free to implement your own using ConcurrentQueue or whatever if you think you need the low-lock performance that the TPL collections provide.
Update:
I was able to test the code this morning. There is some good news and bad news. The good news is that this will technically work. The bad news is that you probably will not want to do this because BlockingCollection.TryAdd intercepts the return value from the underlying IProducerConsumerCollection.TryAdd method and throws an exception when false is detected. Yep, that is right. It does not return false like you would expect and instead generates an exception. I have to be honest, this is both surprising and ridiculous. The whole point of the TryXXX methods is that they should not throw exceptions. I am deeply disappointed.
In addition to the caveat Brian Gideon mentioned after Update, his solution suffers from these performance issues:
O(n) operations on the queue (queue.Contains(item)) have a severe impact on performance as the queue grows
locks limit concurrency (which he does mention)
The following code improves on Brian's solution by
using a hash set to do O(1) lookups
combining 2 data structures from the System.Collections.Concurrent namespace
N.B. As there is no ConcurrentHashSet, I'm using a ConcurrentDictionary, ignoring the values.
In this rare case it is luckily possible to simply compose a more complex concurrent data structure out of multiple simpler ones, without adding locks. The order of operations on the 2 concurrent data structures is important here.
public class NoDuplicatesConcurrentQueue<T> : IProducerConsumerCollection<T>
{
private readonly ConcurrentDictionary<T, bool> existingElements = new ConcurrentDictionary<T, bool>();
private readonly ConcurrentQueue<T> queue = new ConcurrentQueue<T>();
public bool TryAdd(T item)
{
if (existingElements.TryAdd(item, false))
{
queue.Enqueue(item);
return true;
}
return false;
}
public bool TryTake(out T item)
{
if (queue.TryDequeue(out item))
{
bool _;
existingElements.TryRemove(item, out _);
return true;
}
return false;
}
...
}
N.B. Another way at looking at this problem: You want a set that preserves the insertion order.
I would suggest implementing your operations with lock so that you don't read and write the item in a way that corrupts it, making them atomic. For example, with any IEnumerable:
object bcLocker = new object();
// ...
lock (bcLocker)
{
bool foundTheItem = false;
foreach (someClass nextItem in myBlockingColl)
{
if (nextItem.Equals(item))
{
foundTheItem = true;
break;
}
}
if (foundTheItem == false)
{
// Add here
}
}
How to access the underlying default concurrent queue of a blocking collection?
The BlockingCollection<T> is backed by a ConcurrentQueue<T> by default. In other words if you don't specify explicitly its backing storage, it will create a ConcurrentQueue<T> behind the scenes. Since you want to have direct access to the underlying storage, you can create manually a ConcurrentQueue<T> and pass it to the constructor of the BlockingCollection<T>:
ConcurrentQueue<Item> queue = new();
BlockingCollection<Item> collection = new(queue);
Unfortunately the ConcurrentQueue<T> collection doesn't have a TryPeek method with an input parameter, so what you intend to do is not possible:
if (!queue.TryPeek(item)) // Compile error, missing out keyword
collection.Add(item);
Also be aware that the queue is now owned by the collection. If you attempt to mutate it directly (by issuing Enqueue or TryDequeue commands), the collection will throw exceptions.
everyone.
I'm using BlockingCollection in the traditional producer-consumer scenario. To process items in the collection one by one, I have to write this code:
while (...)
{
var item = collection.Take(cancellationTokenSource.Token);
ProcessItem(item);
}
But how to process a batch of N items (wait until collection has less than N items)?
My solution is using some temporary buffer:
var buffer = new List<MyType>(N);
while (...)
{
var item = collection.Take(cancellationTokenSource.Token);
buffer.Add(item);
if (buffer.Count == N)
{
foreach (var item in items)
{
ProcessItem(item);
}
buffer.Clear();
}
}
But it seems to me very ugly... Is there any better approach?
[UPDATE]:
Here's extension method's prototype, which makes the solution more readable. Maybe, someone will find it useful:
public static class BlockingCollectionExtensions
{
public static IEnumerable<T> TakeBuffer<T>(this BlockingCollection<T> collection,
CancellationToken cancellationToken, Int32 bufferSize)
{
var buffer = new List<T>(bufferSize);
while (buffer.Count < bufferSize)
{
try
{
buffer.Add(collection.Take(cancellationToken));
}
catch (OperationCanceledException)
{
// we need to handle the rest of buffer,
// even if the task has been cancelled.
break;
}
}
return buffer;
}
}
And usage:
foreach (var item in collection.TakeBuffer(cancellationTokenSource.Token, 5))
{
// TODO: process items here...
}
Of course, this is not a complete solution: for example, I would add any timeout support - if there's not enough items, but time is elapsed, we need to stop waiting and process items already added to the buffer.
I don't find that solution all that ugly. The batch processing is an orthogonal requirement to what the blocking collection does and should be treated as such. I would encapsulate the batch processing behaviour in a BatchProcessor class with a clean interface but other than that I don't really see a problem with that approach.
You may find the lock-free implementation of a queue together with a blocking collection to be a premature optimization. You might be able to write cleaner code if you take a step back and use Queue with Monitor-based locks.
First of all I'm not sure if your logic is correct. You say you want to wait until collection has less than N items - isn't it the other way around? You want the collection to have N or more items, in order to process N items. Or perhaps I'm misunderstanding.
Then I also suggest you process items one by one if there are less than N items, or you may find that your application seems to hang at N-1 items. Of course if this is a steady stream of data, processing only when buffer.Count >= N could be good enough.
I'd suggest going for a queue and Monitor like GregC says.
Something like this:
public object Dequeue() {
while (_queue.Count < N) {
Monitor.Wait(_queue);
}
return _queue.Dequeue();
}
public void Enqueue( object q )
{
lock (_queue)
{
_queue.Enqueue(q);
if (_queue.Count == N)
{
// wake up any blocked dequeue call(s)
Monitor.PulseAll(_queue);
}
}
}