Iterate through growing list and remembering position - c#

What I'm trying to do add an existing list from another class/thread (which is constantly growing) to a new list that contains values to be validated. But I'm not sure how to do this without processing the same values over and over. I would just like to process the newest added values. See below code
public static void ParsePhotos()
{
int tmprow = 0;
string checkthis = "";
List<String> PhotoCheck = new List<String>();
while (Core.Hashtag.PhotoUrls.Count > Photocheck.Count)
{
foreach (string photourl in Core.Hashtag.PhotoUrls)
{
PhotoCheck.Add(photourl);
}
checkthis = PhotoCheck[tmprow];
//validate checkthis here
//add checkthis to new list if valid here
tmprow++;
while (Thread.Sleep(10000);
};
}
HashSet<String>Core.Hashtag.PhotoUrls is being updated every few seconds in another thread.

There are no practical ways to do it that way, since HashSet is an unordered collections. Regular collections are also not thread safe, and should not be used without locking when using multiple threads.
A typical way to do it would be to use a concurrent queue, where the validation removes items from the queue, and adds them to another queue. The thread that produces the items should be modified to add items to the concurrent queue instead of or in addition to the hashSet.
public static Task ValidateOnWorkerThread<T>(
BlockingCollection<T> queue,
Func<T, bool> validateMethod,
ConcurrentBag<T> validatedItems,
CancellationToken cancel)
{
return Task.Run(ProcessInternal, cancel);
void ProcessInternal()
{
foreach (var item in queue.GetConsumingEnumerable(cancel))
{
if (validateMethod(item))
{
validatedItems.Add(item);
}
}
}
}
This also have the advantage of processing items in real time instead of needing to sleep all the time.

Related

ConcurrentQueue and Parallel.ForEach

I have a ConcurrentQueue with a list of URLs that I need to get the the source of. When using the Parallel.ForEach with the ConcurrentQueue object as the input parameter, the Pop method won't work nothing (Should return a string).
I'm using Parallel with the MaxDegreeOfParallelism set to four. I really need to block the number of concurrent threads. Is using a queue with Parallelism redundant?
Thanks in advance.
// On the main class
var items = await engine.FetchPageWithNumberItems(result);
// Enqueue List of items
itemQueue.EnqueueList(items);
var crawl = Task.Run(() => { engine.CrawlItems(itemQueue); });
// On the Engine class
public void CrawlItems(ItemQueue itemQueue)
{
Parallel.ForEach(
itemQueue,
new ParallelOptions {MaxDegreeOfParallelism = 4},
item =>
{
var worker = new Worker();
// Pop doesn't return anything
worker.Url = itemQueue.Pop();
/* Some work */
});
}
// Item Queue
class ItemQueue : ConcurrentQueue<string>
{
private ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
public string Pop()
{
string value = String.Empty;
if(this.queue.Count == 0)
throw new Exception();
this.queue.TryDequeue(out value);
return value;
}
public void Push(string item)
{
this.queue.Enqueue(item);
}
public void EnqueueList(List<string> list)
{
list.ForEach(this.queue.Enqueue);
}
}
You don't need ConcurrentQueue<T> if all you're going to do is to first add items to it from a single thread and then iterate it in Parallel.ForEach(). A normal List<T> would be enough for that.
Also, your implementation of ItemQueue is very suspicious:
It inherits from ConcurrentQueue<string> and also contains another ConcurrentQueue<string>. That doesn't make much sense, is confusing and inefficient.
The methods on ConcurrentQueue<T> were designed very carefully to be thread-safe. Your Pop() isn't thread-safe. What could happen is that you check Count, notice it's 1, then call TryDequeue() and not get any value (i.e. value will be null), because another thread removed the item from the queue in the time between the two calls.
The issue is with CrawlItems method, since you shouldn't call Pop in the action provided to the ForEach method. The reason is that the action is being called on each popped item, hence the item was already popped. This is the reason that the action has an 'item' argument.
I assume that you're getting null since all of the items already popped by the other threads, by the ForEach method.
Therefore, your code should look like this:
public void CrawlItems(ItemQueue itemQueue)
{
Parallel.ForEach(
itemQueue,
new ParallelOptions {MaxDegreeOfParallelism = 4},
item =>
{
worker.Url = item;
/* Some work */
});
}

Multithreading with method containing 2 parameters to return dictionary

I wrote a method that goes through a list of files and extracts values from each file and stores them into a dictionary and returns the dictionary. This method goes through a large amount of files and I receive a ContextSwitchDeadLock error because of it. I have looked into this error and I am needing to use a thread to fix this error. I am brand new to threads and would like some help with threading.
I create a new thread and use delegate to pass through the parameters dictionary and fileNames into the method getValuesNew(). I am wondering how can I return the dictionary. I have attached the method that I would like to call as well as the code in the main program that creates the new thread. Any suggestions to better my code will be greatly appreciated!
//dictionary and fileNames are manipulated a bit before use in thread
Dictionary<string, List<double>> dictionary = new Dictionary<string, List<double>>();
List<string> fileNames = new List<string>();
...
Thread thread = new Thread(delegate()
{
getValuesNEW(dictionary, fileNames);
});
thread.Start();
//This is the method that I am calling
public Dictionary<string, List<double>> getValuesNEW(Dictionary<string, List<double>> dictionary, List<string> fileNames)
{
foreach (string name in fileNames)
{
XmlReader reader = XmlReader.Create(name);
var collectValues = false;
string ertNumber = null;
while (reader.Read())
{
if ((reader.NodeType == XmlNodeType.Element))
{
if (reader.Name == "ChannelID" && reader.HasAttributes)
{
if (dictionary.ContainsKey(sep(reader.GetAttribute("EndPointChannelID"))))
{
//collectValues = sep(reader.GetAttribute("EndPointChannelID")) == ertNumber;
collectValues = true;
ertNumber = sep(reader.GetAttribute("EndPointChannelID"));
}
else
{
collectValues = false;
}
}
else if (collectValues && reader.Name == "Reading" && reader.HasAttributes)
{
dictionary[ertNumber].Add(Convert.ToDouble(reader.GetAttribute("Value")));
}
}
}
}
return dictionary;
}
Others have explained why the current approach isn't getting you anywhere. If you're using .NET 4 you can use the ConcurrentDictionary and Parallel.ForEach
private List<double> GetValuesFromFile(string fileName)
{
//TBD
}
private void RetrieveAllFileValues()
{
IEnumerable<string> files = ...;
ConcurrentDictionary<int, List<double>> dict = new ConcurrentDictionary<int, List<double>>();
Parallel.ForEach(files, file =>
{
var values = GetValuesFromFile(file);
dict.Add(file, values);
});
}
You don't need to return the dictionary: the main thread already has a reference to it and it'll see the changes that your thread does. All the main thread has to do is wait until the delegate thread is done (for example using thread.Wait()).
However, doing things this way, you get no benefit from multithreading because nothing is done in parallel. What you can do is have multiple threads, and multiple dictionaries (one per thread). When everyone's done, the main thread can put all those dictionaries together.
The reason you don't want multiple threads accessing the same dictionary is that the Dictionary class is not thread-safe: its behavior is undefined if more than one thread use it at the same time. However you could use a ConcurrentDictionary, this one is thread-safe. What this means is that every time you read or write to the ConcurrentDictionary, it uses a lock to make sure to wait until no one else is using the dictionary at the same time.
Which of the two techniques is faster depends on how often your threads would be accessing the shared dictionary: if they access it rarely then a ConcurrentDictionary will work well. If they access it very often then it might be preferable to use multiple dictionaries and merge in the end. In your case since there is file I/O involved, I suspect that the ConcurrentDictionary approach will work best.
So, in short, change getValuesNEW to this:
//This is the method that I am calling
public void getValuesNEW(ConcurrentDictionary<string, List<double>> dictionary, List<string> fileNames)
{
foreach (string name in fileNames)
{
// (code in there is unchanged)
}
// no need to return the dictionary
//return dictionary;
}
If you want to wait for your thread to finish, then you can use Thread.Join after Thread.Start, to get the result of your thread create a class variable or something that both your main program and your thread can use, however, i don't see the point of a thread here, unless you want to process all the files in parallel.

Most Efficient way to create list of all existing items in ConcurrentQueue at certain timestamp

I have to maintain information logs , these logs can be written from many threads concurrently, but when I need them I am using only one thread to dequeue it that takes break of around 5 seconds between dequeueing the collection.
Following is the code I've written to Dequeue it.
if (timeNotReached)
{
InformationLogQueue.Enqueue(informationLog);
}
else
{
int currentLogCount = InformationLogQueue.Count;
var informationLogs = new List<InformationLog>();
for (int i = 0; i < currentLogCount; i++)
{
InformationLog informationLog1;
InformationLogQueue.TryDequeue(out informationLog1);
informationLogs.Add(informationLog1);
}
WriteToDatabase(informationLogs);
}
After dequeueing I am passing it to LINQ's insert method that requires List of InformationLog to insert to database.
Is this the correct way or is there any other efficient way to do this?
You could use the ConcurrentQueue<T> directly in a Linq statement via an extension method like this:
static IEnumerable<T> DequeueExisting<T>(this ConcurrentQueue<T> queue)
{
T item;
while (queue.TryDequeue(out item))
yield return item;
}
This would save you from having to continuously allocate new List<T> and ConcurrentQueue<T> objects.
You should probably be using the ConcurrentQueue<T> via a BlockingCollection<T> as described here.
Somthing like this,
private BlockingCollection<InformationLog> informationLogs =
new BlockingCollection<InformationLog>(new ConcurrentQueue<InformationLog>);
Then on your consumer thread you can do
foreach(var log in this.informationLogs.GetConsumingEnumerable())
{
// process consumer logs 1 by 1.
}
Okay, here is an answer for cosuming mutiple items. On the consuming thread do this,
InformationLog nextLog;
while (this.informationLogs.TryTake(out nextLog, -1))
{
var workToDo = new List<informationLog>();
workToDo.Add(nextLog);
while(this.informationLogs.TryTake(out nextLog))
{
workToDo.Add(nextLog);
}
// process workToDo, then go back to the queue.
}
The first while loop takes items from the queue with an infinite wait time, I'm assuming that once adding is complete on the queue, i.e CompleteAdding is called, this call will return false, without a delay, once the queue is empty.
The inner while loop takes items with a 50 millisecond timeout, this could be adjusted for you needs. Once the queue is empty it will return false, then the batch of work can be processed.

How to access the underlying default concurrent queue of a blocking collection?

I have multiple producers and a single consumer. However if there is something in the queue that is not yet consumed a producer should not queue it again. (unique no duplicates blocking collection that uses the default concurrent queue)
if (!myBlockingColl.Contains(item))
myBlockingColl.Add(item)
However the blocking collection does not have a contains method nor does it provide any kind of TryPeek() like method. How can I access the underlying concurrent queue so I can do something like
if (!myBlockingColl.myConcurQ.trypeek(item)
myBlockingColl.Add(item)
In a tail spin?
This is an interesting question. This is the first time I have seen someone ask for a blocking queue that ignores duplicates. Oddly enough I could find nothing like what you want that already exists in the BCL. I say this is odd because BlockingCollection can accept a IProducerConsumerCollection as the underlying collection which has the TryAdd method that is advertised as being able to fail when duplicates are detected. The problem is that I see no concrete implementation of IProducerConsumerCollection that prevents duplicates. At least we can write our own.
public class NoDuplicatesConcurrentQueue<T> : IProducerConsumerCollection<T>
{
// TODO: You will need to fully implement IProducerConsumerCollection.
private Queue<T> queue = new Queue<T>();
public bool TryAdd(T item)
{
lock (queue)
{
if (!queue.Contains(item))
{
queue.Enqueue(item);
return true;
}
return false;
}
}
public bool TryTake(out T item)
{
lock (queue)
{
item = null;
if (queue.Count > 0)
{
item = queue.Dequeue();
}
return item != null;
}
}
}
Now that we have our IProducerConsumerCollection that does not accept duplicates we can use it like this:
public class Example
{
private BlockingCollection<object> queue = new BlockingCollection<object>(new NoDuplicatesConcurrentQueue<object>());
public Example()
{
new Thread(Consume).Start();
}
public void Produce(object item)
{
bool unique = queue.TryAdd(item);
}
private void Consume()
{
while (true)
{
object item = queue.Take();
}
}
}
You may not like my implementation of NoDuplicatesConcurrentQueue. You are certainly free to implement your own using ConcurrentQueue or whatever if you think you need the low-lock performance that the TPL collections provide.
Update:
I was able to test the code this morning. There is some good news and bad news. The good news is that this will technically work. The bad news is that you probably will not want to do this because BlockingCollection.TryAdd intercepts the return value from the underlying IProducerConsumerCollection.TryAdd method and throws an exception when false is detected. Yep, that is right. It does not return false like you would expect and instead generates an exception. I have to be honest, this is both surprising and ridiculous. The whole point of the TryXXX methods is that they should not throw exceptions. I am deeply disappointed.
In addition to the caveat Brian Gideon mentioned after Update, his solution suffers from these performance issues:
O(n) operations on the queue (queue.Contains(item)) have a severe impact on performance as the queue grows
locks limit concurrency (which he does mention)
The following code improves on Brian's solution by
using a hash set to do O(1) lookups
combining 2 data structures from the System.Collections.Concurrent namespace
N.B. As there is no ConcurrentHashSet, I'm using a ConcurrentDictionary, ignoring the values.
In this rare case it is luckily possible to simply compose a more complex concurrent data structure out of multiple simpler ones, without adding locks. The order of operations on the 2 concurrent data structures is important here.
public class NoDuplicatesConcurrentQueue<T> : IProducerConsumerCollection<T>
{
private readonly ConcurrentDictionary<T, bool> existingElements = new ConcurrentDictionary<T, bool>();
private readonly ConcurrentQueue<T> queue = new ConcurrentQueue<T>();
public bool TryAdd(T item)
{
if (existingElements.TryAdd(item, false))
{
queue.Enqueue(item);
return true;
}
return false;
}
public bool TryTake(out T item)
{
if (queue.TryDequeue(out item))
{
bool _;
existingElements.TryRemove(item, out _);
return true;
}
return false;
}
...
}
N.B. Another way at looking at this problem: You want a set that preserves the insertion order.
I would suggest implementing your operations with lock so that you don't read and write the item in a way that corrupts it, making them atomic. For example, with any IEnumerable:
object bcLocker = new object();
// ...
lock (bcLocker)
{
bool foundTheItem = false;
foreach (someClass nextItem in myBlockingColl)
{
if (nextItem.Equals(item))
{
foundTheItem = true;
break;
}
}
if (foundTheItem == false)
{
// Add here
}
}
How to access the underlying default concurrent queue of a blocking collection?
The BlockingCollection<T> is backed by a ConcurrentQueue<T> by default. In other words if you don't specify explicitly its backing storage, it will create a ConcurrentQueue<T> behind the scenes. Since you want to have direct access to the underlying storage, you can create manually a ConcurrentQueue<T> and pass it to the constructor of the BlockingCollection<T>:
ConcurrentQueue<Item> queue = new();
BlockingCollection<Item> collection = new(queue);
Unfortunately the ConcurrentQueue<T> collection doesn't have a TryPeek method with an input parameter, so what you intend to do is not possible:
if (!queue.TryPeek(item)) // Compile error, missing out keyword
collection.Add(item);
Also be aware that the queue is now owned by the collection. If you attempt to mutate it directly (by issuing Enqueue or TryDequeue commands), the collection will throw exceptions.

Is there anything like a expandable Queue in C#?

i have a set of IDs on which i do some operations:
Queue<string> queue = new Queue<string>();
queue.Enqueue("1");
queue.Enqueue("2");
...
queue.Enqueue("10");
foreach (string id in queue)
{
DoSomeWork(id);
}
static void DoSomeWork(string id)
{
// Do some work and oooo there are new ids which should also be processed :)
foreach(string newID in newIDs)
{
if(!queue.Contains(newID)) queue.Enqueue(newID);
}
}
Is it possible to add some new items to queue in DoSomeWork() which will be also processed bei the main foreach-Loop?
What you're doing is to use an iterator over a changing collection. This is bad practice, since some collections will throw an exception when doing this (e.g. the collection should not change during the enumeration).
Use the following approach, which does use new items as well:
while (queue.Count > 0)
{
DoSomeWork(queue.Dequeue());
}
Use Dequeue instead of a foreach-loop. Most enumerators become invalid whenever the underlying container is changed. And En-/Dequeue are the natural operations on a Queue. Else you could use List<T> or HashSet<T>
while(queue.Count>0)
{
var value=queue.Dequeue();
...
}
To check if an item has already been processed a HashSet<T> is a fast solution. I typically use a combination of HashSet and Queue in those cases. The advantage of this solution is that it's O(n) because checking and adding to a HashSet is O(1). Your original code was O(n^2) since Contains on a Queue is O(n).
Queue<string> queue=new Queue<string>();
HashSet<string> allItems=new HashSet<string>();
void Add(string item)
{
if(allItems.Add(item))
queue.Enqueue(item);
}
void DoWork()
{
while(queue.Count>0)
{
var value=queue.Dequeue();
...
}
}
It is common for loop iterations to add more work; just pass the queue into the method as an argument and adding to it should work fine.
The problem is that ou should be using Dequeue:
while(queue.Count>0) {
DoSomeWork(queue.Dequeue());
}

Categories