ConcurrentQueue and Parallel.ForEach - c#

I have a ConcurrentQueue with a list of URLs that I need to get the the source of. When using the Parallel.ForEach with the ConcurrentQueue object as the input parameter, the Pop method won't work nothing (Should return a string).
I'm using Parallel with the MaxDegreeOfParallelism set to four. I really need to block the number of concurrent threads. Is using a queue with Parallelism redundant?
Thanks in advance.
// On the main class
var items = await engine.FetchPageWithNumberItems(result);
// Enqueue List of items
itemQueue.EnqueueList(items);
var crawl = Task.Run(() => { engine.CrawlItems(itemQueue); });
// On the Engine class
public void CrawlItems(ItemQueue itemQueue)
{
Parallel.ForEach(
itemQueue,
new ParallelOptions {MaxDegreeOfParallelism = 4},
item =>
{
var worker = new Worker();
// Pop doesn't return anything
worker.Url = itemQueue.Pop();
/* Some work */
});
}
// Item Queue
class ItemQueue : ConcurrentQueue<string>
{
private ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
public string Pop()
{
string value = String.Empty;
if(this.queue.Count == 0)
throw new Exception();
this.queue.TryDequeue(out value);
return value;
}
public void Push(string item)
{
this.queue.Enqueue(item);
}
public void EnqueueList(List<string> list)
{
list.ForEach(this.queue.Enqueue);
}
}

You don't need ConcurrentQueue<T> if all you're going to do is to first add items to it from a single thread and then iterate it in Parallel.ForEach(). A normal List<T> would be enough for that.
Also, your implementation of ItemQueue is very suspicious:
It inherits from ConcurrentQueue<string> and also contains another ConcurrentQueue<string>. That doesn't make much sense, is confusing and inefficient.
The methods on ConcurrentQueue<T> were designed very carefully to be thread-safe. Your Pop() isn't thread-safe. What could happen is that you check Count, notice it's 1, then call TryDequeue() and not get any value (i.e. value will be null), because another thread removed the item from the queue in the time between the two calls.

The issue is with CrawlItems method, since you shouldn't call Pop in the action provided to the ForEach method. The reason is that the action is being called on each popped item, hence the item was already popped. This is the reason that the action has an 'item' argument.
I assume that you're getting null since all of the items already popped by the other threads, by the ForEach method.
Therefore, your code should look like this:
public void CrawlItems(ItemQueue itemQueue)
{
Parallel.ForEach(
itemQueue,
new ParallelOptions {MaxDegreeOfParallelism = 4},
item =>
{
worker.Url = item;
/* Some work */
});
}

Related

Iterate through growing list and remembering position

What I'm trying to do add an existing list from another class/thread (which is constantly growing) to a new list that contains values to be validated. But I'm not sure how to do this without processing the same values over and over. I would just like to process the newest added values. See below code
public static void ParsePhotos()
{
int tmprow = 0;
string checkthis = "";
List<String> PhotoCheck = new List<String>();
while (Core.Hashtag.PhotoUrls.Count > Photocheck.Count)
{
foreach (string photourl in Core.Hashtag.PhotoUrls)
{
PhotoCheck.Add(photourl);
}
checkthis = PhotoCheck[tmprow];
//validate checkthis here
//add checkthis to new list if valid here
tmprow++;
while (Thread.Sleep(10000);
};
}
HashSet<String>Core.Hashtag.PhotoUrls is being updated every few seconds in another thread.
There are no practical ways to do it that way, since HashSet is an unordered collections. Regular collections are also not thread safe, and should not be used without locking when using multiple threads.
A typical way to do it would be to use a concurrent queue, where the validation removes items from the queue, and adds them to another queue. The thread that produces the items should be modified to add items to the concurrent queue instead of or in addition to the hashSet.
public static Task ValidateOnWorkerThread<T>(
BlockingCollection<T> queue,
Func<T, bool> validateMethod,
ConcurrentBag<T> validatedItems,
CancellationToken cancel)
{
return Task.Run(ProcessInternal, cancel);
void ProcessInternal()
{
foreach (var item in queue.GetConsumingEnumerable(cancel))
{
if (validateMethod(item))
{
validatedItems.Add(item);
}
}
}
}
This also have the advantage of processing items in real time instead of needing to sleep all the time.

Proper data access in Multithreading

I have Method which is used by multiple threads at the same time. each one of this thread Call another method to receive the data they need from a List (each one should get a different data not same).
I wrote this code to get Data from a list and use them in the Threads.
public static List<string> ownersID;
static int idIdx = 0;
public static string[] GetUserID()
{
if (idIdx < ownersID.Count-1)
{
string[] ret = { ownersID[idIdx], idIdx.ToString() };
idIdx++;
return ret;
}
else if (idIdx >= ownersID.Count)
{
string[] ret = { "EndOfThat" };
return ret;
}
return new string[0];
}
Then each thread use this code to receive the data and remove it from the list:
string[] arrOwner = GetUserID();
string id = arrOwner[0];
ownersID.RemoveAt(Convert.ToInt32(arrOwner[1]));
But sometimes 2 or more threads can have the same data.
Is there has any better way to do this?
If you want to do it with List just add little bit of locking
private object _lock = new object();
private List<string> _list = new List<string>();
public void Add(string someStr)
{
lock(_lock)
{
if (_list.Any(s => s == someStr) // already added (inside lock)
return;
_list.Add(someStr);
}
}
public void Remove(string someStr)
{
lock(_lock)
{
if (!_list.Any(s => s == someStr) // already removed(inside lock)
return;
_list.Remove(someStr);
}
}
With that, no thread will be adding/removing anything while another thread does the same. Your list will be protected from multi-thread access. And you make sure that you only have 1 of the kind. However, you can achieve this using ConcurrentDictionary<T1, T2>
Update: I removed pre-lock check due to this MSDN thread safety statement
It is safe to perform multiple read operations on a List (read - multithreading), but issues can occur if the collection is modified while it's being read.
On a larger scale of application you can use .Net queue to communicate between two thread.
The benefit of using a queue is you don't need to lock the object which will be decrease the latency.From Main thread to Thread A , Thread B And Thread C the data will add and receive through queue.No Locking.

Enumerable foreach extend

I created an extension to Enumerable to execute action fastly, so I have listed and in this method, I loop and if object executing the method in certain time out I return,
now I want to make the output generic because the method output will differ, any advice on what to do
this IEnumerable of processes, it's like load balancing, if the first not responded the second should, I want to return the output of the input Action
public static class EnumerableExtensions
{
public static void ForEach<T>(this IEnumerable<T> source, Action action, int timeOut)
{
foreach (T element in source)
{
lock (source)
{
// Loop for all connections and get the fastest responsive proxy
foreach (var mxAccessProxy in source)
{
try
{
// check for the health
Task executionTask = Task.Run(action);
if (executionTask.Wait(timeOut))
{
return ;
}
}
catch
{
//ignore
}
}
}
}
}
}
this code run like
_proxies.ForEach(certainaction, timeOut);
this will enhance the performance and code readability
No, it definitely won't :) Moreover, you bring some more problems with this code like redundant locking or exception swallowing, but don't actually execute code in parallel.
It seems like you want to get the fastest possible call for your Action using some sort of proxy objects. You need to run Tasks asynchronously, not consequently with .Wait().
Something like this could be helpful for you:
public static class TaskExtensions
{
public static TReturn ParallelSelectReturnFastest<TPoolObject, TReturn>(this TPoolObject[] pool,
Func<TPoolObject, CancellationToken, TReturn> func,
int? timeout = null)
{
var ctx = new CancellationTokenSource();
// for every object in pool schedule a task
Task<TReturn>[] tasks = pool
.Select(poolObject =>
{
ctx.Token.ThrowIfCancellationRequested();
return Task.Factory.StartNew(() => func(poolObject, ctx.Token), ctx.Token);
})
.ToArray();
// not sure if Cast is actually needed,
// just to get rid of co-variant array conversion
int firstCompletedIndex = timeout.HasValue
? Task.WaitAny(tasks.Cast<Task>().ToArray(), timeout.Value, ctx.Token)
: Task.WaitAny(tasks.Cast<Task>().ToArray(), ctx.Token);
// we need to cancel token to avoid unnecessary work to be done
ctx.Cancel();
if (firstCompletedIndex == -1) // no objects in pool managed to complete action in time
throw new NotImplementedException(); // custom exception goes here
return tasks[firstCompletedIndex].Result;
}
}
Now, you can use this extension method to call a specific action on any pool of objects and get the first executed result:
var pool = new[] { 1, 2, 3, 4, 5 };
var result = pool.ParallelSelectReturnFastest((x, token) => {
Thread.Sleep(x * 200);
token.ThrowIfCancellationRequested();
Console.WriteLine("calculate");
return x * x;
}, 100);
Console.WriteLine(result);
It outputs:
calculate
1
Because the first task will complete work in 200ms, return it, and all other tasks will be cancelled through cancellation token.
In your case it will be something like:
var actionResponse = proxiesList.ParallelSelectReturnFastest((proxy, token) => {
token.ThrowIfCancellationRequested();
return proxy.SomeAction();
});
Some things to mention:
Make sure that your actions are safe. You can't rely on how many of these will actually come to the actual execution of your action. If this action is CreateItem, then you can end up with many items to be created through different proxies
It cannot guarantee that you will run all of these actions in parallel, because it is up to TPL to chose the optimal number of running tasks
I have implemented in old-fashioned TPL way, because your original question contained it. If possible, you need to switch to async/await - in this case your Func will return tasks and you need to use await Task.WhenAny(tasks) instead of Task.WaitAny()

Async Producer/Consumer

I have a instance of a class that is accessed from several threads. This class take this calls and add a tuple into a database. I need this to be done in a serial manner, as due to some db constraints, parallel threads could result in an inconsistent database.
As I am new to parallelism and concurrency in C#, I did this:
private BlockingCollection<Task> _tasks = new BlockingCollection<Task>();
public void AddDData(string info)
{
Task t = new Task(() => { InsertDataIntoBase(info); });
_tasks.Add(t);
}
private void InsertWorker()
{
Task.Factory.StartNew(() =>
{
while (!_tasks.IsCompleted)
{
Task t;
if (_tasks.TryTake(out t))
{
t.Start();
t.Wait();
}
}
});
}
The AddDData is the one who is called by multiple threads and InsertDataIntoBase is a very simple insert that should take few milliseconds.
The problem is that, for some reason that my lack of knowledge doesn't allow me to figure out, sometimes a task is been called twice! It always goes like this:
T1
T2
T3
T1 <- PK error.
T4
...
Did I understand .Take() completely wrong, am I missing something or my producer/ consumer implementation is really bad?
Best Regards,
Rafael
UPDATE:
As suggested, I made a quick sandbox test implementation with this architecture and as I was suspecting, it does not guarantee that a task will not be fired before the previous one finishes.
So the question remains: how to properly queue tasks and fire them sequentially?
UPDATE 2:
I simplified the code:
private BlockingCollection<Data> _tasks = new BlockingCollection<Data>();
public void AddDData(Data info)
{
_tasks.Add(info);
}
private void InsertWorker()
{
Task.Factory.StartNew(() =>
{
while (!_tasks.IsCompleted)
{
Data info;
if (_tasks.TryTake(out info))
{
InsertIntoDB(info);
}
}
});
}
Note that I got rid of Tasks as I'm relying on synced InsertIntoDB call (as it is inside a loop), but still no luck... The generation is fine and I'm absolutely sure that only unique instances are going to the queue. But no matter I try, sometimes the same object is used twice.
I think this should work:
private static BlockingCollection<string> _itemsToProcess = new BlockingCollection<string>();
static void Main(string[] args)
{
InsertWorker();
GenerateItems(10, 1000);
_itemsToProcess.CompleteAdding();
}
private static void InsertWorker()
{
Task.Factory.StartNew(() =>
{
while (!_itemsToProcess.IsCompleted)
{
string t;
if (_itemsToProcess.TryTake(out t))
{
// Do whatever needs doing here
// Order should be guaranteed since BlockingCollection
// uses a ConcurrentQueue as a backing store by default.
// http://msdn.microsoft.com/en-us/library/dd287184.aspx#remarksToggle
Console.WriteLine(t);
}
}
});
}
private static void GenerateItems(int count, int maxDelayInMs)
{
Random r = new Random();
string[] items = new string[count];
for (int i = 0; i < count; i++)
{
items[i] = i.ToString();
}
// Simulate many threads adding items to the collection
items
.AsParallel()
.WithDegreeOfParallelism(4)
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.Select((x) =>
{
Thread.Sleep(r.Next(maxDelayInMs));
_itemsToProcess.Add(x);
return x;
}).ToList();
}
This does mean that the consumer is single threaded, but allows for multiple producer threads.
From your comment
"I simplified the code shown here, as the data is not a string"
I assume that info parameter passed into AddDData is a mutable reference type. Make sure that the caller is not using the same info instance for multple calls since that reference is captured in Task lambda .
Based on the trace that you provided the only logical possibility is that you have called InsertWorker twice (or more). There are thus two background threads waiting for items to appear in the collection and occasionally they both manage to grab an item and begin executing it.

How to add to a List while using Multi-Threading?

I'm kinda new to Multi-Threading and have only played around with it in the past. But I'm curious if it is possible to have a List of byte arrays on a main thread and still be able to add to that List while creating the new byte array in a seperate Thread. Also, I'll be using a for-each loop that will go through a list of forms that will be used to parse into the byte array. So basically a pseudo code would be like this...
reports = new List();
foreach (form in forms)
{
newReport = new Thread(ParseForm(form));
reports.Add(newReport);
}
void ParseForm(form)
{
newArray = new byte[];
newArray = Convert.ToBytes(form);
return newArray;
}
Hopefully the pseudo-code above makes some sense. If anyone could tell me if this is possible and point me in the direction of a good example, I'm sure I can figure out the actual code.
If you need to access a collection from multiple threads, you should either use synchronization, or use a SynchronizedCollection if your .NET version is 3.0 or higher.
Here is one way to make the collection accessible to your thread:
SynchronizedCollection reports = new SynchronizedCollection();
foreach (form in forms) {
var reportThread = new Thread(() => ParseForm(form, reports));
reportThread.Start();
}
void ParseForm(Form form, SynchronizedCollection reports) {
newArray = new byte[];
newArray = Convert.ToBytes(form);
reports.Add(newArray);
}
If you are on .NET 4 or later, a much better alternative to managing your threads manually is presented by various classes of the System.Threading.Tasks namespace. Consider exploring this alternative before deciding on your threading implementation.
In before we realized it was .Net 3.5, keep for reference on .Net 4
If you don't need any order within the list, an easy "fix" is to use the ConcurrentBag<T> class instead of a list. If you need more order, there is also a ConcurrentQueue<T> collection too.
If you really need something more custom, you can implement your own blocking collection using BlockingCollection<T>. Here's a good article on the topic.
You can also use Parallel.Foreach to avoid the explicit thread creation too:
private void ParseForms()
{
var reports = new ConcurrentBag<byte[]>();
Parallel.ForEach(forms, (form) =>
{
reports.Add(ParseForm(form));
});
}
private byte[] ParseForm(form)
{
newArray = new byte[];
newArray = Convert.ToBytes(form);
return newArray;
}
Why is enumerate files returning the same file more than once?
Check that out. It shows I think exactly what you want to do.
It creates a list on the main thread then adds to it from a different thread.
your going to need
using System.Threading.Tasks
-
Files.Clear(); //List<string>
Task.Factory.StartNew( () =>
{
this.BeginInvoke( new Action(() =>
{
Files.Add("Hi");
}));
});
Below is a simple Blocking Collection (as a queue only) that I just whipped up now since you don't have access to C# 4.0. It's most likely less efficient than the 4.0 concurrent collections, but it should work well enough. I didn't re-implement all of the Queue methods, just enqueue, dequeue, and peek. If you need others and can't figure out how they would be implemented just mention it in the comments.
Once you have the working blocking collection you can simply add to it from the producer threads and remove from it using the consumer threads.
public class MyBlockingQueue<T>
{
private Queue<T> queue = new Queue<T>();
private AutoResetEvent signal = new AutoResetEvent(false);
private object padLock = new object();
public void Enqueue(T item)
{
lock (padLock)
{
queue.Enqueue(item);
signal.Set();
}
}
public T Peek()
{
lock (padLock)
{
while (queue.Count < 1)
{
signal.WaitOne();
}
return queue.Peek();
}
}
public T Dequeue()
{
lock (padLock)
{
while (queue.Count < 1)
{
signal.WaitOne();
}
return queue.Dequeue();
}
}
}

Categories