I've already read previous questions here about ConcurrentBag but did not find an actual sample of implementation in multi-threading.
ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag."
Currently this is the current usage in my code (this is simplified not actual codes):
private void MyMethod()
{
List<Product> products = GetAllProducts(); // Get list of products
ConcurrentBag<Product> myBag = new ConcurrentBag<Product>();
//products were simply added here in the ConcurrentBag to simplify the code
//actual code process each product before adding in the bag
Parallel.ForEach(
products,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
product => myBag.Add(product));
ProcessBag(myBag); // method to process each items in the concurrentbag
}
My questions:
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
For me I think a simple List<Product> and a manual lock will do better. The reason for this is that the scenario above already breaks the "same thread will be both producing and consuming data stored in the bag" rule.
Also I also found out that the ThreadLocal storage created in each thread in the parallel will still exist after the operation (even if the thread is reused is this right?) which may cause an undesired memory leak.
Am I right in this one guys? Or a simple clear or empty method to remove the items in the ConcurrentBag is enough?
This looks like an ok use of ConcurrentBag. The thread local variables are members of the bag, and will become eligible for garbage collection at the same time the bag is (clearing the contents won't release them). You are right that a simple List with a lock would suffice for your case. If the work you are doing in the loop is at all significant, the type of thread synchronization won't matter much to the overall performance. In that case, you might be more comfortable using what you are familiar with.
Another option would be to use ParallelEnumerable.Select, which matches what you are trying to do more closely. Again, any performance difference you are going to see is likely going to be negligible and there's nothing wrong with sticking with what you know.
As always, if the performance of this is critical there's no substitute for trying it and measuring.
It seems to me that bmm6o's is not correct. The ConcurrentBag instance internally contains mini-bags for each thread that adds items to it, so item insertion does not involve any thread locks, and thus all Environment.ProcessorCount threads may get into full swing without being stuck waiting and without any thread context switches. A thread sinchronization may require when iterating over the collected items, but again in the original example the iteration is done by a single thread after all insertions are done. Moreover, if the ConcurrentBag uses Interlocked techniques as the first layer of the thread synchronization, then it is possible not to involve Monitor operations at all.
On the other hand, using a usual List<T> instance and wrapping each its Add() method call with a lock keyword will hurt the performance a lot. First, due to the constant Monitor.Enter() and Monitor.Exit() calls that each require to step deep into the kernel mode and to work with Windows synchronization primitives. Secondly, sometimes occasionally one thread may be blocked by the second thread because the second thread has not finished its addition yet.
As for me, the code above is a really good example of the right usage of ConcurrentBag class.
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
No, for multiple reasons:
This is not the intended usage scenario for this collection. The ConcurrentBag<T> is intended for mixed producer-consumer scenarios, meaning that each thread is expected to add and take items from the bag. Your scenario is nothing like this. You have many threads that add items, and zero threads that take items. The main application for the ConcurrentBag<T> is for making object-pools (pools of reusable objects that are expensive to create or destroy). And given the availability of the ObjectPool<T> class in the Microsoft.Extensions.ObjectPool package, even this niche application for this collection is contested.
It doesn't preserve the insertion order. Even if preserving the insertion order is not important, getting a shuffled output makes the debugging more difficult.
It creates garbage that have to be collected by the GC. It creates one WorkStealingQueue (internal class) per thread, each containing an expandable array, so the more threads you have the more objects you allocate. Also each time it is enumerated it copies all the items in an array, and exposes an IEnumerator<T> GetEnumerator() property that is boxed on each foreach.
There are better options available, offering both better performance and better ordering behavior.
In your scenario you can store the results of the parallel execution in a simple array. Just create an array with length equal to the products.Count, switch from the Parallel.ForEach to the Parallel.For, and assign the result directly to the corresponding slot of the results array without doing any synchronization at all:
List<Product> products = GetAllProducts(); // Get list of products
Product[] results = Product[products.Count];
Parallel.For(0, products.Count,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => results[i] = products[i]);
ProcessResults(results);
This way you'll get the results with perfect ordering, stored in a container that has the most compact size and the fastest enumeration of all .NET collections, doing only a single object allocation.
In case you are concerned about the thread-safety of the above operation, there is nothing to worry about. Each thread writes on different slots in the results array. After the completion of the parallel execution the current thread has full visibility of all the values that are stored in the array, because the TPL includes the appropriate barriers when tasks are queued, and at the beginning/end of task execution (citation).
(I have posted more thoughts about the ConcurrentBag<T> in this answer.)
If List<T> is used with a lock around Add() method it will make threads wait and will reduce the performance gain of using Parallel.ForEach()
Related
I have a list of table names (student, exam, school).
I use a Parallel.ForEach loop to iterate over the table names and do processing for each table, with MaxDegreeOfParallelism = 8.
My problem is that my Parallel.ForEach doesn't always engage in work stealing. For example, when two tables are left to process, they may be processed one after another instead of in parallel. I'm trying to improve performance and increase throughput.
I tried to do this by creating a custom TaskScheduler, however, for my implementation I need a sorted list of tasks with the easiest tasks ordered first, so that they aren't held-up by longer-running tables. I can't seem to do this by sorting the list passed to Parallel.ForEach (List< string >) because the tasks are Enqueued by the TaskScheduler out-of-order. Therefore, I need a way to sort a list of tasks inside my CustomTaskScheduler, which is based on https://psycodedeveloper.wordpress.com/2013/06/28/a-custom-taskscheduler-in-c/
How can I control the order in which tasks are passed by the Parallel.ForEach to the TaskScheduler to be enqueued?
The Parallel.ForEach method employs two different partitioning strategies depending on the type of the source. If the source is an array or a List, it is partitioned statically (upfront). If the source is an honest-to-goodness¹ IEnumerable, it is partitioned dynamically (on the go). The dynamic partitioning has the desirable behavior of work-stealing, but has more overhead. In your case the overhead is not important, because the granularity of your workload is very low.
To ensure that the partitioning is dynamic, the easiest way is to wrap your source with the Partitioner.Create method:
string[] tableNames;
Parallel.ForEach(Partitioner.Create(tableNames), tableName =>
{
// Process table
});
¹ (The expression is borrowed from a comment in the source code)
I would recommend looking up partitioners. Managing threads on a Parallel loop has some overhead, so there is some built-in logic to try to keep this overhead small while still balancing the work across all cores propperly. This is done by dividing the list into chunks and adjusting the chunk-size to hit some sweet-spot.
I would guess that ordering the tasks as as smallest first will work against the paritioners balancing. I would try ordering the work largest first if balancing is the goal. Another thing I would try is to partition the work items with some constant chunk-size and see if that helps. Or perhaps even write your own partitioner.
I'm not sure it is a great idea to try to enforce some execution order. Since you do not control the OS scheduler there cannot be any guaranteed ordering. And even if you can make it more ordered, it would probably be at the cost of throughput.
Also, if you are spending lots of time optimizing the parallelization, are you sure the rest of the code is optimized?
I have a method which reads a text file which contains an int value per line, for making reading faster, i used Parallel.ForEach, but the behaviour what i am seeing is unexpected, i have 800 lines in the file but when i run this method, every time it returns different count of HashSet, what i have read after searching is Parallel.ForEach spawns multiple threads and it returns the result when all threads have completed their work, but my code execute contradicts, or i am missing something improtant here?
Here is my method:
private HashSet<int> GetKeyItemsProcessed()
{
HashSet<int> keyItems = new HashSet<int>();
if (!File.Exists(TrackingFilePath))
return keyItems;
// normal foreach works fine
//foreach(var keyItem in File.ReadAllLines(TrackingFilePath))
//{
// keyItems.Add(int.Parse(keyItem));
//}
// this does not return right number of hashset rows
Parallel.ForEach(File.ReadAllLines(TrackingFilePath).AsParallel(), keyItem =>
{
keyItems.Add(int.Parse(keyItem));
});
return keyItems;
}
HashSet.Add is NOT thread safe.
From MSDN:
Any public static (Shared in Visual Basic) members of this type are
thread safe. Any instance members are not guaranteed to be thread
safe.
The unpredictability of multithread timing could, and seems to be, causing issues.
You could wrap the access in a synchronization construct, which is sometimes faster than a concurrent collection, but may not speed anything up in some cases. As others have mentioned, another option is to use a thread safe collection like ConcurrenDictionary or ConcurrentQueue, though those may have additional memory overhead.
Be sure to benchmark any results you get with regards to timing. The raw power of singlethreaded access can sometimes be faster than dealing with the overhead of threading. It may not be worth it at all to thread this code.
The final word though, is that HashSet alone, without synchronization, is simply unacceptable for multi threaded operations.
So my problem is as follows: I have a list of items to process and I'd like to process the items in parallel then commit the processed items.
The barrier class in C# will allow me to do this - I can run threads in parallel to process the list of items and when SignalAndWait is called and all participants hit he barrier I can commit the processed items.
The Task class will also allow me to do this - on the Task.WaitAll call I can wait for all tasks to complete and I can commit the processed items. If I understand correctly each task will run on it's own thread not a bunch of tasks in parallel on the same thread.
Is my understand correct on both usages for the problem?
Is there any advantage between one over the other?
Is there any way a hybrid solution is better (barrier and tasks?).
Is my understand correct on both usages for the problem?
I think you have a misunderstanding of the Barrier class. The docs say:
A barrier is a user-defined synchronization primitive that enables multiple threads (known as participants) to work concurrently on an algorithm in phases.
A barrier is a synchronization primitive. Comparing it to a unit of work which may be computed in parallel such as a Task isn't correct.
A barrier can signal all threads to wait until all others have completed some work and check upon that work. By itself, it has no parallel computation capabilities and no threading model behind it.
Is there any advantage between one over the other?
As for question 1, you see this is irrelevant.
Is there any way a hybrid solution is better (barrier and tasks?).
In your case, I'm not sure its needed at all. If you sinply want to do CPU bound computation in parallel on a collection of items, you have Parallel.ForEach exactly for that purpose. It will partition an enumerable and invoke them in parallel, and block until the entire collection has been computed.
I'm not directly answering your question because I think that working with barriers and tasks is just making your code more complex than it needs to be.
I'd suggest using Microsoft's Reactive Framework for this - NuGet "Rx-Main" - as it just makes the whole problem super simple.
Here's the code:
var query =
from item in items.ToObservable()
from processed in Observable.Start(() => processItem(item))
select new { item, processed };
query
.ToArray()
.Subscribe(processedItems =>
{
/* commit the processed items */
});
The query turns a list of items into a observable and then processes each item using Observable.Start(...). This optimally fires off new threads as needed. The .ToArray() takes the sequence of individual results and changes it into a single array of results. The .Subscribe(...) method then allows you to process the results.
The code is much simpler than using tasks or barriers.
Ok, I have read Thread safe collections in .NET and Why lock Thread safe collections?.
The former question being java centered, doesn't answer my question and the answer to later question tells that I don't need to lock the collection because they are supposed to thread-safe. (which is what I thought)
Now coming to my question,
I lot of developers I see, (on github and in my organisation) have started using the new thread-safe collection. However, often they don'tremove the lock around read & write operations.
I don't understand this. Isn't a thread-safe collection ... well, thread-safe completely ?
What could be the implications involved in not locking a thread-safe collection ?
EDIT: PS: here's my case,
I have a lot of classes, and some of them have an attribute on them. Very often I need to check if a given type has that attribute or not (using reflection of course). This could be expensive on performance. So decided to create a cache using a ConcurrentDictionary<string,bool>. string being the typeName and bool specifying if it has the attribute. At First, the cache is empty, the plan was to keep on adding to it as and when required. I came across GetOrAdd() method of ConcurrentDictionary. And my question is about the same, if I should call this method without locking ?
The remarks on MSDN says:
If you call GetOrAdd simultaneously on different threads,
addValueFactory may be called multiple times, but its key/value pair
might not be added to the dictionary for every call.
You should not lock a thread safe collection, it exposes methods to update the collection that are already locked, use them as intended.
The thread safe collection may not match your needs for instance if you want to prevent modification while an enumerator is opened on the collection (the provided thread safe collections allow modifications). If that's the case you'd better use a regular collection and lock it everywhere. The internal locks of the thread safe collections aren't publicly available.
It's hard to answer about implication in not locking a thread-safe collection. You don't need to lock a thread-safe collection but you may have to lock your code that does multiple things. Hard to tell without seeing the code.
Yes the method is thread safe but it might call the AddValueFactory multiple times if you hit an Add for the same key at the same time. In the end only one of the values will be added, the others will be discarded. It might not be an issue... you'll have to check how often you may reach this situation but I think it's not common and you can live with the performance penalty in an edge case that may never occur.
You could also build your dictionnary in a static ctor or before you need it. This way, the dictionnary is filled once and you don't ever write to it. The dictionary is then read only and you don't need any lock neither a thread safe collection.
A method of a class typically changes the object from state A to state B. However, another thread may also change the state of the object during the execution of that method, potentially leaving the object in an instable state.
For instance, a list may want to check if its underlying data buffer is large enough before adding a new item:
void Add(object item)
{
int requiredSpace = Count + 1;
if (buffer.Length < requiredSpace)
{
// increase underlying buffer
}
buffer[Count] = item;
}
Now if a list has buffer space for only one more item, and two threads attempt to add an item at the same time, they may both decide that no additional buffer space is required, potentially causing an IndexOutOfRangeException on one of these threads.
Thread-safe classes ensure that this does not happen.
This does not mean that using a thread-safe class makes your code thread-safe:
int count = myConcurrentCollection.Count;
myCurrentCollection.Add(item);
count++;
if (myConcurrentCollection.Count != count)
{
// some other thread has added or removed an item
}
So although the collection is thread safe, you still need to consider thread-safety for your own code. The enumerator example Guillaume mentioned is a perfect example of where threading issues might occur.
In regards to your comment, the documentation for ConcurrentDictionary mentions:
All these operations are atomic and are thread-safe with regards to all other operations on the ConcurrentDictionary class. The only exceptions are the methods that accept a delegate, that is, AddOrUpdate and GetOrAdd. For modifications and write operations to the dictionary, ConcurrentDictionary uses fine-grained locking to ensure thread safety. (Read operations on the dictionary are performed in a lock-free manner.) However, delegates for these methods are called outside the locks to avoid the problems that can arise from executing unknown code under a lock. Therefore, the code executed by these delegates is not subject to the atomicity of the operation.
So yes these overloads (that take a delegate) are exceptions.
If I have an array that can/will be accessed by multiple threads at any given point in time, what exactly causes it to be non-thread safe, and what would be the steps taken to ensure that the array would be thread safe in most situations?
I have looked extensively around on the internet and have found little to no information on this subject, everything seems to be specific scenarios (e.g. is this array, that is being accessed like this by these two threads thread-safe, and on, and on). I would really like of someone could either answer the questions I laid out at the top, or if someone could point towards a good document explaining said items.
EDIT:
After looking around on MSDN, I found the ArrayList class. When you use the synchronize method, it returns a thread-safe wrapper for a given list. When setting data in the list (i.e. list1[someNumber] = anotherNumber;) does the wrapper automatically take care of locking the list, or do you still need to lock it?
When two threads are accessing the exact same resource (e.g., not local copies, but actually the same copy of the same resource), a number of things can happen. In the most obvious scenario, if Thread #1 is accessing a resource and Thread #2 changes it mid-read, some unpredictable behavior can happen. Even with something as simple as an integer, you could have logic errors arise, so try to imagine the horrors that can result from improperly using something more complicated, like a database access class that's declared as static.
The classical way of handling this problem is to put a lock on the sensitive resources so only one thread can use it at a time. So in the above example, Thread #1 would request a lock to a resource and be granted it, then go in to read what it needs to read. Thread #2 would come along mid-read and request a lock to the resource, but be denied and told to wait because Thread #1 is using it. When Thread #1 finishes, it releases the lock and it's OK for Thread #2 to proceed.
There are other situations, but this illustrates one of the most basic problems and solutions. In C#, you may:
1) Use specific .NET objects that are managed as lockable by the framework (like Scorpion-Prince's link to SynchronizedCollection)
2) Use [MethodImpl(MethodImplOptions.Synchronized)] to dictate that a specific method that does something dangerous should only be used by one thread at a time
3) Use the lock statement to isolate specific lines of code that are doing something potentially dangerous
What approach is best is really up to your situation.
If I have an array that can/will be accessed by multiple threads at
any given point in time, what exactly causes it to be non-thread safe,
and what would be the steps taken to ensure that the array would be
thread safe in most situations?
In general terms, the fact that the array is not thread-safe is the notion that two or more threads could be modifying the contents of the array if you do not synchronize access to it.
Speaking generally, for example, let's suppose you have thread 1 doing this work:
for (int i = 0; i < array.Length; i++)
{
array[i] = "Hello";
}
And thread 2 doing this work (on the same shared array)
for (int i = 0; i < array.Length; i++)
{
array[i] = "Goodbye";
}
There isn't anything synchronizing the threads so your results will depend on which thread wins the race first. It could be "Hello" or "Goodbye", in some random order, but will always be at least 'Hello' or 'Goodbye'.
The actual write of the string 'Hello' or 'Goodbye' is guaranteed by the CLR to be atomic. That is to say, the writing of the value 'Hello' cannot be interrupted by a thread trying to write 'Goodbye'. One must occur before or after the other, never in between.
So you need to create some kind of synchronization mechanism to prevent the arrays from stepping on each other. You can accomplish this by using a lock statement in C#.
C# 3.0 and above provide a generic collection class called SynchronizedCollection which "provides a thread-safe collection that contains objects of a type specified by the generic parameter as elements."
Array is thread safe if it is named public and static keywords - instant is not guaranteed - as the System.Array implements the ICollection interface which define some synchronize method to support synchronizing mechanism.
However, coding to enumerate through the array's item is not safe, developer should implement lock statement to make sure there is no change to the array during the array enumeration.
EX:
Array arrThreadSafe = new string[] {"We", "are", "safe"};
lock(arrThreadSafe.SyncRoot)
{
foreach (string item in arrThreadSafe)
{
Console.WriteLine(item);
}
}