Parallel.ForEach returning inconsistent result - c#

I have a method which reads a text file which contains an int value per line, for making reading faster, i used Parallel.ForEach, but the behaviour what i am seeing is unexpected, i have 800 lines in the file but when i run this method, every time it returns different count of HashSet, what i have read after searching is Parallel.ForEach spawns multiple threads and it returns the result when all threads have completed their work, but my code execute contradicts, or i am missing something improtant here?
Here is my method:
private HashSet<int> GetKeyItemsProcessed()
{
HashSet<int> keyItems = new HashSet<int>();
if (!File.Exists(TrackingFilePath))
return keyItems;
// normal foreach works fine
//foreach(var keyItem in File.ReadAllLines(TrackingFilePath))
//{
// keyItems.Add(int.Parse(keyItem));
//}
// this does not return right number of hashset rows
Parallel.ForEach(File.ReadAllLines(TrackingFilePath).AsParallel(), keyItem =>
{
keyItems.Add(int.Parse(keyItem));
});
return keyItems;
}

HashSet.Add is NOT thread safe.
From MSDN:
Any public static (Shared in Visual Basic) members of this type are
thread safe. Any instance members are not guaranteed to be thread
safe.
The unpredictability of multithread timing could, and seems to be, causing issues.
You could wrap the access in a synchronization construct, which is sometimes faster than a concurrent collection, but may not speed anything up in some cases. As others have mentioned, another option is to use a thread safe collection like ConcurrenDictionary or ConcurrentQueue, though those may have additional memory overhead.
Be sure to benchmark any results you get with regards to timing. The raw power of singlethreaded access can sometimes be faster than dealing with the overhead of threading. It may not be worth it at all to thread this code.
The final word though, is that HashSet alone, without synchronization, is simply unacceptable for multi threaded operations.

Related

Threads operating on the one instance of a method

I'm creating an app, where I have 50x50 map. On this map I can add dots, which are new instances of the class "dot". Every dot has it's own thread, and every thread connected with a specific dot operates on the method "explore" of the class, and in this method there is another method "check_place(x,y)" which is responsible for checking if some place on the map was already discovered. If not, the static variable of the class "num_discovered" should be incremented. This single instance of the method "check_place(x,y)" should be accessed in the real-time by every thread started in the app.
Constructor:
public dot(Form1 F)
{
/...
thread = new System.Threading.Thread(new System.Threading.ThreadStart(explore)); //wątek wykonujący metodę explore klasy robot
thread.Start();
}
check_place(x,y) method:
static void check_place(int x, int y)
{
lock (ob)
{
if (discovered[x, y] == false)
{
discovered[x, y] = true;
num_discovered += 1;
}
}
}
In the explore method I'm invoking method "check_place(x,y)" like this:
dot.check_place(x, y);
Is it enough to achieve a situation where in the single time only one dot can check if place was already discovered?
Is it enough to achieve a situation where in the single time only one dot can check if place was already discovered?
Yes. But what's the point?
If threads are spending all of their time waiting on other threads, what have you gained from being multi-threaded?
There are three (sometimes overlapping) reasons to spawn more threads:
To make use of more than one core at the same time: overall throughput increases.
To have work done while another thread is waiting on something else (typically I/O from file, DB or network): overall throughput increases.
To respond to user interaction while work is being done: overall throughput decreases, but it feels faster to the user as they are separately being reacted to.
Here the last doesn't apply.
If your "checking" involved I/O then the second might apply, and this strategy might make sense.
The first could well apply, but because all the threads are spending most of their time waiting on other threads, you don't gain an improvement in throughput.
Indeed, because there is overhead involved in setting up threads and switching between them, this code will be slower than just having one thread do everything: If only one thread can work at a time, then only have one thread!
So your use of a lock here is correct in that it prevents corruption and errors, but pointless in that it makes everything too slow.
What to do about this:
If your real case involves I/O or other reasons why the threads in fact spend most of their time out of each others' way, then what you have is fine.
Otherwise you've got two options.
Easy: Just use one thread.
Hard: Have finer locking.
One way to have finer locking would be to do double-checking:
static void check_place(int x, int y)
{
if (!discovered[x, y])
lock (ob)
if (!discovered[x, y])
{
discovered[x, y] = true;
num_discovered += 1;
}
}
Now at the very least some threads will skip past some cases where discovered[x, y] is true without holding up the other threads.
This is useful when a thread is going to get a result at the end of the locked period. Its still not good enough here though, because it's just going to move on quickly to a case were it fights for the lock again.
If our lookup of discovered were itself thread-safe and that thread-safety was finely grained, then we could make some progress:
static void check_place(int x, int y)
{
if (discovered.SetIfFalse(x, y))
Interlocked.Increment(ref num_discovered)
}
So far though we've just moved the problem around; how do we make SetIfFalse thread-safe without using a single lock and causing the same problem?
There are a few approaches. We could use striped locks, or low-locking concurrent collections.
It seem that you have a fixed-size structure of 50×50, in which case this isn't too hard:
private class DotMap
{
//ints because we can't use interlocked with bools
private int[][] _map = new int[50][];
public DotMap()
{
for(var i = 0; i != 50; ++i)
_map[i] = new int[50];
}
public bool SetIfFalse(int x, int y)
{
return Interlocked.CompareExchange(ref _map[x][y], 1, 0) == 0;
}
}
Now our advantages are:
All of our locking is much lower-level (but note that Interlocked operations will still slow down in the face of contention, albeit not as much as lock).
Much of our locking is out of the way of other locking. Specifically, that in SetIfFalse can allow for separate areas to be checked without being in each others way at all.
This is neither a panacea though (such approaches still suffer in the face of contention, and also bring their own costs) nor easy to generalise to other cases (changing SetIfFalse to something that does anything more than check and change that single value is not easy). It's still quite likely that even on a machine with a lot of cores this would be slower than the single-threaded approach.
Another possibility is to not have SetIfFalse thread-safe at all, but to ensure that the threads where each partitioned from each other so that they were never going to hit the same values and that the structure is safe in the case of such multi-threaded access (fixed arrays of elements above machine word-size are thread-safe when threads only ever hit different indices, must mutable structures where one can Add and/or Remove are not).
In all, you've got the right idea about how to use lock to keep threads from causing errors, and that is the approach to use 98% of the time when something lends itself well to multithreading because it involves threads waiting on something else. Your example though hits that lock too much to benefit from multiple cores, and creating code that does is not trivial.
Your performance on this could potentially be pretty bad - I recommend using Task.Run here to increase efficiency when you need to run your explore method on multiple threads in parallel.
As far as locking and thread safety, if the lock in check_place is the only place you're setting bools in the discovered variable and setting the num_discovered variable, the existing code will work. If you start setting them from somewhere else in the code, you will need to use locks there as well.
Also, when reading from these variables, you should read these values into local variables inside other locks using the same lock object to maintain thread safety here as well.
I have other suggestions but those are the two most basic things you need here.

What is the correct usage of ConcurrentBag?

I've already read previous questions here about ConcurrentBag but did not find an actual sample of implementation in multi-threading.
ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag."
Currently this is the current usage in my code (this is simplified not actual codes):
private void MyMethod()
{
List<Product> products = GetAllProducts(); // Get list of products
ConcurrentBag<Product> myBag = new ConcurrentBag<Product>();
//products were simply added here in the ConcurrentBag to simplify the code
//actual code process each product before adding in the bag
Parallel.ForEach(
products,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
product => myBag.Add(product));
ProcessBag(myBag); // method to process each items in the concurrentbag
}
My questions:
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
For me I think a simple List<Product> and a manual lock will do better. The reason for this is that the scenario above already breaks the "same thread will be both producing and consuming data stored in the bag" rule.
Also I also found out that the ThreadLocal storage created in each thread in the parallel will still exist after the operation (even if the thread is reused is this right?) which may cause an undesired memory leak.
Am I right in this one guys? Or a simple clear or empty method to remove the items in the ConcurrentBag is enough?
This looks like an ok use of ConcurrentBag. The thread local variables are members of the bag, and will become eligible for garbage collection at the same time the bag is (clearing the contents won't release them). You are right that a simple List with a lock would suffice for your case. If the work you are doing in the loop is at all significant, the type of thread synchronization won't matter much to the overall performance. In that case, you might be more comfortable using what you are familiar with.
Another option would be to use ParallelEnumerable.Select, which matches what you are trying to do more closely. Again, any performance difference you are going to see is likely going to be negligible and there's nothing wrong with sticking with what you know.
As always, if the performance of this is critical there's no substitute for trying it and measuring.
It seems to me that bmm6o's is not correct. The ConcurrentBag instance internally contains mini-bags for each thread that adds items to it, so item insertion does not involve any thread locks, and thus all Environment.ProcessorCount threads may get into full swing without being stuck waiting and without any thread context switches. A thread sinchronization may require when iterating over the collected items, but again in the original example the iteration is done by a single thread after all insertions are done. Moreover, if the ConcurrentBag uses Interlocked techniques as the first layer of the thread synchronization, then it is possible not to involve Monitor operations at all.
On the other hand, using a usual List<T> instance and wrapping each its Add() method call with a lock keyword will hurt the performance a lot. First, due to the constant Monitor.Enter() and Monitor.Exit() calls that each require to step deep into the kernel mode and to work with Windows synchronization primitives. Secondly, sometimes occasionally one thread may be blocked by the second thread because the second thread has not finished its addition yet.
As for me, the code above is a really good example of the right usage of ConcurrentBag class.
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
No, for multiple reasons:
This is not the intended usage scenario for this collection. The ConcurrentBag<T> is intended for mixed producer-consumer scenarios, meaning that each thread is expected to add and take items from the bag. Your scenario is nothing like this. You have many threads that add items, and zero threads that take items. The main application for the ConcurrentBag<T> is for making object-pools (pools of reusable objects that are expensive to create or destroy). And given the availability of the ObjectPool<T> class in the Microsoft.Extensions.ObjectPool package, even this niche application for this collection is contested.
It doesn't preserve the insertion order. Even if preserving the insertion order is not important, getting a shuffled output makes the debugging more difficult.
It creates garbage that have to be collected by the GC. It creates one WorkStealingQueue (internal class) per thread, each containing an expandable array, so the more threads you have the more objects you allocate. Also each time it is enumerated it copies all the items in an array, and exposes an IEnumerator<T> GetEnumerator() property that is boxed on each foreach.
There are better options available, offering both better performance and better ordering behavior.
In your scenario you can store the results of the parallel execution in a simple array. Just create an array with length equal to the products.Count, switch from the Parallel.ForEach to the Parallel.For, and assign the result directly to the corresponding slot of the results array without doing any synchronization at all:
List<Product> products = GetAllProducts(); // Get list of products
Product[] results = Product[products.Count];
Parallel.For(0, products.Count,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => results[i] = products[i]);
ProcessResults(results);
This way you'll get the results with perfect ordering, stored in a container that has the most compact size and the fastest enumeration of all .NET collections, doing only a single object allocation.
In case you are concerned about the thread-safety of the above operation, there is nothing to worry about. Each thread writes on different slots in the results array. After the completion of the parallel execution the current thread has full visibility of all the values that are stored in the array, because the TPL includes the appropriate barriers when tasks are queued, and at the beginning/end of task execution (citation).
(I have posted more thoughts about the ConcurrentBag<T> in this answer.)
If List<T> is used with a lock around Add() method it will make threads wait and will reduce the performance gain of using Parallel.ForEach()

Is this use of the generic List thread safe

I have a System.Collections.Generic.List<T> to which I only ever add items in a timer callback. The timer is restarted only after the operation completes.
I have a System.Collections.Concurrent.ConcurrentQueue<T> which stores indices of added items in the list above. This store operation is also always performed in the same timer callback described above.
Is a read operation that iterates the queue and accesses the corresponding items in the list thread safe?
Sample code:
private List<Object> items;
private ConcurrentQueue<int> queue;
private Timer timer;
private void callback(object state)
{
int index = items.Count;
items.Add(new object());
if (true)//some condition here
queue.Enqueue(index);
timer.Change(TimeSpan.FromMilliseconds(500), TimeSpan.FromMilliseconds(-1));
}
//This can be called from any thread
public IEnumerable<object> AccessItems()
{
foreach (var index in queue)
{
yield return items[index];
}
}
My understanding:
Even if the list is resized when it is being indexed, I am only accessing an item that already exists, so it does not matter whether it is read from the old array or the new array. Hence this should be thread-safe.
Is a read operation that iterates the queue and accesses the corresponding items in the list thread safe?
Is it documented as being thread safe?
If no, then it is foolish to treat it as thread safe, even if it is in this implementation by accident. Thread safety should be by design.
Sharing memory across threads is a bad idea in the first place; if you don't do it then you don't have to ask whether the operation is thread safe.
If you have to do it then use a collection designed for shared memory access.
If you can't do that then use a lock. Locks are cheap if uncontended.
If you have a performance problem because your locks are contended all the time then fix that problem by changing your threading architecture rather than trying to do dangerous and foolish things like low-lock code. No one writes low-lock code correctly except for a handful of experts. (I am not one of them; I don't write low-lock code either.)
Even if the list is resized when it is being indexed, I am only accessing an item that already exists, so it does not matter whether it is read from the old array or the new array.
That's the wrong way to think about it. The right way to think about it is:
If the list is resized then the list's internal data structures are being mutated. It is possible that the internal data structure is mutated into an inconsistent form halfway through the mutation, that will be made consistent by the time the mutation is finished. Therefore my reader can see this inconsistent state from another thread, which makes the behaviour of my entire program unpredictable. It could crash, it could go into an infinite loop, it could corrupt other data structures, I don't know, because I'm running code that assumes a consistent state in a world with inconsistent state.
Big edit
The ConcurrentQueue is only safe with regard to the Enqueue(T) and T Dequeue() operations.
You're doing a foreach on it and that doesn't get synchronized at the required level.
The biggest problem in your particular case is the fact the enumerating of the Queue (which is a Collection in it's own right) might throw the wellknown "Collection has been modified" exception. Why is that the biggest problem ? Because you are adding things to the queue after you've added the corresponding objects to the list (there's also a great need for the List to be synchronized but that + the biggest problem get solved with just one "bullet"). While enumerating a collection it is not easy to swallow the fact that another thread is modifying it (even if on a microscopic level the modification is a safe - ConcurrentQueue does just that).
Therefore you absolutely need synchronize the access to the queues (and the central List while you're at it) using another means of synchronization (and by that I mean you can also forget abount ConcurrentQueue and use a simple Queue or even a List since you never Dequeue things).
So just do something like:
public void Writer(object toWrite) {
this.rwLock.EnterWriteLock();
try {
int tailIndex = this.list.Count;
this.list.Add(toWrite);
if (..condition1..)
this.queue1.Enqueue(tailIndex);
if (..condition2..)
this.queue2.Enqueue(tailIndex);
if (..condition3..)
this.queue3.Enqueue(tailIndex);
..etc..
} finally {
this.rwLock.ExitWriteLock();
}
}
and in the AccessItems:
public IEnumerable<object> AccessItems(int queueIndex) {
Queue<object> whichQueue = null;
switch (queueIndex) {
case 1: whichQueue = this.queue1; break;
case 2: whichQueue = this.queue2; break;
case 3: whichQueue = this.queue3; break;
..etc..
default: throw new NotSupportedException("Invalid queue disambiguating params");
}
List<object> results = new List<object>();
this.rwLock.EnterReadLock();
try {
foreach (var index in whichQueue)
results.Add(this.list[index]);
} finally {
this.rwLock.ExitReadLock();
}
return results;
}
And, based on my entire understanding of the cases in which your app accesses the List and the various Queues, it should be 100% safe.
End of big edit
First of all: What is this thing you call Thread-Safe ? by Eric Lippert
In your particular case, I guess the answer is no.
It is not the case that inconsistencies might arrise in the global context (the actual list).
Instead it is possible that the actual readers (who might very well "collide" with the unique writer) end up with inconsistencies in themselves (their very own Stacks meaning: local variables of all methods, parameters and also their logically isolated portion of the heap)).
The possibility of such "per-Thread" inconsistencies (the Nth thread wants to learn the number of elements in the List and finds out that value is 39404999 although in reality you only added 3 values) is enough to declare that, generally speaking that architecture is not thread-safe ( although you don't actually change the globally accessible List, simply by reading it in a flawed manner ).
I suggest you use the ReaderWriterLockSlim class.
I think you will find it fits your needs:
private ReaderWriterLockSlim rwLock = new ReaderWriterLockSlim(LockRecursionPolicy.SupportsRecursion);
private List<Object> items;
private ConcurrentQueue<int> queue;
private Timer timer;
private void callback(object state)
{
int index = items.Count;
this.rwLock.EnterWriteLock();
try {
// in this place, right here, there can be only ONE writer
// and while the writer is between EnterWriteLock and ExitWriteLock
// there can exist no readers in the following method (between EnterReadLock
// and ExitReadLock)
// we add the item to the List
// AND do the enqueue "atomically" (as loose a term as thread-safe)
items.Add(new object());
if (true)//some condition here
queue.Enqueue(index);
} finally {
this.rwLock.ExitWriteLock();
}
timer.Change(TimeSpan.FromMilliseconds(500), TimeSpan.FromMilliseconds(-1));
}
//This can be called from any thread
public IEnumerable<object> AccessItems()
{
List<object> results = new List<object>();
this.rwLock.EnterReadLock();
try {
// in this place there can exist a thousand readers
// (doing these actions right here, between EnterReadLock and ExitReadLock)
// all at the same time, but NO writers
foreach (var index in queue)
{
this.results.Add ( this.items[index] );
}
} finally {
this.rwLock.ExitReadLock();
}
return results; // or foreach yield return you like that more :)
}
No because you are reading and writing to/from the same object concurrently. This is not documented to be safe so you can't be sure it is safe. Don't do it.
The fact that it is in fact unsafe as of .NET 4.0 means nothing, btw. Even if it was safe according to Reflector it could change anytime. You can't rely on the current version to predict future versions.
Don't try to get away with tricks like this. Why not just do it in an obviously safe way?
As a side note: Two timer callbacks can execute at the same time, so your code is doubly broken (multiple writers). Don't try to pull off tricks with threads.
It is thread-safish. The foreach statement uses the ConcurrentQueue.GetEnumerator() method. Which promises:
The enumeration represents a moment-in-time snapshot of the contents of the queue. It does not reflect any updates to the collection after GetEnumerator was called. The enumerator is safe to use concurrently with reads from and writes to the queue.
Which is another way of saying that your program isn't going to blow up randomly with an inscrutable exception message like the kind you'll get when you use the Queue class. Beware of the consequences though, implicit in this guarantee is that you may well be looking at a stale version of the queue. Your loop will not be able to see any elements that were added by another thread after your loop started executing. That kind of magic doesn't exist and is impossible to implement in a consistent way. Whether or not that makes your program misbehave is something you will have to think about and can't be guessed from the question. It is pretty rare that you can completely ignore it.
Your usage of the List<> is however utterly unsafe.

Array and Safe Access by Threads

If I have an array that can/will be accessed by multiple threads at any given point in time, what exactly causes it to be non-thread safe, and what would be the steps taken to ensure that the array would be thread safe in most situations?
I have looked extensively around on the internet and have found little to no information on this subject, everything seems to be specific scenarios (e.g. is this array, that is being accessed like this by these two threads thread-safe, and on, and on). I would really like of someone could either answer the questions I laid out at the top, or if someone could point towards a good document explaining said items.
EDIT:
After looking around on MSDN, I found the ArrayList class. When you use the synchronize method, it returns a thread-safe wrapper for a given list. When setting data in the list (i.e. list1[someNumber] = anotherNumber;) does the wrapper automatically take care of locking the list, or do you still need to lock it?
When two threads are accessing the exact same resource (e.g., not local copies, but actually the same copy of the same resource), a number of things can happen. In the most obvious scenario, if Thread #1 is accessing a resource and Thread #2 changes it mid-read, some unpredictable behavior can happen. Even with something as simple as an integer, you could have logic errors arise, so try to imagine the horrors that can result from improperly using something more complicated, like a database access class that's declared as static.
The classical way of handling this problem is to put a lock on the sensitive resources so only one thread can use it at a time. So in the above example, Thread #1 would request a lock to a resource and be granted it, then go in to read what it needs to read. Thread #2 would come along mid-read and request a lock to the resource, but be denied and told to wait because Thread #1 is using it. When Thread #1 finishes, it releases the lock and it's OK for Thread #2 to proceed.
There are other situations, but this illustrates one of the most basic problems and solutions. In C#, you may:
1) Use specific .NET objects that are managed as lockable by the framework (like Scorpion-Prince's link to SynchronizedCollection)
2) Use [MethodImpl(MethodImplOptions.Synchronized)] to dictate that a specific method that does something dangerous should only be used by one thread at a time
3) Use the lock statement to isolate specific lines of code that are doing something potentially dangerous
What approach is best is really up to your situation.
If I have an array that can/will be accessed by multiple threads at
any given point in time, what exactly causes it to be non-thread safe,
and what would be the steps taken to ensure that the array would be
thread safe in most situations?
In general terms, the fact that the array is not thread-safe is the notion that two or more threads could be modifying the contents of the array if you do not synchronize access to it.
Speaking generally, for example, let's suppose you have thread 1 doing this work:
for (int i = 0; i < array.Length; i++)
{
array[i] = "Hello";
}
And thread 2 doing this work (on the same shared array)
for (int i = 0; i < array.Length; i++)
{
array[i] = "Goodbye";
}
There isn't anything synchronizing the threads so your results will depend on which thread wins the race first. It could be "Hello" or "Goodbye", in some random order, but will always be at least 'Hello' or 'Goodbye'.
The actual write of the string 'Hello' or 'Goodbye' is guaranteed by the CLR to be atomic. That is to say, the writing of the value 'Hello' cannot be interrupted by a thread trying to write 'Goodbye'. One must occur before or after the other, never in between.
So you need to create some kind of synchronization mechanism to prevent the arrays from stepping on each other. You can accomplish this by using a lock statement in C#.
C# 3.0 and above provide a generic collection class called SynchronizedCollection which "provides a thread-safe collection that contains objects of a type specified by the generic parameter as elements."
Array is thread safe if it is named public and static keywords - instant is not guaranteed - as the System.Array implements the ICollection interface which define some synchronize method to support synchronizing mechanism.
However, coding to enumerate through the array's item is not safe, developer should implement lock statement to make sure there is no change to the array during the array enumeration.
EX:
Array arrThreadSafe = new string[] {"We", "are", "safe"};
lock(arrThreadSafe.SyncRoot)
{
foreach (string item in arrThreadSafe)
{
Console.WriteLine(item);
}
}

Is this use of Parallel.ForEach() thread safe?

Essentially, I am working with this:
var data = input.AsParallel();
List<String> output = new List<String>();
Parallel.ForEach<String>(data, line => {
String outputLine = "";
// ** Do something with "line" and store result in "outputLine" **
// Additionally, there are some this.Invoke statements for updating UI
output.Add(outputLine);
});
Input is a List<String> object. The ForEach() statement does some processing on each value, updates the UI, and adds the result to the output List. Is there anything inherently wrong with this?
Notes:
Output order is unimportant
Update:
Based on feedback I've gotten, I've added a manual lock to the output.Add statement, as well as to the UI updating code.
Yes; List<T> is not thread safe, so adding to it ad-hoc from arbitrary threads (quite possibly at the same time) is doomed. You should use a thread-safe list instead, or add locking manually. Or maybe there is a Parallel.ToList.
Also, if it matters: insertion order will not be guaranteed.
This version is safe, though:
var output = new string[data.Count];
Parallel.ForEach<String>(data, (line,state,index) =>
{
String outputLine = index.ToString();
// ** Do something with "line" and store result in "outputLine" **
// Additionally, there are some this.Invoke statements for updating UI
output[index] = outputLine;
});
here we are using index to update a different array index per parallel call.
Is there anything inherently wrong with this?
Yes, everything. None of this is safe. Lists are not safe for updating on multiple threads concurrently, and you can't update the UI from any thread other than the UI thread.
The documentation says the following about the thread safety of List<T>:
Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
A List(Of T) can support multiple readers concurrently, as long as the collection is not modified. Enumerating through a collection is intrinsically not a thread-safe procedure. In the rare case where an enumeration contends with one or more write accesses, the only way to ensure thread safety is to lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization.
Thus, output.Add(outputLine) is not thread-safe and you need to ensure thread safety yourself, for example, by wrapping the add operation in a lock statement.
When you want the results of a parallel operation, the PLINQ is more convenient than the Parallel class. You started well by converting your input to a ParallelQuery<T>:
ParallelQuery<string> data = input.AsParallel();
...but then you fed the data to the Parallel.ForEach, which treats it as a standard IEnumerable<T>. So the AsParallel() was wasted. It didn't provide any parallelization, only overhead. Here is the correct way to use PLINQ:
List<string> output = input
.AsParallel()
.Select(line =>
{
string outputLine = "";
// ** Do something with "line" and store result in "outputLine" **
return outputLine;
})
.ToList();
A few differences that you should have in mind:
The Parallel runs the code on the ThreadPool by default, but it's configurable. The PLINQ uses exclusively the ThreadPool.
The Parallel by default has unlimited parallelism (it uses all the available threads of the ThreadPool). The PLINQ uses by default at most Environment.ProcessorCount threads.
Regarding the order of the results, PLINQ doesn't preserve the order by default. In case you want to preserve the order, you can attach the AsOrdered operator.

Categories