Parallel.For does not loop expected times

Parallel.For does not loop expected times - c#

I have this test:
public void Run()
{
var result = new List<int>();
int i = 0;
Parallel.For(0, 100000, new Action<int>((counter) =>
{
i++;
if (counter == 99999)
{
Trace.WriteLine("i is " + i);
}
}));
}
Now why does the output print seemingly random numbers in the range of 50000 to 99999?
I expected the output to always be 99999. Have I missunderstood the parallel for loop implementation?
If I now run the loop 100 times then the program outputs 100, as expected. FYI I have a 8-core CPU
UPDATE:
offcourse! I missed the thread safety aspect of it :) thanks! Now lets see which one is faster, using lock, declaring as volatile, or using Interlocked

Your problem is probably due to i++ not being thread-safe and your tasks being in a kind of 'race condition'.
Further explanation about i++ not being thread-safe you can find here: Are incrementers / decrementers (var++, var--) etc thread safe?
A quote of the answer given by Michael Burr in the aforementioned linked thread (upvote it there):
You can use something like InterlockedIncrement() depending on your
platform. On .NET you can use the Interlocked class methods
(Interlocked.Increment() for example).
A Rob Kennedy mentioned, even if the operation is implemented in terms
of a single INC instruction, as far as the memory is concerned a
read/increment/write set of steps is performed. There is the
opportunity on a multi-processor system for corruption.
There's also the volatile issue, which would be a necessary part of
making the operation thread-safe - however, marking the variable
volatile is not sufficient to make it thread-safe. Use the interlocked
support the platform provides.
This is true in general, and on x86/x64 platforms certainly.
The race to Trace.WriteLine()
Between the time you do ++i and output i, other parallel tasks might have changed/incremented i several times.
Imagine your first task, which is incrementing i so that it becomes 1. However, depending on your runtime environment and the weather of the day, i might be incremented twenty times more by other parallel tasks before the first task outputs the variable -- which would now be 21 (and not 1 anymore). To prevent this from happening use a local variable which remembers the value of the incremented i for a particular task for later processing/output:
int remember = Interlocked.Increment(ref i);
...
Trace.WriteLine("i of this task is: " + remember);

Because your code is not thread safe. i++ is "read modify write" which is not thread safe.
Use Interlocked.Increment instead, or get a lock around it.
Is the ++ operator thread safe?

Parallel.For means it runs all tasks in a parallel way, so for code, it doesn't means the it runs from 0 - 100000, it can start running the delegate function with 99999 first, so that is why you get an arbitrary value of i.
When a parallel loop runs, the TPL partitions the data source so that the loop can operate on multiple parts concurrently. Behind the scenes, the Task Scheduler partitions the task based on system resources and workload. When possible, the scheduler redistributes work among multiple threads and processors if the workload becomes unbalanced.

Related

understanding Parallel.Invoke, creation and reusing of threads

I am trying to understand how Parallel.Invoke creates and reuses threads.
I ran the following example code (from MSDN, https://msdn.microsoft.com/en-us/library/dd642243(v=vs.110).aspx):
using System;
using System.Threading;
using System.Threading.Tasks;
class ThreadLocalDemo
{
static void Main()
{
// Thread-Local variable that yields a name for a thread
ThreadLocal<string> ThreadName = new ThreadLocal<string>(() =>
{
return "Thread" + Thread.CurrentThread.ManagedThreadId;
});
// Action that prints out ThreadName for the current thread
Action action = () =>
{
// If ThreadName.IsValueCreated is true, it means that we are not the
// first action to run on this thread.
bool repeat = ThreadName.IsValueCreated;
Console.WriteLine("ThreadName = {0} {1}", ThreadName.Value, repeat ? "(repeat)" : "");
};
// Launch eight of them. On 4 cores or less, you should see some repeat ThreadNames
Parallel.Invoke(action, action, action, action, action, action, action, action);
// Dispose when you are done
ThreadName.Dispose();
}
}
As I understand it, Parallel.Invoke tries to create 8 threads here - one for each action. So it creates the first thread, runs the first action, and by that gives a ThreadName to the thread. Then it creates the next thread (which gets a different ThreadName) and so on.
If it cannot create a new thread, it will reuse one of the threads created before. In this case, the value of repeat will be true and we can see this in the console output.
Is this correct until here?
The second-last comment ("Launch eight of them. On 4 cores or less, you should see some repeat ThreadNames") implies that the threads created by Invoke correspond to the available cpu threads of the processor: on 4 cores we have 8 cpu threads, at least one is busy (running the operating system and stuff), so Invoke can only use 7 different threads, so we must get at least one "repeat".
Is my interpretation of this comment correct?
I ran this code on my PC which has an Intel® Core™ i7-2860QM processor (i.e. 4 cores, 8 cpu threads). I expected to get at least one "repeat", but I didn't. When I changed the Invoke to take 10 instead of 8 actions, I got this output:
ThreadName = Thread6
ThreadName = Thread8
ThreadName = Thread6 (repeat)
ThreadName = Thread5
ThreadName = Thread3
ThreadName = Thread1
ThreadName = Thread10
ThreadName = Thread7
ThreadName = Thread4
ThreadName = Thread9
So I have at least 9 different threads in the console application. This contradicts the fact that my processor only has 8 threads.
So I guess some of my reasoning from above is wrong. Does Parallel.Invoke work differently than what I described above? If yes, how?

If you pass less then 10 items to Parallel.Invoke, and you don't specify MaxDegreeOfParallelism in options (so - your case), it will just run them all in parallel on thread pool sheduler using rougly the following code:
var actions = new [] { action, action, action, action, action, action, action, action };
var tasks = new Task[actions.Length];
for (int index = 1; index < tasks.Length; ++index)
tasks[index] = Task.Factory.StartNew(actions[index]);
tasks[0] = new Task(actions[0]);
tasks[0].RunSynchronously();
Task.WaitAll(tasks);
So just a regular Task.Factory.StartNew. If you will look at max number of threads in thread pool
int th, io;
ThreadPool.GetMaxThreads(out th, out io);
Console.WriteLine(th);
You will see some big number, like 32767. So, number of threads on which Parallel.Invoke will be executed (in your case) are not limited to number of cpu cores at all. Even on 1-core cpu it might run 8 threads in parallel.
You might then think, why some threads are reused at all? Because when work is done on thread pool thread - that thread is returned to the pool and is ready to accept new work. Actions from your example basically do no work at all and complete very fast. So sometimes first thread started via Task.Factory.StartNew has already completed your action and is returned to the pool before all subsequent threads were started. So that thread is reused.
By the way, you can see (repeat) in your example with 8 actions, and even with 7 if you try hard enough, on a 8 core (16 logical cores) processor.
UPDATE to answer your comment. Thread pool scheduler will not necessary create new threads immediately. There is min and max number of threads in thread pool. How to see max I already shown above. To see min number:
int th, io;
ThreadPool.GetMinThreads(out th, out io);
This number will usually be equal to the number of cores (so for example 8). Now, when you request new action to be performed on thread pool thread, and number of threads in a thread pool is less than minimum - new thread will be created immeditely. However, if number of available threads is greater than minimum - certain delay will be introduced before creating new thread (I don't remember how long exactly unfortunately, about 500ms).
Statement you added in your comment I highly doubt can execute in 2-3 seconds. For me it executes for 0.3 seconds max. So when first 8 threads are created by thread pool, there is that 500ms delay before creating 9th. During that delay, some (or all) of first 8 threads are completed their job and are available for new work, so there is no need to create new thread and they can be reused.
To verify this, introduce bigger delay:
static void Main()
{
// Thread-Local variable that yields a name for a thread
ThreadLocal<string> ThreadName = new ThreadLocal<string>(() =>
{
return "Thread" + Thread.CurrentThread.ManagedThreadId;
});
// Action that prints out ThreadName for the current thread
Action action = () =>
{
// If ThreadName.IsValueCreated is true, it means that we are not the
// first action to run on this thread.
bool repeat = ThreadName.IsValueCreated;
Console.WriteLine("ThreadName = {0} {1}", ThreadName.Value, repeat ? "(repeat)" : "");
Thread.Sleep(1000000);
};
int th, io;
ThreadPool.GetMinThreads(out th, out io);
Console.WriteLine("cpu:" + Environment.ProcessorCount);
Console.WriteLine(th);
Parallel.Invoke(Enumerable.Repeat(action, 100).ToArray());
// Dispose when you are done
ThreadName.Dispose();
Console.ReadKey();
}
You will see that now thread pool has to create new threads every time (much more than there are cores), because it cannot reuse previous threads while they are busy.
You can also increase number of min threads in thread pool, like this:
int th, io;
ThreadPool.GetMinThreads(out th, out io);
ThreadPool.SetMinThreads(100, io);
This will remove the delay (until 100 threads are created) and in above example you will notice that.

Behind the scenes, threads are organized (and possessed by) the task scheduler. Primary purpose of the task scheduler is to keep all CPU cores used as much as possible with useful work.
Under the hood, scheduler is using the thread pool, and then size of the thread pool is the way to fine-tune usefulness of operations executed on CPU cores.
Now this requires some analysis. For instance, thread switching costs CPU cycles and it is not useful work. On the other hand, when one thread executes one task on a core, all other tasks are stalled and they are not progressing on that core. I believe that is the core reason why the scheduler is usually starting two threads per core, so that at least some movement is visible in case that one task takes longer to complete (like several seconds).
There are corollaries to this basic mechanism. When some tasks take long time to complete, scheduler starts new threads to compensate. That means that long-running task will now have to compete for the core with short-running tasks. In that way, short tasks will be completed one after another, and long task will slowly progress to its completion as well.
Bottom line is that your observations about threads are generally correct, but not entirely true in specific situations. In concrete execution of a number of tasks, scheduler might choose to raise more threads, or to keep going with the default. That is why you will sometimes notice that number of threads differs.
Remember the goal of the game: Utilize CPU cores with useful work as much as possible, while at the same time making all tasks move, so that the application doesn't look like frozen. Historically, people used to try to reach these goals with many different techniques. Analysis had shown that many of those techniques were applied randomly and didn't really increase CPU utilization. That analysis has lead to introduction of task schedulers in .NET, so that fine-tuning can be coded once and be done well.

So I have at least 9 different threads in the console application. This contradicts the fact that my processor only has 8 threads.
A thread is a very much overloaded term. It can mean, at the very least: (1) something you sew with, (2) a bunch of code with associated state, that is represented by an OS handle, and (3) an execution pipeline of a CPU. The Thread.CurrentThread refers to (2), the "processor thread" that you mentioned refers to (3).
The existence of a (2)-thread is not predicated on the existence of (3)-thread, and the number of (2)-threads that exist on any particular system is pretty much limited by available memory and OS design. The existence of (2)-thread doesn't imply execution of (2)-thread at any given time (unless you use APIs that guarantee that).
Furthermore, if a (2)-thread executes at some point - implying a temporary 1:1 binding between (2)-thread and (3)-thread, there is no implication that the thread will continue executing in general, and of course neither is there an implication that the thread will continue executing on the same (3)-thread if it continues executing at all.
So, even if you have "caught" the execution of a (2)-thread on a (3)-thread by some side effect, e.g. console output, as you did, that doesn't necessarily imply anything about any other (2)-threads and (3)-threads at that point.
On to your code:
// If ThreadName.IsValueCreated is true, it means that we are not the
// first action to run on this thread. <-- this refers to (2)-thread, NOT (3)-thread.
Parallel.Invoke is not precluded from (in terms of specifications) creating as many new (2)-threads as there are arguments passed to it. The actual number of (2)-threads created may be all the way from zero to a hero, since to call Parallel.Invoke there must be an existing (2)-thread with some code that calls this API. So, no new (2)-threads need to be created at all, for example. Whether the (2)-threads created by Parallel.Invoke execute on any particular number of (3)-threads concurrently is beyond your control either.
So that explains the behavior you see. You conflated (2)-threads with (3)-threads, and assumed that Parallel.Invoke does something specific it in fact is not guaranteed to do. Citing documentation:
No guarantees are made about the order in which the operations execute or whether they execute in parallel.
This implies that Invoke is free to run the actions on dedicated (2)-threads if it so wishes. And that is what you observed.

Parallel.For does not wait all iterations

I am building an optimization program using Genetic Algorithms. I used Parallel.For in order to decrease time. But it caused a problem which is same in code below:
class Program
{
static void Main(string[] args)
{
int j=0;
Parallel.For(0, 10000000, i =>
{
j++;
});
Console.WriteLine(j);
Console.ReadKey();
}
}
Every time i run the program above, it writes a different value of j between 0 and 10000000. I guess it doesn't wait for all iterations to finish. And it passes to next line.
How am i supposed to solve this problem? Any help will be appreciated. Thanks.
Edition:
Interlocked.Increment(ref j); clause solves the unexpected results, but this operation causes about 10 times much more time when i compare with normal for loop.

You could use the Interlocked.Increment(int32) method which would probably be easiest.
Using Parallel.For will create multiple threads which will execute the same lambda expression; in this case all it does is j++.
j++ will be compiled to something like this j = j + 1, which is a read and write operation. This can cause unwanted behavior.
Say that j = 50.
Thread 1 is executing the read for j++ which will get 50 and will add 1 to it. Before that thread could finish the write operation to j another thread does the read operation and reads 50 from j then the first thread has finished his write operation to j making it 51 but the second thread still has 50 in memory as the value for j and will add 1 to that and again write 51 back to j.
Using the Interlocked class makes sure that every operation happens atomically.

Your access to j is not syncronized. Please read a basic book or tutorial on multi-threading and syncronization.
Parallel.For does wait for all iterations.
Using syncronization (and thereby defeating the use of the parallel for):
class Program
{
static void Main(string[] args)
{
object sync = new object;
int j=0;
Parallel.For(0, 10000000, i =>
{
lock(sync)
{
j++;
}
});
Console.WriteLine(j);
Console.ReadKey();
}
}

Parallel.For does wait for all iterations to finish. The reason you're seeing unexpected values in your variable is different - and it is expected.
Basically, Parallel.For dispatches the iterations to multiple threads (as you would expect). However, multiple threads can't share the same writeable memory without some kind of guarding mechanism - if they do, you'll have a data race and the result is logically undeterministic. This is applicable in all programming languages and it is the fundamental caveat of multithreading.
There are many kinds of guards you can put in place, depending on your use case. The fundamental way they work is through atomic operations, which are accessible to you through the Interlocked helper class. Higher-level guards include the Monitor class, the related lock language construct and classes like ReaderWriterLock (and its siblings).

Parallel.For vs regular threads

I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.

The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.

Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());

Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..

Thread order execution?

I have this simple code : (which i run in linqpad)
void Main()
{
for ( int i=0;i<10;i++)
{
int tmp=i;
new Thread (() =>doWork(tmp)).Start();
}
}
public void doWork( int h)
{
h.Dump();
}
the int tmp=i; line is for capture variable - so each iteration will have its own value.
2 problems :
1) the numbers are not sequential , while thread execution is !
2) sometimes i get less than 10 numbers !
here are some executions outputs:
questions :
1) why case 1 is happening and how can i solve it ?
2) why case 2 is happening and how can i solve it ?

It should not be expected that they are sequential. Each thread gets priority as the kernel chooses. It might happen that they look sequential, purely by the nature of when each is started, but that is pure chance.
For ensuring that they all complete - mark each new thread as IsBackground = false, so that it keeps the executable alive. For example:
new Thread(() => doWork(tmp)) { IsBackground = false }.Start();

Threads execute in unpredictable order, and if the main thread finishes before others you'll not get all the numbers (dump() will not execute). If you mark your threads as IsBackground = false you'll get them all. There's no real solution for the first one except not using threads (or joining threads, which is same thing really).

You shouldn't expect any ordering between threads.
If you start a new thread, it is merely added to the operating system's management structures. Eventually the thread scheduler will come around and allocate a time slice for the thread. It may do this in a round-robin fashion, pick a random one, use some heuristics to determine which one looks most important (eg. one that owns a Window which is in the foreground) and so on.
If the order of the outputs is relevant, you can either sort it afterwards or - if you know the ordering before work begins already - use an array where each thread is given an index into which it should write its result.
Creating new threads the way your example does is also very slow. For micro tasks, using the thread pool is at least one order of magnitude faster.

The nature of thread management is random. You can solve both task, but overhead is too big.
Problem appears that multiple thread concurs on console (or what you use for dump), overriding of sync mechanism is possible but complicated and will cause reduce of performance
You exit before all threads are invoked (see answer by #Marc Gravell)

if ordering is important you may want to avail of a shared queue and use a semaphore to ensure only one thread operates on the top of the queue at a time

You can order thread execution, but it has to be done specifically by you for the specific problem with a specific solution.
E.g.: you would like that thread 1,2,3 complete phase 1 of your code,
and then they proceed to the next phase in the order of their IDs (these IDs you have assign).
You can use semaphores to achieve the behavior - search for block synchronization and mutual exclusion and Test-and-set method.

Why isn't Parallel.ForEach running multiple threads?

Today i tried do some optimization to foreach statement, that works on XDocument.
Before optimization:
foreach (XElement elem in xDoc.Descendants("APSEvent").ToList())
{
//some operations
}
After optimization:
Parallel.ForEach(xDoc.Descendants("APSEvent").ToList(), elem =>
{
//same operations
});
I saw that .NET in Parallel.ForEach(...) opened ONLY one thread! As a result the timespan of Parallel was bigger than standard foreach.
Why do you think .NET only opened 1 thread? Because of locking of file?
Thanks

It's by design that Parallel.ForEach may use fewer threads than requested to achieve better performance. According to MSDN [link]:
By default, the Parallel.ForEach and Parallel.For methods can use a variable number of tasks. That's why, for example, the ParallelOptions class has a MaxDegreeOfParallelism property instead of a "MinDegreeOfParallelism" property. The idea is that the system can use fewer threads than requested to process a loop.
The .NET thread pool adapts dynamically to changing workloads by allowing the number of worker threads for parallel tasks to change over time. At run time, the system observes whether increasing the number of threads improves or degrades overall throughput and adjusts the number of worker threads accordingly.

From the problem description, there is nothing that explains why the TPL is not spawning more threads.
There is no evidence in the question that is even the problem. That can be fixed quite easily: you could log the thread id, before you enter the loop, and as the first thing you do inside your loop.
If it is always the same number, it is the TPL failing to spawn threads. You should then try different versions of your code and what change triggers the TPL to serialize everything. One reason could be if there are a small number of elements in your list. The TPL partitions your collection, and if you have only a few items, you might end up with only one batch. This behavior is configurable by the way.
It could be you are inadvertedly taking a lock in in the loop, then you will be seeing lots of different numbers, but no speedup. Then, simplify the code until the problem vanishes.

Not always the parallel way is faster than the "old fashion way"
http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/c860cf3f-f7a6-46b5-8a07-ca2f413258dd

use it like this:
int ParallelThreads = 10;
Parallel.ForEach(xDoc.Descendants("APSEvent").ToList(), new ParallelOptions() { MaxDegreeOfParallelism = ParallelThreads }, (myXDOC, i, j) =>
{
//do whatever you want here
});

Yes exactly, Document.Load(...) locks the file and due to resource contention between threads, TPL is unable to use the power of multiple threads. Try to load the XML into a Stream and then use Parallel.For(...).

Do you happen to have a single processor? TPL may limit the number of threads to one in this case. Same thing may happen if the collection is very small. Try a bigger collection.
See this answer for more details on how the degree of parallelism is determined.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.