I wanted to stress test my new cpu. I made this in about 2 mins.
When I add more threads to it, the efficiency of it decreases dramatically. These are results: (Please note I set the Priority in the taskmanager to High)
1 Thread: After one minute on 1 thread(s), you got to the number/prime 680263811
2 Threads: After one minute on 2 thread(s), you got to the number/prime 360252913
4 Threads: After one minute on 4 thread(s), you got to the number/prime 216150449
There are problems with the code, I just made it as a test. Please don't bash me that it was written horribly... I've kinda had a bad day
static void Main(string[] args)
{
Console.Write("Stress test, how many OS threads?: ");
int thr = int.Parse(Console.ReadLine());
Thread[] t = new Thread[thr];
Stopwatch s = new Stopwatch();
s.Start();
UInt64 it = 0;
UInt64 prime = 0;
for (int i = 0; i < thr; i++)
{
t[i] = new Thread(delegate()
{
while (s.Elapsed.TotalMinutes < 1)
{
it++;
if (it % 2 != 0)// im 100% sure that x % 2 does not give primes, but it uses up the cpu, so idc
{
prime = it;
}
}
Console.WriteLine("After one minute on " + t.Length + " thread(s), you got to the number/prime " + prime.ToString());//idc if this prints 100 times, this is just a test
});
t[i].Start();
}
Console.ReadLine();
}
Question: Can someone explain these unexpected results?
Your threads are incrementing it without any synchronization, so you're going to get weird results like this. Worse, you're also assigning prime without any synchronization.
thread 1: reads 0 from it, then gets unscheduled by the OS for whatever reason
thread 2: reads 0 from it, then increments to 1
thread 2: does work, assigns 1 to prime
... thread 2 repeats for awhile. thread 2 is now up to 7, and is about to check if (it % 2 != 0)
thread 1: regains the CPU
thread 1: increments it to 1
thread 2: assigns 1 to prime --- wat?
The possibilities get even worse as you get to the point where a bit in the high half of it is changing because 64-bit reads and writes are not atomic either although these numbers are a little larger than in the question, after running for longer, wildly variable would be possible ... consider
After some time it = 0x00000000_FFFFFFFF
thread 1 reads reads it (both words)
thread 2 reads the higher word 0x0000000_????????
thread 1 calculates it + 1 0x00000001_00000000
thread 1 writes it (both halves)
thread 2 reads the lower word (and puts this with the already read half) 0x00000000_00000000
While thread 1 was incrementing to 4294967296, thread 2 managed to read 0.
You should apply keyword volatile to variable it and prime.
I have the following program (that I got from http://blogs.msdn.com/b/csharpfaq/archive/2010/06/01/parallel-programming-in-net-framework-4-getting-started.aspx) that splits a task up using Parallel.For loop
class Program
{
static void Main(string[] args)
{
var watch = Stopwatch.StartNew();
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
Console.WriteLine(watch.ElapsedMilliseconds);
Console.ReadLine();
}
public static double SumRootN(int root)
{
double result = 0;
for (int i = 1; i < 10000000; i++)
{
result += Math.Exp(Math.Log(i) / root);
}
return result;
}
}
When I run this test several times I get times of:
1992, 2140, 1783, 1863 ms etc etc.
My first question is, why are the times always different?? I am doing the exact same calculations each time yet the times vary each time.
Now when I add in the following code to make use of all the available processors on my CPU:
var parallelOptions = new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount (On my CPU this is 8)
};
Parallel.For(2, 20, parallelOptions, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
I notice that the execution times actually increase!! The times are now:
2192, 3192, 2603, 2245 ms etc etc.
Why does this cause the times to increase? Am I using this wrong?
From http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism(v=vs.110).aspx
By default, For and ForEach will utilize however many threads the underlying scheduler provides. Changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used.
This means that setting MaxDegreeOfParallelism to the number of processors will in fact limit the capacity of the Parallel.For loop to use the optimal amount of threads for the work load. For example, I have a migration job that uses close to 60 threads on around 600 iterations of long-running code, a lot more than the 1 thread per processor limit that you're trying to set.
MaxDegreeOfParallelism or ThreadPool.SetMaxThreads should only be used if you explicitly need to prevent more than a given number of threads from executing. For example, if using an Access database, I would set it to 64, because that's the maximum number of concurrent connections that can be handled by Access for a single process.
I am trying to speed up my calculation times by using Parallel.For. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for loop. Is this as good as it can get with Parallel.For, or can the method call be fine-tuned to increase performance?
Here is my test code, sequential:
var loops = 200;
var perloop = 10000000;
var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
}
and parallel:
sum = 0.0;
Parallel.For(0, loops,
k =>
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
});
The loop that I am parallelizing involves computation with a "globally" defined variable, sum, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.
In Release build ("optimize code" flag set) the sequential for loop takes 33.7 s on my computer, whereas the Parallel.For loop takes 8.4 s, a performance ratio of only 4.0.
In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set
ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount
but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?
I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.
UPDATE
It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.
With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for loop. CPU utilization during the for loop is up to approx. 45% (!!!) and 100% during the Parallel.For loop.
Computation time for the for loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For (25% better than when hyper-threading is enabled). Performance ratio with Parallel.For is now only 2.5, running on 4 real cores.
So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for loop? Could there be some kind of internal parallelization going on in this loop as well?
Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it.
You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.
The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
// Use type parameter to make subtotal a long, not an int
Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
{
subtotal += nums[j];
return subtotal;
},
(x) => Interlocked.Add(ref total, x)
);
Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.
Parallel.For and Parallel.ForEach will use a degree of parallelism that it feels is appropriate, balancing the cost to setup and tear down threads and the work it expects each thread will perform. .NET 4.5 made several improvements to performance (including more intelligent decisions on the number of threads to spin up) compared to previous .NET versions.
Note that, even if it were to spin up one thread per core, context switches, false sharing issues, resource locks, and other issues may prevent you from achieving linear scalability (in general, not necessarily with your specific code example).
I think the computation gain is so low because your code is "too easy" to work on other task each iteration - because parallel.for just create new task in each iteration, so this will take time to service them in threads. I will it like this:
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
Parallel.ForEach(
Partitioner.Create(0, nums.Length),
() => 0,
(part, loopState, partSum) =>
{
for (int i = part.Item1; i < part.Item2; i++)
{
partSum += nums[i];
}
return partSum;
},
(partSum) =>
{
Interlocked.Add(ref total, partSum);
}
);
Partitioner will create optimal part of job for each task, there will be less time for service task with threads. If you can, please benchmark this solution and tell us if it get better speed up.
foreach vs parallel for each an example
for (int i = 0; i < 10; i++)
{
int[] array = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 };
Stopwatch watch = new Stopwatch();
watch.Start();
//Parallel foreach
Parallel.ForEach(array, line =>
{
for (int x = 0; x < 1000000; x++)
{
}
});
watch.Stop();
Console.WriteLine("Parallel.ForEach {0}", watch.Elapsed.Milliseconds);
watch = new Stopwatch();
//foreach
watch.Start();
foreach (int item in array)
{
for (int z = 0; z < 10000000; z++)
{
}
}
watch.Stop();
Console.WriteLine("ForEach {0}", watch.Elapsed.Milliseconds);
Console.WriteLine("####");
}
Console.ReadKey();
My CPU
Intel® Core™ i7-620M Processor (4M Cache, 2.66 GHz)
I'm trying to understand the basics of multi-threading so I built a little program that raised a few question and I'll be thankful for any help :)
Here is the little program:
class Program
{
public static int count;
public static int max;
static void Main(string[] args)
{
int t = 0;
DateTime Result;
Console.WriteLine("Enter Max Number : ");
max = int.Parse(Console.ReadLine());
Console.WriteLine("Enter Thread Number : ");
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
foreach (Thread th in MyThreads)
th.Start();
while (count < max)
{
}
Console.WriteLine("Finish , Took : " + (DateTime.Now - Result).ToString() + " With : " + t + " Threads.");
Console.ReadLine();
}
public static void print()
{
while (count < max)
{
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
I checked this with some test runs:
I made the maximum number 100, and it seems to be that the fastest execution time is with 2 threads which is 80% faster than the time with 10 threads.
Questions:
1) Threads 4-10 don't print even one time, how can it be?
2) Shouldn't more threads be faster?
I made the maximum number 10000 and disabled printing.
With this configuration, 5 threads seems to be fastest.
Why there is a change compared to the first check?
And also in this configuration (with printing) all the threads print a few times. Why is that different from the first run where only a few threads printed?
Is there is a way to make all the threads print one by one? In a line or something like that?
Thank you very much for your help :)
Your code is certainly a first step into the world of threading, and you've just experienced the first (of many) headaches!
To start with, static may enable you to share a variable among the threads, but it does not do so in a thread safe manner. This means your count < max expression and count++ are not guaranteed to be up to date or an effective guard between threads. Look at the output of your program when max is only 10 (t set to 4, on my 8 processor workstation):
T0 - 0
T0 - 1
T0 - 2
T0 - 3
T1 - 0 // wait T1 got count = 0 too!
T2 - 1 // and T2 got count = 1 too!
T2 - 6
T2 - 7
T2 - 8
T2 - 9
T0 - 4
T3 - 1 // and T3 got count = 1 too!
T1 - 5
To your question about each thread printing one-by-one, I assume you're trying to coordinate access to count. You can accomplish this with synchronization primitives (such as the lock statement in C#). Here is a naive modification to your code which will ensure only max increments occur:
static object countLock = new object();
public static void printWithLock()
{
// loop forever
while(true)
{
// protect access to count using a static object
// now only 1 thread can use 'count' at a time
lock (countLock)
{
if (count >= max) return;
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
This simple modification makes your program logically correct, but also slow. The sample now exhibits a new problem: lock contention. Every thread is now vying for access to countLock. We've made our program thread safe, but without any benefits of parallelism!
Threading and parallelism is not particularly easy to get right, but thankfully recent versions of .Net come with the Task Parallel Library (TPL) and Parallel LINQ (PLINQ).
The beauty of the library is how easy it would be to convert your current code:
var sw = new Stopwatch();
sw.Start();
Enumerable.Range(0, max)
.AsParallel()
.ForAll(number =>
Console.WriteLine("T{0}: {1}",
Thread.CurrentThread.ManagedThreadId,
number));
Console.WriteLine("{0} ms elapsed", sw.ElapsedMilliseconds);
// Sample output from max = 10
//
// T9: 3
// T9: 4
// T9: 5
// T9: 6
// T9: 7
// T9: 8
// T9: 9
// T8: 1
// T7: 2
// T1: 0
// 30 ms elapsed
The output above is an interesting illustration of why threading produces "unexpected results" for newer users. When threads execute in parallel, they may complete chunks of code at different points in time or one thread may be faster than another. You never really know with threading!
Your print function is far from thread safe, that's why 4-10 doesn't print. All threads share the same max and count variables.
Reason for the why more threads slows you down is likely the state change taking place each time the processor changes focus between each thread.
Also, when you're creating a lot of threads, the system needs to allocate new ones. Most of the time it is now advisable to use Tasks instead, as they are pulled from a system managed thread-pool. And thus doesn't necessarily have to be allocated. The creation of a distinct new thread is rather expensive.
Take a look here anyhow: http://msdn.microsoft.com/en-us/library/aa645740(VS.71).aspx
Look carefuly:
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
I think you missed a variable t ( i < 31).
You should read many books on parallel and multithreaded programming before writing code, because programming language is just a tool. Good luck!
Does it make sense to you to use for every normal foreach a parallel.foreach loop ?
When should I start using parallel.foreach, only iterating 1,000,000 items?
No, it doesn't make sense for every foreach. Some reasons:
Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
If you're aggregating (e.g. summing values) then there are ways of using Parallel.ForEach for this, but you shouldn't just do it blindly
If your work will complete very fast anyway, there's no benefit, and it may well slow things down
Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.
No, you should definitely not do that. The important point here is not really the number of iterations, but the work to be done. If your work is really simple, executing 1000000 delegates in parallel will add a huge overhead and will most likely be slower than a traditional single threaded solution. You can get around this by partitioning the data, so you execute chunks of work instead.
E.g. consider the situation below:
Input = Enumerable.Range(1, Count).ToArray();
Result = new double[Count];
Parallel.ForEach(Input, (value, loopState, index) => { Result[index] = value*Math.PI; });
The operation here is so simple, that the overhead of doing this in parallel will dwarf the gain of using multiple cores. This code runs significantly slower than a regular foreach loop.
By using a partition we can reduce the overhead and actually observe a gain in performance.
Parallel.ForEach(Partitioner.Create(0, Input.Length), range => {
for (var index = range.Item1; index < range.Item2; index++) {
Result[index] = Input[index]*Math.PI;
}
});
The morale of the story here is that parallelism is hard and you should only employ this after looking closely at the situation at hand. Additionally, you should profile the code both before and after adding parallelism.
Remember that regardless of any potential performance gain parallelism always adds complexity to the code, so if the performance is already good enough, there's little reason to add the complexity.
The short answer is no, you should not just use Parallel.ForEach or related constructs on each loop that you can.
Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.
Parallel.ForEach is a request to schedule the loop as the task scheduler sees fit, based on number of iterations in the loop, number of CPU cores on the hardware and current load on that hardware. Actual parallel execution is not always guaranteed, and is less likely if there are fewer cores, the number of iterations is low and/or the current load is high.
See also Does Parallel.ForEach limits the number of active threads? and Does Parallel.For use one Task per iteration?
The long answer:
We can classify loops by how they fall on two axes:
Few iterations through to many iterations.
Each iteration is fast through to each iteration is slow.
A third factor is if the tasks vary in duration very much – for instance if you are calculating points on the Mandelbrot set, some points are quick to calculate, some take much longer.
When there are few, fast iterations it's probably not worth using parallelisation in any way, most likely it will end up slower due to the overheads. Even if parallelisation does speed up a particular small, fast loop, it's unlikely to be of interest: the gains will be small and it's not a performance bottleneck in your application so optimise for readability not performance.
Where a loop has very few, slow iterations and you want more control, you may consider using Tasks to handle them, along the lines of:
var tasks = new List<Task>(actions.Length);
foreach(var action in actions)
{
tasks.Add(Task.Factory.StartNew(action));
}
Task.WaitAll(tasks.ToArray());
Where there are many iterations, Parallel.ForEach is in its element.
The Microsoft documentation states that
When a parallel loop runs, the TPL partitions the data source so that
the loop can operate on multiple parts concurrently. Behind the
scenes, the Task Scheduler partitions the task based on system
resources and workload. When possible, the scheduler redistributes
work among multiple threads and processors if the workload becomes
unbalanced.
This partitioning and dynamic re-scheduling is going to be harder to do effectively as the number of loop iterations decreases, and is more necessary if the iterations vary in duration and in the presence of other tasks running on the same machine.
I ran some code.
The test results below show a machine with nothing else running on it, and no other threads from the .Net Thread Pool in use. This is not typical (in fact in a web-server scenario it is wildly unrealistic). In practice, you may not see any parallelisation with a small number of iterations.
The test code is:
namespace ParallelTests
{
class Program
{
private static int Fibonacci(int x)
{
if (x <= 1)
{
return 1;
}
return Fibonacci(x - 1) + Fibonacci(x - 2);
}
private static void DummyWork()
{
var result = Fibonacci(10);
// inspect the result so it is no optimised away.
// We know that the exception is never thrown. The compiler does not.
if (result > 300)
{
throw new Exception("failed to to it");
}
}
private const int TotalWorkItems = 2000000;
private static void SerialWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
for (int index1 = 0; index1 < outerWorkItems; index1++)
{
InnerLoop(innerLoopLimit);
}
}
private static void InnerLoop(int innerLoopLimit)
{
for (int index2 = 0; index2 < innerLoopLimit; index2++)
{
DummyWork();
}
}
private static void ParallelWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
var outerRange = Enumerable.Range(0, outerWorkItems);
Parallel.ForEach(outerRange, index1 =>
{
InnerLoop(innerLoopLimit);
});
}
private static void TimeOperation(string desc, Action operation)
{
Stopwatch timer = new Stopwatch();
timer.Start();
operation();
timer.Stop();
string message = string.Format("{0} took {1:mm}:{1:ss}.{1:ff}", desc, timer.Elapsed);
Console.WriteLine(message);
}
static void Main(string[] args)
{
TimeOperation("serial work: 1", () => Program.SerialWork(1));
TimeOperation("serial work: 2", () => Program.SerialWork(2));
TimeOperation("serial work: 3", () => Program.SerialWork(3));
TimeOperation("serial work: 4", () => Program.SerialWork(4));
TimeOperation("serial work: 8", () => Program.SerialWork(8));
TimeOperation("serial work: 16", () => Program.SerialWork(16));
TimeOperation("serial work: 32", () => Program.SerialWork(32));
TimeOperation("serial work: 1k", () => Program.SerialWork(1000));
TimeOperation("serial work: 10k", () => Program.SerialWork(10000));
TimeOperation("serial work: 100k", () => Program.SerialWork(100000));
TimeOperation("parallel work: 1", () => Program.ParallelWork(1));
TimeOperation("parallel work: 2", () => Program.ParallelWork(2));
TimeOperation("parallel work: 3", () => Program.ParallelWork(3));
TimeOperation("parallel work: 4", () => Program.ParallelWork(4));
TimeOperation("parallel work: 8", () => Program.ParallelWork(8));
TimeOperation("parallel work: 16", () => Program.ParallelWork(16));
TimeOperation("parallel work: 32", () => Program.ParallelWork(32));
TimeOperation("parallel work: 64", () => Program.ParallelWork(64));
TimeOperation("parallel work: 1k", () => Program.ParallelWork(1000));
TimeOperation("parallel work: 10k", () => Program.ParallelWork(10000));
TimeOperation("parallel work: 100k", () => Program.ParallelWork(100000));
Console.WriteLine("done");
Console.ReadLine();
}
}
}
the results on a 4-core Windows 7 machine are:
serial work: 1 took 00:02.31
serial work: 2 took 00:02.27
serial work: 3 took 00:02.28
serial work: 4 took 00:02.28
serial work: 8 took 00:02.28
serial work: 16 took 00:02.27
serial work: 32 took 00:02.27
serial work: 1k took 00:02.27
serial work: 10k took 00:02.28
serial work: 100k took 00:02.28
parallel work: 1 took 00:02.33
parallel work: 2 took 00:01.14
parallel work: 3 took 00:00.96
parallel work: 4 took 00:00.78
parallel work: 8 took 00:00.84
parallel work: 16 took 00:00.86
parallel work: 32 took 00:00.82
parallel work: 64 took 00:00.80
parallel work: 1k took 00:00.77
parallel work: 10k took 00:00.78
parallel work: 100k took 00:00.77
done
Running code Compiled in .Net 4 and .Net 4.5 give much the same results.
The serial work runs are all the same. It doesn't matter how you slice it, it runs in about 2.28 seconds.
The parallel work with 1 iteration is slightly longer than no parallelism at all. 2 items is shorter, so is 3 and with 4 or more iterations is all about 0.8 seconds.
It is using all cores, but not with 100% efficiency. If the serial work was divided 4 ways with no overhead it would complete in 0.57 seconds (2.28 / 4 = 0.57).
In other scenarios I saw no speed-up at all with parallel 2-3 iterations. You do not have fine-grained control over that with Parallel.ForEach and the algorithm may decide to "partition " them into just 1 chunk and run it on 1 core if the machine is busy.
There is no lower limit for doing parallel operations. If you have only 2 items to work on but each one will take a while, it might still make sense to use Parallel.ForEach. On the other hand if you have 1000000 items but they don't do very much, the parallel loop might not go any faster than the regular loop.
For example, I wrote a simple program to time nested loops where the outer loop ran both with a for loop and with Parallel.ForEach. I timed it on my 4-CPU (dual-core, hyperthreaded) laptop.
Here's a run with only 2 items to work on, but each takes a while:
2 outer iterations, 100000000 inner iterations:
for loop: 00:00:00.1460441
ForEach : 00:00:00.0842240
Here's a run with millions of items to work on, but they don't do very much:
100000000 outer iterations, 2 inner iterations:
for loop: 00:00:00.0866330
ForEach : 00:00:02.1303315
The only real way to know is to try it.
In general, once you go above a thread per core, each extra thread involved in an operation will make it slower, not faster.
However, if part of each operation will block (the classic example being waiting on disk or network I/O, another being producers and consumers that are out of synch with each other) then more threads than cores can begin to speed things up again, because tasks can be done while other threads are unable to make progress until the I/O operation returns.
For this reason, when single-core machines were the norm, the only real justifications in multi-threading were when either there was blocking of the sort I/O introduces or else to improve responsiveness (slightly slower to perform a task, but much quicker to start responding to user-input again).
Still, these days single-core machines are increasingly rare, so it would appear that you should be able to make everything at least twice as fast with parallel processing.
This will still not be the case if order is important, or something inherent to the task forces it to have a synchronised bottleneck, or if the number of operations is so small that the increase in speed from parallel processing is outweighed by the overheads involved in setting up that parallel processing. It may or may not be the case if a share resource requires threads to block on other threads performing the same parallel operation (depending on the degree of lock contention).
Also, if your code is inherently multithreaded to begin with, you can be in a situation where you are essentially competing for resources with yourself (a classic case being ASP.NET code handling simultaneous requests). Here the advantage in parallel operation may mean that a single test operation on a 4-core machine approaches 4 times the performance, but once the number of requests needing the same task to be performed reaches 4, then since each of those 4 requests are each trying to use each core, it becomes little better than if they had a core each (perhaps slightly better, perhaps slightly worse). The benefits of parallel operation hence disappears as the use changes from a single-request test to a real-world multitude of requests.
You shouldn't blindly replace every single foreach loop in your application with the parallel foreach. More threads doesn't necessary mean that your application will work faster. You need to slice the task into smaller tasks which could run in parallel if you want to really benefit from multiple threads. If your algorithm is not parallelizable you won't get any benefit.
No. You need to understand what the code is doing and whether it is amenable to parallelization. Dependencies between your data items can make it hard to parallelize, i.e., if a thread uses the value calculated for the previous element it has to wait until the value is calculated anyway and can't run in parallel. You also need to understand your target architecture, though, you will typically have a multicore CPU on just about anything you buy these days. Even on a single core, you can get some benefits from more threads but only if you have some blocking tasks. You should also keep in mind that there is overhead in creating and organizing the parallel threads. If this overhead is a significant fraction of (or more than) the time your task takes you could slow it down.
These are my benchmarks showing pure serial is slowest, along with various levels of partitioning.
class Program
{
static void Main(string[] args)
{
NativeDllCalls(true, 1, 400000000, 0); // Seconds: 0.67 |) 595,203,995.01 ops
NativeDllCalls(true, 1, 400000000, 3); // Seconds: 0.91 |) 439,052,826.95 ops
NativeDllCalls(true, 1, 400000000, 4); // Seconds: 0.80 |) 501,224,491.43 ops
NativeDllCalls(true, 1, 400000000, 8); // Seconds: 0.63 |) 635,893,653.15 ops
NativeDllCalls(true, 4, 100000000, 0); // Seconds: 0.35 |) 1,149,359,562.48 ops
NativeDllCalls(true, 400, 1000000, 0); // Seconds: 0.24 |) 1,673,544,236.17 ops
NativeDllCalls(true, 10000, 40000, 0); // Seconds: 0.22 |) 1,826,379,772.84 ops
NativeDllCalls(true, 40000, 10000, 0); // Seconds: 0.21 |) 1,869,052,325.05 ops
NativeDllCalls(true, 1000000, 400, 0); // Seconds: 0.24 |) 1,652,797,628.57 ops
NativeDllCalls(true, 100000000, 4, 0); // Seconds: 0.31 |) 1,294,424,654.13 ops
NativeDllCalls(true, 400000000, 0, 0); // Seconds: 1.10 |) 364,277,890.12 ops
}
static void NativeDllCalls(bool useStatic, int nonParallelIterations, int parallelIterations = 0, int maxParallelism = 0)
{
if (useStatic) {
Iterate<string, object>(
(msg, cntxt) => {
ServiceContracts.ForNativeCall.SomeStaticCall(msg);
}
, "test", null, nonParallelIterations,parallelIterations, maxParallelism );
}
else {
var instance = new ServiceContracts.ForNativeCall();
Iterate(
(msg, cntxt) => {
cntxt.SomeCall(msg);
}
, "test", instance, nonParallelIterations, parallelIterations, maxParallelism);
}
}
static void Iterate<T, C>(Action<T, C> action, T testMessage, C context, int nonParallelIterations, int parallelIterations=0, int maxParallelism= 0)
{
var start = DateTime.UtcNow;
if(nonParallelIterations == 0)
nonParallelIterations = 1; // normalize values
if(parallelIterations == 0)
parallelIterations = 1;
if (parallelIterations > 1) {
ParallelOptions options;
if (maxParallelism == 0) // default max parallelism
options = new ParallelOptions();
else
options = new ParallelOptions { MaxDegreeOfParallelism = maxParallelism };
if (nonParallelIterations > 1) {
Parallel.For(0, parallelIterations, options
, (j) => {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
});
}
else { // no nonParallel iterations
Parallel.For(0, parallelIterations, options
, (j) => {
action(testMessage, context);
});
}
}
else {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
}
var end = DateTime.UtcNow;
Console.WriteLine("\tSeconds: {0,8:0.00} |) {1,16:0,000.00} ops",
(end - start).TotalSeconds, (Math.Max(parallelIterations, 1) * nonParallelIterations / (end - start).TotalSeconds));
}
}