Parallel.Invoke gives a minimal performance increase if any

Parallel.Invoke gives a minimal performance increase if any - c#

I wrote a simple console app to test the performance of Parallel.Invoke based on Microsoft's example on msdn:
public static void TestParallelInvokeSimple()
{
ParallelOptions parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = 1 }; // 1 to disable threads, -1 to enable them
Parallel.Invoke(parallelOptions,
() =>
{
Stopwatch sw = new Stopwatch();
sw.Start();
Console.WriteLine("Begin first task...");
List<string> objects = new List<string>();
for (int i = 0; i < 10000000; i++)
{
if (objects.Count > 0)
{
string tempstr = string.Join("", objects.Last().Take(6).ToList());
objects.Add(tempstr + i);
}
else
{
objects.Add("START!");
}
}
sw.Stop();
Console.WriteLine("End first task... {0} seconds", sw.Elapsed.TotalSeconds);
},
() =>
{
Stopwatch sw = new Stopwatch();
sw.Start();
Console.WriteLine("Begin second task...");
List<string> objects = new List<string>();
for (int i = 0; i < 10000000; i++)
{
objects.Add("abc" + i);
}
sw.Stop();
Console.WriteLine("End second task... {0} seconds", sw.Elapsed.TotalSeconds);
},
() =>
{
Stopwatch sw = new Stopwatch();
sw.Start();
Console.WriteLine("Begin third task...");
List<string> objects = new List<string>();
for (int i = 0; i < 20000000; i++)
{
objects.Add("abc" + i);
}
sw.Stop();
Console.WriteLine("End third task... {0} seconds", sw.Elapsed.TotalSeconds);
}
);
}
The ParallelOptions is to easily enable/disable threading.
When I disable threading I get the following output:
Begin first task...
End first task... 10.034647 seconds
Begin second task...
End second task... 3.5326487 seconds
Begin third task...
End third task... 6.8715266 seconds
done!
Total elapsed time: 20.4456563 seconds
Press any key to continue . . .
When I enable threading by setting MaxDegreeOfParallelism to -1 I get:
Begin third task...
Begin first task...
Begin second task...
End second task... 5.9112167 seconds
End third task... 13.113622 seconds
End first task... 19.5815043 seconds
done!
Total elapsed time: 19.5884057 seconds
Which is practically the same speed as sequential processing. Since task 1 takes the longest - about 10 seconds, I would expect the threading to take around 10 seconds total to run all 3 tasks. So what gives? Why is Parallel.Invoke running my tasks slower individually, yet in parallel?
BTW, I've seen the exact same results when using Parallel.Invoke in a real app performing many different tasks at the same time (most of which are running queries).
If you think it's my pc, think again... it's 1 year old, with 8GB of RAM, windows 8.1, Intel Core I7 2.7GHz 8 core cpu. My PC is not overloaded as I watched the performance while running my tests over and over again. My PC never maxed out but obviously showed cpu and memory increase when running.

I haven't profiled this, but the majority of the time here is probably being spent doing memory allocation for those lists and tiny strings. These "tasks" aren't actually doing anything other than growing the lists with minimal input and almost no processing time.
Consider that:
objects.Add("abc" + i);
is essentially just creating a new string and then adding it to a list. Creating a small string like this is largely just a memory allocation exercise since strings are stored on the heap. Furthermore, the memory allocated for the List is going to fill up rapidly - each time it does the list will re-allocate more memory for its own storage.
Now, heap allocations are serialized within a process - four threads inside one process cannot allocate memory at the same time. Requests for memory allocation are processed in sequence since the shared heap is like any other shared resource that needs to be protected from concurrent mayhem.
So what you have are three extremely memory-hungry threads that are probably spending most of their time waiting for each other to finish getting new memory.
Fill those methods with CPU intensive work (ie : do some math, etc) and you'll see the results are very different. The lesson is that not all tasks are efficiently parallelizable and not all in the ways that you might think. The above, for example, could be sped up by running each task within its own process - with its own private memory space there would be no contention for memory allocation, for example.

Related

Multithread foreach slows down main thread

Edit: As per the discussion in the comments, I was overestimating how much many threads would help, and have gone back to Parallell.ForEach with a reasonable MaxDegreeOfParallelism, and just have to wait it out.
I have a 2D array data structure, and perform work on slices of the data. There will only ever be around 1000 threads required to work on all the data simultaneously. Basically there are around 1000 "days" worth of data for all ~7000 data points, and I would like to process the data for each day in a new thread in parallel.
My issue is that doing work in the child threads dramatically slows the time in which the main thread starts them. If I have no work being done in the child threads, the main thread starts them all basically instantly. In my example below, with just a bit of work, it takes ~65ms to start all the threads. In my real use case, the worker threads will take around 5-10 seconds to compute all what they need, but I would like them all to start instantly otherwise, I am basically running the work in sequence. I do not understand why their work is slowing down the main thread from starting them.
How the data is setup shouldn't matter (I hope). The way it's setupmight look weird I was just simulating exactly how I receive the data. What's important is that if you comment out the foreach loop in the DoThreadWork method, the time it takes to start the threads is waaay lower.
I have the for (var i = 0; i < 4; i++) loop just to run the simulation multiple times to see 4 sets of timing results to make sure that it wasn't just slow the first time.
Here is a code snippet to simulate my real code:
public static void Main(string[] args)
{
var fakeData = Enumerable
.Range(0, 7000)
.Select(_ => Enumerable.Range(0, 400).ToArray())
.ToArray();
const int offset = 100;
var dataIndices = Enumerable
.Range(offset, 290)
.ToArray();
for (var i = 0; i < 4; i++)
{
var s = Stopwatch.StartNew();
var threads = dataIndices
.Select(n =>
{
var thread = new Thread(() =>
{
foreach (var fake in fakeData)
{
var sliced = new ArraySegment<int>(fake, n - offset, n - (n - offset));
DoThreadWork(sliced);
}
});
return thread;
})
.ToList();
foreach (var thread in threads)
{
thread.Start();
}
Console.WriteLine($"Before Join: {s.Elapsed.Milliseconds}");
foreach (var thread in threads)
{
thread.Join();
}
Console.WriteLine($"After Join: {s.Elapsed.Milliseconds}");
}
}
private static void DoThreadWork(ArraySegment<int> fakeData)
{
// Commenting out this foreach loop will dramatically increase the speed
// in which all the threads start
var a = 0;
foreach (var fake in fakeData)
{
// Simulate thread work
a += fake;
}
}

Use the thread/task pool and limit thread/task count to 2*(CPU Cores) at most. Creating more threads doesn't magically make more work get done as you need hardware "threads" to run them (1 per CPU core for non-SMT CPU's, 2 per core for Intel HT, AMD's SMT implementation). Executing hundreds to thousands of threads that don't have to passively await asynchronous callbacks (i.e. I/O) makes running the threads far less efficient due to thrashing the CPU with context switches for no reason.

Multithreading has no effect when working with Bitmap object

class Program
{
static void Main(string[] args)
{
new Program();
}
private static int prepTotal;
private static readonly object Lock = new object();
public Program()
{
var sw = new Stopwatch();
sw.Start();
Parallel.For((long) 0, 10,new ParallelOptions {MaxDegreeOfParallelism = 1}, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"1 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
sw.Restart();
prepTotal = 0;
Parallel.For((long)0, 10, new ParallelOptions { MaxDegreeOfParallelism = 3 }, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"3 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
sw.Restart();
prepTotal = 0;
Parallel.For((long)0, 10, new ParallelOptions { MaxDegreeOfParallelism = 1 }, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"1 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
sw.Restart();
prepTotal = 0;
Parallel.For((long)0, 10, new ParallelOptions { MaxDegreeOfParallelism = 3 }, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"3 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
Console.ReadLine();
}
private static void DoIt()
{
var sw2 = new Stopwatch();
sw2.Start();
using (var bmp = new Bitmap(3000, 3000))
{
}
sw2.Stop();
lock (Lock)
{
prepTotal += (int) sw2.ElapsedMilliseconds;
}
}
}
When I run my test code(derived from original really complex code) I got following results. As you can see code running in more threads is almost 3 times slower. Is Bitmap constructor makes some blocking or what?
1 thread sum time is 125 ms. Total time is 132 ms.
3 thread sum time is 360 ms. Total time is 132 ms.
1 thread sum time is 121 ms. Total time is 127 ms.
3 thread sum time is 364 ms. Total time is 128 ms.

Well, I just used a profiler see if my guess is correct, and indeed, new Bitmap(3000, 3000) is almost entirely memory bound. So unless you have a server machine with multiple independent memory systems, adding more CPU doesn't help any. The bottleneck is memory.
The second most important part happens in the Dispose, which is again... almost entirely memory bound.
Multi-threading only helps with CPU-bound code. Since the CPU is much faster than any memory you may have in your system, the CPU is only really saturated when it can avoid working with memory (and other I/O devices). Your case is pretty much exactly the opposite - there's very little CPU work, and where there is CPU work, it's mostly synchronized (e.g. requesting and freeing virtual memory). Not a lot of opportunities for parallelization.

Parallel Invoke is taking too long to launch

I need some advice. I have an application that processes trade information from a real-time data feed from the stock exachanges. My processing is falling behind.
Since I'm running on a 3GHz Intel I7 with 32 GBytes of main memory, I should have enough power to do this application. The Parse routine stores trade information in an SQL Server 2014 database, running on a Windows 2012 R2 Server.
I put the following timing information in the main processing loop:
invokeTime.Restart();
Parallel.Invoke(() => parserObj.Parse(julian, data));
invokeTime.Stop();
var milliseconds = invokeTime.ElapsedMilliseconds;
if (milliseconds > maxMilliseconds) {
maxMilliseconds = milliseconds;
messageBar.SetText("Invoke: " + milliseconds);
}
I'm getting as much as 1122 milliseonds to do the Parallel.Invoke. A similar timing test shows that the Parse routine only takes 7 milliseconds (max).
Is there a better way of processing the data, other than doing the Parallel.Invoke?
Any suggestions will be greatly appreciated.
Charles

Have you tried
Task.Factory.StartNew(() => {
parserObj.Parse(julian, data));
});
How does your Parse method look like? Maybe the bottleneck is in there...

In Stephen Toub's article: http://blogs.msdn.com/b/pfxteam/archive/2011/10/24/10229468.aspx he describes: "Task.Run can and should be used for the most common cases of simply offloading some work to be processed on the ThreadPool". That's exactly what I want to do, offload the Parse rosutine to a background thread. So I changed:
Parallel.Invoke(() => parserObj.Parse(julian, data));
to:
Task.Run(() => parserObj.Parse(julian, data));
I also increased the number of threads in the ThreadPool from 8 to 80 by doing:
int minWorker, minIOC;;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
var newMinWorker = 10*minWorker;
var newMinIOC = 10*minIOC;
if (ThreadPool.SetMinThreads(newMinWorker, newMinIOC)) {
textBox.AddLine("The minimum no. of worker threads is now: " + newMinWorker);
} else {
textBox.AddLine("Drat! The minimum no. of worker threads could not be changed.");
}
The parsing loop, which runs for 6 1/2 hours/day, looks like:
var stopWatch = new Stopwatch();
var maxMilliseconds = 0L;
while ((data = GetDataFromIQClient()) != null) {
if ( MarketHours.IsMarketClosedPlus2() ) {
break;
}
stopWatch.Restart();
Task.Run(() => parserObj.Parse(julian, data));
stopWatch.Stop();
var milliseconds = stopWatch.ElapsedMilliseconds;
if (milliseconds > maxMilliseconds) {
maxMilliseconds = milliseconds;
messageBar.SetText("Task.Run: " + milliseconds);
}
}
Now, the maximum time spent to call Task.Run was 96 milliseconds, and the maximum time spent in parser was 18 milliseconds. I'm now keeping up with the data transmission.
Charles

Parallel Library: does a delay on one degree of parallelism delay all of them?

I have a ConcurrentBag urls whose items are being processed in parallel (nothing is being written back to the collection):
urls.AsParallel<UrlInfo>().WithDegreeOfParallelism(17).ForAll( item =>
UrlInfo info = MakeSynchronousWebRequest(item);
(myProgress as IProgress<UrlInfo>).Report(info);
});
I have the timeout set to 30 seconds in the web request. When a url that is very slow to respond is encountered, all of the parallel processing grinds to a halt. Is this expected behavior, or should I be searching out some problem in my code?
Here's the progress :
myProgress = new Progress<UrlInfo>( info =>
{
Action action = () =>
{
Interlocked.Increment(ref itested);
if (info.status == UrlInfo.UrlStatusCode.dead)
{
Interlocked.Increment(ref idead);
this.BadUrls.Add(info);
}
dead.Content = idead.ToString();
tested.Content = itested.ToString();
};
try
{
Dispatcher.BeginInvoke(action);
}
catch (Exception ex)
{
}
});

It's the expected behavior. AsParallel doesn't return until all the operations are finished. Since you're making synchronous requests, you've got to wait until your slowest one is finished. However note that even if you've got one really slow task hogging up a thread, the scheduler continues to schedule new tasks as old ones finish on the remaining threads.
Here's a rather instructive example. It creates 101 tasks. The first task hogs one thread for 5000 ms, the 100 others churn on the remaining 20 threads for 1000 ms each. So it schedules 20 of those tasks and they run for one second each, going through that cycle 5 times to get through all 100 tasks, for a total of 5000 ms. However if you change the 101 to 102, that means you've got 101 tasks churning on the 20 threads, which will end up taking 6000 ms; that 101th task just didn't have a thread to churn on until the 5 sec mark. If you change the 101 to, say, 2, you note it still takes 5000 ms because you have to wait for the slow task to complete.
static void Main()
{
ThreadPool.SetMinThreads(21, 21);
var sw = new Stopwatch();
sw.Start();
Enumerable.Range(0, 101).AsParallel().WithDegreeOfParallelism(21).ForAll(i => Thread.Sleep(i==0?5000:1000));
Console.WriteLine(sw.ElapsedMilliseconds);
}

How does sequential loop run faster than Parallel loop in C#?

I tried a very minimal example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.Collections.Concurrent;
using System.Diagnostics;
namespace TPLExample {
class Program {
static void Main(string[] args) {
int[] dataItems = new int[100];
double[] resultItems = new double[100];
for (int i = 0; i < dataItems.Length; ++i) {
dataItems[i] = i;
}
Stopwatch stopwatch = new Stopwatch();
stopwatch.Reset();
stopwatch.Start();
Parallel.For(0, dataItems.Length, (index) => {
resultItems[index] = Math.Pow(dataItems[index], 2);
});
stopwatch.Stop();
Console.WriteLine("TPL Time elapsed: {0}", stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < dataItems.Length; ++i) {
resultItems[i] = Math.Pow(dataItems[i], 2);
}
stopwatch.Stop();
Console.WriteLine("Sequential Time elapsed: {0}", stopwatch.Elapsed);
WaitForEnterKey();
}
public static void WaitForEnterKey() {
Console.WriteLine("Press enter to finish");
Console.ReadLine();
}
public static void PrintMessage() {
Console.WriteLine("Message printed");
}
}
}
The output was:
TPL Time elapsed: 00:00:00.0010670
Sequential Time elapsed: 00:00:00.0000178
Press enter to finish
The sequential loop is way faster than TPL! How is this possible? From my understanding, calculation within the Parallel.For will be executed in parallel, so must it be faster?

Simply put: For only iterating over a hundred items and performing a small mathematical operation, spawning new threads and waiting for them to complete produces more overhead than just running through the loop would.
From my understanding, calculation within the Parallel.For will be executed in parallel, so must it be faster?
As generally happens when people make sweeping statements about computer performance, there are far more variables at play here, and you can't really make that assumption. For example, inside your for loop, you are doing nothing more than Math.Pow, which the processor can perform very quickly. If this were an I/O intensive operation, requiring each thread to wait a long time, or even if it were a series of processor-intensive operations, you would get more out of Parallel processing (assuming you have a multi-threaded processor). But as it is, the overhead of creating and synchronizing these threads is far greater than any advantage that parallelism might give you.

Parallel loop processing is beneficial when the operation performed within the loop is relatively costly. All you're doing in your example is calculating an exponent, which is trivial. The overhead of multithreading is far outweighing the gains that you're getting in this case.

This code example is practical proof really nice answers above.
I've simulated intensive processor operation by simply blocking thread by Thead.Sleep.
The output was:
Sequential Loop - 00:00:09.9995500
Parallel Loop - 00:00:03.0347901
_
class Program
{
static void Main(string[] args)
{
const int a = 10;
Stopwatch sw = new Stopwatch();
sw.Start();
//for (long i = 0; i < a; i++)
//{
// Thread.Sleep(1000);
//}
Parallel.For(0, a, i =>
{
Thread.Sleep(1000);
});
sw.Stop();
Console.WriteLine(sw.Elapsed);
Console.ReadLine();
}
}

The overhead of parallelization is far greater than simply running Math.Pow 100 times sequentially. The others have said this.
More importantly, though, the memory access is trivial in the sequential version, but with the parallel version, the threads have to share memory (resultItems) and that kind of thing will really kill you even if you have a million items.
See page 44 of this excellent Microsoft whitepaper on parallel programming:
http://www.microsoft.com/en-us/download/details.aspx?id=19222. Here is an MSDN magazine article on the subject: http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.