Why does this threaded code run so much slower? - c#

I'm writing an N-Body simulation, and for computational simplification I've divided the whole space into a number of uniformly-sized regions.
For each body, I compute the force of all other bodies in the same region, and for the other regions I aggregate the mass and distances together so there's less work to be done.
I have a List<Region> and Region defines public void Index() which sums the total mass at this iteration.
I have two variants of my Space.Tick() function:
public void Tick()
{
foreach (Region r in Regions)
r.Index();
}
This is very quick. For 20x20x20 = 8000 regions with 100 bodies each = 800000 bodies in total, it only takes about 0.1 seconds to do this. The CPU graph shows 25% utilisation on my quad-core, which is exactly what I would expect.
Now I write this multi-threaded variant:
public void Tick()
{
Thread[] threads = new Thread[Environment.ProcessorCount];
foreach (Region r in Regions)
while (true)
{
bool queued = false;
for (int i = 0; i < threads.Length; i++)
if (threads[i] == null || !threads[i].IsAlive)
{
Region s = r;
threads[i] = new Thread(s.Index);
threads[i].Start();
queued = true;
break;
}
if (queued)
break;
}
}
So a quick explanation in case it's not obvious: threads is an array of 4, in the case of my CPU. It starts off being 4xnull. For each region, I loop through all 4 Thread objects (which could be null). When I find one that's either null or isn't IsAlive, I queue up the Index() of that Region and Start() it. I set queued to true so that I can tell that the region has started indexing.
This code takes about 7 seconds. That's 70x slower. I understand that there's a bit of overhead involved with setting up the threads, finding a thread that's vacant, etc. But I would still expect that I would have at least some sort of performance gain.
What am I doing wrong?

Why not try PLINQ?
Regions.AsParallel().ForAll(x=>x.Index());
PLINQ is usually SUPER fast for me, and it scales dependent on your environment.. If it shouldn't be Parallel, it does single thread.
So, if you had to have a multidimensional array come into the function, you could just do this:
Regions.AsParallel().Cast<Region>().ForAll(x=>x.Index());

Related

Why are 1000 threads faster than a few?

I have a simple program that searches linearly in an array of 2D points. I do 1000 searches into an array of 1 000 000 points.
The curious thing is that if I spawn 1000 threads, the program works as fast as when I span only as much as CPU cores I have, or when I use Parallel.For. This is contrary to everything I know about creating threads. Creating and destroying threads is expensive, but obviously not in this case.
Can someone explain why?
Note: this is a methodological example; the search algorithm is deliberately not meant do to optimal. The focus is on threading.
Note 2: I tested on an 4-core i7 and 3-core AMD, the results follow the same pattern!
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
/// <summary>
/// We search for closest points.
/// For every point in array searchData, we search into inputData for the closest point,
/// and store it at the same position into array resultData;
/// </summary>
class Program
{
class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom (Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
const int inputDataSize = 1_000_000;
static Point[] inputData = new Point[inputDataSize];
const int searchDataSize = 1000;
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
static void GenerateRandomData (Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[]{ rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For (0, searchDataSize,
index =>
{
SearchOne(index);
});
}
static void Main(string[] args)
{
Console.Write("Generatic data... ");
GenerateRandomData(inputData);
GenerateRandomData(searchData);
Console.WriteLine("Done.");
Console.WriteLine();
Stopwatch watch = new Stopwatch();
Console.Write("All thread searching... ");
watch.Restart();
AllThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Few thread searching... ");
watch.Restart();
FewThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Parallel thread searching... ");
watch.Restart();
ParallelThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.WriteLine();
Console.WriteLine("Press ENTER to quit.");
Console.ReadLine();
}
}
EDIT: Please make sure to run the app outside the debugger. VS Debugger slows down the case of multiple threads.
EDIT 2: Some more tests.
To make it clear, here is updated code that guarantees we do have 1000 running at once:
public static void AllThreadSearch()
{
ManualResetEvent startEvent = new ManualResetEvent(false);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
startEvent.WaitOne();
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
startEvent.Set();
foreach (var t in threads) t.Join();
}
Testing with a smaller array - 100K elements, the results are:
1000 vs 8 threads
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|----------:|-------:|
AllThreadSearch | 323.0 ms | 7.307 ms | 21.546 ms | 1.00 |
FewThreadSearch | 164.9 ms | 3.311 ms | 5.251 ms | 1.00 |
ParallelThreadSearch | 141.3 ms | 1.503 ms | 1.406 ms | 1.00 |
Now, 1000 threads is much slower, as expected. Parallel.For still bests them all, which is also logical.
However, growing the array to 500K (i.e. the amount of work every thread does), things start to look weird:
1000 vs 8, 500K
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|---------:|-------:|
AllThreadSearch | 890.9 ms | 17.74 ms | 30.61 ms | 1.00 |
FewThreadSearch | 712.0 ms | 13.97 ms | 20.91 ms | 1.00 |
ParallelThreadSearch | 714.5 ms | 13.75 ms | 12.19 ms | 1.00 |
Looks like context-switching has negligible costs. Thread-creation costs are also relatively small. The only significant cost of having too many threads is loss of memory (memory addresses). Which, alone, is bad enough.
Now, are thread-creation costs that little indeed? We've been universally told that creating threads is very bad and context-switches are evil.
You may want to consider how the application is accessing memory. In the maximum threads scenario you are effectively accessing memory sequentially, which is efficient from a caching point of view. The approach using a small number of threads is more random, causing cache misses. Depending on the CPU there are performance counters that allow you to measure L1 and L2 cache hits/misses.
I think the real issue (other than memory use) with too many threads is that the CPU may have a hard time optimizing itself because it is switching tasks all the time. In the OP's original benchmark, the threads are all working on the same task and so you aren't seeing that much of a cost for the extra threads.
To simulate threads working on different tasks, I modified Jodrell's reformulation of the original code (labeled "Normal" in the data below) to first optimize memory access by ensuring all the threads are working in the same block of memory at the same time and such that the block fits in the cache (4mb) using the method from this cache blocking techniques article. Then I "reversed" that to ensure each set of 4 threads work in a different block of memory. The results for my machine (in ms):
Intel Core i7-5500U CPU 2.40GHz (Max: 2.39GHz) (Broadwell), 1 CPU, 4 logical and 2 physical cores)
inputDataSize = 1_000_000; searchDataSize = 1000; blocks used for O/D: 10
Threads 1 2 3 4 6 8 10 18 32 56 100 178 316 562 1000
Normal(N) 5722 3729 3599 2909 3485 3109 3422 3385 3138 3220 3196 3216 3061 3012 3121
Optimized(O) 5480 2903 2978 2791 2826 2806 2820 2796 2778 2775 2775 2805 2833 2866 2988
De-optimized(D) 5455 3373 3730 3849 3409 3350 3297 3335 3365 3406 3455 3553 3692 3719 3900
For O, all the threads worked in the same block of cacheable memory at the same time (where 1 block = 1/10 of inputData). For D, for every set of 4 threads, no thread worked in the same block of memory at the same time. So basically, in the former case access of inputData was able to make use of the cache whereas in the latter case for 4 threads access of inputData was forced to use main memory.
It's easier to see in charts. These charts have the thread-creation cost subtracted out and note the x-axis is logarithmic and y-axis is truncated to better show the shape of the data. Also, the value for 1 thread has been halved to show the theoretical best multi-threaded performance:
A quick glance above shows the optimized data (O) is indeed faster than the others. It is also more consistent (smoother) because compared to N it is not having to deal with cache-misses. As suggested by Jodrell, there appears to be a sweet spot around 100 threads, which is the number on my system which would allow a thread to complete its work within 1 time-slice. After that, the time increases linearly with number of threads (remember, the x-axis has a logarithmic scale on the chart.)
Comparing the normal and optimized data, the former is quite jagged whereas the latter is smooth. This answer suggested more threads would be more efficient from a caching point of view compared to fewer threads where the memory access could be more "random". The chart below seems to confirm this (note 4 threads is optimal for my machine as it has 4 logical cores):
The de-optimized version is most interesting. The worse case is with 4 threads as they have been forced to work in different areas of memory, preventing effective caching. As the number threads increases, the system is able to cache as threads share blocks of memory. But, as the number of threads increases presumably the context-switching makes it harder for the system to cache again and the results tend back to the worst-case:
I think this last chart is what shows the real cost of context-switching. In the original (N) version, the threads are all doing the same task. As a result there is limited competition for resources other than CPU time and the CPU is able to optimize itself for the workload (i.e. cache effectively.) If the threads are all doing different things, then the CPU isn't able to optimize itself and a severe performance hit results. So it's not directly the context switching that causes the problem, but the competition for resources.
In this case, the difference for 4 Threads between O (2909ms) and D (3849ms) is 940ms. This represents a 32% performance hit. Because my machine has a shared L3 cache, this performance hit shows up even with only 4 threads.
I took the liberty of rearranging your code to run using BenchmarkDotNet, it looks like this,
using System;
using System.Collections.Generic;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace Benchmarks
{
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom(Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
[ClrJob(baseline: true)]
public class SomeVsMany
{
[Params(1000)]
public static int inputDataSize = 1000;
[Params(10)]
public static int searchDataSize = 10;
static Point[] inputData = new Point[inputDataSize];
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
[GlobalSetup]
public static void Setup()
{
GenerateRandomData(inputData);
GenerateRandomData(searchData);
}
[Benchmark]
public static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[] { rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For(0, searchDataSize,
index =>
{
SearchOne(index);
});
}
private static void GenerateRandomData(Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
}
public class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<SomeVsMany>();
}
}
}
When I run the benchmark I get these results,
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.14393.2485
(1607/AnniversaryUpdate/Redstone1) Intel Core i7-7600U CPU 2.80GHz
(Max: 2.90GHz) (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
Frequency=2835938 Hz, Resolution=352.6170 ns, Timer=TSC [Host] :
.NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3163.0
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit
RyuJIT-v4.7.3163.0 Job=Clr Runtime=Clr
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000 10 1,276.53us 51.0605us 142.3364us
FewThreadSearch 1000 10 547.72us 24.8199us 70.0049us
ParallelThreadSearch 1000 10 36.54us 0.6973us 0.8564us
These are the kind of results I'd expect and different to what you are claiming in the question. However, as you correctly identify in the comment, this is because I have reduced the values of inputDataSize and searchDataSize.
If I rerun the test with the original values I get results like this,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000000 1000 2.872s 0.0554s 0.0701s
FewThreadSearch 1000000 1000 2.384s 0.0471s 0.0612s
ParallelThreadSearch 1000000 1000 2.449s 0.0368s 0.0344s
These results support your question.
FWIW I did another test run,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 20000000 40 1.972s 0.0392s 0.1045s
FewThreadSearch 20000000 40 1.718s 0.0501s 0.1477s
ParallelThreadSearch 20000000 40 1.978s 0.0454s 0.0523s
This may help distinguish the cost of context switching versus thread creation but ultimately, there must be an element of both.
There is a little speculation but, here are a few assertions and, a conclusion, based on our aggregated results.
Creating a Thread incurs some fixed overhead. When the work is large, the overhead becomes insignificant.
The operating system and processor architecture can only run a certain number of CPU threads at a time. Some amount of CPU time will be reserved for the many operations that keep the computer running behind the scenes. A chunk of that CPU time will be consumed by the background processes and services, not related to this test.
Even if we have a 8 core CPU and spawn just 2 threads we cannot expect both threads to progress through the program at exactly the same rate.
Accepting the points above, whether or not the threads are serviced via a .Net ThreadPool, only a finite number can be serviced concurrently. Even if all instantiated threads are progressed to some semaphore, they did not all get there at once and they will not all proceed at once. If we have more threads than available cores, some threads will have to wait before they can progress at all.
Each thread will proceed for a certain time-slice or until it is waiting for a resource.
This is where the speculation comes in but, when inputDataSize is small, the threads will tend to complete their work within one time-slice, requiring less or no context switching.
When inputDataSize becomes sufficiently large, the work cannot be completed within one time-slice, this makes context switching more likely.
So, given a large fixed size for searchDataSize we have three scenarios. The boundaries of these scenarios will depend on the characteristics of the test platform.
inputDataSize is small
Here, the cost of thread creation is significant, AllThreadSearch is massively slower. ParallelThreadSearch tends to win because it minimizes the cost of thread creation.
inputDataSize is medium
The cost of thread creation is insignificant. Crucially, the work can be completed in one time slice. AllThreadSearch makes use of OS level scheduling and avoids the reasonable but significant overhead of both the Parallel.For and the bucket looping in FewThreadSearch. Somewhere in this area is the sweet spot for AllThreadSearch, it may be possible that for some combinations AllThreadSearch is the fastest option.
inputDataSize is large
Crucially, the work cannot be completed in one time slice. Both the OS scheduler and the ThreadPool fail to anticipate the cost of context switching. Without some expensive heuristics how could they? FewThreadSearch wins out because it avoids the context switching, the cost of which outweighs the cost of bucket looping.
As ever, if you care about performance it pays to benchmark, on a representative system, with a representative workload, with a representative configuration.
First you have to understand the difference between Process and Thread to deep dive into the benefits of concurrency to achieve faster results over sequential programming.
Process - We can call it as an instance of a program in execution. Operating System creates different processes while executing an application. An application can have one or more processes. Process creation is some what costly job to the operating system as it needs to provide several resources while creating, such as Memory, Registers, open handles to system objects to access, security context etc.,
Thread - it is the entity within a process that can be scheduled for execution(can be a part of your code). Unlike Process creation, thread creation is not costly/time consuming as threads share virtual address space and system resources of the process where it belongs. It's improving the performance of the OS as it no need to provide resources for each thread it creates.
Below diagram will elaborate more than my words.
As threads are sharing the resources and having the concurrency nature in them they can run parallel and produce improved results. If your application needs to be highly parallel then you can create ThreadPool(collection of worker threads) to achieve efficiently execute asynchronous callbacks.
And to correct your final assumption/question, creating/destroying threads is not costly than creating/destroying process so always having a "properly handled threading code" would benefit the performance of the application.
It is simply because you can't create threads more than the capacity of your cpu ... so actually in both cases you are creating the same number of threads; your CPU max ...

Multithreading iterative-intensive work

I'm creating a game with loads of asteroids floating in a huge space. To accomodate the heavy collision detection, I've split the world into "AsteroidChunks" where only the closest chunks are being used for collision detection. The work to split up the asteroids into the chunks is extremely repetitive with something like 1.2 billion iterations made every time the game starts up. However, I've noticed that the game is not actually using that much CPU power. Not a single core was running at 100%! (The game is based on MonoGame for Windows)
My idea is that by running the most iterative-heavy stuff on multiple threads (something like giving each thread 1000 iterations to complete) I would be able to get the job done quicker, at the expense of CPU power of course. Am I on the right track, and how should I proceed?
Is there any other way of speeding up the iterations, allowing for a larger 3D space?
This is what I've tried so far. It's not working, but gives an idea of what I want to accomplish: (asteroidChunks is a List with approx. 16,000 entries, but could easily be > 100,000))
Task[] populateChunkTasks = new Task[asteroidChunks.Count];
for (int i = 0; i < asteroidChunks.Count; i += 1000)
populateChunkTasks[i] = new Task(() =>
{
for (int x = i; x < asteroidChunks.Count && x < i + 1000; x++)
PopulateChunk(asteroidChunks[x]);
});
for (int i = 0; i < populateChunkTasks.Length; i++)
populateChunkTasks[i].Start();
Task.WaitAll(populateChunkTasks);
As you can see I want to wait for the threads to complete at the end. However, the operation itself is in its own Task, which is why I can't wait for all tasks - in this case - to complete before proceeding.
PopulateChunk: (asteroidList is a list with approx 75,000 entries, but could easily be > 1million)
public static AsteroidChunk PopulateChunk(AsteroidChunk chunk)
{
foreach (Asteroid asteroid in asteroidList)
if (chunk.box.Contains(asteroid.boundingSphere) == ContainmentType.Contains)
chunk.asteroids.Add(asteroid);
return chunk;
}

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

No performance gains with Parallel.ForEach and Regex?

I have a program which color codes a returned results set a certain way depending on what the results are. Due to the length of time it takes to color-code the results (currently being done with Regex and RichTextBox.Select + .SelectionColor), I cut off color-coding at 400 results. At around that number it takes about 20 seconds, which is just about max time of what I'd consider reasonable.
To try an improve performance I re-wrote the Regex part to use a Parallel.ForEach loop to iterate through the MatchCollection, but the time was about the same (18-19 seconds vs 20)! Is just not a job that lends itself to Parallel programming very well? Should I try something different? Any advice is welcome. Thanks!
PS: Thought it was a bit strange that my CPU utilization never went about 14%, with or without Parallel.ForEach.
Code
MatchCollection startMatches = Regex.Matches(tempRTB.Text, startPattern);
object locker = new object();
System.Threading.Tasks.Parallel.ForEach(startMatches.Cast<Match>(), m =>
{
int i = 0;
foreach (Group g in m.Groups)
{
if (i > 0 && i < 5 && g.Length > 0)
{
tempRTB.Invoke(new Func<bool>(
delegate
{
lock (locker)
{
tempRTB.Select(g.Index, g.Length);
if ((i & 1) == 0) // Even number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.ValueColor;
else // Odd number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.AttributeColor;
return true;
}
}));
}
else if (i == 5 && g.Length > 0)
{
var result = tempRTB.Invoke(new Func<string>(
delegate
{
lock (locker)
{
return tempRTB.Text.Substring(g.Index, g.Length);
}
}));
MatchCollection subMatches = Regex.Matches((string)result, pattern);
foreach (Match subMatch in subMatches)
{
int j = 0;
foreach (Group subGroup in subMatch.Groups)
{
if (j > 0 && subGroup.Length > 0)
{
tempRTB.Invoke(new Func<bool>(
delegate
{
lock (locker)
{
tempRTB.Select(g.Index + subGroup.Index, subGroup.Length);
if ((j & 1) == 0) // Even number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.ValueColor;
else // Odd number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.AttributeColor;
return true;
}
}));
}
j++;
}
}
}
i++;
}
});
Virtually no aspect of your program is actually able to run in parallel.
The generation of the matches needs to be done sequentially. It can't find the second match until it has already found the first. Parallel.ForEach will, at best, allow you to process the results of the sequence in parallel, but they are still generated sequentially. This is where the majority of your time consuming work seems to be, and there are no gains there.
On top of that, you aren't really processing the results in parallel either. The majority of code run in the body of your loop is all inside an invoke to the UI thread, which means it's all being run by a single thread.
In short, only a tiny, tiny bit of your program is actually run in parallel, and using parallelization in general adds some overhead; it sounds like you're just barely getting more than that overhead. There isn't really much that you did wrong, the operation just inherently doesn't lend itself to parallelization, unless there is an effective way of breaking up the initial string into several smaller chucks that the regex can parse individually (in parallel).
The most time in your code is most likely spent in the part that actually selects the text in the richtext box and sets the color.
This code is impossible to execute in parallel, because it has to be marshalled to the UI thread - which you do via tempRTB.Invoke.
Furthermore, you explicitly make sure that the highlighting is not executed in parallel but sequentially by using the lock statement. This is unnecessary, because all of that code is run on the single UI thread anyway.
You could try to improve your performance by suspending the layouting of your UI while you select and color the text in the RTB:
tempRTB.SuspendLayout();
// your loop
tempRTB.ResumeLayout();

Should I always use Parallel.Foreach because more threads MUST speed up everything?

Does it make sense to you to use for every normal foreach a parallel.foreach loop ?
When should I start using parallel.foreach, only iterating 1,000,000 items?
No, it doesn't make sense for every foreach. Some reasons:
Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
If you're aggregating (e.g. summing values) then there are ways of using Parallel.ForEach for this, but you shouldn't just do it blindly
If your work will complete very fast anyway, there's no benefit, and it may well slow things down
Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.
No, you should definitely not do that. The important point here is not really the number of iterations, but the work to be done. If your work is really simple, executing 1000000 delegates in parallel will add a huge overhead and will most likely be slower than a traditional single threaded solution. You can get around this by partitioning the data, so you execute chunks of work instead.
E.g. consider the situation below:
Input = Enumerable.Range(1, Count).ToArray();
Result = new double[Count];
Parallel.ForEach(Input, (value, loopState, index) => { Result[index] = value*Math.PI; });
The operation here is so simple, that the overhead of doing this in parallel will dwarf the gain of using multiple cores. This code runs significantly slower than a regular foreach loop.
By using a partition we can reduce the overhead and actually observe a gain in performance.
Parallel.ForEach(Partitioner.Create(0, Input.Length), range => {
for (var index = range.Item1; index < range.Item2; index++) {
Result[index] = Input[index]*Math.PI;
}
});
The morale of the story here is that parallelism is hard and you should only employ this after looking closely at the situation at hand. Additionally, you should profile the code both before and after adding parallelism.
Remember that regardless of any potential performance gain parallelism always adds complexity to the code, so if the performance is already good enough, there's little reason to add the complexity.
The short answer is no, you should not just use Parallel.ForEach or related constructs on each loop that you can.
Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.
Parallel.ForEach is a request to schedule the loop as the task scheduler sees fit, based on number of iterations in the loop, number of CPU cores on the hardware and current load on that hardware. Actual parallel execution is not always guaranteed, and is less likely if there are fewer cores, the number of iterations is low and/or the current load is high.
See also Does Parallel.ForEach limits the number of active threads? and Does Parallel.For use one Task per iteration?
The long answer:
We can classify loops by how they fall on two axes:
Few iterations through to many iterations.
Each iteration is fast through to each iteration is slow.
A third factor is if the tasks vary in duration very much – for instance if you are calculating points on the Mandelbrot set, some points are quick to calculate, some take much longer.
When there are few, fast iterations it's probably not worth using parallelisation in any way, most likely it will end up slower due to the overheads. Even if parallelisation does speed up a particular small, fast loop, it's unlikely to be of interest: the gains will be small and it's not a performance bottleneck in your application so optimise for readability not performance.
Where a loop has very few, slow iterations and you want more control, you may consider using Tasks to handle them, along the lines of:
var tasks = new List<Task>(actions.Length);
foreach(var action in actions)
{
tasks.Add(Task.Factory.StartNew(action));
}
Task.WaitAll(tasks.ToArray());
Where there are many iterations, Parallel.ForEach is in its element.
The Microsoft documentation states that
When a parallel loop runs, the TPL partitions the data source so that
the loop can operate on multiple parts concurrently. Behind the
scenes, the Task Scheduler partitions the task based on system
resources and workload. When possible, the scheduler redistributes
work among multiple threads and processors if the workload becomes
unbalanced.
This partitioning and dynamic re-scheduling is going to be harder to do effectively as the number of loop iterations decreases, and is more necessary if the iterations vary in duration and in the presence of other tasks running on the same machine.
I ran some code.
The test results below show a machine with nothing else running on it, and no other threads from the .Net Thread Pool in use. This is not typical (in fact in a web-server scenario it is wildly unrealistic). In practice, you may not see any parallelisation with a small number of iterations.
The test code is:
namespace ParallelTests
{
class Program
{
private static int Fibonacci(int x)
{
if (x <= 1)
{
return 1;
}
return Fibonacci(x - 1) + Fibonacci(x - 2);
}
private static void DummyWork()
{
var result = Fibonacci(10);
// inspect the result so it is no optimised away.
// We know that the exception is never thrown. The compiler does not.
if (result > 300)
{
throw new Exception("failed to to it");
}
}
private const int TotalWorkItems = 2000000;
private static void SerialWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
for (int index1 = 0; index1 < outerWorkItems; index1++)
{
InnerLoop(innerLoopLimit);
}
}
private static void InnerLoop(int innerLoopLimit)
{
for (int index2 = 0; index2 < innerLoopLimit; index2++)
{
DummyWork();
}
}
private static void ParallelWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
var outerRange = Enumerable.Range(0, outerWorkItems);
Parallel.ForEach(outerRange, index1 =>
{
InnerLoop(innerLoopLimit);
});
}
private static void TimeOperation(string desc, Action operation)
{
Stopwatch timer = new Stopwatch();
timer.Start();
operation();
timer.Stop();
string message = string.Format("{0} took {1:mm}:{1:ss}.{1:ff}", desc, timer.Elapsed);
Console.WriteLine(message);
}
static void Main(string[] args)
{
TimeOperation("serial work: 1", () => Program.SerialWork(1));
TimeOperation("serial work: 2", () => Program.SerialWork(2));
TimeOperation("serial work: 3", () => Program.SerialWork(3));
TimeOperation("serial work: 4", () => Program.SerialWork(4));
TimeOperation("serial work: 8", () => Program.SerialWork(8));
TimeOperation("serial work: 16", () => Program.SerialWork(16));
TimeOperation("serial work: 32", () => Program.SerialWork(32));
TimeOperation("serial work: 1k", () => Program.SerialWork(1000));
TimeOperation("serial work: 10k", () => Program.SerialWork(10000));
TimeOperation("serial work: 100k", () => Program.SerialWork(100000));
TimeOperation("parallel work: 1", () => Program.ParallelWork(1));
TimeOperation("parallel work: 2", () => Program.ParallelWork(2));
TimeOperation("parallel work: 3", () => Program.ParallelWork(3));
TimeOperation("parallel work: 4", () => Program.ParallelWork(4));
TimeOperation("parallel work: 8", () => Program.ParallelWork(8));
TimeOperation("parallel work: 16", () => Program.ParallelWork(16));
TimeOperation("parallel work: 32", () => Program.ParallelWork(32));
TimeOperation("parallel work: 64", () => Program.ParallelWork(64));
TimeOperation("parallel work: 1k", () => Program.ParallelWork(1000));
TimeOperation("parallel work: 10k", () => Program.ParallelWork(10000));
TimeOperation("parallel work: 100k", () => Program.ParallelWork(100000));
Console.WriteLine("done");
Console.ReadLine();
}
}
}
the results on a 4-core Windows 7 machine are:
serial work: 1 took 00:02.31
serial work: 2 took 00:02.27
serial work: 3 took 00:02.28
serial work: 4 took 00:02.28
serial work: 8 took 00:02.28
serial work: 16 took 00:02.27
serial work: 32 took 00:02.27
serial work: 1k took 00:02.27
serial work: 10k took 00:02.28
serial work: 100k took 00:02.28
parallel work: 1 took 00:02.33
parallel work: 2 took 00:01.14
parallel work: 3 took 00:00.96
parallel work: 4 took 00:00.78
parallel work: 8 took 00:00.84
parallel work: 16 took 00:00.86
parallel work: 32 took 00:00.82
parallel work: 64 took 00:00.80
parallel work: 1k took 00:00.77
parallel work: 10k took 00:00.78
parallel work: 100k took 00:00.77
done
Running code Compiled in .Net 4 and .Net 4.5 give much the same results.
The serial work runs are all the same. It doesn't matter how you slice it, it runs in about 2.28 seconds.
The parallel work with 1 iteration is slightly longer than no parallelism at all. 2 items is shorter, so is 3 and with 4 or more iterations is all about 0.8 seconds.
It is using all cores, but not with 100% efficiency. If the serial work was divided 4 ways with no overhead it would complete in 0.57 seconds (2.28 / 4 = 0.57).
In other scenarios I saw no speed-up at all with parallel 2-3 iterations. You do not have fine-grained control over that with Parallel.ForEach and the algorithm may decide to "partition " them into just 1 chunk and run it on 1 core if the machine is busy.
There is no lower limit for doing parallel operations. If you have only 2 items to work on but each one will take a while, it might still make sense to use Parallel.ForEach. On the other hand if you have 1000000 items but they don't do very much, the parallel loop might not go any faster than the regular loop.
For example, I wrote a simple program to time nested loops where the outer loop ran both with a for loop and with Parallel.ForEach. I timed it on my 4-CPU (dual-core, hyperthreaded) laptop.
Here's a run with only 2 items to work on, but each takes a while:
2 outer iterations, 100000000 inner iterations:
for loop: 00:00:00.1460441
ForEach : 00:00:00.0842240
Here's a run with millions of items to work on, but they don't do very much:
100000000 outer iterations, 2 inner iterations:
for loop: 00:00:00.0866330
ForEach : 00:00:02.1303315
The only real way to know is to try it.
In general, once you go above a thread per core, each extra thread involved in an operation will make it slower, not faster.
However, if part of each operation will block (the classic example being waiting on disk or network I/O, another being producers and consumers that are out of synch with each other) then more threads than cores can begin to speed things up again, because tasks can be done while other threads are unable to make progress until the I/O operation returns.
For this reason, when single-core machines were the norm, the only real justifications in multi-threading were when either there was blocking of the sort I/O introduces or else to improve responsiveness (slightly slower to perform a task, but much quicker to start responding to user-input again).
Still, these days single-core machines are increasingly rare, so it would appear that you should be able to make everything at least twice as fast with parallel processing.
This will still not be the case if order is important, or something inherent to the task forces it to have a synchronised bottleneck, or if the number of operations is so small that the increase in speed from parallel processing is outweighed by the overheads involved in setting up that parallel processing. It may or may not be the case if a share resource requires threads to block on other threads performing the same parallel operation (depending on the degree of lock contention).
Also, if your code is inherently multithreaded to begin with, you can be in a situation where you are essentially competing for resources with yourself (a classic case being ASP.NET code handling simultaneous requests). Here the advantage in parallel operation may mean that a single test operation on a 4-core machine approaches 4 times the performance, but once the number of requests needing the same task to be performed reaches 4, then since each of those 4 requests are each trying to use each core, it becomes little better than if they had a core each (perhaps slightly better, perhaps slightly worse). The benefits of parallel operation hence disappears as the use changes from a single-request test to a real-world multitude of requests.
You shouldn't blindly replace every single foreach loop in your application with the parallel foreach. More threads doesn't necessary mean that your application will work faster. You need to slice the task into smaller tasks which could run in parallel if you want to really benefit from multiple threads. If your algorithm is not parallelizable you won't get any benefit.
No. You need to understand what the code is doing and whether it is amenable to parallelization. Dependencies between your data items can make it hard to parallelize, i.e., if a thread uses the value calculated for the previous element it has to wait until the value is calculated anyway and can't run in parallel. You also need to understand your target architecture, though, you will typically have a multicore CPU on just about anything you buy these days. Even on a single core, you can get some benefits from more threads but only if you have some blocking tasks. You should also keep in mind that there is overhead in creating and organizing the parallel threads. If this overhead is a significant fraction of (or more than) the time your task takes you could slow it down.
These are my benchmarks showing pure serial is slowest, along with various levels of partitioning.
class Program
{
static void Main(string[] args)
{
NativeDllCalls(true, 1, 400000000, 0); // Seconds: 0.67 |) 595,203,995.01 ops
NativeDllCalls(true, 1, 400000000, 3); // Seconds: 0.91 |) 439,052,826.95 ops
NativeDllCalls(true, 1, 400000000, 4); // Seconds: 0.80 |) 501,224,491.43 ops
NativeDllCalls(true, 1, 400000000, 8); // Seconds: 0.63 |) 635,893,653.15 ops
NativeDllCalls(true, 4, 100000000, 0); // Seconds: 0.35 |) 1,149,359,562.48 ops
NativeDllCalls(true, 400, 1000000, 0); // Seconds: 0.24 |) 1,673,544,236.17 ops
NativeDllCalls(true, 10000, 40000, 0); // Seconds: 0.22 |) 1,826,379,772.84 ops
NativeDllCalls(true, 40000, 10000, 0); // Seconds: 0.21 |) 1,869,052,325.05 ops
NativeDllCalls(true, 1000000, 400, 0); // Seconds: 0.24 |) 1,652,797,628.57 ops
NativeDllCalls(true, 100000000, 4, 0); // Seconds: 0.31 |) 1,294,424,654.13 ops
NativeDllCalls(true, 400000000, 0, 0); // Seconds: 1.10 |) 364,277,890.12 ops
}
static void NativeDllCalls(bool useStatic, int nonParallelIterations, int parallelIterations = 0, int maxParallelism = 0)
{
if (useStatic) {
Iterate<string, object>(
(msg, cntxt) => {
ServiceContracts.ForNativeCall.SomeStaticCall(msg);
}
, "test", null, nonParallelIterations,parallelIterations, maxParallelism );
}
else {
var instance = new ServiceContracts.ForNativeCall();
Iterate(
(msg, cntxt) => {
cntxt.SomeCall(msg);
}
, "test", instance, nonParallelIterations, parallelIterations, maxParallelism);
}
}
static void Iterate<T, C>(Action<T, C> action, T testMessage, C context, int nonParallelIterations, int parallelIterations=0, int maxParallelism= 0)
{
var start = DateTime.UtcNow;
if(nonParallelIterations == 0)
nonParallelIterations = 1; // normalize values
if(parallelIterations == 0)
parallelIterations = 1;
if (parallelIterations > 1) {
ParallelOptions options;
if (maxParallelism == 0) // default max parallelism
options = new ParallelOptions();
else
options = new ParallelOptions { MaxDegreeOfParallelism = maxParallelism };
if (nonParallelIterations > 1) {
Parallel.For(0, parallelIterations, options
, (j) => {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
});
}
else { // no nonParallel iterations
Parallel.For(0, parallelIterations, options
, (j) => {
action(testMessage, context);
});
}
}
else {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
}
var end = DateTime.UtcNow;
Console.WriteLine("\tSeconds: {0,8:0.00} |) {1,16:0,000.00} ops",
(end - start).TotalSeconds, (Math.Max(parallelIterations, 1) * nonParallelIterations / (end - start).TotalSeconds));
}
}

Categories