I have a simple program that searches linearly in an array of 2D points. I do 1000 searches into an array of 1 000 000 points.
The curious thing is that if I spawn 1000 threads, the program works as fast as when I span only as much as CPU cores I have, or when I use Parallel.For. This is contrary to everything I know about creating threads. Creating and destroying threads is expensive, but obviously not in this case.
Can someone explain why?
Note: this is a methodological example; the search algorithm is deliberately not meant do to optimal. The focus is on threading.
Note 2: I tested on an 4-core i7 and 3-core AMD, the results follow the same pattern!
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
/// <summary>
/// We search for closest points.
/// For every point in array searchData, we search into inputData for the closest point,
/// and store it at the same position into array resultData;
/// </summary>
class Program
{
class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom (Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
const int inputDataSize = 1_000_000;
static Point[] inputData = new Point[inputDataSize];
const int searchDataSize = 1000;
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
static void GenerateRandomData (Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[]{ rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For (0, searchDataSize,
index =>
{
SearchOne(index);
});
}
static void Main(string[] args)
{
Console.Write("Generatic data... ");
GenerateRandomData(inputData);
GenerateRandomData(searchData);
Console.WriteLine("Done.");
Console.WriteLine();
Stopwatch watch = new Stopwatch();
Console.Write("All thread searching... ");
watch.Restart();
AllThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Few thread searching... ");
watch.Restart();
FewThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Parallel thread searching... ");
watch.Restart();
ParallelThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.WriteLine();
Console.WriteLine("Press ENTER to quit.");
Console.ReadLine();
}
}
EDIT: Please make sure to run the app outside the debugger. VS Debugger slows down the case of multiple threads.
EDIT 2: Some more tests.
To make it clear, here is updated code that guarantees we do have 1000 running at once:
public static void AllThreadSearch()
{
ManualResetEvent startEvent = new ManualResetEvent(false);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
startEvent.WaitOne();
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
startEvent.Set();
foreach (var t in threads) t.Join();
}
Testing with a smaller array - 100K elements, the results are:
1000 vs 8 threads
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|----------:|-------:|
AllThreadSearch | 323.0 ms | 7.307 ms | 21.546 ms | 1.00 |
FewThreadSearch | 164.9 ms | 3.311 ms | 5.251 ms | 1.00 |
ParallelThreadSearch | 141.3 ms | 1.503 ms | 1.406 ms | 1.00 |
Now, 1000 threads is much slower, as expected. Parallel.For still bests them all, which is also logical.
However, growing the array to 500K (i.e. the amount of work every thread does), things start to look weird:
1000 vs 8, 500K
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|---------:|-------:|
AllThreadSearch | 890.9 ms | 17.74 ms | 30.61 ms | 1.00 |
FewThreadSearch | 712.0 ms | 13.97 ms | 20.91 ms | 1.00 |
ParallelThreadSearch | 714.5 ms | 13.75 ms | 12.19 ms | 1.00 |
Looks like context-switching has negligible costs. Thread-creation costs are also relatively small. The only significant cost of having too many threads is loss of memory (memory addresses). Which, alone, is bad enough.
Now, are thread-creation costs that little indeed? We've been universally told that creating threads is very bad and context-switches are evil.
You may want to consider how the application is accessing memory. In the maximum threads scenario you are effectively accessing memory sequentially, which is efficient from a caching point of view. The approach using a small number of threads is more random, causing cache misses. Depending on the CPU there are performance counters that allow you to measure L1 and L2 cache hits/misses.
I think the real issue (other than memory use) with too many threads is that the CPU may have a hard time optimizing itself because it is switching tasks all the time. In the OP's original benchmark, the threads are all working on the same task and so you aren't seeing that much of a cost for the extra threads.
To simulate threads working on different tasks, I modified Jodrell's reformulation of the original code (labeled "Normal" in the data below) to first optimize memory access by ensuring all the threads are working in the same block of memory at the same time and such that the block fits in the cache (4mb) using the method from this cache blocking techniques article. Then I "reversed" that to ensure each set of 4 threads work in a different block of memory. The results for my machine (in ms):
Intel Core i7-5500U CPU 2.40GHz (Max: 2.39GHz) (Broadwell), 1 CPU, 4 logical and 2 physical cores)
inputDataSize = 1_000_000; searchDataSize = 1000; blocks used for O/D: 10
Threads 1 2 3 4 6 8 10 18 32 56 100 178 316 562 1000
Normal(N) 5722 3729 3599 2909 3485 3109 3422 3385 3138 3220 3196 3216 3061 3012 3121
Optimized(O) 5480 2903 2978 2791 2826 2806 2820 2796 2778 2775 2775 2805 2833 2866 2988
De-optimized(D) 5455 3373 3730 3849 3409 3350 3297 3335 3365 3406 3455 3553 3692 3719 3900
For O, all the threads worked in the same block of cacheable memory at the same time (where 1 block = 1/10 of inputData). For D, for every set of 4 threads, no thread worked in the same block of memory at the same time. So basically, in the former case access of inputData was able to make use of the cache whereas in the latter case for 4 threads access of inputData was forced to use main memory.
It's easier to see in charts. These charts have the thread-creation cost subtracted out and note the x-axis is logarithmic and y-axis is truncated to better show the shape of the data. Also, the value for 1 thread has been halved to show the theoretical best multi-threaded performance:
A quick glance above shows the optimized data (O) is indeed faster than the others. It is also more consistent (smoother) because compared to N it is not having to deal with cache-misses. As suggested by Jodrell, there appears to be a sweet spot around 100 threads, which is the number on my system which would allow a thread to complete its work within 1 time-slice. After that, the time increases linearly with number of threads (remember, the x-axis has a logarithmic scale on the chart.)
Comparing the normal and optimized data, the former is quite jagged whereas the latter is smooth. This answer suggested more threads would be more efficient from a caching point of view compared to fewer threads where the memory access could be more "random". The chart below seems to confirm this (note 4 threads is optimal for my machine as it has 4 logical cores):
The de-optimized version is most interesting. The worse case is with 4 threads as they have been forced to work in different areas of memory, preventing effective caching. As the number threads increases, the system is able to cache as threads share blocks of memory. But, as the number of threads increases presumably the context-switching makes it harder for the system to cache again and the results tend back to the worst-case:
I think this last chart is what shows the real cost of context-switching. In the original (N) version, the threads are all doing the same task. As a result there is limited competition for resources other than CPU time and the CPU is able to optimize itself for the workload (i.e. cache effectively.) If the threads are all doing different things, then the CPU isn't able to optimize itself and a severe performance hit results. So it's not directly the context switching that causes the problem, but the competition for resources.
In this case, the difference for 4 Threads between O (2909ms) and D (3849ms) is 940ms. This represents a 32% performance hit. Because my machine has a shared L3 cache, this performance hit shows up even with only 4 threads.
I took the liberty of rearranging your code to run using BenchmarkDotNet, it looks like this,
using System;
using System.Collections.Generic;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace Benchmarks
{
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom(Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
[ClrJob(baseline: true)]
public class SomeVsMany
{
[Params(1000)]
public static int inputDataSize = 1000;
[Params(10)]
public static int searchDataSize = 10;
static Point[] inputData = new Point[inputDataSize];
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
[GlobalSetup]
public static void Setup()
{
GenerateRandomData(inputData);
GenerateRandomData(searchData);
}
[Benchmark]
public static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[] { rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For(0, searchDataSize,
index =>
{
SearchOne(index);
});
}
private static void GenerateRandomData(Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
}
public class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<SomeVsMany>();
}
}
}
When I run the benchmark I get these results,
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.14393.2485
(1607/AnniversaryUpdate/Redstone1) Intel Core i7-7600U CPU 2.80GHz
(Max: 2.90GHz) (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
Frequency=2835938 Hz, Resolution=352.6170 ns, Timer=TSC [Host] :
.NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3163.0
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit
RyuJIT-v4.7.3163.0 Job=Clr Runtime=Clr
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000 10 1,276.53us 51.0605us 142.3364us
FewThreadSearch 1000 10 547.72us 24.8199us 70.0049us
ParallelThreadSearch 1000 10 36.54us 0.6973us 0.8564us
These are the kind of results I'd expect and different to what you are claiming in the question. However, as you correctly identify in the comment, this is because I have reduced the values of inputDataSize and searchDataSize.
If I rerun the test with the original values I get results like this,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000000 1000 2.872s 0.0554s 0.0701s
FewThreadSearch 1000000 1000 2.384s 0.0471s 0.0612s
ParallelThreadSearch 1000000 1000 2.449s 0.0368s 0.0344s
These results support your question.
FWIW I did another test run,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 20000000 40 1.972s 0.0392s 0.1045s
FewThreadSearch 20000000 40 1.718s 0.0501s 0.1477s
ParallelThreadSearch 20000000 40 1.978s 0.0454s 0.0523s
This may help distinguish the cost of context switching versus thread creation but ultimately, there must be an element of both.
There is a little speculation but, here are a few assertions and, a conclusion, based on our aggregated results.
Creating a Thread incurs some fixed overhead. When the work is large, the overhead becomes insignificant.
The operating system and processor architecture can only run a certain number of CPU threads at a time. Some amount of CPU time will be reserved for the many operations that keep the computer running behind the scenes. A chunk of that CPU time will be consumed by the background processes and services, not related to this test.
Even if we have a 8 core CPU and spawn just 2 threads we cannot expect both threads to progress through the program at exactly the same rate.
Accepting the points above, whether or not the threads are serviced via a .Net ThreadPool, only a finite number can be serviced concurrently. Even if all instantiated threads are progressed to some semaphore, they did not all get there at once and they will not all proceed at once. If we have more threads than available cores, some threads will have to wait before they can progress at all.
Each thread will proceed for a certain time-slice or until it is waiting for a resource.
This is where the speculation comes in but, when inputDataSize is small, the threads will tend to complete their work within one time-slice, requiring less or no context switching.
When inputDataSize becomes sufficiently large, the work cannot be completed within one time-slice, this makes context switching more likely.
So, given a large fixed size for searchDataSize we have three scenarios. The boundaries of these scenarios will depend on the characteristics of the test platform.
inputDataSize is small
Here, the cost of thread creation is significant, AllThreadSearch is massively slower. ParallelThreadSearch tends to win because it minimizes the cost of thread creation.
inputDataSize is medium
The cost of thread creation is insignificant. Crucially, the work can be completed in one time slice. AllThreadSearch makes use of OS level scheduling and avoids the reasonable but significant overhead of both the Parallel.For and the bucket looping in FewThreadSearch. Somewhere in this area is the sweet spot for AllThreadSearch, it may be possible that for some combinations AllThreadSearch is the fastest option.
inputDataSize is large
Crucially, the work cannot be completed in one time slice. Both the OS scheduler and the ThreadPool fail to anticipate the cost of context switching. Without some expensive heuristics how could they? FewThreadSearch wins out because it avoids the context switching, the cost of which outweighs the cost of bucket looping.
As ever, if you care about performance it pays to benchmark, on a representative system, with a representative workload, with a representative configuration.
First you have to understand the difference between Process and Thread to deep dive into the benefits of concurrency to achieve faster results over sequential programming.
Process - We can call it as an instance of a program in execution. Operating System creates different processes while executing an application. An application can have one or more processes. Process creation is some what costly job to the operating system as it needs to provide several resources while creating, such as Memory, Registers, open handles to system objects to access, security context etc.,
Thread - it is the entity within a process that can be scheduled for execution(can be a part of your code). Unlike Process creation, thread creation is not costly/time consuming as threads share virtual address space and system resources of the process where it belongs. It's improving the performance of the OS as it no need to provide resources for each thread it creates.
Below diagram will elaborate more than my words.
As threads are sharing the resources and having the concurrency nature in them they can run parallel and produce improved results. If your application needs to be highly parallel then you can create ThreadPool(collection of worker threads) to achieve efficiently execute asynchronous callbacks.
And to correct your final assumption/question, creating/destroying threads is not costly than creating/destroying process so always having a "properly handled threading code" would benefit the performance of the application.
It is simply because you can't create threads more than the capacity of your cpu ... so actually in both cases you are creating the same number of threads; your CPU max ...
Related
I am running the Hello World example of hybridizer-basic-samples.But the time taking for the execution is more in GPU than Cpu.
[EntryPoint("run")]
public static void Run(int N, double[] a, double[] b)
{
Parallel.For(0, N, i => { a[i] += b[i]; });
}
static void Main(string[] args)
{
int N = 1024 * 1024 * 16;
double[] acuda = new double[N];
double[] adotnet = new double[N];
double[] b = new double[N];
Random rand = new Random();
for (int i = 0; i < N; ++i)
{
acuda[i] = rand.NextDouble();
adotnet[i] = acuda[i];
b[i] = rand.NextDouble();
}
cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);
HybRunner runner = HybRunner.Cuda().SetDistrib(prop.multiProcessorCount * 16, 128);
dynamic wrapped = runner.Wrap(new Program());
// run the method on GPU
var watch = System.Diagnostics.Stopwatch.StartNew();
wrapped.Run(N, acuda, b);
watch.Stop();
Console.WriteLine($"Execution Time: {watch.ElapsedMilliseconds} ms");
// run .Net method
var watch2 = System.Diagnostics.Stopwatch.StartNew();
Run(N, adotnet, b);
watch2.Stop();
Console.WriteLine($"Execution Time: {watch2.ElapsedMilliseconds} ms");
}
When i run the program, the execution time of the Run() in GPU is always more than the .Net method.Like for the GPU execution it took 818ms but for the cpu,89ms.Can any one please explain me the reason?
As mentioned by #InBetween, it is likely that you are measuring compiler overhead. It is good practice to do a warmup pass to let all code compile first. Or use something like benchmarking.net that does that for you.
Another possible reason is overhead. When running things on a GPU the system would need to copy the input data to GPU memory, and copy the result back again. There will probably also be other costs involved. Adding numbers together is a very simple operation, so it is likely the processor can run at max theoretical speed.
Lets do some back of the envelope calculations. Assume the CPU can do 4 adds per clock (i.e. what AVX256 can do). 4 * 8 bytes per double = 32 bytes per clock, and 4*10^9 clocks per second. This gives 128 GB/s in processing speed. This is significantly higher than the PCIe 3 x16 bandwith of 16GB/s. You will probably not reach this speed due to other limitations, but it shows that the limiting factor is probably not the processor itself, so using a GPU will probably not improve things.
GPU processing should show better gains when using more complicated algorithms that do more processing for each data-item.
I’m working on a Genetic Machine Learning project developed in .Net (as opposed to Matlab – My Norm). I’m no pro .net coder so excuse any noobish implementations.
The project itself is huge so I won’t bore you with the full details but basically a population of Artificial Neural Networks (like decision trees) are each evaluated on a problem domain that in this case uses a stream of sensory inputs. The top performers in the population are allowed to breed and produced offspring (that inherit tendencies from both parents) and the poor performers are killed off or breed-out of the population. Evolution continues until an acceptable solution is found. Once found, the final evolved ‘Network’ is extracted from the lab and placed in a light-weight real-world application. The technique can be used to develop very complex control solution that would be almost impossible or too time consuming to program normally, like automated Car driving, mechanical stability control, datacentre load balancing etc, etc.
Anyway, the project has been a huge success so far and is producing amazing results, but the only problem is the very slow performance once I move to larger datasets. I’m hoping is just my code, so would really appreciate some expert help.
In this project, convergence to a solution close to an ideal can often take around 7 days of processing! Just making a little tweak to a parameter and waiting for results is just too painful.
Basically, multiple parallel threads need to read sequential sections of a very large dataset (the data does not change once loaded). The dataset consists of around 300 to 1000 Doubles in a row and anything over 500k rows. As the dataset can exceed the .Net object limit of 2GB, it can’t be stored in normal 2d array – The simplest way round this was to use a Generic List of single arrays.
The parallel scalability seems to be a big limiting factor as running the code on a beast of a server with 32 Xeon cores that normally eats Big dataset for breakfast does not yield much of a performance gain over a Corei3 desktop!
Performance gains quickly dwindle away as the number of cores increases.
From profiling the code (with my limited knowledge) I get the impression that there is a huge amount of contention reading the dataset from multiple threads.
I’ve tried experimenting with different dataset implementations using Jagged arrays and various concurrent collections but to no avail.
I’ve knocked up a quick and dirty bit of code for benchmarking that is similar to the core implementation of the original and still exhibits the similar read performance issues and parallel scalability issues.
Any thoughts or suggestions would be much appreciated or confirmation that this is the best I’m going to get.
Many thanks
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading.Tasks;
//Benchmark script to time how long it takes to read dataset per iteration
namespace Benchmark_Simple
{
class Program
{
public static TrainingDataSet _DataSet;
public static int Features = 100; //Real test will require 300+
public static int Rows = 200000; //Real test will require 500K+
public static int _PopulationSize = 500; //Real test will require 1000+
public static int _Iterations = 10;
public static List<NeuralNetwork> _NeuralNetworkPopulation = new List<NeuralNetwork>();
static void Main()
{
Stopwatch _Stopwatch = new Stopwatch();
//Create Dataset
Console.WriteLine("Creating Training DataSet");
_DataSet = new TrainingDataSet(Features, Rows);
Console.WriteLine("Finished Creating Training DataSet");
//Create Neural Network Population
for (int i = 0; i <= _PopulationSize - 1; i++)
{
_NeuralNetworkPopulation.Add(new NeuralNetwork());
}
//Main Loop
for (int i = 0; i <= _Iterations - 1; i++)
{
_Stopwatch.Restart();
Parallel.ForEach(_NeuralNetworkPopulation, _Network => { EvaluateNetwork(_Network); });
//######## Removed for simplicity ##########
//Run Evolutionary Genetic Algorithm on population - I.E. Breed the strong, kill of the weak
//##########################################
//Repeat until acceptable solution is found
Console.WriteLine("Iteration time: {0}", _Stopwatch.ElapsedMilliseconds / 1000);
_Stopwatch.Stop();
}
Console.ReadLine();
}
private static void EvaluateNetwork(NeuralNetwork Network)
{
//Evaluate network on 10% of the Training Data at a random starting point
double Score = 0;
Random Rand = new Random();
int Count = (Rows / 100) * 10;
int RandonStart = Rand.Next(0, Rows - Count);
//The data must be read sequentially
for (int i = RandonStart; i <= RandonStart + Count; i++)
{
double[] NetworkInputArray = _DataSet.GetDataRow(i);
//####### Dummy Evaluation - just give it somthing to do for the sake of it
double[] Temp = new double[NetworkInputArray.Length + 1];
for (int j = 0; j <= NetworkInputArray.Length - 1; j++)
{
Temp[j] = Math.Log(NetworkInputArray[j] * Rand.NextDouble());
}
Score += Rand.NextDouble();
//##################
}
Network.Score = Score;
}
public class TrainingDataSet
{
//Simple demo class of fake data for benchmarking
private List<double[]> DataList = new List<double[]>();
public TrainingDataSet(int Features, int Rows)
{
Random Rand = new Random();
for (int i = 1; i <= Rows; i++)
{
double[] NewRow = new double[Features];
for (int j = 0; j <= Features - 1; j++)
{
NewRow[j] = Rand.NextDouble();
}
DataList.Add(NewRow);
}
}
public double[] GetDataRow(int Index)
{
return DataList[Index];
}
}
public class NeuralNetwork
{
//Simple Class to represent a dummy Neural Network -
private double _Score;
public NeuralNetwork()
{
}
public double Score
{
get { return _Score; }
set { _Score = value; }
}
}
}
}
The first thing is that the only way to answer any performance questions is by profiling the application. I'm using the VS 2012 builtin profiler - there are others https://stackoverflow.com/a/100490/19624
From an initial read through the code, i.e. a static analysis the only thing that jumped out at me was the continual reallocation of Temp inside the loop; this is not efficient and if possible needs moving outside of the loop.
With a profiler you can see what's happening:
I profiled first using the code you posted, (top marks to you for posting a full compilable example of the problem, if you hadn't I wouldn't be answering this now).
This shows me that the bulk is in the inside of the loop, I moved the allocation to the the Parallel.ForEach loop.
Parallel.ForEach(_NeuralNetworkPopulation, _Network =>
{
double[] Temp = new double[Features + 1];
EvaluateNetwork(_Network, Temp);
});
So what I can see from the above is that there is 4.4% wastage on the reallocation; but the probably unsurprising thing is that it is the inner loop that is taking 87.6%.
This takes me to my first rule of optimisation which is to first to review your algorithm rather than optimizing the code. A poor implementation of a good algorithm is usually faster than a highly optimized poor algorithm.
Removing the repeated allocate of Temp changes the picture slightly;
Also worth tuning a bit by specifying the parallelism; I've found that Parallel.ForEach is good enough for what I use it for, but again you may get better results from manually partitioning the work up into queues.
Parallel.ForEach(_NeuralNetworkPopulation,
new ParallelOptions { MaxDegreeOfParallelism = 32 },
_Network =>
{
double[] Temp = new double[Features + 1];
EvaluateNetwork(_Network, Temp);
});
Whilst running I'm getting what I'd expect in terms of CPU usage: although my machine was also running another lengthy process which was taking the base level (the peak in the chart below is when profiling this program).
So to summarize
Review the most frequently executed part and come up with new algorithm if possible.
Profile on the target machine
Only when you're sure about (1) above is it then worth looking at optimising the algorithm; considering the following
a) Code optimisations
b) Memory tuning / partioning of data to keep as much in cache
c) Improvements to threading usage
I'm writing an N-Body simulation, and for computational simplification I've divided the whole space into a number of uniformly-sized regions.
For each body, I compute the force of all other bodies in the same region, and for the other regions I aggregate the mass and distances together so there's less work to be done.
I have a List<Region> and Region defines public void Index() which sums the total mass at this iteration.
I have two variants of my Space.Tick() function:
public void Tick()
{
foreach (Region r in Regions)
r.Index();
}
This is very quick. For 20x20x20 = 8000 regions with 100 bodies each = 800000 bodies in total, it only takes about 0.1 seconds to do this. The CPU graph shows 25% utilisation on my quad-core, which is exactly what I would expect.
Now I write this multi-threaded variant:
public void Tick()
{
Thread[] threads = new Thread[Environment.ProcessorCount];
foreach (Region r in Regions)
while (true)
{
bool queued = false;
for (int i = 0; i < threads.Length; i++)
if (threads[i] == null || !threads[i].IsAlive)
{
Region s = r;
threads[i] = new Thread(s.Index);
threads[i].Start();
queued = true;
break;
}
if (queued)
break;
}
}
So a quick explanation in case it's not obvious: threads is an array of 4, in the case of my CPU. It starts off being 4xnull. For each region, I loop through all 4 Thread objects (which could be null). When I find one that's either null or isn't IsAlive, I queue up the Index() of that Region and Start() it. I set queued to true so that I can tell that the region has started indexing.
This code takes about 7 seconds. That's 70x slower. I understand that there's a bit of overhead involved with setting up the threads, finding a thread that's vacant, etc. But I would still expect that I would have at least some sort of performance gain.
What am I doing wrong?
Why not try PLINQ?
Regions.AsParallel().ForAll(x=>x.Index());
PLINQ is usually SUPER fast for me, and it scales dependent on your environment.. If it shouldn't be Parallel, it does single thread.
So, if you had to have a multidimensional array come into the function, you could just do this:
Regions.AsParallel().Cast<Region>().ForAll(x=>x.Index());
I am trying to speed up my calculation times by using Parallel.For. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for loop. Is this as good as it can get with Parallel.For, or can the method call be fine-tuned to increase performance?
Here is my test code, sequential:
var loops = 200;
var perloop = 10000000;
var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
}
and parallel:
sum = 0.0;
Parallel.For(0, loops,
k =>
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
});
The loop that I am parallelizing involves computation with a "globally" defined variable, sum, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.
In Release build ("optimize code" flag set) the sequential for loop takes 33.7 s on my computer, whereas the Parallel.For loop takes 8.4 s, a performance ratio of only 4.0.
In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set
ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount
but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?
I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.
UPDATE
It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.
With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for loop. CPU utilization during the for loop is up to approx. 45% (!!!) and 100% during the Parallel.For loop.
Computation time for the for loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For (25% better than when hyper-threading is enabled). Performance ratio with Parallel.For is now only 2.5, running on 4 real cores.
So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for loop? Could there be some kind of internal parallelization going on in this loop as well?
Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it.
You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.
The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
// Use type parameter to make subtotal a long, not an int
Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
{
subtotal += nums[j];
return subtotal;
},
(x) => Interlocked.Add(ref total, x)
);
Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.
Parallel.For and Parallel.ForEach will use a degree of parallelism that it feels is appropriate, balancing the cost to setup and tear down threads and the work it expects each thread will perform. .NET 4.5 made several improvements to performance (including more intelligent decisions on the number of threads to spin up) compared to previous .NET versions.
Note that, even if it were to spin up one thread per core, context switches, false sharing issues, resource locks, and other issues may prevent you from achieving linear scalability (in general, not necessarily with your specific code example).
I think the computation gain is so low because your code is "too easy" to work on other task each iteration - because parallel.for just create new task in each iteration, so this will take time to service them in threads. I will it like this:
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
Parallel.ForEach(
Partitioner.Create(0, nums.Length),
() => 0,
(part, loopState, partSum) =>
{
for (int i = part.Item1; i < part.Item2; i++)
{
partSum += nums[i];
}
return partSum;
},
(partSum) =>
{
Interlocked.Add(ref total, partSum);
}
);
Partitioner will create optimal part of job for each task, there will be less time for service task with threads. If you can, please benchmark this solution and tell us if it get better speed up.
foreach vs parallel for each an example
for (int i = 0; i < 10; i++)
{
int[] array = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 };
Stopwatch watch = new Stopwatch();
watch.Start();
//Parallel foreach
Parallel.ForEach(array, line =>
{
for (int x = 0; x < 1000000; x++)
{
}
});
watch.Stop();
Console.WriteLine("Parallel.ForEach {0}", watch.Elapsed.Milliseconds);
watch = new Stopwatch();
//foreach
watch.Start();
foreach (int item in array)
{
for (int z = 0; z < 10000000; z++)
{
}
}
watch.Stop();
Console.WriteLine("ForEach {0}", watch.Elapsed.Milliseconds);
Console.WriteLine("####");
}
Console.ReadKey();
My CPU
Intel® Core™ i7-620M Processor (4M Cache, 2.66 GHz)
I'm trying to understand the basics of multi-threading so I built a little program that raised a few question and I'll be thankful for any help :)
Here is the little program:
class Program
{
public static int count;
public static int max;
static void Main(string[] args)
{
int t = 0;
DateTime Result;
Console.WriteLine("Enter Max Number : ");
max = int.Parse(Console.ReadLine());
Console.WriteLine("Enter Thread Number : ");
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
foreach (Thread th in MyThreads)
th.Start();
while (count < max)
{
}
Console.WriteLine("Finish , Took : " + (DateTime.Now - Result).ToString() + " With : " + t + " Threads.");
Console.ReadLine();
}
public static void print()
{
while (count < max)
{
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
I checked this with some test runs:
I made the maximum number 100, and it seems to be that the fastest execution time is with 2 threads which is 80% faster than the time with 10 threads.
Questions:
1) Threads 4-10 don't print even one time, how can it be?
2) Shouldn't more threads be faster?
I made the maximum number 10000 and disabled printing.
With this configuration, 5 threads seems to be fastest.
Why there is a change compared to the first check?
And also in this configuration (with printing) all the threads print a few times. Why is that different from the first run where only a few threads printed?
Is there is a way to make all the threads print one by one? In a line or something like that?
Thank you very much for your help :)
Your code is certainly a first step into the world of threading, and you've just experienced the first (of many) headaches!
To start with, static may enable you to share a variable among the threads, but it does not do so in a thread safe manner. This means your count < max expression and count++ are not guaranteed to be up to date or an effective guard between threads. Look at the output of your program when max is only 10 (t set to 4, on my 8 processor workstation):
T0 - 0
T0 - 1
T0 - 2
T0 - 3
T1 - 0 // wait T1 got count = 0 too!
T2 - 1 // and T2 got count = 1 too!
T2 - 6
T2 - 7
T2 - 8
T2 - 9
T0 - 4
T3 - 1 // and T3 got count = 1 too!
T1 - 5
To your question about each thread printing one-by-one, I assume you're trying to coordinate access to count. You can accomplish this with synchronization primitives (such as the lock statement in C#). Here is a naive modification to your code which will ensure only max increments occur:
static object countLock = new object();
public static void printWithLock()
{
// loop forever
while(true)
{
// protect access to count using a static object
// now only 1 thread can use 'count' at a time
lock (countLock)
{
if (count >= max) return;
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
This simple modification makes your program logically correct, but also slow. The sample now exhibits a new problem: lock contention. Every thread is now vying for access to countLock. We've made our program thread safe, but without any benefits of parallelism!
Threading and parallelism is not particularly easy to get right, but thankfully recent versions of .Net come with the Task Parallel Library (TPL) and Parallel LINQ (PLINQ).
The beauty of the library is how easy it would be to convert your current code:
var sw = new Stopwatch();
sw.Start();
Enumerable.Range(0, max)
.AsParallel()
.ForAll(number =>
Console.WriteLine("T{0}: {1}",
Thread.CurrentThread.ManagedThreadId,
number));
Console.WriteLine("{0} ms elapsed", sw.ElapsedMilliseconds);
// Sample output from max = 10
//
// T9: 3
// T9: 4
// T9: 5
// T9: 6
// T9: 7
// T9: 8
// T9: 9
// T8: 1
// T7: 2
// T1: 0
// 30 ms elapsed
The output above is an interesting illustration of why threading produces "unexpected results" for newer users. When threads execute in parallel, they may complete chunks of code at different points in time or one thread may be faster than another. You never really know with threading!
Your print function is far from thread safe, that's why 4-10 doesn't print. All threads share the same max and count variables.
Reason for the why more threads slows you down is likely the state change taking place each time the processor changes focus between each thread.
Also, when you're creating a lot of threads, the system needs to allocate new ones. Most of the time it is now advisable to use Tasks instead, as they are pulled from a system managed thread-pool. And thus doesn't necessarily have to be allocated. The creation of a distinct new thread is rather expensive.
Take a look here anyhow: http://msdn.microsoft.com/en-us/library/aa645740(VS.71).aspx
Look carefuly:
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
I think you missed a variable t ( i < 31).
You should read many books on parallel and multithreaded programming before writing code, because programming language is just a tool. Good luck!