There is such an array, I know what is needed through Thread, but I don’t understand how to do it. Do you need to split the array into parts, or can you do something right away?
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int[] a = new int[10000];
Random rand = new Random();
for (int i = 0; i < a.Length; i++)
{
a[i] = rand.Next(-100, 100);
}
foreach (var p in a)
Console.WriteLine(p);
TimeSpan ts = stopWatch.Elapsed;
stopWatch.Stop();
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Another approach, compared to John Wu's, is to use a custom partitioner. I think that it is a little more readable.
using System.Collections.Concurrent;
using System.Threading.Tasks;
int[] a = new int[10000];
int batchSize = 1000;
Random rand = new Random();
Parallel.ForEach(Partitioner.Create(0, a.Length, batchSize), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
a[i] = rand.Next(-100, 100);
}
});
In modern c#, you should almost never have to use Thread objects themselves-- they are fraught with peril, and there are other language features that will do the job just as well (see async and TPL). I'll show you a way to do it with TPL.
Note: Due to the problem of false sharing, you need to rig things so that the different threads are working on different memory areas. Otherwise you will see no gain in performance-- indeed, performance could get considerably worse. In this example I divide the array into blocks of 4,000 bytes (1,000 elements) each and work on each block in a separate thread.
using System.Threading.Tasks;
var array = new int[10000];
var offsets = Enumerable.Range(0, 10).Select( x => x * 1000 );
Parallel.ForEach( offsets, offset => {
for ( int i=0; i<1000; i++ )
{
array[offset + i] = random.Next( -100,100 );
}
});
That all being said, I doubt you'll see much of a gain in performance in this example-- the array is much too small to be worth the additional overhead.
Related
I have to create a HashSet with the elements from 1 to N+1, where N is a large number (1M).
For example, if N = 5, the HashSet will have then integers {1, 2, 3, 4, 5, 6 }.
The only way I have found is:
HashSet<int> numbers = new HashSet<int>(N);
for (int i = 1; i <= (N + 1) ; i++)
{
numbers.Add(i);
}
Are there another faster (more efficient) ways to do it?
6 is a tiny number of items so I suspect the real problem is adding a few thousand items. The delays in this case are caused by buffer reallocations, not the speed of Add itself.
The solution to this is to specify even an approximate capacity when constructing the HashSet :
var set=new HashSet<int>(1000);
If, and only if, the input implements ICollection<T>, the HashSet<T>(IEnumerable<T>) constructor will check the size of input collection and use it as its capacity:
if (collection is ICollection<T> coll)
{
int count = coll.Count;
if (count > 0)
{
Initialize(count);
}
}
Explanation
Most containers in .NET use buffers internally to store data. This is far faster than implementing containers using pointers, nodes etc due to CPU cache and RAM access delays. Accessing the next item in the CPU's cache is far faster than chasing a pointer in RAM in all CPUs.
The downside is that each time the buffer is full a new one will have to be allocated. Typically, this buffer will have twice the size of the original buffer. Adding items one by one can result in log2(N) reallocations. This works fine for a moderate number of items but can result in a lot of orphaned buffers when adding eg 1000 items one by one. All those temporary buffers will have to be garbage collected at some point, causing additional delays.
Here's the code to test the three options:
var N = 1000000;
var trials = new List<(int method, TimeSpan duration)>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
HashSet<int> numbers1 = new HashSet<int>(Enumerable.Range(1, N + 1));
sw.Stop();
trials.Add((1, sw.Elapsed));
sw = Stopwatch.StartNew();
HashSet<int> numbers2 = new HashSet<int>(N);
for (int n = 1; n < N + 1; n++)
numbers2.Add(n);
sw.Stop();
trials.Add((2, sw.Elapsed));
HashSet<int> numbers3 = new HashSet<int>(N);
foreach (int n in Enumerable.Range(1, N + 1))
numbers3.Add(n);
sw.Stop();
trials.Add((3, sw.Elapsed));
}
for (int j = 1; j <= 3; j++)
Console.WriteLine(trials.Where(x => x.method == j).Average(x => x.duration.TotalMilliseconds));
Typical output is this:
31.314788
16.493208
16.493208
It is nearly twice as fast to preallocate the capacity of the HashSet<int>.
There is no difference between the traditional loop and a LINQ foreach option.
To build on #Enigmativity's answer, here's a proper benchmark using BenchmarkDotNet:
public class Benchmark
{
private const int N = 1000000;
[Benchmark]
public HashSet<int> EnumerableRange() => new HashSet<int>(Enumerable.Range(1, N + 1));
[Benchmark]
public HashSet<int> NoPreallocation()
{
var result = new HashSet<int>();
for (int n = 1; n < N + 1; n++)
{
result.Add(n);
}
return result;
}
[Benchmark]
public HashSet<int> Preallocation()
{
var result = new HashSet<int>(N);
for (int n = 1; n < N + 1; n++)
{
result.Add(n);
}
return result;
}
}
public class Program
{
public static void Main(string[] args)
{
BenchmarkRunner.Run(typeof(Program).Assembly);
}
}
With the results:
Method
Mean
Error
StdDev
EnumerableRange
29.17 ms
0.743 ms
2.179 ms
NoPreallocation
23.96 ms
0.471 ms
0.775 ms
Preallocation
11.68 ms
0.233 ms
0.665 ms
As we can see, using linq is a bit slower than not using linq (as expected), and pre-allocating saves a significant amount of time.
I have a question concerning parallel for loops. I have the following code:
public static void MultiplicateArray(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
}
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
}
}
Now I try to add parallel function:
public static void MultiplicateArray(double[] array, double factor)
{
Parallel.For(0, array.Length, i =>
{
array[i] = array[i] * factor;
});
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
});
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
});
}
The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?
Parallel.For() can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is very short (basically, just a few CPU instructions), this overhead can become prominent.
Because of this, I thought using Parallel.For() is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.
To verify this, I performed some measurements: I ran different implementations of MultiplicateArray() on an array of 200 000 000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For() usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!
First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by N CPUs, so we split it into N equally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But for loop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.
Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For() does, why? Parallel.For() does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For() seems to help with the spikes a little.
Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.
The code I used:
static void Main()
{
double[] array = new double[200 * 1000 * 1000];
for (int i = 0; i < array.Length; i++)
array[i] = 1;
for (int i = 0; i < 5; i++)
{
Stopwatch sw = Stopwatch.StartNew();
Serial(array, 2);
Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelFor(array, 2);
Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelForDegreeOfParallelism(array, 2);
Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallel(array, 2);
Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMax(array, 2);
Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMaxHalfParallelism(array, 2);
Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelFalseSharing(array, 2);
Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
}
}
static void Serial(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
static void ParallelFor(double[] array, double factor)
{
Parallel.For(
0, array.Length, i => { array[i] = array[i] * factor; });
}
static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
Parallel.For(
0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => { array[i] = array[i] * factor; });
}
static void CustomParallel(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMax(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount / 2;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelFalseSharing(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
int i = -1;
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
int j = Interlocked.Increment(ref i);
while (j < array.Length)
{
array[j] = array[j] * factor;
j = Interlocked.Increment(ref i);
}
});
}
Task.WaitAll(tasks);
}
Example output:
Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Svick already provided a great answer but I'd like to emphasize that the key point is not to "parallelize your code manually" instead of using Parallel.For() but that you have to process larger chunks of data.
This can still be done using Parallel.For() like this:
static void My(double[] array, double factor)
{
int degreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, degreeOfParallelism, workerId =>
{
var max = array.Length * (workerId + 1) / degreeOfParallelism;
for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
array[i] = array[i] * factor;
});
}
which does the same thing as svicks CustomParallelExtractedMax() but is shorter, simpler and (on my machine) performs even slightly faster:
Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s
Btw, the keyword for this which is missing from all the other answers is granularity.
See Custom Partitioners for PLINQ and TPL:
In a For loop, the body of the loop is provided to the method as a delegate. The cost of invoking that delegate is about the same as a virtual method call. In some scenarios, the body of a parallel loop might be small enough that the cost of the delegate invocation on each loop iteration becomes significant. In such situations, you can use one of the Create overloads to create an IEnumerable<T> of range partitions over the source elements. Then, you can pass this collection of ranges to a ForEach method whose body consists of a regular for loop. The benefit of this approach is that the delegate invocation cost is incurred only once per range, rather than once per element.
In your loop body, you are performing a single multiplication, and the overhead of the delegate call will be very noticeable.
Try this:
public static void MultiplicateArray(double[] array, double factor)
{
var rangePartitioner = Partitioner.Create(0, array.Length);
Parallel.ForEach(rangePartitioner, range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
array[i] = array[i] * factor;
}
});
}
See also: Parallel.ForEach documentation and Partitioner.Create documentation.
Parallel.For involves more complex memory management. That result could vary depending on cpu specs, like #cores, L1 & L2 cache...
Please take a look to this interesting article:
http://msdn.microsoft.com/en-us/magazine/cc872851.aspx
from http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx and http://msdn.microsoft.com/en-us/library/dd537608.aspx
you are not creating three thread/process that execute your for, but the iteration of the for is tryed to be executet in parallel, so even with only one for you are using multiple thread/process.
this mean that interation with index = 0 and index = 1 may be executed at the same time.
Probabily you are forcing to use too much thread/process, and the overhead for the creation/execution of them is bigger that the speed gain.
Try to use three normal for but in three different thread/process, if your sistem is multicore (3x at least) it should take less than one minute
I have a numeric intensive application and after looking for GFLOPS on the internet, I decided to do my own little benchmark. I just did a single thread matrix multiplication thousands of times to get about a second of execution. This is the inner loop.full
for (int i = 0; i < SIZEA; i++)
for (int j = 0; j < SIZEB; j++)
vector_out[i] = vector_out[i] + vector[j] * matrix[i, j];
It's been years since I dealt with FLOPS, so I expected to get something around 3 to 6 cycles per FLOP. But I am getting 30 (100 MFLOPS), surely if I parallelize this I will get more but I just did not expect that. Could this be a problem with dot NET. or is this really the CPU performance?
Here is a fiddle with the full benchmark code.
EDIT: Visual studio even in release mode takes longer to run, the executable by itself it runs in 12 cycles per FLOP (250 MFLOPS). Still is there any VM impact?
Your bench mark doesn't really measure FLOPS, it does some floating point operations and looping in C#.
However, if you can isolate your code to a repetition of just floating point operations you still have some problems.
Your code should include some "pre-cycles" to allow the "jitter to warm-up", so you are not measuring compile time.
Then, even if you do that,
You need to compile in release mode with optimizations on and execute your test from the commmand-line on a known consistent platform.
Fiddle here
Here is my alternative benchmark,
using System;
using System.Linq;
using System.Diagnostics;
class Program
{
static void Main()
{
const int Flops = 10000000;
var random = new Random();
var output = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var left = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var right = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
for (var i = 0; i < Flops - 1; i++)
{
output[i] = random.NextDouble();
}
timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
Console.WriteLine("ms: {0}", timer.ElapsedMilliseconds);
Console.WriteLine(
"MFLOPS: {0}",
(double)Flops / timer.ElapsedMilliseconds / 1000.0);
}
}
On my VM I get results like
ms: 73
MFLOPS: 136.986301...
Note, I had to increase the number of operations significantly to get over 1 millisecond.
I spent the last few days on creating a parallel version of a code (college work), but I came to a dead end (at least for me): The parallel version is nearly as twice slower than the sequential one, and I have no clue on why. Here is the code:
Variables.GetMatrix();
int ThreadNumber = Environment.ProcessorCount/2;
int SS = Variables.PopSize / ThreadNumber;
//GeneticAlgorithm GA = new GeneticAlgorithm();
Stopwatch stopwatch = new Stopwatch(), st = new Stopwatch(), st1 = new Stopwatch();
List<Thread> ThreadList = new List<Thread>();
//List<Task> TaskList = new List<Task>();
GeneticAlgorithm[] SubPop = new GeneticAlgorithm[ThreadNumber];
Thread t;
//Task t;
ThreadVariables Instance = new ThreadVariables();
stopwatch.Start();
st.Start();
PopSettings();
InitialPopulation();
st.Stop();
//Lots of attributions...
int SPos = 0, EPos = SS;
for (int i = 0; i < ThreadNumber; i++)
{
int temp = i, StartPos = SPos, EndPos = EPos;
t = new Thread(() =>
{
SubPop[temp] = new GeneticAlgorithm(Population, NumSeq, SeqSize, MaxOffset, PopFit, Child, Instance, StartPos, EndPos);
SubPop[temp].RunGA();
SubPop[temp].ShowPopulation();
});
t.Start();
ThreadList.Add(t);
SPos = EPos;
EPos += SS;
}
foreach (Thread a in ThreadList)
a.Join();
double BestFit = SubPop[0].BestSol;
string BestAlign = SubPop[0].TV.Debug;
for (int i = 1; i < ThreadNumber; i++)
{
if (BestFit < SubPop[i].BestSol)
{
BestFit = SubPop[i].BestSol;
BestAlign = SubPop[i].TV.Debug;
Variables.ResSave = SubPop[i].TV.ResSave;
Variables.NumSeq = SubPop[i].TV.NumSeq;
}
}
Basically the code creates an array of the object type, instantiante and run the algorithm in each position of the array, and collecting the best value of the object array at the end. This type of algorithm works on a three-dimentional data array, and on the parallel version I assign each thread to process one range of the array, avoiding concurrency on data. Still, I'm getting the slow timing... Any ideas?
I'm using an Core i5, which has four cores (two + two hyperthreading), but any amount of threads greater than one I use makes the code run slower.
What I can explain of the code I'm running in parallel is:
The second method being called in the code I posted makes about 10,000 iterations, and in each iteration it calls one function. This function may or may not call others more (spread across two different objects for each thread) and make lots of calculations, it depends on a bunch of factors which are particular of the algorithm. And all these methods for one thread work in an area of a data array that isn't accessed by the other threads.
With System.Linq there is a lot to make simpler:
int ThreadNumber = Environment.ProcessorCount/2;
int SS = Variables.PopSize / ThreadNumber;
int numberOfTotalIterations = // I don't know what goes here.
var doneAlgorithms = Enumerable.Range(0, numberOfTotalIterations)
.AsParallel() // Makes the whole thing running in parallel
.WithDegreeOfParallelism(ThreadNumber) // We don't need this line if you want the system to manage the number of parallel processings.
.Select(index=> _runAlgorithmAndReturn(index,SS))
.ToArray(); // This is obsolete if you only need the collection of doneAlgorithms to determine the best one.
// If not, keep it to prevent multiple enumerations.
// So we sort algorithms by BestSol ascending and take the first one to determine the "best".
// OrderBy causes a full enumeration, hence the above mentioned obsoletion of the ToArray() statement.
GeneticAlgorithm best = doneAlgorithms.OrderBy(algo => algo.BestSol).First();
BestFit = best.Bestsol;
BestAlign = best.TV.Debug;
Variables.ResSave = best.TV.ResSave;
Variables.NumSeq = best.TV.NumSeq;
And declare a method to make it a bit more readable
/// <summary>
/// Runs a single algorithm and returns it
/// </summary>
private GeneticAlgorithm _runAlgorithmAndReturn(int index, int SS)
{
int startPos = index * SS;
int endPos = startPos + SS;
var algo = new GeneticAlgorithm(Population, NumSeq, SeqSize, MaxOffset, PopFit, Child, Instance, startPos, endPos);
algo.RunGA();
algo.ShowPopulation();
return algo;
}
There is a big overhead in creating threads.
Instead of creating new threads, use the ThreadPool, as show below:
Variables.GetMatrix();
int ThreadNumber = Environment.ProcessorCount / 2;
int SS = Variables.PopSize / ThreadNumber;
//GeneticAlgorithm GA = new GeneticAlgorithm();
Stopwatch stopwatch = new Stopwatch(), st = new Stopwatch(), st1 = new Stopwatch();
List<WaitHandle> WaitList = new List<WaitHandle>();
//List<Task> TaskList = new List<Task>();
GeneticAlgorithm[] SubPop = new GeneticAlgorithm[ThreadNumber];
//Task t;
ThreadVariables Instance = new ThreadVariables();
stopwatch.Start();
st.Start();
PopSettings();
InitialPopulation();
st.Stop();
//lots of attributions...
int SPos = 0, EPos = SS;
for (int i = 0; i < ThreadNumber; i++)
{
int temp = i, StartPos = SPos, EndPos = EPos;
ManualResetEvent wg = new ManualResetEvent(false);
WaitList.Add(wg);
ThreadPool.QueueUserWorkItem((unused) =>
{
SubPop[temp] = new GeneticAlgorithm(Population, NumSeq, SeqSize, MaxOffset, PopFit, Child, Instance, StartPos, EndPos);
SubPop[temp].RunGA();
SubPop[temp].ShowPopulation();
wg.Set();
});
SPos = EPos;
EPos += SS;
}
ManualResetEvent.WaitAll(WaitList.ToArray());
double BestFit = SubPop[0].BestSol;
string BestAlign = SubPop[0].TV.Debug;
for (int i = 1; i < ThreadNumber; i++)
{
if (BestFit < SubPop[i].BestSol)
{
BestFit = SubPop[i].BestSol;
BestAlign = SubPop[i].TV.Debug;
Variables.ResSave = SubPop[i].TV.ResSave;
Variables.NumSeq = SubPop[i].TV.NumSeq;
}
}
Note that instead of using Join to wait the thread execution, I'm using WaitHandles.
You're creating the threads yourself, so there's some extreme overhead there. Parallelise like the comments suggested. Also make sure the time a single work-unit takes is long enough. A single thread/workunit should be alive for at least ~20 ms.
Pretty basic things really. I'd suggest you really read up on how multi-threading in .NET works.
I see you don't create too many threads. But the optimal threadcount can't be determined just from the processor count. The built-in Parallel class has advanced algorithms to reduce the overall time.
Partitioning and threading are some pretty complex things that require a lot knowledge to get right, so unless you REALLY know what you're doing rely on the Parallel class to handle it for you.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
c# Leaner way of initializing int array
Basically I would like to know if there is a more efficent code than the one shown below
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; i++)
{
result[i] = value;
}
return result;
}
where size can vary from 10 to 150000. For small arrays is not an issue, but there should be a better way to do the above.
I am using VS2010(.NET 4.0)
C#/CLR does not have built in way to initalize array with non-default values.
Your code is as efficient as it could get if you measure in operations per item.
You can get potentially faster initialization if you initialize chunks of huge array in parallel. This approach will need careful tuning due to non-trivial cost of mutlithread operations.
Much better results can be obtained by analizing your needs and potentially removing whole initialization alltogether. I.e. if array is normally contains constant value you can implement some sort of COW (copy on write) approach where your object initially have no backing array and simpy returns constant value, that on write to an element it would create (potentially partial) backing array for modified segment.
Slower but more compact code (that potentially easier to read) would be to use Enumerable.Repeat. Note that ToArray will cause significant amount of memory to be allocated for large arrays (which may also endup with allocations on LOH) - High memory consumption with Enumerable.Range?.
var result = Enumerable.Repeat(value, size).ToArray();
One way that you can improve speed is by utilizing Array.Copy. It's able to work at a lower level in which it's bulk assigning larger sections of memory.
By batching the assignments you can end up copying the array from one section to itself.
On top of that, the batches themselves can be quite effectively paralleized.
Here is my initial code up. On my machine (which only has two cores) with a sample array of size 10 million items, I was getting a 15% or so speedup. You'll need to play around with the batch size (try to stay in multiples of your page size to keep it efficient) to tune it to the size of items that you have. For smaller arrays it'll end up almost identical to your code as it won't get past filling up the first batch, but it also won't be (noticeably) worse in those cases either.
private const int batchSize = 1048576;
private static int[] GetDefaultSeriesArray2(int size, int value)
{
int[] result = new int[size];
//fill the first batch normally
int end = Math.Min(batchSize, size);
for (int i = 0; i < end; i++)
{
result[i] = value;
}
int numBatches = size / batchSize;
Parallel.For(1, numBatches, batch =>
{
Array.Copy(result, 0, result, batch * batchSize, batchSize);
});
//handle partial leftover batch
for (int i = numBatches * batchSize; i < size; i++)
{
result[i] = value;
}
return result;
}
Another way to improve performance is with a pretty basic technique: loop unrolling.
I have written some code to initialize an array with 20 million items, this is done repeatedly 100 times and an average is calculated. Without unrolling the loop, this takes about 44 MS. With loop unrolling of 10 the process is finished in 23 MS.
private void Looper()
{
int repeats = 100;
float avg = 0;
ArrayList times = new ArrayList();
for (int i = 0; i < repeats; i++)
times.Add(Time());
Console.WriteLine(GetAverage(times)); //44
times.Clear();
for (int i = 0; i < repeats; i++)
times.Add(TimeUnrolled());
Console.WriteLine(GetAverage(times)); //22
}
private float GetAverage(ArrayList times)
{
long total = 0;
foreach (var item in times)
{
total += (long)item;
}
return total / times.Count;
}
private long Time()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i++)
{
result[i] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
private long TimeUnrolled()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i += 10)
{
result[i] = 5;
result[i + 1] = 5;
result[i + 2] = 5;
result[i + 3] = 5;
result[i + 4] = 5;
result[i + 5] = 5;
result[i + 6] = 5;
result[i + 7] = 5;
result[i + 8] = 5;
result[i + 9] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
Enumerable.Repeat(value, size).ToArray();
Reading up Enumerable.Repeat is 20 times slower than the ops standard for loop and the only thing I found which might improve its speed is
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; ++i)
{
result[i] = value;
}
return result;
}
NOTE the i++ is changed to ++i. i++ copies i, increments i, and returns the original value. ++i just returns the incremented value
As someone already mentioned, you can leverage parallel processing like this:
int[] result = new int[size];
Parallel.ForEach(result, x => x = value);
return result;
Sorry I had no time to do performance testing on this (don't have VS installed on this machine) but if you can do it and share the results it would be great.
EDIT: As per comment, while I still think that in terms of performance they are equivalent, you can try the parallel for loop:
Parallel.For(0, size, i => result[i] = value);