How many cycles to multiply a float in C# - c#

I have a numeric intensive application and after looking for GFLOPS on the internet, I decided to do my own little benchmark. I just did a single thread matrix multiplication thousands of times to get about a second of execution. This is the inner loop.full
for (int i = 0; i < SIZEA; i++)
for (int j = 0; j < SIZEB; j++)
vector_out[i] = vector_out[i] + vector[j] * matrix[i, j];
It's been years since I dealt with FLOPS, so I expected to get something around 3 to 6 cycles per FLOP. But I am getting 30 (100 MFLOPS), surely if I parallelize this I will get more but I just did not expect that. Could this be a problem with dot NET. or is this really the CPU performance?
Here is a fiddle with the full benchmark code.
EDIT: Visual studio even in release mode takes longer to run, the executable by itself it runs in 12 cycles per FLOP (250 MFLOPS). Still is there any VM impact?

Your bench mark doesn't really measure FLOPS, it does some floating point operations and looping in C#.
However, if you can isolate your code to a repetition of just floating point operations you still have some problems.
Your code should include some "pre-cycles" to allow the "jitter to warm-up", so you are not measuring compile time.
Then, even if you do that,
You need to compile in release mode with optimizations on and execute your test from the commmand-line on a known consistent platform.
Fiddle here
Here is my alternative benchmark,
using System;
using System.Linq;
using System.Diagnostics;
class Program
{
static void Main()
{
const int Flops = 10000000;
var random = new Random();
var output = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var left = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var right = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
for (var i = 0; i < Flops - 1; i++)
{
output[i] = random.NextDouble();
}
timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
Console.WriteLine("ms: {0}", timer.ElapsedMilliseconds);
Console.WriteLine(
"MFLOPS: {0}",
(double)Flops / timer.ElapsedMilliseconds / 1000.0);
}
}
On my VM I get results like
ms: 73
MFLOPS: 136.986301...
Note, I had to increase the number of operations significantly to get over 1 millisecond.

Related

How to fill an array in multiple threads?

There is such an array, I know what is needed through Thread, but I don’t understand how to do it. Do you need to split the array into parts, or can you do something right away?
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int[] a = new int[10000];
Random rand = new Random();
for (int i = 0; i < a.Length; i++)
{
a[i] = rand.Next(-100, 100);
}
foreach (var p in a)
Console.WriteLine(p);
TimeSpan ts = stopWatch.Elapsed;
stopWatch.Stop();
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Another approach, compared to John Wu's, is to use a custom partitioner. I think that it is a little more readable.
using System.Collections.Concurrent;
using System.Threading.Tasks;
int[] a = new int[10000];
int batchSize = 1000;
Random rand = new Random();
Parallel.ForEach(Partitioner.Create(0, a.Length, batchSize), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
a[i] = rand.Next(-100, 100);
}
});
In modern c#, you should almost never have to use Thread objects themselves-- they are fraught with peril, and there are other language features that will do the job just as well (see async and TPL). I'll show you a way to do it with TPL.
Note: Due to the problem of false sharing, you need to rig things so that the different threads are working on different memory areas. Otherwise you will see no gain in performance-- indeed, performance could get considerably worse. In this example I divide the array into blocks of 4,000 bytes (1,000 elements) each and work on each block in a separate thread.
using System.Threading.Tasks;
var array = new int[10000];
var offsets = Enumerable.Range(0, 10).Select( x => x * 1000 );
Parallel.ForEach( offsets, offset => {
for ( int i=0; i<1000; i++ )
{
array[offset + i] = random.Next( -100,100 );
}
});
That all being said, I doubt you'll see much of a gain in performance in this example-- the array is much too small to be worth the additional overhead.

C# vs C++ for loop performance measurment

For kicks, I wanted to see how the speed of a C# for-loop compares with that of a C++ for-loop. My test is to simply iterate over a for-loop 100000 times, 100000 times, and average the result.
Here is my C# implementation:
static void Main(string[] args) {
var numberOfMeasurements = 100000;
var numberOfLoops = 100000;
var measurements = new List < long > ();
var stopwatch = new Stopwatch();
for (var i = 0; i < numberOfMeasurements; i++) {
stopwatch.Start();
for (int j = 0; j < numberOfLoops; j++) {}
measurements.Add(stopwatch.ElapsedMilliseconds);
}
Console.WriteLine("Average runtime = " + measurements.Average() + " ms.");
Console.Read();
}
Result: Average runtime = 10301.92929 ms.
Here is my C++ implementation:
void TestA()
{
auto numberOfMeasurements = 100000;
auto numberOfLoops = 100000;
std::vector<long> measurements;
for (size_t i = 0; i < numberOfMeasurements; i++)
{
auto start = clock();
for (size_t j = 0; j < numberOfLoops; j++){}
auto duration = start - clock();
measurements.push_back(duration);
}
long avg = std::accumulate(measurements.begin(), measurements.end(), 0.0) / measurements.size();
std::cout << "TestB: Time taken in milliseconds: " << avg << std::endl;
}
int main()
{
TestA();
return 0;
}
Result: TestA: Time taken in milliseconds: 0
When I had a look at what was in measurements, I noticed that it was filled with zeros... So, what is it, what is the problem here? Is it clock? Is there a better/correct way to measure the for-loop?
There is no "problem". Being able to optimize away useless code is one of the key features of C++. As the inner loop does nothing, it should be removed by every sane compiler.
Tip of the day: Only profile meaningful code that does something.
If you want to learn something about micro-benchmarks, you might be interested in this.
As "Baum mit Augen" already said the compiler will remove code that doesn't do anything. That is a common mistake when "benchmarking" C++ code. The same thing will happen if you create some kind of benchmark function which just calculates some things that will never used (won't be returned or used otherwise in code) - the compiler will just remove it.
You can avoid this behavior by not using optimize flags like O2, Ofast and so on. Since nobody would do that with real code it won't display the real performance of C++.
TL;DR Just benchmark real production code.

How to execute a method with parameters as different threads independently in C#? [duplicate]

I have a question concerning parallel for loops. I have the following code:
public static void MultiplicateArray(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
}
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
}
}
Now I try to add parallel function:
public static void MultiplicateArray(double[] array, double factor)
{
Parallel.For(0, array.Length, i =>
{
array[i] = array[i] * factor;
});
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
});
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
});
}
The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?
Parallel.For() can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is very short (basically, just a few CPU instructions), this overhead can become prominent.
Because of this, I thought using Parallel.For() is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.
To verify this, I performed some measurements: I ran different implementations of MultiplicateArray() on an array of 200 000 000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For() usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!
First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by N CPUs, so we split it into N equally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But for loop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.
Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For() does, why? Parallel.For() does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For() seems to help with the spikes a little.
Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.
The code I used:
static void Main()
{
double[] array = new double[200 * 1000 * 1000];
for (int i = 0; i < array.Length; i++)
array[i] = 1;
for (int i = 0; i < 5; i++)
{
Stopwatch sw = Stopwatch.StartNew();
Serial(array, 2);
Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelFor(array, 2);
Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelForDegreeOfParallelism(array, 2);
Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallel(array, 2);
Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMax(array, 2);
Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMaxHalfParallelism(array, 2);
Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelFalseSharing(array, 2);
Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
}
}
static void Serial(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
static void ParallelFor(double[] array, double factor)
{
Parallel.For(
0, array.Length, i => { array[i] = array[i] * factor; });
}
static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
Parallel.For(
0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => { array[i] = array[i] * factor; });
}
static void CustomParallel(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMax(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount / 2;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelFalseSharing(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
int i = -1;
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
int j = Interlocked.Increment(ref i);
while (j < array.Length)
{
array[j] = array[j] * factor;
j = Interlocked.Increment(ref i);
}
});
}
Task.WaitAll(tasks);
}
Example output:
Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Svick already provided a great answer but I'd like to emphasize that the key point is not to "parallelize your code manually" instead of using Parallel.For() but that you have to process larger chunks of data.
This can still be done using Parallel.For() like this:
static void My(double[] array, double factor)
{
int degreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, degreeOfParallelism, workerId =>
{
var max = array.Length * (workerId + 1) / degreeOfParallelism;
for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
array[i] = array[i] * factor;
});
}
which does the same thing as svicks CustomParallelExtractedMax() but is shorter, simpler and (on my machine) performs even slightly faster:
Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s
Btw, the keyword for this which is missing from all the other answers is granularity.
See Custom Partitioners for PLINQ and TPL:
In a For loop, the body of the loop is provided to the method as a delegate. The cost of invoking that delegate is about the same as a virtual method call. In some scenarios, the body of a parallel loop might be small enough that the cost of the delegate invocation on each loop iteration becomes significant. In such situations, you can use one of the Create overloads to create an IEnumerable<T> of range partitions over the source elements. Then, you can pass this collection of ranges to a ForEach method whose body consists of a regular for loop. The benefit of this approach is that the delegate invocation cost is incurred only once per range, rather than once per element.
In your loop body, you are performing a single multiplication, and the overhead of the delegate call will be very noticeable.
Try this:
public static void MultiplicateArray(double[] array, double factor)
{
var rangePartitioner = Partitioner.Create(0, array.Length);
Parallel.ForEach(rangePartitioner, range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
array[i] = array[i] * factor;
}
});
}
See also: Parallel.ForEach documentation and Partitioner.Create documentation.
Parallel.For involves more complex memory management. That result could vary depending on cpu specs, like #cores, L1 & L2 cache...
Please take a look to this interesting article:
http://msdn.microsoft.com/en-us/magazine/cc872851.aspx
from http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx and http://msdn.microsoft.com/en-us/library/dd537608.aspx
you are not creating three thread/process that execute your for, but the iteration of the for is tryed to be executet in parallel, so even with only one for you are using multiple thread/process.
this mean that interation with index = 0 and index = 1 may be executed at the same time.
Probabily you are forcing to use too much thread/process, and the overhead for the creation/execution of them is bigger that the speed gain.
Try to use three normal for but in three different thread/process, if your sistem is multicore (3x at least) it should take less than one minute

Summation algorithm for a parallel program

I am trying to write a parallel algorithm to be three times faster than a sequential algorithm that does essentially the same thing. Please see the pastebin.
http://pastebin.com/3DDyxfPP
Pasted:
Hello everyone. I'm doing an assignment for class and have the majority of it done, however I am having some problems with the math. I am trying to calculate the expression:
100000000
∑ (9999999/10000000)^i * i^2
i = 1
i goes from 1 to 10 million.
A fast sequential algorithm is given:
double sum = 0.0;
double fact1 = 0.9999999;
for (int i = 1; i <= 10000000; i++)
{
sum += (fact1 * i * i);
fact1 *= 0.9999999;
}
We are supposed to implement it and verify that it works, as well as time it in release mode. I already have this done and working properly. The time is then displayed on the console.
DateTime t = DateTime.Now;
long saveticks = t.Ticks;
double sum = 0.0;
double fact1 = 0.9999999;
for (int i = 1; i <= 100000000; i++)
{
sum += (fact1 * i * i);
fact1 *= 0.9999999;
}
t = DateTime.Now;
We then have to write a timed parallel algorithm that will beat the time, and are supposed to model it after an example parallel program. It must be at least 3 times faster than the sequential algorithm. We are to use 4 processing elements for the parallel program.
There is a hint, "After you figure out the work each processing element will do, you may need to start off the processing element with the time consuming Pow function".
for example:
Math.Pow(x,y)
"Don't use the pow function on each iteration for the parallel code, because it wont beat the time."
Here is my code for the parallel program. This does both the sequential algorithm and the parallel one and times them both.
const int numPEs = 4;
const int size = 100000000;
static double pSum;
static int numThreadsDone;
static int nextid;
static object locker1 = new object();
static object locker2 = new object();
static long psaveticks;
static DateTime pt;
static void Main(string[] args)
{
DateTime t = DateTime.Now;
long saveticks = t.Ticks;
double sum = 0.0;
double fact1 = 0.9999999;
for (int i = 1; i <= 100000000; i++)
{
sum += (fact1 * (i * i));
fact1 *= 0.9999999;
}
t = DateTime.Now;
Console.WriteLine("sequential: " + ((t.Ticks - saveticks) / 100000000.0) + " seconds");
Console.WriteLine("sum is " + sum);
// time it
pt = DateTime.Now;
psaveticks = pt.Ticks;
for (int i = 0; i < numPEs; i++)
new Thread(countThreads).Start();
Console.ReadKey();
}
static void countThreads()
{
int id;
double localcount = 0;
lock (locker1)
{
id = nextid;
nextid++;
}
// assumes array is evenly divisible by the number of threads
int granularity = size / numPEs;
int start = granularity * id;
for (int i = start; i < start + granularity; i++)
localcount += (Math.Pow(0.9999999, i) * (i * i));
lock (locker2)
{
pSum += localcount;
numThreadsDone++;
if (numThreadsDone == numPEs)
{
pt = DateTime.Now;
Console.WriteLine("parallel: " + ((pt.Ticks - psaveticks) / 10000000.0) + " seconds");
Console.WriteLine("parallel count is " + pSum);
}
}
}
My problem is that my sequential program is way faster than the parallel one. There has got to be a problem with the algorithm I'm using.
Can anyone help?
Console.WriteLine("sequential: " + ((t.Ticks - saveticks) / 100000000.0) + " seconds");
There are 10,000,000 ticks in one second. In the above line, you're dividing by an extra order of magnitude, 100,000,000, making your sequential execution appear to be 10 times faster than it actually is. To avoid these errors, use the appropriate fields from the .NET Framework itself; in this case, TimeSpan.TicksPerSecond.
The main reason you're getting a slow-down is that your parallel code is much more computationally-demanding than your sequential one.
// Inner loop of sequential code:
sum += (fact1 * (i * i));
fact1 *= 0.9999999;
// Inner loop of parallel code:
localcount += (Math.Pow(0.9999999, i) * (i * i));
From a mathematical perspective, you're justified in assuming that exponentiation would be equivalent to repeated multiplication. However, from a computational perspective, the Math.Pow operation is much more expensive than a simple multiplication.
A way of mitigating these expensive Math.Pow calls would be to perform the exponentiation just once at the beginning of each thread, and then revert to using plain multiplication (like in your sequential case):
double fact1 = Math.Pow(0.9999999, start + 1);
for (int i = start + 1; i <= start + granularity; i++)
{
localcount += (fact1 * (i * i));
fact1 *= 0.9999999;
}
On an Intel Core i7, this gives a speedup of around 3x for your problem size.
Obligatory reminders:
Don't use DateTime.Now for measuring brief time intervals. Use the Stopwatch class instead.
Don't take cross-thread time measurements. Wait for your worker threads to complete from your main thread, and take the final reading from there.

How to initialize integer array in C# [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
c# Leaner way of initializing int array
Basically I would like to know if there is a more efficent code than the one shown below
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; i++)
{
result[i] = value;
}
return result;
}
where size can vary from 10 to 150000. For small arrays is not an issue, but there should be a better way to do the above.
I am using VS2010(.NET 4.0)
C#/CLR does not have built in way to initalize array with non-default values.
Your code is as efficient as it could get if you measure in operations per item.
You can get potentially faster initialization if you initialize chunks of huge array in parallel. This approach will need careful tuning due to non-trivial cost of mutlithread operations.
Much better results can be obtained by analizing your needs and potentially removing whole initialization alltogether. I.e. if array is normally contains constant value you can implement some sort of COW (copy on write) approach where your object initially have no backing array and simpy returns constant value, that on write to an element it would create (potentially partial) backing array for modified segment.
Slower but more compact code (that potentially easier to read) would be to use Enumerable.Repeat. Note that ToArray will cause significant amount of memory to be allocated for large arrays (which may also endup with allocations on LOH) - High memory consumption with Enumerable.Range?.
var result = Enumerable.Repeat(value, size).ToArray();
One way that you can improve speed is by utilizing Array.Copy. It's able to work at a lower level in which it's bulk assigning larger sections of memory.
By batching the assignments you can end up copying the array from one section to itself.
On top of that, the batches themselves can be quite effectively paralleized.
Here is my initial code up. On my machine (which only has two cores) with a sample array of size 10 million items, I was getting a 15% or so speedup. You'll need to play around with the batch size (try to stay in multiples of your page size to keep it efficient) to tune it to the size of items that you have. For smaller arrays it'll end up almost identical to your code as it won't get past filling up the first batch, but it also won't be (noticeably) worse in those cases either.
private const int batchSize = 1048576;
private static int[] GetDefaultSeriesArray2(int size, int value)
{
int[] result = new int[size];
//fill the first batch normally
int end = Math.Min(batchSize, size);
for (int i = 0; i < end; i++)
{
result[i] = value;
}
int numBatches = size / batchSize;
Parallel.For(1, numBatches, batch =>
{
Array.Copy(result, 0, result, batch * batchSize, batchSize);
});
//handle partial leftover batch
for (int i = numBatches * batchSize; i < size; i++)
{
result[i] = value;
}
return result;
}
Another way to improve performance is with a pretty basic technique: loop unrolling.
I have written some code to initialize an array with 20 million items, this is done repeatedly 100 times and an average is calculated. Without unrolling the loop, this takes about 44 MS. With loop unrolling of 10 the process is finished in 23 MS.
private void Looper()
{
int repeats = 100;
float avg = 0;
ArrayList times = new ArrayList();
for (int i = 0; i < repeats; i++)
times.Add(Time());
Console.WriteLine(GetAverage(times)); //44
times.Clear();
for (int i = 0; i < repeats; i++)
times.Add(TimeUnrolled());
Console.WriteLine(GetAverage(times)); //22
}
private float GetAverage(ArrayList times)
{
long total = 0;
foreach (var item in times)
{
total += (long)item;
}
return total / times.Count;
}
private long Time()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i++)
{
result[i] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
private long TimeUnrolled()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i += 10)
{
result[i] = 5;
result[i + 1] = 5;
result[i + 2] = 5;
result[i + 3] = 5;
result[i + 4] = 5;
result[i + 5] = 5;
result[i + 6] = 5;
result[i + 7] = 5;
result[i + 8] = 5;
result[i + 9] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
Enumerable.Repeat(value, size).ToArray();
Reading up Enumerable.Repeat is 20 times slower than the ops standard for loop and the only thing I found which might improve its speed is
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; ++i)
{
result[i] = value;
}
return result;
}
NOTE the i++ is changed to ++i. i++ copies i, increments i, and returns the original value. ++i just returns the incremented value
As someone already mentioned, you can leverage parallel processing like this:
int[] result = new int[size];
Parallel.ForEach(result, x => x = value);
return result;
Sorry I had no time to do performance testing on this (don't have VS installed on this machine) but if you can do it and share the results it would be great.
EDIT: As per comment, while I still think that in terms of performance they are equivalent, you can try the parallel for loop:
Parallel.For(0, size, i => result[i] = value);

Categories