For kicks, I wanted to see how the speed of a C# for-loop compares with that of a C++ for-loop. My test is to simply iterate over a for-loop 100000 times, 100000 times, and average the result.
Here is my C# implementation:
static void Main(string[] args) {
var numberOfMeasurements = 100000;
var numberOfLoops = 100000;
var measurements = new List < long > ();
var stopwatch = new Stopwatch();
for (var i = 0; i < numberOfMeasurements; i++) {
stopwatch.Start();
for (int j = 0; j < numberOfLoops; j++) {}
measurements.Add(stopwatch.ElapsedMilliseconds);
}
Console.WriteLine("Average runtime = " + measurements.Average() + " ms.");
Console.Read();
}
Result: Average runtime = 10301.92929 ms.
Here is my C++ implementation:
void TestA()
{
auto numberOfMeasurements = 100000;
auto numberOfLoops = 100000;
std::vector<long> measurements;
for (size_t i = 0; i < numberOfMeasurements; i++)
{
auto start = clock();
for (size_t j = 0; j < numberOfLoops; j++){}
auto duration = start - clock();
measurements.push_back(duration);
}
long avg = std::accumulate(measurements.begin(), measurements.end(), 0.0) / measurements.size();
std::cout << "TestB: Time taken in milliseconds: " << avg << std::endl;
}
int main()
{
TestA();
return 0;
}
Result: TestA: Time taken in milliseconds: 0
When I had a look at what was in measurements, I noticed that it was filled with zeros... So, what is it, what is the problem here? Is it clock? Is there a better/correct way to measure the for-loop?
There is no "problem". Being able to optimize away useless code is one of the key features of C++. As the inner loop does nothing, it should be removed by every sane compiler.
Tip of the day: Only profile meaningful code that does something.
If you want to learn something about micro-benchmarks, you might be interested in this.
As "Baum mit Augen" already said the compiler will remove code that doesn't do anything. That is a common mistake when "benchmarking" C++ code. The same thing will happen if you create some kind of benchmark function which just calculates some things that will never used (won't be returned or used otherwise in code) - the compiler will just remove it.
You can avoid this behavior by not using optimize flags like O2, Ofast and so on. Since nobody would do that with real code it won't display the real performance of C++.
TL;DR Just benchmark real production code.
Related
Given the following code:
public float[] weights;
public void Input(Neuron[] neurons)
{
float output = 0;
for (int i = 0; i < neurons.Length; i++)
output += neurons[i].input * weights[i];
}
Is it possible to perform all the calculations in a single execution? For example that would be 'neurons[0].input * weights[0].value + neurons[1].input * weights[1].value...'
Coming from this topic - How to sum up an array of integers in C#, there is a way for simpler caclulations, but the idea of my code is to iterate over the first array, multiply each element by the element in the same index in the second array and add that to a sum total.
Doing perf profiling, the line where the output is summed is very heavy on I/O and consumes 99% of my processing power. The stack should have enough memory for this, I am not worried about stack overflow, I just want to see it work faster for the moment (even if accuracy is sacrificed).
I think you are looking for AVX in C#
So you can actually calculate several values in one command.
Thats SIMD for CPU cores. Take a look at this
Here an example from the website:
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
var simdLength = Vector<int>.Count;
var result = new int[lhs.Length];
var i = 0;
for (i = 0; i <= lhs.Length - simdLength; i += simdLength)
{
var va = new Vector<int>(lhs, i);
var vb = new Vector<int>(rhs, i);
(va + vb).CopyTo(result, i);
}
for (; i < lhs.Length; ++i)
{
result[i] = lhs[i] + rhs[i];
}
return result;
}
You can also combine it with the parallelism you already use.
Here I'm using the mathematical term for linear. If we look at, for example, the definition of averaging, we know that:
That's what I mean by linear. In C#, suppose I want to do the following:
for (int i = 0; i<N; i++)
{
someVar[i] += i;
someOtherVar[i] += i;
}
Does this cost the same overhead as:
for (int i = 0; i<N; i++)
{
someVar[i] += i;
}
for (int i = 0; i<N; i++)
{
someOtherVar[i] += i;
}
Would the difference change if the operation in the for loop is more complicated, like:
for (int i = 0; i<N; i++)
{
someOtherVar[i] *= Math.Cos(2.0 * Math.Pi * i / N);
}
Assume N is large, 16384 entries.
As far as complexity goes, both a one and two loop algorithm are linear O(n) operations. But as far as actual performance (overhead), the two loop version is most likely slower. The compiler or runtime may opt to do optimizations, but two loops generally translate to:
set i to 0;
a:
if i < N goto done;
set someVar[i] = someVar[i] + i;
set i = i + 1
goto a;
done:
set i to 0;
b:
if i < N goto done;
set someOtherVar[i] = someOtherVar[i] + i;
set i = i + 1;
goto b;
...which is clearly more steps than one loop.
I did an IL dump, and using one loop is faster than using two loops as there's the overhead incurred by incrementing "i".
That said, the difference will be minimal.
If you're dealing with a much larger data set and need to optimize this further, look into SIMD (single instruction, multiple data) operations.
https://habr.com/en/post/467689/
I have a numeric intensive application and after looking for GFLOPS on the internet, I decided to do my own little benchmark. I just did a single thread matrix multiplication thousands of times to get about a second of execution. This is the inner loop.full
for (int i = 0; i < SIZEA; i++)
for (int j = 0; j < SIZEB; j++)
vector_out[i] = vector_out[i] + vector[j] * matrix[i, j];
It's been years since I dealt with FLOPS, so I expected to get something around 3 to 6 cycles per FLOP. But I am getting 30 (100 MFLOPS), surely if I parallelize this I will get more but I just did not expect that. Could this be a problem with dot NET. or is this really the CPU performance?
Here is a fiddle with the full benchmark code.
EDIT: Visual studio even in release mode takes longer to run, the executable by itself it runs in 12 cycles per FLOP (250 MFLOPS). Still is there any VM impact?
Your bench mark doesn't really measure FLOPS, it does some floating point operations and looping in C#.
However, if you can isolate your code to a repetition of just floating point operations you still have some problems.
Your code should include some "pre-cycles" to allow the "jitter to warm-up", so you are not measuring compile time.
Then, even if you do that,
You need to compile in release mode with optimizations on and execute your test from the commmand-line on a known consistent platform.
Fiddle here
Here is my alternative benchmark,
using System;
using System.Linq;
using System.Diagnostics;
class Program
{
static void Main()
{
const int Flops = 10000000;
var random = new Random();
var output = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var left = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var right = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
for (var i = 0; i < Flops - 1; i++)
{
output[i] = random.NextDouble();
}
timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
Console.WriteLine("ms: {0}", timer.ElapsedMilliseconds);
Console.WriteLine(
"MFLOPS: {0}",
(double)Flops / timer.ElapsedMilliseconds / 1000.0);
}
}
On my VM I get results like
ms: 73
MFLOPS: 136.986301...
Note, I had to increase the number of operations significantly to get over 1 millisecond.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
c# Leaner way of initializing int array
Basically I would like to know if there is a more efficent code than the one shown below
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; i++)
{
result[i] = value;
}
return result;
}
where size can vary from 10 to 150000. For small arrays is not an issue, but there should be a better way to do the above.
I am using VS2010(.NET 4.0)
C#/CLR does not have built in way to initalize array with non-default values.
Your code is as efficient as it could get if you measure in operations per item.
You can get potentially faster initialization if you initialize chunks of huge array in parallel. This approach will need careful tuning due to non-trivial cost of mutlithread operations.
Much better results can be obtained by analizing your needs and potentially removing whole initialization alltogether. I.e. if array is normally contains constant value you can implement some sort of COW (copy on write) approach where your object initially have no backing array and simpy returns constant value, that on write to an element it would create (potentially partial) backing array for modified segment.
Slower but more compact code (that potentially easier to read) would be to use Enumerable.Repeat. Note that ToArray will cause significant amount of memory to be allocated for large arrays (which may also endup with allocations on LOH) - High memory consumption with Enumerable.Range?.
var result = Enumerable.Repeat(value, size).ToArray();
One way that you can improve speed is by utilizing Array.Copy. It's able to work at a lower level in which it's bulk assigning larger sections of memory.
By batching the assignments you can end up copying the array from one section to itself.
On top of that, the batches themselves can be quite effectively paralleized.
Here is my initial code up. On my machine (which only has two cores) with a sample array of size 10 million items, I was getting a 15% or so speedup. You'll need to play around with the batch size (try to stay in multiples of your page size to keep it efficient) to tune it to the size of items that you have. For smaller arrays it'll end up almost identical to your code as it won't get past filling up the first batch, but it also won't be (noticeably) worse in those cases either.
private const int batchSize = 1048576;
private static int[] GetDefaultSeriesArray2(int size, int value)
{
int[] result = new int[size];
//fill the first batch normally
int end = Math.Min(batchSize, size);
for (int i = 0; i < end; i++)
{
result[i] = value;
}
int numBatches = size / batchSize;
Parallel.For(1, numBatches, batch =>
{
Array.Copy(result, 0, result, batch * batchSize, batchSize);
});
//handle partial leftover batch
for (int i = numBatches * batchSize; i < size; i++)
{
result[i] = value;
}
return result;
}
Another way to improve performance is with a pretty basic technique: loop unrolling.
I have written some code to initialize an array with 20 million items, this is done repeatedly 100 times and an average is calculated. Without unrolling the loop, this takes about 44 MS. With loop unrolling of 10 the process is finished in 23 MS.
private void Looper()
{
int repeats = 100;
float avg = 0;
ArrayList times = new ArrayList();
for (int i = 0; i < repeats; i++)
times.Add(Time());
Console.WriteLine(GetAverage(times)); //44
times.Clear();
for (int i = 0; i < repeats; i++)
times.Add(TimeUnrolled());
Console.WriteLine(GetAverage(times)); //22
}
private float GetAverage(ArrayList times)
{
long total = 0;
foreach (var item in times)
{
total += (long)item;
}
return total / times.Count;
}
private long Time()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i++)
{
result[i] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
private long TimeUnrolled()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i += 10)
{
result[i] = 5;
result[i + 1] = 5;
result[i + 2] = 5;
result[i + 3] = 5;
result[i + 4] = 5;
result[i + 5] = 5;
result[i + 6] = 5;
result[i + 7] = 5;
result[i + 8] = 5;
result[i + 9] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
Enumerable.Repeat(value, size).ToArray();
Reading up Enumerable.Repeat is 20 times slower than the ops standard for loop and the only thing I found which might improve its speed is
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; ++i)
{
result[i] = value;
}
return result;
}
NOTE the i++ is changed to ++i. i++ copies i, increments i, and returns the original value. ++i just returns the incremented value
As someone already mentioned, you can leverage parallel processing like this:
int[] result = new int[size];
Parallel.ForEach(result, x => x = value);
return result;
Sorry I had no time to do performance testing on this (don't have VS installed on this machine) but if you can do it and share the results it would be great.
EDIT: As per comment, while I still think that in terms of performance they are equivalent, you can try the parallel for loop:
Parallel.For(0, size, i => result[i] = value);
I have two very simple code in C++ and C#
in c#
for (int counter = 0; counter < 100000; counter ++)
{
String a = "";
a = "xyz";
a = a + 'd';
a = a + 'c';
a = a + 'h';
}
in c++
for (int counter = 0; counter < 100000; counter ++)
{
string a = "";
a.append("xyz");
a = a + 'd';
a = a + 'c';
a = a + 'h';
}
the strange thing is the c# code took 1/20 time for execution than the c++ code.
could you please help me to find why this happened? and how can I change my c++ code to become faster.
It's probably a quirk of the implementations. For example, one optimizer might have figured out that the result of the operations isn't used. Or one might happen to allocate a string large enough to add three extra characters without reallocating while the other didn't. Or it could be a million other things.
Benchmarking with "toy" code really isn't helpful. I wouldn't assume the results apply to any realistic situation.
There are so many obvious optimizations to this code, for example:
string a;
for (int counter = 0; counter < 100000; counter ++)
{
a = "xyz";
a.append(1, 'd');
a.append(1, 'c');
a.append(1, 'h');
}
That may make a huge difference by reusing the buffer and avoiding extra allocate/copy/free cycles.
For large string manipulation use StringBuilder