Weird performance behavior

Weird performance behavior - c#

So I have this 2 methods which suppose to multiply a 1000 items long array of integers by 2.
The first method:
[MethodImpl(MethodImplOptions.NoOptimization)]
Power(int[] arr)
{
for (int i = 0; i < arr.Length; i++)
{
arr[i] = arr[i] + arr[i];
}
}
The second method:
[MethodImpl(MethodImplOptions.NoOptimization)]
PowerNoLoop(int[] arr)
{
int i = 0;
arr[i] = arr[i] + arr[i];
i++;
arr[i] = arr[i] + arr[i];
i++;
arr[i] = arr[i] + arr[i];
i++;
............1000 Times........
arr[i] = arr[i] + arr[i];
}
Note that I use this code only for performance research and that's why it looks so disgusting.
The surprising result is that Power is faster by almost 50% than PowerNoLoop even though I have checked the decompiled IL source of both of them and the content of the for loop is exactly the same as each line in PowerNoLoop.
How can it be?

A sample measurement from my machine, running the test 10 times, PowerNoLoop is first:
00:00:00.0277138 00:00:00.0001553
00:00:00.0000142 00:00:00.0000057
00:00:00.0000106 00:00:00.0000053
00:00:00.0000084 00:00:00.0000053
00:00:00.0000080 00:00:00.0000053
00:00:00.0000075 00:00:00.0000053
00:00:00.0000080 00:00:00.0000057
00:00:00.0000080 00:00:00.0000053
00:00:00.0000080 00:00:00.0000053
00:00:00.0000075 00:00:00.0000053
Yes, about 50% slower. Notable is the jitter overhead in the first pass through the test, obviously it burns a lot more core trying to get that huge method compiled. Keep in mind that the measurement is vastly different when you don't disable the optimizer, the no-loop version is then ~800% slower.
First place to always look for an explanation is the generated machine code, you can see it with Debug > Windows > Disassembly. The primary troublespot is the prologue of the PowerNoLoop() method. Looks like this in x86 code:
067E0048 push ebp ; setup stack frame
067E0049 mov ebp,esp
067E004B push edi ; preserve registers
067E004C push esi
067E004D sub esp,0FA8h ; stack frame size = 4008 bytes
067E0053 mov esi,ecx
067E0055 lea edi,[ebp-0ACCh] ; temp2 variables
067E005B mov ecx,2B1h ; initialize 2756 bytes
067E0060 xor eax,eax ; set them to 0
067E0062 rep stos dword ptr es:[edi]
Note the very large stack size, 4008 bytes. Far too much for a method with only one local variable, it should only require 8 bytes. The extra 4000 of them are temporary variables, I named them temp2. They are initialized to 0 by the rep stos instruction, that takes a while. I can't explain 2756.
The individual adds are a very plodding affair in the non-optimized code. I'll spare you the machine code dump and write it in equivalent C# code:
if (i >= arr.Length) goto throwOutOfBoundsException
var temp1 = arr[i];
if (i >= arr.Length) goto throwOutOfBoundsException
var temp2 = temp1 + arr[i];
if (i >= arr.Length) goto throwOutOfBoundsException
arr[i] = temp2
Repeated over and over again, a thousand times total. The temp2 variable is the troublemaker, there's one each for each individual statement. Thus adding 4000 bytes to the stack frame size. If anybody has a guess at 2756 then I'd love to hear it in a comment.
Having to set them all to 0 before the method can start running is, roughly, what produces the 50% slow-down. There is probably some instruction fetch and decode overhead as well, it can't be isolated easily from the measurement.
Notable as well is that they are not eliminated when you remove the [MethodImpl] attribute and allow the optimizer to do its job. The method is in fact not optimized at all, surely because it doesn't want to tackle such a large chunk of code.
Conclusion you should draw is to always leave it up to the jitter optimizer to unroll loops for you. It knows better.

Because the c# jit compiler is optimized to eliminate bounds checks if it can deduce that the variable will not go outside the range of the for loop.
The case with the for (int i = 0; i < arr.Length; i++) is caught by the optimizer, the other case not.
Here is a blog post about it, it's long but worth the read: http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx

Hans Passant seems to have hit the main issues on the head, but has missed some points.
Firstly as Mark Jansen says, the code generator (in the JIT) has a special case to remove bonds checking for simple array access in simple for loops. It is very likely that [MethodImpl(MethodImplOptions.NoOptimization)] does not effect this. Your unrolled loop has to do this check 3000 times!
The next issue is that it takes much longer to read data (or code) from memory then it does to run an instruction that is already in the processor 1st level cache. There is also limited bandwidth from the CPU to RAM, so whenever the CPU is reading an instruction from memory, it can’t be reading from (or updating) the array. Once the loop in Power has executed the first time, all processor instructions will be in the 1st level cache – they may even be stored in a partly decoded form.
Updating 1000 different tempN variables, will put load on the CPU cache and maybe even RAM (as the CPU does not know they are not going to be read again, so must save them to RAM) (Without MethodImplOptions.NoOptimization, the JIT may combine the tempN variables into a few variables that will then fit in registers.)
These days most CPUs can run many instructions at the same time (Superscalar), therefore it is very likely that all the loop checks (1 < arr.Length) etc are being executed at the same time as the store/load from the array. Even the conditional GoTo at the end of the loop with be hidden by Speculative execution (and/or Out-of-order execution).
Any reasonable CPU will be able to run your loop in about the time it takes to read/write the value from memory.
If you had done the same test on 20 years ago on a PC then it is likely you would have got the result you expected.

I am not seeing these results in my tests. I suspect that your tests may be being distorted by garbage collection.
My test results for a release build are as follows (using Visual Studio 2015, .Net 4.6, Windows 10):
x64:
Power() took 00:00:01.5277909
PowerNoLoop() took 00:00:01.4462461
Power() took 00:00:01.5403739
PowerNoLoop() took 00:00:01.4038312
Power() took 00:00:01.5327902
PowerNoLoop() took 00:00:01.4318121
Power() took 00:00:01.5451933
PowerNoLoop() took 00:00:01.4252743
x86:
Power() took 00:00:01.1769501
PowerNoLoop() took 00:00:00.9933677
Power() took 00:00:01.1557201
PowerNoLoop() took 00:00:01.0033348
Power() took 00:00:01.1119558
PowerNoLoop() took 00:00:00.9588702
Power() took 00:00:01.1167853
PowerNoLoop() took 00:00:00.9553292
And the code:
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main()
{
Stopwatch sw = new Stopwatch();
int count = 200000;
var test = new int[1000];
for (int trial = 0; trial < 4; ++trial)
{
sw.Restart();
for (int i = 0; i < count; ++i)
Power(test);
Console.WriteLine("Power() took " + sw.Elapsed);
sw.Restart();
for (int i = 0; i < count; ++i)
PowerNoLoop(test);
Console.WriteLine("PowerNoLoop() took " + sw.Elapsed);
}
}
[MethodImpl(MethodImplOptions.NoOptimization)]
public static void Power(int[] arr)
{
for (int i = 0; i < arr.Length; i++)
{
arr[i] = arr[i] + arr[i];
}
}
[MethodImpl(MethodImplOptions.NoOptimization)]
public static void PowerNoLoop(int[] arr)
{
int i = 0;
arr[i] = arr[i] + arr[i];
++i;
<snip> Previous two lines repeated 1000 times.
}
}
}

Related

Video rate image construction from binary data performance

First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}

How come this algorithm in Ruby runs faster than in Parallel'd C#?

The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s

Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.

Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.

I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".

Most probable bits in random integer

I've made such experiment - made 10 million random numbers from C and C#. And then counted how much times each bit from 15 bits in random integer is set. (I chose 15 bits because C supports random integer only up to 0x7fff).
What i've got is this:
I have two questions:
Why there are 3 most probable bits ? In C case bits 8,10,12 are most probable. And
in C# bits 6,8,11 are most probable.
Also seems that C# most probable bits is mostly shifted by 2 positions then compared to C most probable bits. Why is this ? Because C# uses other RAND_MAX constant or what ?
My test code for C:
void accumulateResults(int random, int bitSet[15]) {
int i;
int isBitSet;
for (i=0; i < 15; i++) {
isBitSet = ((random & (1<<i)) != 0);
bitSet[i] += isBitSet;
}
}
int main() {
int i;
int bitSet[15] = {0};
int times = 10000000;
srand(0);
for (i=0; i < times; i++) {
accumulateResults(rand(), bitSet);
}
for (i=0; i < 15; i++) {
printf("%d : %d\n", i , bitSet[i]);
}
system("pause");
return 0;
}
And test code for C#:
static void accumulateResults(int random, int[] bitSet)
{
int i;
int isBitSet;
for (i = 0; i < 15; i++)
{
isBitSet = ((random & (1 << i)) != 0) ? 1 : 0;
bitSet[i] += isBitSet;
}
}
static void Main(string[] args)
{
int i;
int[] bitSet = new int[15];
int times = 10000000;
Random r = new Random();
for (i = 0; i < times; i++)
{
accumulateResults(r.Next(), bitSet);
}
for (i = 0; i < 15; i++)
{
Console.WriteLine("{0} : {1}", i, bitSet[i]);
}
Console.ReadKey();
}
Very thanks !! Btw, OS is Windows 7, 64-bit architecture & Visual Studio 2010.
EDIT
Very thanks to #David Heffernan. I made several mistakes here:
Seed in C and C# programs was different (C was using zero and C# - current time).
I didn't tried experiment with different values of Times variable to research reproducibility of results.
Here's what i've got when analyzed how probability that first bit is set depends on number of times random() was called:
So as many noticed - results are not reproducible and shouldn't be taken seriously.
(Except as some form of confirmation that C/C# PRNG are good enough :-) ).

This is just common or garden sampling variation.
Imagine an experiment where you toss a coin ten times, repeatedly. You would not expect to get five heads every single time. That's down to sampling variation.
In just the same way, your experiment will be subject to sampling variation. Each bit follows the same statistical distribution. But sampling variation means that you would not expect an exact 50/50 split between 0 and 1.
Now, your plot is misleading you into thinking the variation is somehow significant or carries meaning. You'd get a much better understanding of this if you plotted the Y axis of the graph starting at 0. That graph looks like this:
If the RNG behaves as it should, then each bit will follow the binomial distribution with probability 0.5. This distribution has variance np(1 − p). For your experiment this gives a variance of 2.5 million. Take the square root to get the standard deviation of around 1,500. So you can see simply from inspecting your results, that the variation you see is not obviously out of the ordinary. You have 15 samples and none are more than 1.6 standard deviations from the true mean. That's nothing to worry about.
You have attempted to discern trends in the results. You have said that there are "3 most probable bits". That's only your particular interpretation of this sample. Try running your programs again with different seeds for your RNGs and you will have graphs that look a little different. They will still have the same quality to them. Some bits are set more than others. But there won't be any discernible patterns, and when you plot them on a graph that includes 0, you will see horizontal lines.
For example, here's what your C program outputs for a random seed of 98723498734.
I think this should be enough to persuade you to run some more trials. When you do so you will see that there are no special bits that are given favoured treatment.

You know that the deviation is about 2500/5,000,000, which comes down to 0,05%?

Note that the difference of frequency of each bit varies by only about 0.08% (-0.03% to +0.05%). I don't think I would consider that significant. If every bit were exactly equally probable, I would find the PRNG very questionable instead of just somewhat questionable. You should expect some level of variance in processes that are supposed to be more or less modelling randomness...

Off By One errors and Mutation Testing

In the process of writing an "Off By One" mutation tester for my favourite mutation testing framework (NinjaTurtles), I wrote the following code to provide an opportunity to check the correctness of my implementation:
public int SumTo(int max)
{
int sum = 0;
for (var i = 1; i <= max; i++)
{
sum += i;
}
return sum;
}
now this seems simple enough, and it didn't strike me that there would be a problem trying to mutate all the literal integer constants in the IL. After all, there are only 3 (the 0, the 1, and the ++).
WRONG!
It became very obvious on the first run that it was never going to work in this particular instance. Why? Because changing the code to
public int SumTo(int max)
{
int sum = 0;
for (var i = 0; i <= max; i++)
{
sum += i;
}
return sum;
}
only adds 0 (zero) to the sum, and this obviously has no effect. Different story if it was the multiple set, but in this instance it was not.
Now there's a fairly easy algorithm for working out the sum of integers
sum = max * (max + 1) / 2;
which I could have fail the mutations easily, since adding or subtracting 1 from either of the constants there will result in an error. (given that max >= 0)
So, problem solved for this particular case. Although it did not do what I wanted for the test of the mutation, which was to check what would happen when I lost the ++ - effectively an infinite loop. But that's another problem.
So - My Question: Are there any trivial or non-trivial cases where a loop starting from 0 or 1 may result in a "mutation off by one" test failure that cannot be refactored (code under test or test) in a similar way? (examples please)
Note: Mutation tests fail when the test suite passes after a mutation has been applied.
Update: an example of something less trivial, but something that could still have the test refactored so that it failed would be the following
public int SumArray(int[] array)
{
int sum = 0;
for (var i = 0; i < array.Length; i++)
{
sum += array[i];
}
return sum;
}
Mutation testing against this code would fail when changing the var i=0 to var i=1 if the test input you gave it was new[] {0,1,2,3,4,5,6,7,8,9}. However change the test input to new[] {9,8,7,6,5,4,3,2,1,0}, and the mutation testing will fail. So a successful refactor proves the testing.

I think with this particular method, there are two choices. You either admit that it's not suitable for mutation testing because of this mathematical anomaly, or you try to write it in a way that makes it safe for mutation testing, either by refactoring to the form you give, or some other way (possibly recursive?).
Your question really boils down to this: is there a real life situation where we care about whether the element 0 is included in or excluded from the operation of a loop, and for which we cannot write a test around that specific aspect? My instinct is to say no.
Your trivial example may be an example of lack of what I referred to as test-drivenness in my blog, writing about NinjaTurtles. Meaning in the case that you have not refactored this method as far as you should.

One natural case of "mutation test failure" is an algorithm for matrix transposition. To make it more suitable for a single for-loop, add some constraints to this task: let the matrix be non-square and require transposition to be in-place. These constraints make one-dimensional array most suitable place to store the matrix and a for-loop (starting, usually, from index '1') may be used to process it. If you start it from index '0', nothing changes, because top-left element of the matrix always transposes to itself.
For an example of such code, see answer to other question (not in C#, sorry).
Here "mutation off by one" test fails, refactoring the test does not change it. I don't know if the code itself may be refactored to avoid this. In theory it may be possible, but should be too difficult.
The code snippet I referenced earlier is not a perfect example. It still may be refactored if the for loop is substituted by two nested loops (as if for rows and columns) and then these rows and columns are recalculated back to one-dimensional index. Still it gives an idea how to make some algorithm, which cannot be refactored (though not very meaningful).
Iterate through an array of positive integers in the order of increasing indexes, for each index compute its pair as i + i % a[i], and if it's not outside the bounds, swap these elements:
for (var i = 1; i < a.Length; i++)
{
var j = i + i % a[i];
if (j < a.Length)
Swap(a[i], a[j]);
}
Here again a[0] is "unmovable", refactoring the test does not change this, and refactoring the code itself is practically impossible.
One more "meaningful" example. Let's implement an implicit Binary Heap. It is usually placed to some array, starting from index '1' (this simplifies many Binary Heap computations, compared to starting from index '0'). Now implement a copy method for this heap. "Off-by-one" problem in this copy method is undetectable because index zero is unused and C# zero-initializes all arrays. This is similar to OP's array summation, but cannot be refactored.
Strictly speaking, you can refactor the whole class and start everything from '0'. But changing only 'copy' method or the test does not prevent "mutation off by one" test failure. Binary Heap class may be treated just as a motivation to copy an array with unused first element.
int[] dst = new int[src.Length];
for (var i = 1; i < src.Length; i++)
{
dst[i] = src[i];
}

Yes, there are many, assuming I have understood your question.
One similar to your case is:
public int MultiplyTo(int max)
{
int product = 1;
for (var i = 1; i <= max; i++)
{
product *= i;
}
return product;
}
Here, if it starts from 0, the result will be 0, but if it starts from 1 the result should be correct. (Although it won't tell the difference between 1 and 2!).

Not quite sure what you are looking for exactly, but it seems to me that if you change/mutate the initial value of sum from 0 to 1, you should fail the test:
public int SumTo(int max)
{
int sum = 1; // Now we are off-by-one from the beginning!
for (var i = 0; i <= max; i++)
{
sum += i;
}
return sum;
}
Update based on comments:
The loop will only not fail after mutation when the loop invariant is violated in the processing of index 0 (or in the absence of it). Most such special cases can be refactored out of the loop, but consider a summation of 1/x:
for (var i = 1; i <= max; i++) {
sum += 1/i;
}
This works fine, but if you mutate the initial bundary from 1 to 0, the test will fail as 1/0 is invalid operation.

which approach is good while using strings [duplicate]

I have always wondered if, in general, declaring a throw-away variable before a loop, as opposed to repeatedly inside the loop, makes any (performance) difference?
A (quite pointless) example in Java:
a) declaration before loop:
double intermediateResult;
for(int i=0; i < 1000; i++){
intermediateResult = i;
System.out.println(intermediateResult);
}
b) declaration (repeatedly) inside loop:
for(int i=0; i < 1000; i++){
double intermediateResult = i;
System.out.println(intermediateResult);
}
Which one is better, a or b?
I suspect that repeated variable declaration (example b) creates more overhead in theory, but that compilers are smart enough so that it doesn't matter. Example b has the advantage of being more compact and limiting the scope of the variable to where it is used. Still, I tend to code according example a.
Edit: I am especially interested in the Java case.

Which is better, a or b?
From a performance perspective, you'd have to measure it. (And in my opinion, if you can measure a difference, the compiler isn't very good).
From a maintenance perspective, b is better. Declare and initialize variables in the same place, in the narrowest scope possible. Don't leave a gaping hole between the declaration and the initialization, and don't pollute namespaces you don't need to.

Well I ran your A and B examples 20 times each, looping 100 million times.(JVM - 1.5.0)
A: average execution time: .074 sec
B: average execution time : .067 sec
To my surprise B was slightly faster.
As fast as computers are now its hard to say if you could accurately measure this.
I would code it the A way as well but I would say it doesn't really matter.

It depends on the language and the exact use. For instance, in C# 1 it made no difference. In C# 2, if the local variable is captured by an anonymous method (or lambda expression in C# 3) it can make a very signficant difference.
Example:
using System;
using System.Collections.Generic;
class Test
{
static void Main()
{
List<Action> actions = new List<Action>();
int outer;
for (int i=0; i < 10; i++)
{
outer = i;
int inner = i;
actions.Add(() => Console.WriteLine("Inner={0}, Outer={1}", inner, outer));
}
foreach (Action action in actions)
{
action();
}
}
}
Output:
Inner=0, Outer=9
Inner=1, Outer=9
Inner=2, Outer=9
Inner=3, Outer=9
Inner=4, Outer=9
Inner=5, Outer=9
Inner=6, Outer=9
Inner=7, Outer=9
Inner=8, Outer=9
Inner=9, Outer=9
The difference is that all of the actions capture the same outer variable, but each has its own separate inner variable.

The following is what I wrote and compiled in .NET.
double r0;
for (int i = 0; i < 1000; i++) {
r0 = i*i;
Console.WriteLine(r0);
}
for (int j = 0; j < 1000; j++) {
double r1 = j*j;
Console.WriteLine(r1);
}
This is what I get from .NET Reflector when CIL is rendered back into code.
for (int i = 0; i < 0x3e8; i++)
{
double r0 = i * i;
Console.WriteLine(r0);
}
for (int j = 0; j < 0x3e8; j++)
{
double r1 = j * j;
Console.WriteLine(r1);
}
So both look exactly same after compilation. In managed languages code is converted into CL/byte code and at time of execution it's converted into machine language. So in machine language a double may not even be created on the stack. It may just be a register as code reflect that it is a temporary variable for WriteLine function. There are a whole set optimization rules just for loops. So the average guy shouldn't be worried about it, especially in managed languages. There are cases when you can optimize manage code, for example, if you have to concatenate a large number of strings using just string a; a+=anotherstring[i] vs using StringBuilder. There is very big difference in performance between both. There are a lot of such cases where the compiler cannot optimize your code, because it cannot figure out what is intended in a bigger scope. But it can pretty much optimize basic things for you.

This is a gotcha in VB.NET. The Visual Basic result won't reinitialize the variable in this example:
For i as Integer = 1 to 100
Dim j as Integer
Console.WriteLine(j)
j = i
Next
' Output: 0 1 2 3 4...
This will print 0 the first time (Visual Basic variables have default values when declared!) but i each time after that.
If you add a = 0, though, you get what you might expect:
For i as Integer = 1 to 100
Dim j as Integer = 0
Console.WriteLine(j)
j = i
Next
'Output: 0 0 0 0 0...

I made a simple test:
int b;
for (int i = 0; i < 10; i++) {
b = i;
}
vs
for (int i = 0; i < 10; i++) {
int b = i;
}
I compiled these codes with gcc - 5.2.0. And then I disassembled the main ()
of these two codes and that's the result:
1º:
0x00000000004004b6 <+0>: push rbp
0x00000000004004b7 <+1>: mov rbp,rsp
0x00000000004004ba <+4>: mov DWORD PTR [rbp-0x4],0x0
0x00000000004004c1 <+11>: jmp 0x4004cd <main+23>
0x00000000004004c3 <+13>: mov eax,DWORD PTR [rbp-0x4]
0x00000000004004c6 <+16>: mov DWORD PTR [rbp-0x8],eax
0x00000000004004c9 <+19>: add DWORD PTR [rbp-0x4],0x1
0x00000000004004cd <+23>: cmp DWORD PTR [rbp-0x4],0x9
0x00000000004004d1 <+27>: jle 0x4004c3 <main+13>
0x00000000004004d3 <+29>: mov eax,0x0
0x00000000004004d8 <+34>: pop rbp
0x00000000004004d9 <+35>: ret
vs
2º
0x00000000004004b6 <+0>: push rbp
0x00000000004004b7 <+1>: mov rbp,rsp
0x00000000004004ba <+4>: mov DWORD PTR [rbp-0x4],0x0
0x00000000004004c1 <+11>: jmp 0x4004cd <main+23>
0x00000000004004c3 <+13>: mov eax,DWORD PTR [rbp-0x4]
0x00000000004004c6 <+16>: mov DWORD PTR [rbp-0x8],eax
0x00000000004004c9 <+19>: add DWORD PTR [rbp-0x4],0x1
0x00000000004004cd <+23>: cmp DWORD PTR [rbp-0x4],0x9
0x00000000004004d1 <+27>: jle 0x4004c3 <main+13>
0x00000000004004d3 <+29>: mov eax,0x0
0x00000000004004d8 <+34>: pop rbp
0x00000000004004d9 <+35>: ret
Which are exaclty the same asm result. isn't a proof that the two codes produce the same thing?

It is language dependent - IIRC C# optimises this, so there isn't any difference, but JavaScript (for example) will do the whole memory allocation shebang each time.

I would always use A (rather than relying on the compiler) and might also rewrite to:
for(int i=0, double intermediateResult=0; i<1000; i++){
intermediateResult = i;
System.out.println(intermediateResult);
}
This still restricts intermediateResult to the loop's scope, but doesn't redeclare during each iteration.

In my opinion, b is the better structure. In a, the last value of intermediateResult sticks around after your loop is finished.
Edit:
This doesn't make a lot of difference with value types, but reference types can be somewhat weighty. Personally, I like variables to be dereferenced as soon as possible for cleanup, and b does that for you,

I suspect a few compilers could optimize both to be the same code, but certainly not all. So I'd say you're better off with the former. The only reason for the latter is if you want to ensure that the declared variable is used only within your loop.

As a general rule, I declare my variables in the inner-most possible scope. So, if you're not using intermediateResult outside of the loop, then I'd go with B.

A co-worker prefers the first form, telling it is an optimization, preferring to re-use a declaration.
I prefer the second one (and try to persuade my co-worker! ;-)), having read that:
It reduces scope of variables to where they are needed, which is a good thing.
Java optimizes enough to make no significant difference in performance. IIRC, perhaps the second form is even faster.
Anyway, it falls in the category of premature optimization that rely in quality of compiler and/or JVM.

There is a difference in C# if you are using the variable in a lambda, etc. But in general the compiler will basically do the same thing, assuming the variable is only used within the loop.
Given that they are basically the same: Note that version b makes it much more obvious to readers that the variable isn't, and can't, be used after the loop. Additionally, version b is much more easily refactored. It is more difficult to extract the loop body into its own method in version a. Moreover, version b assures you that there is no side effect to such a refactoring.
Hence, version a annoys me to no end, because there's no benefit to it and it makes it much more difficult to reason about the code...

Well, you could always make a scope for that:
{ //Or if(true) if the language doesn't support making scopes like this
double intermediateResult;
for (int i=0; i<1000; i++) {
intermediateResult = i;
System.out.println(intermediateResult);
}
}
This way you only declare the variable once, and it'll die when you leave the loop.

I think it depends on the compiler and is hard to give a general answer.

I've always thought that if you declare your variables inside of your loop then you're wasting memory. If you have something like this:
for(;;) {
Object o = new Object();
}
Then not only does the object need to be created for each iteration, but there needs to be a new reference allocated for each object. It seems that if the garbage collector is slow then you'll have a bunch of dangling references that need to be cleaned up.
However, if you have this:
Object o;
for(;;) {
o = new Object();
}
Then you're only creating a single reference and assigning a new object to it each time. Sure, it might take a bit longer for it to go out of scope, but then there's only one dangling reference to deal with.

My practice is following:
if type of variable is simple (int, double, ...) I prefer variant b (inside).
Reason: reducing scope of variable.
if type of variable is not simple (some kind of class or struct) I prefer variant a (outside).
Reason: reducing number of ctor-dtor calls.

I had this very same question for a long time. So I tested an even simpler piece of code.
Conclusion: For such cases there is NO performance difference.
Outside loop case
int intermediateResult;
for(int i=0; i < 1000; i++){
intermediateResult = i+2;
System.out.println(intermediateResult);
}
Inside loop case
for(int i=0; i < 1000; i++){
int intermediateResult = i+2;
System.out.println(intermediateResult);
}
I checked the compiled file on IntelliJ's decompiler and for both cases, I got the same Test.class
for(int i = 0; i < 1000; ++i) {
int intermediateResult = i + 2;
System.out.println(intermediateResult);
}
I also disassembled code for both the case using the method given in this answer. I'll show only the parts relevant to the answer
Outside loop case
Code:
stack=2, locals=3, args_size=1
0: iconst_0
1: istore_2
2: iload_2
3: sipush 1000
6: if_icmpge 26
9: iload_2
10: iconst_2
11: iadd
12: istore_1
13: getstatic #2 // Field java/lang/System.out:Ljava/io/PrintStream;
16: iload_1
17: invokevirtual #3 // Method java/io/PrintStream.println:(I)V
20: iinc 2, 1
23: goto 2
26: return
LocalVariableTable:
Start Length Slot Name Signature
13 13 1 intermediateResult I
2 24 2 i I
0 27 0 args [Ljava/lang/String;
Inside loop case
Code:
stack=2, locals=3, args_size=1
0: iconst_0
1: istore_1
2: iload_1
3: sipush 1000
6: if_icmpge 26
9: iload_1
10: iconst_2
11: iadd
12: istore_2
13: getstatic #2 // Field java/lang/System.out:Ljava/io/PrintStream;
16: iload_2
17: invokevirtual #3 // Method java/io/PrintStream.println:(I)V
20: iinc 1, 1
23: goto 2
26: return
LocalVariableTable:
Start Length Slot Name Signature
13 7 2 intermediateResult I
2 24 1 i I
0 27 0 args [Ljava/lang/String;
If you pay close attention, only the Slot assigned to i and intermediateResult in LocalVariableTable is swapped as a product of their order of appearance. The same difference in slot is reflected in other lines of code.
No extra operation is being performed
intermediateResult is still a local variable in both cases, so there is no difference access time.
BONUS
Compilers do a ton of optimization, take a look at what happens in this case.
Zero work case
for(int i=0; i < 1000; i++){
int intermediateResult = i;
System.out.println(intermediateResult);
}
Zero work decompiled
for(int i = 0; i < 1000; ++i) {
System.out.println(i);
}

From a performance perspective, outside is (much) better.
public static void outside() {
double intermediateResult;
for(int i=0; i < Integer.MAX_VALUE; i++){
intermediateResult = i;
}
}
public static void inside() {
for(int i=0; i < Integer.MAX_VALUE; i++){
double intermediateResult = i;
}
}
I executed both functions 1 billion times each.
outside() took 65 milliseconds. inside() took 1.5 seconds.

I tested for JS with Node 4.0.0 if anyone is interested. Declaring outside the loop resulted in a ~.5 ms performance improvement on average over 1000 trials with 100 million loop iterations per trial. So I'm gonna say go ahead and write it in the most readable / maintainable way which is B, imo. I would put my code in a fiddle, but I used the performance-now Node module. Here's the code:
var now = require("../node_modules/performance-now")
// declare vars inside loop
function varInside(){
for(var i = 0; i < 100000000; i++){
var temp = i;
var temp2 = i + 1;
var temp3 = i + 2;
}
}
// declare vars outside loop
function varOutside(){
var temp;
var temp2;
var temp3;
for(var i = 0; i < 100000000; i++){
temp = i
temp2 = i + 1
temp3 = i + 2
}
}
// for computing average execution times
var insideAvg = 0;
var outsideAvg = 0;
// run varInside a million times and average execution times
for(var i = 0; i < 1000; i++){
var start = now()
varInside()
var end = now()
insideAvg = (insideAvg + (end-start)) / 2
}
// run varOutside a million times and average execution times
for(var i = 0; i < 1000; i++){
var start = now()
varOutside()
var end = now()
outsideAvg = (outsideAvg + (end-start)) / 2
}
console.log('declared inside loop', insideAvg)
console.log('declared outside loop', outsideAvg)

A) is a safe bet than B).........Imagine if you are initializing structure in loop rather than 'int' or 'float' then what?
like
typedef struct loop_example{
JXTZ hi; // where JXTZ could be another type...say closed source lib
// you include in Makefile
}loop_example_struct;
//then....
int j = 0; // declare here or face c99 error if in loop - depends on compiler setting
for ( ;j++; )
{
loop_example loop_object; // guess the result in memory heap?
}
You are certainly bound to face problems with memory leaks!. Hence I believe 'A' is safer bet while 'B' is vulnerable to memory accumulation esp working close source libraries.You can check usinng 'Valgrind' Tool on Linux specifically sub tool 'Helgrind'.

It's an interesting question. From my experience there is an ultimate question to consider when you debate this matter for a code:
Is there any reason why the variable would need to be global?
It makes sense to only declare the variable once, globally, as opposed to many times locally, because it is better for organizing the code and requires less lines of code. However, if it only needs to be declared locally within one method, I would initialize it in that method so it is clear that the variable is exclusively relevant to that method. Be careful not to call this variable outside the method in which it is initialized if you choose the latter option--your code won't know what you're talking about and will report an error.
Also, as a side note, don't duplicate local variable names between different methods even if their purposes are near-identical; it just gets confusing.

this is the better form
double intermediateResult;
int i = byte.MinValue;
for(; i < 1000; i++)
{
intermediateResult = i;
System.out.println(intermediateResult);
}
1) in this way declared once time both variable, and not each for cycle.
2) the assignment it's fatser thean all other option.
3) So the bestpractice rule is any declaration outside the iteration for.

Tried the same thing in Go, and compared the compiler output using go tool compile -S with go 1.9.4
Zero difference, as per the assembler output.

I use (A) when I want to see the contents of the variable after exiting the loop. It only matters for debugging. I use (B) when I want the code more compact, since it saves one line of code.

Even if I know my compiler is smart enough, I won't like to rely on it, and will use the a) variant.
The b) variant makes sense to me only if you desperately need to make the intermediateResult unavailable after the loop body. But I can't imagine such desperate situation, anyway....
EDIT: Jon Skeet made a very good point, showing that variable declaration inside a loop can make an actual semantic difference.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.