Related
First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}
The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s
Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.
Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.
I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".
In the process of writing an "Off By One" mutation tester for my favourite mutation testing framework (NinjaTurtles), I wrote the following code to provide an opportunity to check the correctness of my implementation:
public int SumTo(int max)
{
int sum = 0;
for (var i = 1; i <= max; i++)
{
sum += i;
}
return sum;
}
now this seems simple enough, and it didn't strike me that there would be a problem trying to mutate all the literal integer constants in the IL. After all, there are only 3 (the 0, the 1, and the ++).
WRONG!
It became very obvious on the first run that it was never going to work in this particular instance. Why? Because changing the code to
public int SumTo(int max)
{
int sum = 0;
for (var i = 0; i <= max; i++)
{
sum += i;
}
return sum;
}
only adds 0 (zero) to the sum, and this obviously has no effect. Different story if it was the multiple set, but in this instance it was not.
Now there's a fairly easy algorithm for working out the sum of integers
sum = max * (max + 1) / 2;
which I could have fail the mutations easily, since adding or subtracting 1 from either of the constants there will result in an error. (given that max >= 0)
So, problem solved for this particular case. Although it did not do what I wanted for the test of the mutation, which was to check what would happen when I lost the ++ - effectively an infinite loop. But that's another problem.
So - My Question: Are there any trivial or non-trivial cases where a loop starting from 0 or 1 may result in a "mutation off by one" test failure that cannot be refactored (code under test or test) in a similar way? (examples please)
Note: Mutation tests fail when the test suite passes after a mutation has been applied.
Update: an example of something less trivial, but something that could still have the test refactored so that it failed would be the following
public int SumArray(int[] array)
{
int sum = 0;
for (var i = 0; i < array.Length; i++)
{
sum += array[i];
}
return sum;
}
Mutation testing against this code would fail when changing the var i=0 to var i=1 if the test input you gave it was new[] {0,1,2,3,4,5,6,7,8,9}. However change the test input to new[] {9,8,7,6,5,4,3,2,1,0}, and the mutation testing will fail. So a successful refactor proves the testing.
I think with this particular method, there are two choices. You either admit that it's not suitable for mutation testing because of this mathematical anomaly, or you try to write it in a way that makes it safe for mutation testing, either by refactoring to the form you give, or some other way (possibly recursive?).
Your question really boils down to this: is there a real life situation where we care about whether the element 0 is included in or excluded from the operation of a loop, and for which we cannot write a test around that specific aspect? My instinct is to say no.
Your trivial example may be an example of lack of what I referred to as test-drivenness in my blog, writing about NinjaTurtles. Meaning in the case that you have not refactored this method as far as you should.
One natural case of "mutation test failure" is an algorithm for matrix transposition. To make it more suitable for a single for-loop, add some constraints to this task: let the matrix be non-square and require transposition to be in-place. These constraints make one-dimensional array most suitable place to store the matrix and a for-loop (starting, usually, from index '1') may be used to process it. If you start it from index '0', nothing changes, because top-left element of the matrix always transposes to itself.
For an example of such code, see answer to other question (not in C#, sorry).
Here "mutation off by one" test fails, refactoring the test does not change it. I don't know if the code itself may be refactored to avoid this. In theory it may be possible, but should be too difficult.
The code snippet I referenced earlier is not a perfect example. It still may be refactored if the for loop is substituted by two nested loops (as if for rows and columns) and then these rows and columns are recalculated back to one-dimensional index. Still it gives an idea how to make some algorithm, which cannot be refactored (though not very meaningful).
Iterate through an array of positive integers in the order of increasing indexes, for each index compute its pair as i + i % a[i], and if it's not outside the bounds, swap these elements:
for (var i = 1; i < a.Length; i++)
{
var j = i + i % a[i];
if (j < a.Length)
Swap(a[i], a[j]);
}
Here again a[0] is "unmovable", refactoring the test does not change this, and refactoring the code itself is practically impossible.
One more "meaningful" example. Let's implement an implicit Binary Heap. It is usually placed to some array, starting from index '1' (this simplifies many Binary Heap computations, compared to starting from index '0'). Now implement a copy method for this heap. "Off-by-one" problem in this copy method is undetectable because index zero is unused and C# zero-initializes all arrays. This is similar to OP's array summation, but cannot be refactored.
Strictly speaking, you can refactor the whole class and start everything from '0'. But changing only 'copy' method or the test does not prevent "mutation off by one" test failure. Binary Heap class may be treated just as a motivation to copy an array with unused first element.
int[] dst = new int[src.Length];
for (var i = 1; i < src.Length; i++)
{
dst[i] = src[i];
}
Yes, there are many, assuming I have understood your question.
One similar to your case is:
public int MultiplyTo(int max)
{
int product = 1;
for (var i = 1; i <= max; i++)
{
product *= i;
}
return product;
}
Here, if it starts from 0, the result will be 0, but if it starts from 1 the result should be correct. (Although it won't tell the difference between 1 and 2!).
Not quite sure what you are looking for exactly, but it seems to me that if you change/mutate the initial value of sum from 0 to 1, you should fail the test:
public int SumTo(int max)
{
int sum = 1; // Now we are off-by-one from the beginning!
for (var i = 0; i <= max; i++)
{
sum += i;
}
return sum;
}
Update based on comments:
The loop will only not fail after mutation when the loop invariant is violated in the processing of index 0 (or in the absence of it). Most such special cases can be refactored out of the loop, but consider a summation of 1/x:
for (var i = 1; i <= max; i++) {
sum += 1/i;
}
This works fine, but if you mutate the initial bundary from 1 to 0, the test will fail as 1/0 is invalid operation.
Good morning, afternoon or night,
Up until today, I thought comparison was one of the basic processor instructions, and so that it was one of the fastest operations one can do in a computer... On the other hand I know multiplication in sometimes trickier and involves lots of bits operations. However, I was a little shocked to look at the results of the following code:
Stopwatch Test = new Stopwatch();
int a = 0;
int i = 0, j = 0, l = 0;
double c = 0, d = 0;
for (i = 0; i < 32; i++)
{
Test.Start();
for (j = Int32.MaxValue, l = 1; j != 0; j = -j + ((j < 0) ? -1 : 1), l = -l)
{
a = l * j;
}
Test.Stop();
Console.WriteLine("Product: {0}", Test.Elapsed.TotalMilliseconds);
c += Test.Elapsed.TotalMilliseconds;
Test.Reset();
Test.Start();
for (j = Int32.MaxValue, l = 1; j != 0; j = -j + ((j < 0) ? -1 : 1), l = -l)
{
a = (j < 0) ? -j : j;
}
Test.Stop();
Console.WriteLine("Comparison: {0}", Test.Elapsed.TotalMilliseconds);
d += Test.Elapsed.TotalMilliseconds;
Test.Reset();
}
Console.WriteLine("Product: {0}", c / 32);
Console.WriteLine("Comparison: {0}", d / 32);
Console.ReadKey();
}
Result:
Product: 8558.6
Comparison: 9799.7
Quick explanation: j is an ancillary alternate variable which goes like (...), 11, -10, 9, -8, 7, (...) until it reaches zero, l is a variable which stores j's sign, and a is the test variable, which I want always to be equal to the modulus of j. The goal of the test was to check whether it is faster to set a to this value using multiplication or the conditional operator.
Can anyone please comment on these results?
Thank you very much.
Your second test it's not a mere comparison, but an if statement.
That's probably translated in a JUMP/BRANCH instruction in CPU, involving branch prediction (with possible blocks of the pipeline) and then is likely slower than a simple multiplication (even if not so much).
It can often be very difficult to make such assertions about an optimising compiler. They do lots of tricks that make simple cases different from real code. That said, you aren't just doing a comparison, you're doing a compare/assign in a very tight loop. The thread you're working on may have to pause many times at the branch; the multiplication can assign as many times as it likes, as long as the last assignment is still last, so many multiplies can be going on at once.
As is the general rule, make your code clear and ignore minor timing issues unless they become a problem.
If you do have a speed problem a good tracing/timing tool will guide you much better than knowing if one operation is faster than other in a specific case.
I guess one comment I would make is that you are doing a lot more in the second operation:
a = (j < 0) ? -j : j;
Not only are you doing a comparison, but also effectivly a "if..else.." with the ? operator and a negation of j.
You should try to run this test a 1000 or so times and use avrage to compare you never now what CLR is doing in background
I need to resample big sets of data (few hundred spectra, each containing few thousand points) using simple linear interpolation.
I have created interpolation method in C# but it seems to be really slow for huge datasets.
How can I improve the performance of this code?
public static List<double> interpolate(IList<double> xItems, IList<double> yItems, IList<double> breaks)
{
double[] interpolated = new double[breaks.Count];
int id = 1;
int x = 0;
while(breaks[x] < xItems[0])
{
interpolated[x] = yItems[0];
x++;
}
double p, w;
// left border case - uphold the value
for (int i = x; i < breaks.Count; i++)
{
while (breaks[i] > xItems[id])
{
id++;
if (id > xItems.Count - 1)
{
id = xItems.Count - 1;
break;
}
}
System.Diagnostics.Debug.WriteLine(string.Format("i: {0}, id {1}", i, id));
if (id <= xItems.Count - 1)
{
if (id == xItems.Count - 1 && breaks[i] > xItems[id])
{
interpolated[i] = yItems[yItems.Count - 1];
}
else
{
w = xItems[id] - xItems[id - 1];
p = (breaks[i] - xItems[id - 1]) / w;
interpolated[i] = yItems[id - 1] + p * (yItems[id] - yItems[id - 1]);
}
}
else // right border case - uphold the value
{
interpolated[i] = yItems[yItems.Count - 1];
}
}
return interpolated.ToList();
}
Edit
Thanks, guys, for all your responses. What I wanted to achieve, when I wrote this questions, were some general ideas where I could find some areas to improve the performance. I haven't expected any ready solutions, only some ideas. And you gave me what I wanted, thanks!
Before writing this question I thought about rewriting this code in C++ but after reading comments to Will's asnwer it seems that the gain can be less than I expected.
Also, the code is so simple, that there are no mighty code-tricks to use here. Thanks to Petar for his attempt to optimize the code
It seems that all reduces the problem to finding good profiler and checking every line and soubroutine and trying to optimize that.
Thank you again for all responses and taking your part in this discussion!
public static List<double> Interpolate(IList<double> xItems, IList<double> yItems, IList<double> breaks)
{
var a = xItems.ToArray();
var b = yItems.ToArray();
var aLimit = a.Length - 1;
var bLimit = b.Length - 1;
var interpolated = new double[breaks.Count];
var total = 0;
var initialValue = a[0];
while (breaks[total] < initialValue)
{
total++;
}
Array.Copy(b, 0, interpolated, 0, total);
int id = 1;
for (int i = total; i < breaks.Count; i++)
{
var breakValue = breaks[i];
while (breakValue > a[id])
{
id++;
if (id > aLimit)
{
id = aLimit;
break;
}
}
double value = b[bLimit];
if (id <= aLimit)
{
var currentValue = a[id];
var previousValue = a[id - 1];
if (id != aLimit || breakValue <= currentValue)
{
var w = currentValue - previousValue;
var p = (breakValue - previousValue) / w;
value = b[id - 1] + p * (b[id] - b[id - 1]);
}
}
interpolated[i] = value;
}
return interpolated.ToList();
}
I've cached some (const) values and used Array.Copy, but I think these are micro optimization that are already made by the compiler in Release mode. However You can try this version and see if it will beat the original version of the code.
Instead of
interpolated.ToList()
which copies the whole array, you compute the interpolated values directly in the final list (or return that array instead). Especially if the array/List is big enough to qualify for the large object heap.
Unlike the ordinary heap, the LOH is not compacted by the GC, which means that short lived large objects are far more harmful than small ones.
Then again: 7000 doubles are approx. 56'000 bytes which is below the large object threshold of 85'000 bytes (1).
Looks to me you've created an O(n^2) algorithm. You are searching for the interval, that's O(n), then probably apply it n times. You'll get a quick and cheap speed-up by taking advantage of the fact that the items are already ordered in the list. Use BinarySearch(), that's O(log(n)).
If still necessary, you should be able to do something speedier with the outer loop, what ever interval you found previously should make it easier to find the next one. But that code isn't in your snippet.
I'd say profile the code and see where it spends its time, then you have somewhere to focus on.
ANTS is popular, but Equatec is free I think.
few suggestions,
as others suggested, use profiler to understand better where time is used.
the loop
while (breaks[x] < xItems[0])
could cause exception if x grows bigger than number of items in "breaks" list. You should use something like
while (x < breaks.Count && breaks[x] < xItems[0])
But you might not need that loop at all. Why treat the first item as special case, just start with id=0 and handle the first point in for(i) loop. I understand that id might start from 0 in this case, and [id-1] would be negative index, but see if you can do something there.
If you want to optimize for speed then you sacrifice memory size, and vice versa. You cannot usually have both, except if you make really clever algorithm. In this case, it would mean to calculate as much as you can outside loops, store those values in variables (extra memory) and use them later. For example, instead of always saying:
id = xItems.Count - 1;
You could say:
int lastXItemsIndex = xItems.Count-1;
...
id = lastXItemsIndex;
This is the same suggestion as Petar Petrov did with aLimit, bLimit....
next point, your loop (or the one Petar Petrov suggested):
while (breaks[i] > xItems[id])
{
id++;
if (id > xItems.Count - 1)
{
id = xItems.Count - 1;
break;
}
}
could probably be reduced to:
double currentBreak = breaks[i];
while (id <= lastXIndex && currentBreak > xItems[id]) id++;
and the last point I would add is to check if there is some property in your samples that is special for your problem. For example if xItems represent time, and you are sampling in regular intervals, then
w = xItems[id] - xItems[id - 1];
is constant, and you do not have to calculate it every time in the loop.
This is probably not often the case, but maybe your problem has some other property which you could use to improve performance.
Another idea is this: maybe you do not need double precision, "float" is probably faster because it is smaller.
Good luck
System.Diagnostics.Debug.WriteLine(string.Format("i: {0}, id {1}", i, id));
I hope it's release build without DEBUG defined?
Otherwise, it might depend on what exactly are those IList parameters. May be useful to store Count value instead of accessing property every time.
This is the kind of problem where you need to move over to native code.