Multithreading iterative-intensive work

Multithreading iterative-intensive work - c#

I'm creating a game with loads of asteroids floating in a huge space. To accomodate the heavy collision detection, I've split the world into "AsteroidChunks" where only the closest chunks are being used for collision detection. The work to split up the asteroids into the chunks is extremely repetitive with something like 1.2 billion iterations made every time the game starts up. However, I've noticed that the game is not actually using that much CPU power. Not a single core was running at 100%! (The game is based on MonoGame for Windows)
My idea is that by running the most iterative-heavy stuff on multiple threads (something like giving each thread 1000 iterations to complete) I would be able to get the job done quicker, at the expense of CPU power of course. Am I on the right track, and how should I proceed?
Is there any other way of speeding up the iterations, allowing for a larger 3D space?
This is what I've tried so far. It's not working, but gives an idea of what I want to accomplish: (asteroidChunks is a List with approx. 16,000 entries, but could easily be > 100,000))
Task[] populateChunkTasks = new Task[asteroidChunks.Count];
for (int i = 0; i < asteroidChunks.Count; i += 1000)
populateChunkTasks[i] = new Task(() =>
{
for (int x = i; x < asteroidChunks.Count && x < i + 1000; x++)
PopulateChunk(asteroidChunks[x]);
});
for (int i = 0; i < populateChunkTasks.Length; i++)
populateChunkTasks[i].Start();
Task.WaitAll(populateChunkTasks);
As you can see I want to wait for the threads to complete at the end. However, the operation itself is in its own Task, which is why I can't wait for all tasks - in this case - to complete before proceeding.
PopulateChunk: (asteroidList is a list with approx 75,000 entries, but could easily be > 1million)
public static AsteroidChunk PopulateChunk(AsteroidChunk chunk)
{
foreach (Asteroid asteroid in asteroidList)
if (chunk.box.Contains(asteroid.boundingSphere) == ContainmentType.Contains)
chunk.asteroids.Add(asteroid);
return chunk;
}

Related

Video rate image construction from binary data performance

First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}

Which part of my Parallel.For should be fixed and is unsafe?

I have a nested For loops as follow:
// This loop cannot be parallel because results of the next
// two loops will be used for next t
for (int t= 0; t< 6000000; t++)
// Calculations in the following two loops includes 75% of all time spend for.
for (int i= 0; i< 1000; i++)
{
for (int j= 0; j< 1000; j++)
{
if (Vxcal){V.x= ............. // some calculations }
if (Vycal){V.y= ............. // some calculations }
// Vbar is a two dimensional array
Vbar = V;
}
}
I changed the above code to :
// This loop cannot be parallel because results of the next
// two loops will be used for next t
for (int t= 0; t< 6000000; t++)
// Calculations in the following two loops includes 75% of all time spend for.
Parallel.for (0, 1000, i=>
{
Parallel.for (0, 1000, j=>
{
if (Vxcal){V.x= ............. // some calculations }
if (Vycal){V.y= ............. // some calculations }
// Vbar is a two dimensional array
Vbar = V;
}
}
When I run code results are not correct and takes hours instead of 10 mins. My question is:
Are these kind of For loops suitable for Parallel?
These loops just have some mathematical calculations.
How can I make this parallel For loops safe?
I found a keyword "Lock" which can help me for have a safe loop but which part of this loop is unsafe?

I see you modified your question to put in some other numbers. So your inner loop is now executed 6000000*1000*1000, or 6,000,000,000,000 times. At 4 billion calculations per second (which your computer can't do), that's going to take 1,500 seconds, or 25 minutes. If you get perfect parallelism with Parallel.For, you can cut that down to 8.25 minutes on a 4-core machine. If your calculations are long and complicated, there's no surprise that it takes hours to complete.
Your code is slow because it's doing a lot of work. You need a better algorithm!
Original answer
Consider your nested loops:
for (int t= 0; t< 5000000000; t++)
// Calculations in the following two loops includes 75% of all time spend for.
for (int i= 0; i< 5000000000; i++)
{
for (int j= 0; j< 5000000000; j++)
The inner loop (your calculation) is being executed 5,000,000,000*5,000,000,000*5,000,000,000 times. 125,000,000,000,000,000,000,000,000,000 is a huge number. Even if your computer could do that loop 4 billion times per second (it can't--not even close), it would take 31,250,000,000,000,000,000 seconds, or about 990 billion years to complete. Using multiple threads on a four-core machine could cut that down to only 250 billion years.
I don't know what you're trying to do, but you'll need a much better algorithm, a computer that's about a 500 billion times faster than the one you have, or several hundred billion processors if you want it to finish in your lifetime.

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Why does this threaded code run so much slower?

I'm writing an N-Body simulation, and for computational simplification I've divided the whole space into a number of uniformly-sized regions.
For each body, I compute the force of all other bodies in the same region, and for the other regions I aggregate the mass and distances together so there's less work to be done.
I have a List<Region> and Region defines public void Index() which sums the total mass at this iteration.
I have two variants of my Space.Tick() function:
public void Tick()
{
foreach (Region r in Regions)
r.Index();
}
This is very quick. For 20x20x20 = 8000 regions with 100 bodies each = 800000 bodies in total, it only takes about 0.1 seconds to do this. The CPU graph shows 25% utilisation on my quad-core, which is exactly what I would expect.
Now I write this multi-threaded variant:
public void Tick()
{
Thread[] threads = new Thread[Environment.ProcessorCount];
foreach (Region r in Regions)
while (true)
{
bool queued = false;
for (int i = 0; i < threads.Length; i++)
if (threads[i] == null || !threads[i].IsAlive)
{
Region s = r;
threads[i] = new Thread(s.Index);
threads[i].Start();
queued = true;
break;
}
if (queued)
break;
}
}
So a quick explanation in case it's not obvious: threads is an array of 4, in the case of my CPU. It starts off being 4xnull. For each region, I loop through all 4 Thread objects (which could be null). When I find one that's either null or isn't IsAlive, I queue up the Index() of that Region and Start() it. I set queued to true so that I can tell that the region has started indexing.
This code takes about 7 seconds. That's 70x slower. I understand that there's a bit of overhead involved with setting up the threads, finding a thread that's vacant, etc. But I would still expect that I would have at least some sort of performance gain.
What am I doing wrong?

Why not try PLINQ?
Regions.AsParallel().ForAll(x=>x.Index());
PLINQ is usually SUPER fast for me, and it scales dependent on your environment.. If it shouldn't be Parallel, it does single thread.
So, if you had to have a multidimensional array come into the function, you could just do this:
Regions.AsParallel().Cast<Region>().ForAll(x=>x.Index());

Deriving a curve from datapoints

I manually adjust the thread count:
if (items.Count == 0) { threads = 0; }
else if (items.Count < 1 * hundred) { threads = 1; }
else if (items.Count < 3 * hundred) { threads = 2; }
else if (items.Count < 5 * hundred) { threads = 4; }
else if (items.Count < 10 * hundred) { threads = 8; }
else if (items.Count < 20 * hundred) { threads = 11; }
else if (items.Count < 30 * hundred) { threads = 15; }
else if (items.Count < 50 * hundred) { threads = 30; }
else threads = 40;
I need a function that returns the necessary/optimized thread count.
Ok, now forget above. I need a graph curve to plot. I give the coords, function plots the curve. Imagine the point(0,0) and point(5,5) -in (x,y) form. It should be straight line. So then I can measure x for y=3.
What happens if I give the points (0,0), (2,3), (8,10), (15,30) and (30,50). It will be a curve like thing. Now can I calculate x for given y or vice versa?
I think you get the idea. Should I use MathLab or could it be done in C#?

You're looking for curve fitting, or the derivation of a function describing a curve from a set of data points. If you're looking to do this once, from a constant set of data, Matlab would do the job just fine. If you want to do this dynamically, there are libraries and algorithms out there.
Review the Wikipedia article on linear regression. The least squares approach mentioned in that article is pretty common. Look around, and you'll find libraries and code samples using that approach.

You can probably make that run faster by reordering the tests (and using nested if). But that's not a smooth function, there's not likely to be any simpler description.
Or are you trying to find a smooth function that passes near those points?

You could use a linear regression; you would get something like this:
So I would probably encode it in C# like this:
int threads = (int) Math.Ceiling(0.0056*items.Count + 0.5);
I used Math.Ceiling to ensure that you don’t get 0 when the input isn’t 0. Of course, this function gives you 1 even if the input is 0; if that matters, you can always catch that as a special case, or use Math.Round instead.
However, this means the number of threads will go up continuously. It will not level out at 40. If that’s what you want, you might need to research different kinds of regression.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Multithreading iterative-intensive work - c#

Related

Video rate image construction from binary data performance

Which part of my Parallel.For should be fixed and is unsafe?

How to best implement K-nearest neighbours in C# for large number of dimensions?

Why does this threaded code run so much slower?

Deriving a curve from datapoints

Categories

Resources