Summation algorithm for a parallel program - c#

I am trying to write a parallel algorithm to be three times faster than a sequential algorithm that does essentially the same thing. Please see the pastebin.
http://pastebin.com/3DDyxfPP
Pasted:
Hello everyone. I'm doing an assignment for class and have the majority of it done, however I am having some problems with the math. I am trying to calculate the expression:
100000000
∑ (9999999/10000000)^i * i^2
i = 1
i goes from 1 to 10 million.
A fast sequential algorithm is given:
double sum = 0.0;
double fact1 = 0.9999999;
for (int i = 1; i <= 10000000; i++)
{
sum += (fact1 * i * i);
fact1 *= 0.9999999;
}
We are supposed to implement it and verify that it works, as well as time it in release mode. I already have this done and working properly. The time is then displayed on the console.
DateTime t = DateTime.Now;
long saveticks = t.Ticks;
double sum = 0.0;
double fact1 = 0.9999999;
for (int i = 1; i <= 100000000; i++)
{
sum += (fact1 * i * i);
fact1 *= 0.9999999;
}
t = DateTime.Now;
We then have to write a timed parallel algorithm that will beat the time, and are supposed to model it after an example parallel program. It must be at least 3 times faster than the sequential algorithm. We are to use 4 processing elements for the parallel program.
There is a hint, "After you figure out the work each processing element will do, you may need to start off the processing element with the time consuming Pow function".
for example:
Math.Pow(x,y)
"Don't use the pow function on each iteration for the parallel code, because it wont beat the time."
Here is my code for the parallel program. This does both the sequential algorithm and the parallel one and times them both.
const int numPEs = 4;
const int size = 100000000;
static double pSum;
static int numThreadsDone;
static int nextid;
static object locker1 = new object();
static object locker2 = new object();
static long psaveticks;
static DateTime pt;
static void Main(string[] args)
{
DateTime t = DateTime.Now;
long saveticks = t.Ticks;
double sum = 0.0;
double fact1 = 0.9999999;
for (int i = 1; i <= 100000000; i++)
{
sum += (fact1 * (i * i));
fact1 *= 0.9999999;
}
t = DateTime.Now;
Console.WriteLine("sequential: " + ((t.Ticks - saveticks) / 100000000.0) + " seconds");
Console.WriteLine("sum is " + sum);
// time it
pt = DateTime.Now;
psaveticks = pt.Ticks;
for (int i = 0; i < numPEs; i++)
new Thread(countThreads).Start();
Console.ReadKey();
}
static void countThreads()
{
int id;
double localcount = 0;
lock (locker1)
{
id = nextid;
nextid++;
}
// assumes array is evenly divisible by the number of threads
int granularity = size / numPEs;
int start = granularity * id;
for (int i = start; i < start + granularity; i++)
localcount += (Math.Pow(0.9999999, i) * (i * i));
lock (locker2)
{
pSum += localcount;
numThreadsDone++;
if (numThreadsDone == numPEs)
{
pt = DateTime.Now;
Console.WriteLine("parallel: " + ((pt.Ticks - psaveticks) / 10000000.0) + " seconds");
Console.WriteLine("parallel count is " + pSum);
}
}
}
My problem is that my sequential program is way faster than the parallel one. There has got to be a problem with the algorithm I'm using.
Can anyone help?

Console.WriteLine("sequential: " + ((t.Ticks - saveticks) / 100000000.0) + " seconds");
There are 10,000,000 ticks in one second. In the above line, you're dividing by an extra order of magnitude, 100,000,000, making your sequential execution appear to be 10 times faster than it actually is. To avoid these errors, use the appropriate fields from the .NET Framework itself; in this case, TimeSpan.TicksPerSecond.
The main reason you're getting a slow-down is that your parallel code is much more computationally-demanding than your sequential one.
// Inner loop of sequential code:
sum += (fact1 * (i * i));
fact1 *= 0.9999999;
// Inner loop of parallel code:
localcount += (Math.Pow(0.9999999, i) * (i * i));
From a mathematical perspective, you're justified in assuming that exponentiation would be equivalent to repeated multiplication. However, from a computational perspective, the Math.Pow operation is much more expensive than a simple multiplication.
A way of mitigating these expensive Math.Pow calls would be to perform the exponentiation just once at the beginning of each thread, and then revert to using plain multiplication (like in your sequential case):
double fact1 = Math.Pow(0.9999999, start + 1);
for (int i = start + 1; i <= start + granularity; i++)
{
localcount += (fact1 * (i * i));
fact1 *= 0.9999999;
}
On an Intel Core i7, this gives a speedup of around 3x for your problem size.
Obligatory reminders:
Don't use DateTime.Now for measuring brief time intervals. Use the Stopwatch class instead.
Don't take cross-thread time measurements. Wait for your worker threads to complete from your main thread, and take the final reading from there.

Related

Is it possible to multiply two arrays as a single command for code performance?

Given the following code:
public float[] weights;
public void Input(Neuron[] neurons)
{
float output = 0;
for (int i = 0; i < neurons.Length; i++)
output += neurons[i].input * weights[i];
}
Is it possible to perform all the calculations in a single execution? For example that would be 'neurons[0].input * weights[0].value + neurons[1].input * weights[1].value...'
Coming from this topic - How to sum up an array of integers in C#, there is a way for simpler caclulations, but the idea of my code is to iterate over the first array, multiply each element by the element in the same index in the second array and add that to a sum total.
Doing perf profiling, the line where the output is summed is very heavy on I/O and consumes 99% of my processing power. The stack should have enough memory for this, I am not worried about stack overflow, I just want to see it work faster for the moment (even if accuracy is sacrificed).
I think you are looking for AVX in C#
So you can actually calculate several values in one command.
Thats SIMD for CPU cores. Take a look at this
Here an example from the website:
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
var simdLength = Vector<int>.Count;
var result = new int[lhs.Length];
var i = 0;
for (i = 0; i <= lhs.Length - simdLength; i += simdLength)
{
var va = new Vector<int>(lhs, i);
var vb = new Vector<int>(rhs, i);
(va + vb).CopyTo(result, i);
}
for (; i < lhs.Length; ++i)
{
result[i] = lhs[i] + rhs[i];
}
return result;
}
You can also combine it with the parallelism you already use.

How many cycles to multiply a float in C#

I have a numeric intensive application and after looking for GFLOPS on the internet, I decided to do my own little benchmark. I just did a single thread matrix multiplication thousands of times to get about a second of execution. This is the inner loop.full
for (int i = 0; i < SIZEA; i++)
for (int j = 0; j < SIZEB; j++)
vector_out[i] = vector_out[i] + vector[j] * matrix[i, j];
It's been years since I dealt with FLOPS, so I expected to get something around 3 to 6 cycles per FLOP. But I am getting 30 (100 MFLOPS), surely if I parallelize this I will get more but I just did not expect that. Could this be a problem with dot NET. or is this really the CPU performance?
Here is a fiddle with the full benchmark code.
EDIT: Visual studio even in release mode takes longer to run, the executable by itself it runs in 12 cycles per FLOP (250 MFLOPS). Still is there any VM impact?
Your bench mark doesn't really measure FLOPS, it does some floating point operations and looping in C#.
However, if you can isolate your code to a repetition of just floating point operations you still have some problems.
Your code should include some "pre-cycles" to allow the "jitter to warm-up", so you are not measuring compile time.
Then, even if you do that,
You need to compile in release mode with optimizations on and execute your test from the commmand-line on a known consistent platform.
Fiddle here
Here is my alternative benchmark,
using System;
using System.Linq;
using System.Diagnostics;
class Program
{
static void Main()
{
const int Flops = 10000000;
var random = new Random();
var output = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var left = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var right = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
for (var i = 0; i < Flops - 1; i++)
{
output[i] = random.NextDouble();
}
timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
Console.WriteLine("ms: {0}", timer.ElapsedMilliseconds);
Console.WriteLine(
"MFLOPS: {0}",
(double)Flops / timer.ElapsedMilliseconds / 1000.0);
}
}
On my VM I get results like
ms: 73
MFLOPS: 136.986301...
Note, I had to increase the number of operations significantly to get over 1 millisecond.

Why is Console.WriteLine speeding up my application?

Ok so this is kind of weird. I have an algorithm to find the highest possible numerical palindrome that is a multiple of two factors who each have K digits.
The method I'm using to find the highest valid palindrome is to look at the highest possible palindrome for the number set (i.e. if k=3, the highest possible is 999999, then 998899, etc). Then I check if that palindrome has two factors with K digits.
For debugging, I thought it would be a good idea to print to the console each of the palindromes I was checking (to make sure I was getting them all. To my surprise, adding
Console.WriteLine(palindrome.ToString());
to each iteration of finding a palindrome dropped my runtime a whopping 10 seconds from ~24 to ~14.
To verify, I ran the program several times, then commented out the Console command and ran that several times, and every time it was shorter with the Console command.
This just seems weird, any ideas?
Here's the source if anyone wants to take a whack at it:
static double GetHighestPalindromeBench(int k)
{
//Because the result of k == 1 is a known quantity, and results in aberrant behavior in the algorithm, handle as separate case
if (k == 1)
{
return 9;
}
/////////////////////////////////////
//These variables will be used in HasKDigitFactors(), no need to reprocess them each time the function is called
double kTotalSpace = 10;
for (int i = 1; i < k; i++)
{
kTotalSpace *= 10;
}
double digitConstant = kTotalSpace; //digitConstant is used in HasKDigits() to determine if a factor has the right number of digits
double kFloor = kTotalSpace / 10; //kFloor is the lowest number that has k digits (e.g. k = 5, kFloor = 10000)
double digitConstantFloor = kFloor - digitConstant; //also used in HasKDigits()
kTotalSpace--; //kTotalSpace is the highest number that has k digits (e.g. k = 5, kTotalSpace = 99999)
/////////////////////////////////////////
double totalSpace = 10;
double halfSpace = 10;
int reversionConstant = k;
for (int i = 1; i < k * 2; i++)
{
totalSpace *= 10;
}
double floor = totalSpace / 100;
totalSpace--;
for (int i = 1; i < k; i++)
{
halfSpace *= 10;
}
double halfSpaceFloor = halfSpace / 10; //10000
double halfSpaceStart = halfSpace - 1; //99999
for (double i = halfSpaceStart; i > halfSpaceFloor; i--)
{
double value = i;
double palindrome = i;
//First generate the full palindrome
for (int j = 0; j < reversionConstant; j++)
{
int digit = (int)value % 10;
palindrome = palindrome * 10 + digit;
value = value / 10;
}
Console.WriteLine(palindrome.ToString());
//palindrome should be ready
//Now we check the factors of the palindrome to see if they match k
//We only need to check possible factors between our k floor and ceiling, other factors do not solve
if (HasKDigitFactors(palindrome, kTotalSpace, digitConstant, kFloor, digitConstantFloor))
{
return palindrome;
}
}
return 0;
}
static bool HasKDigitFactors(double palindrome, double totalSpace, double digitConstant, double floor, double digitConstantFloor)
{
for (double i = floor; i <= totalSpace; i++)
{
if (palindrome % i == 0)
{
double factor = palindrome / i;
if (HasKDigits(factor, digitConstant, digitConstantFloor))
{
return true;
}
}
}
return false;
}
static bool HasKDigits(double value, double digitConstant, double digitConstantFloor)
{
//if (Math.Floor(Math.Log10(value) + 1) == k)
//{
// return true;
//}
if (value - digitConstant > digitConstantFloor && value - digitConstant < 0)
{
return true;
}
return false;
}
Note that I have the Math.Floor operation in HasKDigits commented out. This all started when I was trying to determine if my digit check operation was faster than the Math.Floor operation. Thanks!
EDIT: Function call
I'm using StopWatch to measure processing time. I also used a physical stopwatch to verify the results of StopWatch.
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
double palindrome = GetHighestPalindromeBench(6);
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}:{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
Console.WriteLine();
Console.WriteLine(palindrome.ToString());
Console.WriteLine();
Console.WriteLine(elapsedTime);
I have tested your code. My system is an i7-3770 3.40 GHz, quad-core with hyperthreading, so 8 cores available.
Debug build, with and without the console Writeline statement (commented out or not), in debug mode or not, the times vary from about 8.7 to 9.8 sec. As a Release build it comes down to about 6.8-7.0 sec either way. Th figures were the same inside VS and from the command line. So your observation is not reproduced.
On performance monitor with no console output I see one core at 100%, but it switches between cores 1,4,5 and 8. Without console output there is activity on other cores. Max CPU usage never exceeds 18%.
In my judgment your figure with console output is probably consistent with mine, and represents the true value. So your question should read: why is your system so slow when it's not doing console output?
The answer is: because there is something different about your computer or your project which we don't know about. I've never seen this before, but something is soaking up cycles and you should be able to find out what it is.
I've written this as an answer although it isn't really an answer. If you get more facts and update your question, hopefully I can provide a better answer.

How to limit the number of cycles of a loop under some condition?

I make a loop like this :
int total;
total = ((toVal - fromVal) + 1) * 2;
RadProgressContext progress = RadProgressContext.Current;
progress.Speed = "N/A";
finYear = fromVal;
for (int i = 0; i < total; i++)
{
decimal ratio = (i * 100 / total);
progress.PrimaryTotal = total;
progress.PrimaryValue = total;
progress.PrimaryPercent = 100;
progress.SecondaryTotal = 100; // total;
progress.SecondaryValue = ratio;//i ;
progress.SecondaryPercent = ratio; //i;
progress.CurrentOperationText = "Step " + i.ToString();
if (!Response.IsClientConnected)
{
//Cancel button was clicked or the browser was closed, so stop processing
break;
}
progress.TimeEstimated = (total - i) * 100;
//Stall the current thread for 0.1 seconds
System.Threading.Thread.Sleep(100);
}
Now i want a specific method to run according to toVal & fromVal
in the previous loop but not with the same number of cycles
i want to to run it in a loop like this :
for (fromVal; fromVal < toVal ; fromVal++)
{
PrepareNewEmployees(calcYear, fromVal);
}
for example :
fromVal = 2014
toVal = 2015
so i want to run twice not 4 times! like this :
PrepareNewEmployees(calcYear, 2014);
PrepareNewEmployees(calcYear, 2015);
but in the previous loop for (int i = 0; i < total; i++)
You're missing the point of progress bar updating. You're not supposed to run 4 iterations and do some work every 2 iterations, but the oposite. Do a loop like:
for (int i = fromVal; i < toVal; i++)
{
PrepareNewEmployees(...);
decimal ratio = ((double)toVal-i)/(toVal-fromVal) *100;
//Some other things, that need to be done twice in an iteration
}
Because you are using Thread's already, consider to implement following:
public void ResetProgress()
{
SetProgress(0);
}
public SetProgress(int percents)
{
// set progress bar to a given percents/ratio
// you will have to use Invoke and blablabla
}
Then any your job will looks like this
ResetProgress();
// note: you need to remember from which value you start to be able to calculate progress
for (int i = startVal; i < toVal ; i++)
{
PrepareNewEmployees(calcYear, i);
SetProgress(100 * (i - startVal) / (toVal - startVal)); // in percents [0-100]
}
// optional, required if you exit loop or use suggestion below
SetProgress(100);
You can also optimise it, to do not update progress after each step, but after certain numbers of steps. To example, instead of calling SetProgress you do
if(i % 10 == 0)
SetProgress();
This will calls SetProgress ten times less often. Of course, there are some assumptions, like: i starts from 0 and if you want to have 100% bar at the end, then i should be dividable by 10. Just an idea to start.

Math.Log vs multiplication complexity in terms of geometric average which is better?

I want to find geometric average of data and performance does matters.
Which one should I pick between
Keep multiplication over single variable and take Nth-root at the end of calculation
X = MUL(x[i])^(1/N)
Thus, O(N) x Multiplication Complexity + O(1) x Nth-root
Use logarithm
X = e ^ { 1/N * SUM(log(x[i])) }
Thus, O(N) x Logarithm Complexity + O(1) x Nth-division + O(1) Exponential
Specialized algorithm for geometric average. Please tell me if there is.
I thought I would try to benchmark this and get a comparison, here is my attempt.
Comparing was difficult since the list of numbers needed to be large enough to make timing it reasonable, so N is large. In my test N = 50,000,000 elements.
However, multiplying lots of numbers together which are greater than 1 overflows the double storing the product. But multiplying together numbers less than 1 gives a total product which is very small, and dividing by the number of elements gives zero.
Just a couple more things: Make sure none of your elements are zero, and the Log approach doesn't work for negative elements.
(The multiply would work without overflow if C# had a BigDecimal class with an Nth root function.)
Anyway, in my code each element is between 1 and 1.00001
On the other hand, the log approach had no problems with overflows, or underflows.
Here's the code:
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Starting...");
Console.WriteLine("");
Stopwatch watch1 = new Stopwatch();
Stopwatch watch2 = new Stopwatch();
List<double> list = getList();
double prod = 1;
double mean1 = -1;
for (int c = 0; c < 2; c++)
{
watch1.Reset();
watch1.Start();
prod = 1;
foreach (double d in list)
{
prod *= d;
}
mean1 = Math.Pow(prod, 1.0 / (double)list.Count);
watch1.Stop();
}
double mean2 = -1;
for (int c = 0; c < 2; c++)
{
watch2.Reset();
watch2.Start();
double sum = 0;
foreach (double d in list)
{
double logged = Math.Log(d, 2);
sum += logged;
}
sum *= 1.0 / (double)list.Count;
mean2 = Math.Pow(2.0, sum);
watch2.Stop();
}
Console.WriteLine("First way gave: " + mean1);
Console.WriteLine("Other way gave: " + mean2);
Console.WriteLine("First way took: " + watch1.ElapsedMilliseconds + " milliseconds.");
Console.WriteLine("Other way took: " + watch2.ElapsedMilliseconds + " milliseconds.");
Console.WriteLine("");
Console.WriteLine("Press enter to exit");
Console.ReadLine();
}
private static List<double> getList()
{
List<double> result = new List<double>();
Random rand = new Random();
for (int i = 0; i < 50000000; i++)
{
result.Add( rand.NextDouble() / 100000.0 + 1);
}
return result;
}
}
My computer output describes that both geometric means are the same, but that:
Multiply way took: 466 milliseconds
Logarithm way took: 3245 milliseconds
So, the multiply appears to be faster.
But multiply is very problematic with overflow and underflow, so I would recommend the Log approach, unless you can guarantee the product won't overflow and that the product won't get too close to zero.

Categories