I have a Parallel.For and a regular for loop doing some simple arithmetic, just to benchmark Parallel.For
My conclusion is that, the regular for is faster on my i5 notebook processor.
This is my code
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
int Iterations = int.MaxValue / 1000;
DateTime StartTime = DateTime.MinValue;
DateTime EndTime = DateTime.MinValue;
StartTime = DateTime.Now;
Parallel.For(0, Iterations, i =>
{
OperationDoWork(i);
});
EndTime = DateTime.Now;
Console.WriteLine(EndTime.Subtract(StartTime).ToString());
StartTime = DateTime.Now;
for (int i = 0; i < Iterations; i++)
{
OperationDoWork(i);
}
EndTime = DateTime.Now;
Console.WriteLine(EndTime.Subtract(StartTime).ToString());
StartTime = DateTime.Now;
Parallel.For(0, Iterations, i =>
{
OperationDoWork(i);
});
EndTime = DateTime.Now;
Console.WriteLine(EndTime.Subtract(StartTime).ToString());
StartTime = DateTime.Now;
for (int i = 0; i < Iterations; i++)
{
OperationDoWork(i);
}
EndTime = DateTime.Now;
Console.WriteLine(EndTime.Subtract(StartTime).ToString());
}
private static void OperationDoWork(int i)
{
int a = 0;
a += i;
i = a;
a *= 2;
a = a * a;
a = i;
}
}
}
And these are my results. Which on repetition do not change much:
00:00:03.9062234
00:00:01.7971028
00:00:03.2231844
00:00:01.7781017
So why ever use Parallel.For ?
Parallel processing has organization overhead. Think of it in terms of having 100 tasks and 10 people to do them. It's not easy to have 10 people working for you, just organizing who does what costs time in addition to actually doing the 100 tasks.
So if you want to do something in parallel, make sure it's so much work that the workload of organizing the parallelism is so small compared to the actual workload that it makes sense to do it.
One of the most common mistakes one does, when first delving into multithreading, is the belief, that multithreading is a Free Lunch.
In truth, splitting your operation into multiple smaller operations, which can then run in parallel, is going to take some extra time. If badly synchronized, your tasks may well be spending even more time, waiting for other tasks to release their locks.
As a result; parallelizing is not worth the time/trouble, when each task is going to do little work, which is the case with OperationDoWork.
Edit:
Consider trying this out:
private static void OperationDoWork(int i)
{
double a = 101.1D * i;
for (int k = 0; k < 100; k++)
a = Math.Pow(a, a);
}
According to my benchmark, for will average to 5.7 seconds, while Parallel.For will take 3.05 seconds on my Core2Duo CPU (speedup == ~1.87).
On my Quadcore i7, I get an average of 5.1 seconds with for, and an average of 1.38 seconds with Parallel.For (speedup == ~3.7).
This modified code scales very well to the number of physical cores available. Q.E.D.
Related
There is such an array, I know what is needed through Thread, but I don’t understand how to do it. Do you need to split the array into parts, or can you do something right away?
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int[] a = new int[10000];
Random rand = new Random();
for (int i = 0; i < a.Length; i++)
{
a[i] = rand.Next(-100, 100);
}
foreach (var p in a)
Console.WriteLine(p);
TimeSpan ts = stopWatch.Elapsed;
stopWatch.Stop();
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Another approach, compared to John Wu's, is to use a custom partitioner. I think that it is a little more readable.
using System.Collections.Concurrent;
using System.Threading.Tasks;
int[] a = new int[10000];
int batchSize = 1000;
Random rand = new Random();
Parallel.ForEach(Partitioner.Create(0, a.Length, batchSize), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
a[i] = rand.Next(-100, 100);
}
});
In modern c#, you should almost never have to use Thread objects themselves-- they are fraught with peril, and there are other language features that will do the job just as well (see async and TPL). I'll show you a way to do it with TPL.
Note: Due to the problem of false sharing, you need to rig things so that the different threads are working on different memory areas. Otherwise you will see no gain in performance-- indeed, performance could get considerably worse. In this example I divide the array into blocks of 4,000 bytes (1,000 elements) each and work on each block in a separate thread.
using System.Threading.Tasks;
var array = new int[10000];
var offsets = Enumerable.Range(0, 10).Select( x => x * 1000 );
Parallel.ForEach( offsets, offset => {
for ( int i=0; i<1000; i++ )
{
array[offset + i] = random.Next( -100,100 );
}
});
That all being said, I doubt you'll see much of a gain in performance in this example-- the array is much too small to be worth the additional overhead.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace Threads
{
class Program
{
static void Main(string[] args)
{
Action<int> TestingDelegate = (x321) => { Console.WriteLine(x321); };
int x123 = Environment.ProcessorCount;
MyParallelFor(0, 8, TestingDelegate);
Console.Read();
}
public static void MyParallelFor(int inclusiveLowerBound, int exclusiveUpperBound, Action<int> body)
{
int size = exclusiveUpperBound - inclusiveLowerBound;
int numProcs = Environment.ProcessorCount;
int range = size / numProcs;
var threads = new List<Task>(numProcs);
for(int p = 0; p < numProcs; p++)
{
int start = p * range + inclusiveLowerBound;
int end = (p == numProcs - 1) ? exclusiveUpperBound : start + range;
Task.Factory.StartNew(() =>
{
for (int i = start; i < end; i++) body(i);
});
}
Task.WaitAll(threads.ToArray());
Console.WriteLine("Done!");
}
}
}
Hi all, I implemented this code from the Patterns of Parallel Programming book and they do it using threads, I decided to rewrite it using the TPL library. The output below is what I get (of course it's random) however... I expect "Done!" to always be printed last. For some reason it is not doing that though. Why is it not blocking?
Done!
1
0
2
6
5
4
3
7
You did not assign any tasks to the threads list on which you are calling WaitAll, your tasks are started independently. you would create tasks and put the tasks in threads collection before you call WaitAll. You can find more how you would add the tasks in tasks list you have created in this MSDN documentation for Task.WaitAll Method (Task[])
You code would be something like
threads.Add(Task.Factory.StartNew(() =>
{
for (int i = 0; i < 10; i++) ;
}));
you are not adding task to your threads collection. So threads collection is empty. So there is no Tasks to wait for. Change code like this
threads.Add(Task.Factory.StartNew(() =>
{
for (int i = start; i < end; i++) body(i);
}));
The reasons is quite simple: You are never adding anything to the threads List. You declare it and allocate space for numProcs entries, but you never call threads.Add.
Therefore the list is still empty and hence the Task.WaitAll doesn't wait on anything.
I have a numeric intensive application and after looking for GFLOPS on the internet, I decided to do my own little benchmark. I just did a single thread matrix multiplication thousands of times to get about a second of execution. This is the inner loop.full
for (int i = 0; i < SIZEA; i++)
for (int j = 0; j < SIZEB; j++)
vector_out[i] = vector_out[i] + vector[j] * matrix[i, j];
It's been years since I dealt with FLOPS, so I expected to get something around 3 to 6 cycles per FLOP. But I am getting 30 (100 MFLOPS), surely if I parallelize this I will get more but I just did not expect that. Could this be a problem with dot NET. or is this really the CPU performance?
Here is a fiddle with the full benchmark code.
EDIT: Visual studio even in release mode takes longer to run, the executable by itself it runs in 12 cycles per FLOP (250 MFLOPS). Still is there any VM impact?
Your bench mark doesn't really measure FLOPS, it does some floating point operations and looping in C#.
However, if you can isolate your code to a repetition of just floating point operations you still have some problems.
Your code should include some "pre-cycles" to allow the "jitter to warm-up", so you are not measuring compile time.
Then, even if you do that,
You need to compile in release mode with optimizations on and execute your test from the commmand-line on a known consistent platform.
Fiddle here
Here is my alternative benchmark,
using System;
using System.Linq;
using System.Diagnostics;
class Program
{
static void Main()
{
const int Flops = 10000000;
var random = new Random();
var output = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var left = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var right = Enumerable.Range(0, Flops)
.Select(i => random.NextDouble())
.ToArray();
var timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
for (var i = 0; i < Flops - 1; i++)
{
output[i] = random.NextDouble();
}
timer = Stopwatch.StartNew();
for (var i = 0; i < Flops - 1; i++)
{
unchecked
{
output[i] += left[i] * right[i];
}
}
timer.Stop();
Console.WriteLine("ms: {0}", timer.ElapsedMilliseconds);
Console.WriteLine(
"MFLOPS: {0}",
(double)Flops / timer.ElapsedMilliseconds / 1000.0);
}
}
On my VM I get results like
ms: 73
MFLOPS: 136.986301...
Note, I had to increase the number of operations significantly to get over 1 millisecond.
I read in books that thread creation is expensive (not so expensive like process creation but nevertheless it is) and we should avoid it. I write the test code and I was shocked how fast thread creation is.
using System;
using System.Diagnostics;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
namespace ConsoleApplication1
{
class Program
{
static int testVal = 0;
static void Main(string[] args)
{
const int ThreadsCount = 10000;
var watch = Stopwatch.StartNew();
for (int i = 0; i < ThreadsCount; i++)
{
var myThread = new Thread(MainVoid);
myThread.Start();
}
watch.Stop();
Console.WriteLine("Test value ={0}", testVal);
Console.WriteLine("Ended in {0} miliseconds", watch.ElapsedMilliseconds);
Console.WriteLine("{0} miliseconds per thread ", (double)watch.ElapsedMilliseconds / ThreadsCount);
}
static void MainVoid()
{
Interlocked.Increment(ref testVal);
}
}
}
Output:
Test value =10000
Ended in 702 miliseconds
0,0702 miliseconds per thread.
Is my code wrong or thread creation is so fast and advices in books are wrong? (I see only some extra memory consumption per thread but no creation time.)
Thread creation is pretty slow. Consider this bit of code that contrasts the speed of doing things inline, and doing things with multiple threads:
private void DoStuff()
{
const int ThreadsCount = 10000;
var sw = Stopwatch.StartNew();
int testVal = 0;
for (int i = 0; i < ThreadsCount; ++i)
{
Interlocked.Increment(ref testVal);
}
sw.Stop();
Console.WriteLine(sw.ElapsedTicks);
sw = Stopwatch.StartNew();
testVal = 0;
for (int i = 0; i < ThreadsCount; ++i)
{
var myThread = new Thread(() =>
{
Interlocked.Increment(ref testVal);
});
myThread.Start();
}
sw.Stop();
Console.WriteLine(sw.ElapsedTicks);
}
On my system, doing it inline requires 200 ticks. With threads it's almost 2 million ticks. So using threads here takes approximately 10,000 times as long. I used ElapsedTicks here rather than ElapsedMilliseconds because with ElapsedMilliseconds the output for the inline code was 0. The threads version takes around 700 milliseconds. Context switches are expensive.
In addition, your test is fundamentally flawed because you don't explicitly wait for all of the threads to finish before harvesting the result. It's quite possible that you could output the value of testVal before the last thread finishes incrementing it.
When timing code, by the way, you should be sure to run it in release mode without the debugger attached. In Visual Studio, use Ctrl+F5 (start without debugging).
I am profiling a C# application and it looks like two threads each calling Dictionary<>.ContainsKey() 5000 time each on two separate but identical dictionaries (with only two items) is twice as slow as one thread calling Dictionary<>.ContainsKey() on a single dictionary 10000 times.
I am measuring the "thread time" using a tool called JetBrains dotTrace. I am explicitly using copies of the same data, so there are no synhronization primitives that I am using. Is it possible that .NET is doing some synchronization behind the scenes?
I have a dual core machine, and there are three threads running: one is blocked using Semaphore.WaitAll() while the work is done on two new threads whose priority is set to ThreadPriority.Highest.
Obvious culprits like, not actually running the code in parallel, and not using a release build has been ruled out.
EDIT:
People want the code. Alright then:
private int ReduceArrayIteration(VM vm, HeronValue[] input, int begin, int cnt)
{
if (cnt <= 1)
return cnt;
int cur = begin;
for (int i=0; i < cnt - 1; i += 2)
{
// The next two calls are effectively dominated by a call
// to dictionary ContainsKey
vm.SetVar(a, input[begin + i]);
vm.SetVar(b, input[begin + i + 1]);
input[cur++] = vm.Eval(expr);
}
if (cnt % 2 == 1)
{
input[cur++] = input[begin + cnt - 1];
}
int r = cur - begin;
Debug.Assert(r >= 1);
Debug.Assert(r < cnt);
return r;
}
// From VM
public void SetVar(string s, HeronValue o)
{
Debug.Assert(o != null);
frames.Peek().SetVar(s, o);
}
// From Frame
public bool SetVar(string s, HeronValue o)
{
for (int i = scopes.Count; i > 0; --i)
{
// Scope is a derived class of Dictionary
Scope tbl = scopes[i - 1];
if (tbl.HasName(s))
{
tbl[s] = o;
return false;
}
}
return false;
}
Now here is the thread spawning code, which might be retarded:
public static class WorkSplitter
{
static WaitHandle[] signals;
public static void ThreadStarter(Object o)
{
Task task = o as Task;
task.Run();
}
public static void SplitWork(List<Task> tasks)
{
signals = new WaitHandle[tasks.Count];
for (int i = 0; i < tasks.Count; ++i)
signals[i] = tasks[i].done;
for (int i = 0; i < tasks.Count; ++i)
{
Thread t = new Thread(ThreadStarter);
t.Priority = ThreadPriority.Highest;
t.Start(tasks[i]);
}
Semaphore.WaitAll(signals);
}
}
Even if there was any locking in Dictionary (there isn't), it could not affect your measurements since each thread is using a separate one. Running this test 10,000 times is not enough to get reliable timing data, ContainsKey() only takes 20 nanoseconds or so. You'll need at least several million times to avoid scheduling artifacts.