Here is a piece of code in C# which applies an operation over each row of a matrix of doubles (suppose 200x200).
For (int i = 0; i < 200; i++)
{
result = process(row[i]);
DoSomething(result);
}
Process is a static method and I have a Corei5 CPU and Windows XP and I'm using .Net Framework 3.5. To gain performance, I tried to process each row using a separate thread (using Asynchronous delegates). So I rewrote the code as follows:
List<Func<double[], double>> myMethodList = new List<Func<double[], double>>();
List<IAsyncResult> myCookieList = new List<IAsyncResult>();
for (int i = 0; i < 200; i++)
{
Func<double[], double> myMethod = process;
IAsyncResult myCookie = myMethod.BeginInvoke(row[i], null, null);
myMethodList.Add(myMethod);
myCookieList.Add(myCookie);
}
for (int j = 0; j < 200; j++)
{
result = myMethodList[j].EndInvoke(myCookieList[j]);
DoSomething(result);
}
This code is being called for 1000 matrixes in one run. When I tested, surprisingly I didn't get any performance improvement! So this brought up this question for me that in what cases the multi-threading will be of benefit for performance enhancement and also is my code logical?
At first glance, your code looks OK. Maybe the CPU isn't the bottleneck.
Can you confirm that process() and DoSomething() are independent and don't do any I/O or locking for shared resources?
The point here is that you'll have to start measuring.
And of course Fx4 with the TPL makes this kind of thing easier to write and ususally more efficient.
You could achieve more parallelism (in the result processing, specifically) by calling BeginInvoke with an AsyncCallback - this will do the result processing in a ThreadPool thread, instead of inline as you have it currently.
See the last section of the async programming docs here.
Before you do anything to modify the code, you should profile it to find out where the program is spending its time.
Your code is going a little overboard. Look at the loops; for each of 200 iterations, you are creating a new thread to make an asynchronous call. That will result in your process having 201 active threads. There is a law of diminishing returns; at about double the number of threads as the number of "execution units" that the processor has (the number of CPUs, times the number of cores on each CPU, X2 if the cores can be hyper-threaded), your computer will start spending more time scheduling threads than it spends running them. The state-of-the-art servers have 4 quad-core HT CPUs, for about 32 EUs. 200 actively executing threads will make this server break down and cry.
If the order of processing doesn't matter, I would implement a MergeSort-like algorithm; break the array in half, process the left hand, process the right hand. Each "left hand" can be processed by a new thread, but process the "right hand" in the current thread. Then, implement some thread-safe means to limit the thread count to about 1.25 times the number of "execution units"; If the limit has been reached, continue processing linearly without creating a new thread.
It looks like you aren't gaining any performance because of the way you are handling the EndInvoke method call. Since you are calling "process" using BeginInvoke, those function calls return immediately so the first loop probably finishes in no time at all. However, EndInvoke blocks until the call for which it is being called is finished processing, which you are still using sequentially. As Steve said, you should use an AsyncCallback so that each completion event is handled on it's own thread.
you are not seeing much gain because you are not parallelizing the code, yes, you are doing async but that just means that your loop does not wait to calculate to go to the next step. use Parallel.For instead of for loop and see if you see any gain on your multi-core box...
If you are going to use async delegates, this would be the way to do it to enure the callbacks happen on a Thread pool thread;
internal static void Do()
{
AsyncCallback cb = Complete;
List<double[]> row = CreateList();
for (int i = 0; i < 200; i++)
{
Func<double[], double> myMethod = Process;
myMethod.BeginInvoke(row[i], cb, null);
}
}
static double Process (double[] vals)
{
// your implementation
return randy.NextDouble();
}
static void Complete(IAsyncResult token)
{
Func<double[], double> callBack = (Func<double[], double>)((AsyncResult)token).AsyncDelegate;
double res = callBack.EndInvoke(token);
Console.WriteLine("complete res {0}", res);
DoSomething(res);
}
Related
I posted another SO question here, and as a follow-up, my colleague did a test, seen below, as some form of "counter" to the argument for async/await/Tasks.
(I am aware that the lock on resultList isn't needed, disregard that)
I am aware that async/await and Tasks is not made to handle CPU-intensive tasks but instead handle I/O operations that are done by the OS. The benchmark below is a CPU-intensive task, so the test is flawed from start.
However, as I understand it, using new Task().Start() will schedule the operation on the ThreadPool and execute the test code on different threads on the ThreadPool. Wouldnt that mean that the first and second test are more or less the same? (I'm guessing not, please explain
Why then the big difference between them?
some form of "counter" to the argument for async/await/Tasks.
The posted code has absolutely nothing to do with async or await. It's comparing three different kinds of parallelism:
Dynamic Task Parallelism.
Direct threadpool access.
Manual multithreading with manual partitioning.
The first two are somewhat comparable. Of course, direct threadpool access will be faster than Dynamic Task Parallelism. But what these tests don't show is that direct threadpool access is much harder to do correctly. In particular, when you are running real-world code and need to handle exceptions and return values, you have to add in boilerplate code and object instances to the direct threadpool access code that slows it down.
The third one is not comparable at all. It just uses 10 manual threads. Again, this example ignores the additional complexity necessary in real-world code; specifically, the need to handle exceptions and return values. It also assumes a partition size, which is problematic; real-world code does not have that luxury. If you're managing your own set of threads, then you have to decide things like how quickly you should increase the number of threads when the queue has many items, and how quickly you should end threads when the queue is empty. These are all difficult questions that add lots of code to the #3 test before you're really comparing the same thing.
And that's not even to say anything about the cost of maintenance. In my experience (i.e., as an application developer), micro-optimizations are just not worth it. Even if you took the "worst" (#1) approach, you're losing about 7 microseconds per item. That is an unimaginably small amount of savings. As a general rule, developer time is far more valuable to your company than user time. If your users have to process a hundred thousand items, the difference would barely be perceptible. If you were to adopt the "best" (#3) approach, the code would be much less maintainable, particularly considering the boilerplate and thread management code necessary in production code and not shown here. Going with #3 would probably cost your company far more in terms of developer time just writing or reading the code than it would ever save in terms of user time.
Oh, and the funniest part of all this is that with all these different kinds of parallelism compared, they didn't even include the one that is most suitable for this test: PLINQ.
static void Main(string[] args)
{
TaskParallelLibrary();
ManualThreads();
Console.ReadKey();
}
static void ManualThreads()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var resultList = new List<string>();
var stopwatch = Stopwatch.StartNew();
var counter = 0;
for (int i = 0; i != 10; ++i)
{
new Thread(() =>
{
while (true)
{
var t = "";
lock (queue)
{
if (counter >= queue.Count)
break;
t = queue[counter];
++counter;
}
t = t.Substring(0, 5);
string t2 = t.Substring(0, 2) + t;
lock (resultList)
resultList.Add(t2);
}
}).Start();
}
while (resultList.Count < queue.Count)
Thread.Sleep(1);
stopwatch.Stop();
Console.WriteLine($"Manual threads: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
static void TaskParallelLibrary()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var stopwatch = Stopwatch.StartNew();
var resultList = queue.AsParallel().Select(t =>
{
t = t.Substring(0, 5);
return t.Substring(0, 2) + t;
}).ToList();
stopwatch.Stop();
Console.WriteLine($"Parallel: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
On my machine, after running this code several times, I find that the PLINQ code outperforms the Manual Threads by about 30%. Sample output on .NET Core 3.0 preview5-27626-15, built for Release, run standalone:
Parallel: Processed 1000000 in 00:00:00.3629408
Manual threads: Processed 1000000 in 00:00:00.5119985
And, of course, the PLINQ code is:
Shorter
More maintainable
More robust (handles exceptions and return types)
Less awkward (no need to poll for completion)
More portable (partitions based on number of processors)
More flexible (automatically adjusts the thread pool as necessary based on amount of work)
I have the following code:
var factory = new TaskFactory();
for (int i = 0; i < 100; i++)
{
var i1 = i;
factory.StartNew(() => foo(i1));
}
static void foo(int i)
{
Thread.Sleep(1000);
Console.WriteLine($"foo{i} - on thread {Thread.CurrentThread.ManagedThreadId}");
}
I can see it only does 4 threads at a time (based on observation). My questions:
What determines the number of threads used at a time?
How can I retrieve this number?
How can I change this number?
P.S. My box has 4 cores.
P.P.S. I needed to have a specific number of tasks (and no more) that are concurrently processed by the TPL and ended up with the following code:
private static int count = 0; // keep track of how many concurrent tasks are running
private static void SemaphoreImplementation()
{
var s = new Semaphore(20, 20); // allow 20 tasks at a time
for (int i = 0; i < 1000; i++)
{
var i1 = i;
Task.Factory.StartNew(() =>
{
try
{
s.WaitOne();
Interlocked.Increment(ref count);
foo(i1);
}
finally
{
s.Release();
Interlocked.Decrement(ref count);
}
}, TaskCreationOptions.LongRunning);
}
}
static void foo(int i)
{
Thread.Sleep(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}
When you are using a Task in .NET, you are telling the TPL to schedule a piece of work (via TaskScheduler) to be executed on the ThreadPool. Note that the work will be scheduled at its earliest opportunity and however the scheduler sees fit. This means that the TaskScheduler will decide how many threads will be used to run n number of tasks and which task is executed on which thread.
The TPL is very well tuned and continues to adjust its algorithm as it executes your tasks. So, in most cases, it tries to minimize contention. What this means is if you are running 100 tasks and only have 4 cores (which you can get using Environment.ProcessorCount), it would not make sense to execute more than 4 threads at any given time, as otherwise it would need to do more context switching. Now there are times where you want to explicitly override this behaviour. Let's say in the case where you need to wait for some sort of IO to finish, which is a whole different story.
In summary, trust the TPL. But if you are adamant to spawn a thread per task (not always a good idea!), you can use:
Task.Factory.StartNew(
() => /* your piece of work */,
TaskCreationOptions.LongRunning);
This tells the DefaultTaskscheduler to explicitly spawn a new thread for that piece of work.
You can also use your own Scheduler and pass it in to the TaskFactory. You can find a whole bunch of Schedulers HERE.
Note another alternative would be to use PLINQ which again by default analyses your query and decides whether parallelizing it would yield any benefit or not, again in the case of a blocking IO where you are certain starting multiple threads will result in a better execution you can force the parallelism by using WithExecutionMode(ParallelExecutionMode.ForceParallelism) you then can use WithDegreeOfParallelism, to give hints on how many threads to use but remember there is no guarantee you would get that many threads, as MSDN says:
Sets the degree of parallelism to use in a query. Degree of
parallelism is the maximum number of concurrently executing tasks that
will be used to process the query.
Finally, I highly recommend having a read of THIS great series of articles on Threading and TPL.
If you increase the number of tasks to for example 1000000 you will see a lot more threads spawned over time. The TPL tends to inject one every 500ms.
The TPL threadpool does not understand IO-bound workloads (sleep is IO). It's not a good idea to rely on the TPL for picking the right degree of parallelism in these cases. The TPL is completely clueless and injects more threads based on vague guesses about throughput. Also to avoid deadlocks.
Here, the TPL policy clearly is not useful because the more threads you add the more throughput you get. Each thread can process one item per second in this contrived case. The TPL has no idea about that. It makes no sense to limit the thread count to the number of cores.
What determines the number of threads used at a time?
Barely documented TPL heuristics. They frequently go wrong. In particular they will spawn an unlimited number of threads over time in this case. Use task manager to see for yourself. Let this run for an hour and you'll have 1000s of threads.
How can I retrieve this number? How can I change this number?
You can retrieve some of these numbers but that's not the right way to go. If you need a guaranteed DOP you can use AsParallel().WithDegreeOfParallelism(...) or a custom task scheduler. You also can manually start LongRunning tasks. Do not mess with process global settings.
I would suggest using SemaphoreSlim because it doesn't use Windows kernel (so it can be used in Linux C# microservices) and also has a property SemaphoreSlim.CurrentCount that tells how many remaining threads are left so you don't need the Interlocked.Increment or Interlocked.Decrement. I also removed i1 because i is value type and it won't be changed by the call of foo method passing the i argument so it's no need to copy it into i1 to ensure it never changes (if that was the reasoning for adding i1):
private static void SemaphoreImplementation()
{
var maxThreadsCount = 20; // allow 20 tasks at a time
var semaphoreSlim = new SemaphoreSlim(maxTasksCount, maxTasksCount);
var taskFactory = new TaskFactory();
for (int i = 0; i < 1000; i++)
{
taskFactory.StartNew(async () =>
{
try
{
await semaphoreSlim.WaitAsync();
var count = maxTasksCount-semaphoreSlim.CurrentCount; //SemaphoreSlim.CurrentCount tells how many threads are remaining
await foo(i, count);
}
finally
{
semaphoreSlim.Release();
}
}, TaskCreationOptions.LongRunning);
}
}
static async void foo(int i, int count)
{
await Task.Wait(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}
At first I have thread waiting example and it works perfect. It's job is ask 100 threads wait 3 seconds and then make output:
for (int i = 0; i < 100; ++i)
{
int index = i;
Thread t = new Thread(() =>
{
Caller c = new Caller();
c.DoWaitCall();
}) { IsBackground = true };
t.Start();
}
the Caller::DoWaitCall() looks like:
public void DoWaitCall()
{
Thread.Sleep(3000);
Console.WriteLine("done");
}
In this case, all threads wait 3 seconds and give output message almost in same time.
But when I try to use Async callback to do the Console.WriteLine:
public void DoWaitCall()
{
MyDel del = () => { Thread.Sleep(3000); };
del.BeginInvoke(CallBack, del);
}
private void CallBack(IAsyncResult r)
{
Console.WriteLine("done");
}
Each thread wait for different time, and make their output one-by-one slowly.
Is there any good way to achieve async callback in parallel?
The effect you're seeing is the ThreadPool ramping up gradually, basically. The idea is that it's relatively expensive to create (and then keep) threads, and the ThreadPool is designed for short-running tasks. So if it receives a bunch of tasks in a short space of time, it makes sense to batch them up, only starting new threads when it spots that there are still tasks waiting a little bit later.
You can force it to keep a minimum number of threads around using ThreadPool.SetMinThreads. For real systems you shouldn't normally need to do this, but it makes sense for a demo or something similar.
The first time you have spawned many threads that do the job in parallel. The second time you have used thread-pool, which has a limited number of threads. As Jon noted you can use a property to define the minimum thread number.
But, why do you need that to make async call from your parallel threads? This will not improve your performance at all, as your work is already done in parallel plus and you are making another split (using thread pool), which will introduce more latencies due to thread context switch. There is no need to do that.
I'm studying C# right now and currently learning threading.
Here is a simple example to adding 1 to a variable multiple times within different threads.
The book suggested I can use Interlocked.increment(ref number) to replace the number += 1 within the AddOne method, therefore the value will be locked until it's updated within the thread. So the output will be 1000, 2000, ..... 10000 as expected. But My output is still 999, 1999, 2999, ...... 9999.
Only after I uncomment the Thread.Sleep(1000) line will the output be correct but even without the Interlocked been used.
Can anyone explain what's happening here?
static void Main(string[] args)
{
myNum n = new myNum();
for (int i = 0;i<10; Interlocked.Increment(ref i))
{
for(int a =1;a<=1000; Interlocked.Increment(ref a))
{
Thread t = new Thread( new ThreadStart( n.AddOne));
t.Start();
}
//Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class myNum
{
public int number = 0;
public void AddOne()
{
//number += 1;
Interlocked.Increment(ref number);
}
}
You are printing out the value before all of the threads have finished executing. You need to join all of the threads before printing.
for(int a = 0; a < 1000; a++)
{
t[a].Join();
}
You'll need to store the threads in an array or list. Also, you don't need the interlocked instruction in any of the for loops. They all run in only one thread (the main thread). Only the code in AddOne runs in multiple threads and hence needs to by synchronized.
It a bit strange for me what you trying to achieve with this code. You are using Interlocked.Increment everywhere without explicit needs for it.
Interlocked.Increment required for access to values which can be accessed from different threads. In your code it is only number, so you don't require it for i and a, just use as usually i++ and a++
The problem you are asking for is that you just don't wait for all threads you started are completed its job. Take a look to Thread.Join() method. You have to wait while all of threads you are started completes its work.
In this simple test you are done with Thread.Sleep(1000); you do similar wait but its not correct to assume that all threads are complete in 1000 ms, so just use Thread.Join() for that.
If you modify your AddOne() method so it starts to executes longer (e.g. add Thread.Sleep(1000) to it) you'll notice that Thread.Sleep(1000); doesn't help any more.
I'll suggest to read more about ThreadPool vs Threads. Also take a look to Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4
I am trying to use ThreadPool.RegisterWaitForSingleObject to add a timer to a set of threads. I create 9 threads and am trying to give each of them an equal chance of operation as at the moment there seems to be a little starvation going on if I just add them to the thread pool. I am also trying to implement a manual reset event as I want all 9 threads to exit before continuing.
What is the best way to ensure that each thread in the threadpool gets an equal chance at running as the function that I am calling has a loop and it seems that each thread (or whichever one runs first) gets stuck in it and the others don't get a chance to run.
resetEvents = new ManualResetEvent[table_seats];
//Spawn 9 threads
for (int i = 0; i < table_seats; i++)
{
resetEvents[i] = new ManualResetEvent(false);
//AutoResetEvent ev = new AutoResetEvent(false);
RegisteredWaitHandle handle = ThreadPool.RegisterWaitForSingleObject(autoEvent, ObserveSeat, (object)i, 100, false);
}
//wait for threads to exit
WaitHandle.WaitAll(resetEvents);
However, it doesn't matter if I use resetEvents[] or ev neither seem to work properly. Am I able to implement this or am I (probably) misunderstanding how they should work.
Thanks, R.
I would not use the RegisterWaitForSingleObject for this purpose. The patterns I am going to describe here require the Reactive Extensions download since you are using .NET v3.5.
First, to wait for all work items from the ThreadPool to complete use the CountdownEvent class. This is a lot more elegant and scalable than using multiple ManualResetEvent instances. Plus, the WaitHandle.WaitAll method is limited to 64 handles.
var finished = new CountdownEvent(1);
for (int i = 0; i < table_seats; i++)
{
finished.AddCount();
ThreadPool.QueueUserWorkItem(ObserveSeat);
(state) =>
{
try
{
ObserveSeat(state);
}
finally
{
finished.Signal();
}
}, i);
}
finished.Signal();
finished.Wait();
Second, you could try calling Thread.Sleep(0) after several iterations of the loop to force a context switch so that the current thread yields to another. If you want a considerably more complex coordination strategy then use the Barrier class. Add another parameter to your ObserveSeat function which accepts this synchronization mechanism. You could supply it by capturing it in the lambda expression in the code above.
public void ObserveSeat(object state, Barrier barrier)
{
barrier.AddParticipant();
try
{
for (int i = 0; i < NUM_ITERATIONS; i++)
{
if (i % AMOUNT == 0)
{
// Let the other threads know we are done with this phase and wait
// for them to catch up.
barrier.SignalAndWait();
}
// Perform your work here.
}
}
finally
{
barrier.RemoveParticipant();
}
}
Note that although this approach would certainly prevent the starvation issue it might limit the throughput of the threads. Calling SignalAndWait too much might cause a lot of unnecessary context switching, but calling it too little might cause a lot of unnecessary waiting. You would probably have to tune AMOUNT to get the optimal balance of throughput and starvation. I suspect there might be a simple way to do the tuning dynamically.