I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.
The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.
Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());
Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..
Related
I am trying to learn about asynchronous programming and how I can benefit from it.
My hope is that I can use it to improve performance whenever I'm looping over a method that takes an significant time to complete, like the following method.
string AddStrings()
{
string result = "";
for (int i = 0; i < 10000; i++)
{
result += "hi";
}
return result;
}
Obviously this method doesn't have much value, and I purposely made it ineffecient, in order to test the benefits of asynchronous programming. The test is done by looping over the method 100 times, first synchronously and then asynchronously.
Stopwatch watch = new Stopwatch();
watch.Start();
List<string> results = WorkSync();
//List<string> results = await WorkAsyncParallel();
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
List<string> WorkSync()
{
var stringList = new List<string>();
for (int i = 0; i < 100; i++)
{
stringList.Add(AddStrings());
}
return stringList;
}
async Task<List<string>> WorkAsyncParallel()
{
var taskList = new List<Task<string>>();
for (int i = 0; i < 100; i++)
{
taskList.Add(Task.Run(() => AddStrings()));
}
var results = (await Task.WhenAll(taskList)).ToList();
return results;
}
Super optimistically (naively), I was hoping that the asynchronous loop would be 100 times as fast as the synchronous loop, as all the tasks are running at the same time. While that didn't exactly turn out to be the case, the time of the loop was decreased by more than two thirds, from around 5000 miliseconds to 1500 miliseconds!
Now my questions are:
What makes the asynchronous loop faster than the synchronous loop, but not nearly 100 times faster? I'm guessing each of the 100 tasks are fighthing for a limited amount of CPU?
Is this a valid method to improve performance when looping methods?
Thank you in advance.
My hope is that I can use it to improve performance
Not really. Concurrency (parallel or asynchronous) can improve performance, but asynchronous code on its own is more about freeing up threads. Asynchronous code provides two main benefits:
Server-side apps get better scalability. By using fewer threads, asynchronous code can scale further and faster than synchronous code.
UI apps get better responsiveness. By freeing up the UI thread, asynchronous code provides a better user experience.
Neither of these have much to do with performance. E.g., when comparing a synchronous server-side request handler with its basic asynchronous counterpart, the asynchronous one is usually slower (slightly), but the server as a whole scales better.
That said, asynchronous code can enable natural concurrency (e.g., Task.WhenAll). And if you do end up adding concurrency, then you can see some performance benefits - sometimes quite drastic ones.
The test is done by looping over the method 100 times, first synchronously and then asynchronously.
Technically, you're comparing single-threaded and parallel code, here.
As Jeroen pointed out in the comments, Parallel or PLINQ are the proper tools to use if you have a parallel problem. While you can use Task.Run, it's a very low-level tool for parallel programming. Task.Run is more commonly used for "shove this one thing to the thread pool", not "shove these 100 things to the thread pool".
What makes the asynchronous loop faster than the synchronous loop, but not nearly 100 times faster? I'm guessing each of the 100 tasks are fighthing for a limited amount of CPU?
Yes. Most machines these days have multi-core CPUs, and each core can do one thing at a time. So if you have 4 or 8 cores, that's the limit of your parallelism.
Is this a valid method to improve performance when looping methods?
If you have a lot of work to do in parallel, then parallel programming is acceptable. Again, I recommend using higher-level constructs like Parallel or PLINQ, which have better built-in partitioning strategies and other optimizations, which will be more efficient than throwing a bunch of tasks at the thread pool.
Parallel programming does have its caveats:
If your work is too fine-grained, then parallel work can end up being slower, as the overhead of partitioning, queueing, and scheduling erases the gains from concurrency.
You generally want to avoid parallelism in some situations such as handling a request on the server side. It's just usually not a good idea to allow one request to consume all the CPU resources of the whole server.
Now, if you want to test out asynchronous benefits, then I recommend using an operation that is inherently asynchronous (e.g., a client web request). Say, if your code was hitting 100 URLs. That would be something more naturally asynchronous that doesn't require thread pool threads.
I am new to threading and I need a clarification for the below scenario.
I am working on apple push notification services. My application demands to send notifications to 30k users when a new deal is added to the website.
can I split the 30k users into lists, each list containing 1000 users and start multiple threads or can use task?
Is the following way efficient?
if (lstDevice.Count > 0)
{
for (int i = 0; i < lstDevice.Count; i += 2)
{
splitList.Add(lstDevice.Skip(i).Take(2).ToList<DeviceHelper>());
}
var tasks = new Task[splitList.Count];
int count=0;
foreach (List<DeviceHelper> lst in splitList)
{
tasks[count] = Task.Factory.StartNew(() =>
{
QueueNotifications(lst, pMessage, pSubject, pNotificationType, push);
},
TaskCreationOptions.None);
count++;
}
QueueNotification method will just loop through each list item and creates a payload like
foreach (DeviceHelper device in splitList)
{
if (device.PlatformType.ToLower() == "ios")
{
push.QueueNotification(new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(pMessage)
.WithBadge(device.Badge)
);
Console.Write("Waiting for Queue to Finish...");
}
}
push.StopAllServices(true);
Technically it is sure possible to split a list and then start threads that runs your List in parallel. You can also implement everything yourself, as you already have done, but this isn't a good approach. At first splitting a List into chunks that gets processed in parallel is already what Parallel.For or Parallel.ForEach does. There is no need to re-implement everything yourself.
Now, you constantly ask if something can run 300 or 500 notifications in parallel. But actually this is not a good question because you completly miss the point of running something in parallel.
So, let me explain you why that question is not good. At first, you should ask yourself why do you want to run something in parallel? The answer to that is, you want that something runs faster by using multiple CPU-cores.
Now your simple idea is probably that spawning 300 or 500 threads is faster, because you have more threads and it runs more things "in parallel". But that is not exactly the case.
At first, creating a thread is not "free". Every thread you create has some overhead, it takes some CPU-time to create a thread, and also it needs some memory. On top of that, if you create 300 threads it doesn't mean 300 threads run in parallel. If you have for example an 8 core CPU only 8 threads really can run in parallel. Creating more threads can even hurt your performance. Because now your program needs to switch constanlty between threads, that also cost CPU-performance.
The result of all that is. If you have something lightweight some small code that don't do a lot of computation it ends that creating a lot of threads will slow down your application instead of running faster, because the managing of your threads creates more overhead than running it on (for example) 8 cpu-cores.
That means, if you have a list of 30,000 of somewhat. It usally end that it is faster to just split your list in 8 chunks and work through your list in 8 threads as creating 300 Threads.
Your goal should never be: Can it run xxx things in parallel?
The question should be like: How many threads do i need, and how much items should every thread process to get my work as fastest done.
That is an important difference because just spawning more threads doesn't mean something ends up beeing fast.
So how many threads do you need, and how many items should every thread process? Well, you can write a lot of code to test it. But the amount changes from hardware to hardware. A PC with just 4 cores have another optimum than a system with 8 cores. If what you are doing is IO bound (for example read/write to disk/network) you also don't get more speed by increasing your threads.
So what you now can do is test everything, try to get the correct thread number and do a lot of benchmarking to find the best numbers.
But actually, that is the whole purpose of the TPL library with the Task<T> class. The Task<T> class already looks at your computer how many cpu-cores it have. And when you are running your Task it automatically tries to create as much threads needed to get the maximum out of your system.
So my suggestion is that you should use the TPL library with the Task<T> class. In my opinion you should never create Threads directly yourself or doing partition yourself, because all of that is already done in TPL.
I think the Task-Class is a good choise for your aim, becuase you have an easy handling over the async process and don't have to deal with Threads directly.
Maybe this help: Task vs Thread differences
But to give you a better answer, you should improve your question an give us more details.
You should be careful with creating to much parallel threads, because this can slow down your application. Read this nice article from SO: How many threads is too many?. The best thing is you make it configurable and than test some values.
I agree Task is a good choice however creating too many tasks also bring risks to your system and for failures, your decision is also a factor to come up a solution. For me I prefer MSQueue combining with thread pool.
If you want parallelize the creation of the push notifications and maximize the performance by using all CPU's on the computer you should use Parallel.ForEach:
Parallel.ForEach(
devices,
device => {
if (device.PlatformType.ToUpperInvariant() == "IOS") {
push.QueueNotification(
new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(message)
.WithBadge(device.Badge)
);
}
}
);
push.StopAllServices(true);
This assumes that calling push.QueueNotification is thread-safe. Also, if this call locks a shared resource you may see lower than expected performance because of lock contention.
To avoid this lock contention you may be able to create a separate queue for each partition that Parallel.ForEach creates. I am improvising a bit here because some details are missing from the question. I assume that the variable push is an instance of the type Push:
Parallel.ForEach(
devices,
() => new Push(),
(device, _, push) => {
if (device.PlatformType.ToUpperInvariant() == "IOS") {
push.QueueNotification(
new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(message)
.WithBadge(device.Badge)
);
}
return push;
},
push.StopAllServices(true);
);
This will create a separate Push instance for each partition that Parallel.ForEach creates and when the partition is complete it will call StopAllServices on the instance.
This approach should perform no worse than splitting the devices into N lists where N is the number of CPU's and and starting either N threads or N tasks to process each list. If one thread or task "gets behind" the total execution time will be the execution time of this "slow" thread or task. With Parallel.ForEach all CPU's are used until all devices have been processed.
I am having a Windows Service that needs to pick the jobs from database and needs to process it.
Here, each job is a scanning process that would take approx 10 mins to complete.
I am very new to Task Parallel Library. I have implemented in the following way as sample logic:
Queue queue = new Queue();
for (int i = 0; i < 10000; i++)
{
queue.Enqueue(i);
}
for (int i = 0; i < 100; i++)
{
Task.Factory.StartNew((Object data ) =>
{
var Objdata = (Queue)data;
Console.WriteLine(Objdata.Dequeue());
Console.WriteLine(
"The current thread is " + Thread.CurrentThread.ManagedThreadId);
}, queue, TaskCreationOptions.LongRunning);
}
Console.ReadLine();
But, this is creating lot of threads. Since loop is repeating 100 times, it is creating 100 threads.
Is it right approach to create that many number of parallel threads ?
Is there any way to limit the number of threads to 10 (concurrency level)?
An important factor to remember when allocating new Threads is that the OS has to allocate a number of logical entities in order for that current thread to run:
Thread kernel object - an object for describing the thread,
including the thread's context, cpu registers, etc
Thread environment block - For exception handling and thread local
storage
User-mode stack - 1MB of stack
Kernel-mode stack - For passing arguments from user mode to kernel
mode
Other than that, the number of concurrent Threads that may run depend on the number of cores your machine is packing, and creating an amount of threads that is larger than the number of cores your machine owns will start causing Context Switching, which in the long run may slow your work down.
So after the long intro, to the good stuff. What we actually want to do is limit the number of threads running and reuse them as much as possible.
For this kind of job, i would go with TPL Dataflow which is based on the Producer-Consumer pattern. Just a small example of what can be done:
// a BufferBlock is an equivalent of a ConcurrentQueue to buffer your objects
var bufferBlock = new BufferBlock<object>();
// An ActionBlock to process each object and do something with it
var actionBlock = new ActionBlock<object>(obj =>
{
// Do stuff with the objects from the bufferblock
});
bufferBlock.LinkTo(actionBlock);
bufferBlock.Completion.ContinueWith(t => actionBlock.Complete());
You may pass each Block a ExecutionDataflowBlockOptions which may limit the Bounded Capacity (The number of objects inside the BufferBlock) and MaxDegreeOfParallelism which tells the block the number of maximum concurrency you may want.
There is a good example here to get you started.
Glad you asked, because you're right in the sense that - this is not the best approach.
The concept of Task should not be confused with a Thread. A Thread can be compared to a chef in a kitchen, while a Task is a dish ordered by a customer. You have a bunch of chefs, and they process the dish orders in some ordering (usually FIFO). A chef finishes a dish then moves on to the next. The concept of Thread Pool is the same. You create a bunch of Tasks to be completed, but you do not need to assign a new thread to each task.
Ok so the actual bits to do it. There are a few. The first one is ThreadPoll.QueueUserWorkItem. (http://msdn.microsoft.com/en-us/library/system.threading.threadpool.queueuserworkitem(v=vs.110).aspx). Using the Parallel library, Parallel.For can also be used, it will automatically spawn threads based on the number of actual CPU cores available in the system.
Parallel.For(0, 100, i=>{
//here, this method will be called 100 times, and i will be 0 to 100
WaitForGrassToGrow();
Console.WriteLine(string.Format("The {0}-th task has completed!",i));
});
Note that there is no guarantee that the method called by Parallel.For is called in sequence (0,1,2,3,4,5...). The actual sequence depends on the execution.
I used Task like below but there is no performance gain. I checked my method which executes in 0-1 seconds but with Task(30 Tasks), it takes 5-12 seconds. Can anyone guide if I have done any mistake. I want to run 30 parallel and expect 30 done in max 2 seconds.
Here is my code:
Task[] tasks = new Task[30];
for (int p = 0; p <= dstable.Tables[0].Rows.Count - 1; p++)
{
MethodParameters newParameter = new MethodParameters();
newParameter.Name = dstable.Tables[0].Rows[p]["Name"].ToString();
tasks[p] = Task.Factory.StartNew(() => ParseUri(newParameter));
Application.DoEvents();
}
try
{
Task.WaitAll(tasks);
//Console.Write("task completed");
}
catch (AggregateException ae)
{
throw ae.Flatten();
}
There are some major problems in your thinking.
does your PC have 30 Cores, so that every core can exactly takes one task? I don't think so
starting a seperate thread also takes some time.
with every concurrent thread more overhead is generated.
Can your problem be solved faster by starting more threads? This is only the case, when all threads do different tasks, like reading from disk, quering a database, computing something, etc.. 10 threads that do all "high-performance" tasks on the cpu, won't give an boost, quite contrary to, because every thread needs to clean up his mess, before he can give some cpu time to the next thread, and that one needs to clean up his mess too.
Check this link out
http://msdn.microsoft.com/en-us/library/ms810437.aspx
You can use the TPL
http://msdn.microsoft.com/en-us/library/dd460717.aspx
they try to guaranty the maximum effect from parallel threads.
Also I recommend this book
http://www.amazon.com/The-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916
When you really want to solve your problem in under 2 seconds, buy more CPU power ;)
I think you may have missing the main point of using thread.
Creating many threads may lead to more complexity by increasing OS over-head. (Context switching)
Also it will be much harder to manage your execution of code and harder to find and fix bugs if there is any.
Usage of thread may give you advantage when there is
Task need to be perform simultaneously
Building responsive UI
I have a function, that processes a list of 6100 list items. The code used to work when the list was just 300 items. But instantly crashes with 6100. Is there a way I can loop through these 6100 items say 30 at a time and execute a new thread per item?
for (var i = 0; i < ListProxies.Items.Count; i++)
{
var s = ListProxies.Items[i] as string;
var thread = new ParameterizedThreadStart(ProxyTest.IsAlive);
var doIt = new Thread(thread) { Name = "CheckProxy# " + i };
doIt.Start(s);
}
Any help would be greatly appreciated.
Do you really need to spawn a new thread for each work item? Unless there is a genuine need for this (if so, please tell us why), I would strongly recommend you use the Managed Thread Pool instead. This will give you the concurrency benefits you require, but without the resource requirements (as well as the creation, destruction and massive context-switching costs) of running thousands of threads. If you are on .NET 4.0, you might also want to consider using the Task Parallel Library.
For example:
for (var i = 0; i < ListProxies.Items.Count; i++)
{
var s = ListProxies.Items[i] as string;
ThreadPool.QueueUserWorkItem(ProxyTest.IsAlive, s);
}
On another note, I would seriously consider renaming the IsAlive method (which looks like a boolean property or method) since:
It clearly has a void IsAlive(object) signature.
It has observable side-effects (from your comment that it "increment a progress bar and add a 'working' proxy to a new list").
There is a limit on the number of threads you can spawn. 6100 threads does seem quite a bit excesive.
I agree win Ani, you should look into a ThreadPool or even a Producer / Consumer process depending on what you are trying to accomplish.
There are quite a few processes for handling multi threaded applications but without knowing what you are doing in the start there really is no way to recommend any approach other than a ThreadPool or Producer / Consumer process (Queues with SyncEvents).
At any rate you really should try to keep the number of threads to a minimum otherwise you run the risk of thread locks, spin locks, wait locks, dead locks, race conditions, who knows what, etc...
If you want good information on threading with C# check out the book Concurrent Programming on Windows By Joe Duffy it is really helpful.