My goal is to write a program that handles an arbitrary number of tasks based on given user input.
Let's say the # of tasks are 1000 in this case.
Now, I'd like to be able to have a dynamic number of threads that are spawned and start handling the tasks one by one.
I would assume I need to use a "synchronous" method, as opposed to a "asynchronous" one, so that in case one tasks has a problem, I wouldn't want it to slow down the completion of the rest.
What method would I use to accomplish the above? Semaphores? ThreadPools? And how do I make sure that a thread does not try to start a task that is already being handled by another thread? Would a "lock" handle this?
Code examples and/or links to sites that will point me in the right direction will be appreciated.
edit: The problem with the MSDN Fibonacci example is that the waitall method can only handle up to 64 waits. I need more than that due to the 1000 tasks. How to fix that situation without creating deadlocks?
Are these tasks independent? If so, you basically want a producer/consumer queue or a custom threadpool, which are effectively different views on the same thing. You need to be able to place tasks in a queue, and have multiple threads be able to read from that queue.
I have a custom threadpool in MiscUtil or there's a simple (nongeneric due to age) producer/consumer queue in my threading tutorial (about half way down this page).
If these tasks are reasonably long-running, I wouldn't use the system threadpool for this - it will spawn more threads than you probably want. If you're using .NET 4.0 beta 1 you could use Parallel Extensions though.
I'm not quite sure about your comment on WaitAll... are you trying to work out when everything's finished? In the producer/consumer queue case, that would probably involve having some sort of "stop" entry in the queue (e.g. null references which the consuming threads would understand to mean "quit") and then add a "WaitUntilEmpty" method (which should be fairly easy to implement). Note that you wouldn't need to wait until the last items had been processed, as they'd all be stop signals... by the time the queue has emptied, all the real work items will definitely have been processed anyway.
You'll probably want to use the ThreadPool to manage this.
I recommend reading up on MSDN on How to use the ThreadPool in C#. It covers many aspects of this, including firing tasks, and simple synchronization.
Using Threading in C# is the main section, and will cover other options.
If you happen to be using VS 2010 beta, and targetting .NET 4, the Task Parallel Library is a very good option for this - it simplifies some of these patterns.
You can't use it (yet) but the new Task class in .NET 4 would be ideal for this kind of situation.
Until then, the ThreadPool is your best bet. It has a (very) limited form of load-balancing. Note that if you try to start 1000 Threads you will probably get an Out Of Memory exception. The ThreadPool will handle that with ease.
Your sync problem can be handled with a simple (Interlocked) counter, if the timing is such that you can tolerate a Sleep(1) loop in the main thread. The ThreadPool is missing a more convenient way to do this.
A simple strategy to avoid a task is get by two or more thread is a syncronized (with a mutext for example) vector.
See this http://msdn.microsoft.com/en-us/library/yy12yx1f.aspx
Perhaps you can use the BackgroundWorker class. It creates a nice abstraction on top of the thread pool. You can even subclass it if you want to setup many similar jobs.
As has been mentioned, .NET 4 features the excellent Task Parallel Library. But you can use the June 2008 CTP of it in .NET 3.5 just fine. I've been doing this for some hobby projects myself, but if this is a commercial project, you should check out if there are legal issues.
Related
What is a multithreading program and how does it work exactly? I read some documents but I'm confused. I know that code is executed line by line, but I can't understand how the program manages this.
A simple answer would be appreciated.c# example please (only animation!)
What is a multi-threading program and how does it work exactly?
Interesting part about this question is complete books are written on the topic, but still it is elusive to lot of people. I will try to explain in the order detailed underneath.
Please note this is just to provide a gist, an answer like this can never do justice to the depth and detail required. Regarding videos, best that I have come across are part of paid subscriptions (Wintellect and Pluralsight), check out if you can listen to them on trial basis, assuming you don't already have the subscription:
Wintellect by Jeffery Ritcher (from his Book, CLR via C#, has same chapter on Thread Fundamentals)
CLR Threading by Mike Woodring
Explanation Order
What is a thread ?
Why were threads introduced, main purpose ?
Pitfalls and how to avoid them, using Synchronization constructs ?
Thread Vs ThreadPool ?
Evolution of Multi threaded programming API, like Parallel API, Task API
Concurrent Collections, usage ?
Async-Await, thread but no thread, why they are best for IO
What is a thread ?
It is software implementation, which is purely a Windows OS concept (multi-threaded architecture), it is bare minimum unit of work. Every process on windows OS has at least one thread, every method call is done on the thread. Each process can have multiple threads, to do multiple things in parallel (provided hardware support).
Other Unix based OS are multi process architecture, in fact in Windows, even the most complex piece of software like Oracle.exe have single process with multiple threads for different critical background operations.
Why were threads introduced, main purpose ?
Contrary to the perception that concurrency is the main purpose, it was robustness that lead to the introduction of threads, imagine every process on Windows is running using same thread (in the initial 16 bit version) and out of them one process crash, that simply means system restart to recover in most of the cases. Usage of threads for concurrent operations, as multiple of them can be invoked in each process, came in picture down the line. In fact it is even important to utilize the processor with multiple cores to its full ability.
Pitfalls and how to avoid using Synchronization constructs ?
More threads means, more work completed concurrently, but issue comes, when same memory is accessed, especially for Write, as that's when it can lead to:
Memory corruption
Race condition
Also, another issue is thread is a very costly resource, each thread has a thread environment block, Kernel memory allocation. Also for scheduling each thread on a processor core, time is spent for context switching. It is quite possible that misuse can cause huge performance penalty, instead of improvement.
To avoid Thread related corruption issues, its important to use the Synchronization constructs, like lock, mutex, semaphore, based on requirement. Read is always thread safe, but Write needs appropriate Synchronization.
Thread Vs ThreadPool ?
Real threads are not the ones, we use in C#.Net, that's just the managed wrapper to invoke Win32 threads. Challenge remain in user's ability to grossly misuse, like invoking lot more than required number of threads, assigning the processor affinity, so isn't it better that we request a standard pool to queue the work item and its windows which decide when the new thread is required, when an already existing thread can schedule the work item. Thread is a costly resource, which needs to be optimized in usage, else it can be bane not boon.
Evolution of Multi threaded programming, like Parallel API, Task API
From .Net 4.0 onward, variety of new APIs Parallel.For, Parallel.ForEach for data paralellization and Task Parallelization, have made it very simple to introduce concurrency in the system. These APIs again work using a Thread pool internally. Task is more like scheduling a work for sometime in the future. Now introducing concurrency is like a breeze, though still synchronization constructs are required to avoid memory corruption, race condition or thread safe collections can be used.
Concurrent Collections, usage ?
Implementations like ConcurrentBag, ConcurrentQueue, ConcurrentDictionary, part of System.Collections.Concurrent are inherent thread safe, using spin-wait and much easier and quicker than explicit Synchronization. Also much easier to manage and work. There's another set API like ImmutableList System.Collections.Immutable, available via nuget, which are thread safe by virtue of creating another copy of data structure internally.
Async-Await, thread but no thread, why they are best for IO
This is an important aspect of concurrency meant for IO calls (disk, network), other APIs discussed till now, are meant for compute based concurrency so threads are important and make it faster, but for IO calls thread has no use except waiting for the call to return, IO calls are processed on hardware based queue IO Completion ports
A simple analogy might be found in the kitchen.
You've probably cooked using a recipe before -- start with the specified ingredients, follow the steps indicated in the recipe, and at the end you (hopefully) have a delicious dish ready to eat. If you do that, then you have executed a traditional (non-multithreaded) program.
But what if you have to cook a full meal, which includes a number of different dishes? The simple way to do it would be to start with the first recipe, do everything the recipe says, and when it's done, put the finished dish (and the first recipe) aside, then start on the second recipe, do everything it says, put the second dish (and second recipe) aside, and so on until you've gone through all of the recipes one after another. That will work, but you might end up spending 10 hours in the kitchen, and of course by the time the last dish is ready to eat, the first dish might be cold and unappetizing.
So instead you'd probably do what most chefs do, which is to start working on several recipes at the same time. For example, you might put the roast in the oven for 45 minutes, but instead of sitting in front of the oven waiting 45 minutes for the roast to cook, you'd spend the 45 minutes chopping the vegetables. When the oven timer rings, you put down your vegetable knife, pull the cooked roast out of the oven and let it cool, then go back to chopping vegetables, and so on. If you can do that, then you are successfully multitasking several recipes/programs. That is, you aren't literally working on multiple recipes at once (you still have only two hands!), but you are jumping back and forth from following one recipe to following another whenever necessary, and thereby making progress on several tasks rather than twiddling your thumbs a lot. Do this well and you can have the whole meal ready to eat in a much shorter amount of time, and everything will be hot and fresh at about the same time too. If you do this, you are executing a simple multithreaded program.
Then if you wanted to get really fancy, you might hire a few other chefs to work in the kitchen at the same time as you, so that you can get even more food prepared in a given amount of time. If you do this, your team is doing multiprocessing, with each chef taking one part of the total work and all of them working simultaneously. Note that each chef may well be working on multiple recipes (i.e. multitasking) as described in the previous paragraph.
As for how a computer does this sort of thing (no more analogies about chefs), it usually implements it using a list of ready-to-run threads and a timer. When the timer goes off (or when the thread that is currently executing has nothing to do for a while, because e.g. it is waiting to load data from a slow hard drive or something), the operating system does a context switch, in which pauses the current thread (by putting it into a list somewhere and no longer executing instructions from that thread's code anymore), then pulls another ready-to-run thread from the list of ready-to-run threads and starts executing instructions from that thread's code instead. This repeats for as long as necessary, often with context switches happening every few milliseconds, giving the illusion that multiple programs are running "at the same time" even on a single-core CPU. (On a multi-core CPU it does this same thing on each core, and in that case it's no longer just an illusion; multiple programs really are running at the same time)
Why don't you refer to Microsoft's very own documentation of the .net class System.Threading.Thread?
It has a handfull of simple example programs written in C# (at the bottom of the page) just as you asked for:
Thread Examples
actually multi thread is do multiple process at the same time together . and you can complete process parallel .
it's actually multi thread is do multiple process at the same time together . and you can complete process parallel . you can take task from your main thread then execute some other way and done .
I see a lot of people in blog posts and here on SO either avoiding or advising against the usage of the Thread class in recent versions of C# (and I mean of course 4.0+, with the addition of Task & friends). Even before, there were debates about the fact that a plain old thread's functionality can be replaced in many cases by the ThreadPool class.
Also, other specialized mechanisms are further rendering the Thread class less appealing, such as Timers replacing the ugly Thread + Sleep combo, while for GUIs we have BackgroundWorker, etc.
Still, the Thread seems to remain a very familiar concept for some people (myself included), people that, when confronted with a task that involves some kind of parallel execution, jump directly to using the good old Thread class. I've been wondering lately if it's time to amend my ways.
So my question is, are there any cases when it's necessary or useful to use a plain old Thread object instead of one of the above constructs?
The Thread class cannot be made obsolete because obviously it is an implementation detail of all those other patterns you mention.
But that's not really your question; your question is
are there any cases when it's necessary or useful to use a plain old Thread object instead of one of the above constructs?
Sure. In precisely those cases where one of the higher-level constructs does not meet your needs.
My advice is that if you find yourself in a situation where existing higher-abstraction tools do not meet your needs, and you wish to implement a solution using threads, then you should identify the missing abstraction that you really need, and then implement that abstraction using threads, and then use the abstraction.
Threads are a basic building block for certain things (namely parallelism and asynchrony) and thus should not be taken away. However, for most people and most use cases there are more appropriate things to use which you mentioned, such as thread pools (which provide a nice way of handling many small jobs in parallel without overloading the machine by spawning 2000 threads at once), BackgroundWorker (which encapsulates useful events for a single shortlived piece of work).
But just because in many cases those are more appropriate as they shield the programmer from needlessly reinventing the wheel, doing stupid mistakes and the like, that does not mean that the Thread class is obsolete. It is still used by the abstractions named above and you would still need it if you need fine-grained control over threads that is not covered by the more special classes.
In a similar vein, .NET doesn't forbid the use of arrays, despite List<T> being a better fit for many cases where people use arrays. Simply because you may still want to build things that are not covered by the standard lib.
Task and Thread are different abstractions. If you want to model a thread, the Thread class is still the most appropriate choice. E.g. if you need to interact with the current thread, I don't see any better types for this.
However, as you point out .NET has added several dedicated abstractions which are preferable over Thread in many cases.
The Thread class is not obsolete, it is still useful in special circumstances.
Where I work we wrote a 'background processor' as part of a content management system: a Windows service that monitors directories, e-mail addresses and RSS feeds, and every time something new shows up execute a task on it - typically to import the data.
Attempts to use the thread pool for this did not work: it tries to execute too much stuff at the same time and trash the disks, so we implemented our own polling and execution system using directly the Thread class.
The new options make direct use and management of the (expensive) threads less frequent.
people that, when confronted with a task that involves some kind of parallel execution, jump directly to using the good old Thread class.
Which is a very expensive and relatively complex way of doing stuff in parallel.
Note that the expense matters most: You cannot use a full thread to do a small job, it would be counterproductive. The ThreadPool combats the costs, the Task class the complexities (exceptions, waiting and canceling).
To answer the question of "are there any cases when it's necessary or useful to use a plain old Thread object", I'd say a plain old Thread is useful (but not necessary) when you have a long running process that you won't ever interact with from a different thread.
For example, if you're writing an application that subscribes to receive messages from some sort of message queue and you're application is going to do more than just process those messages then it would be useful to use a Thread because the thread will be self-contained (i.e. you aren't waiting on it to get done), and it isn't short-lived. Using the ThreadPool class is more for queuing up a bunch of short-lived work items and allowing the ThreadPool class manage efficiently processing each one as a new Thread is available. Tasks can be used where you would use Thread directly, but in the above scenario I don't think they would buy you much. They help you interact with the thread more easily (which the above scenario doesn't need) and they help determine how many Threads actually should be used for the given set of tasks based on the number of processors you have (which isn't what you want, so you'd tell the Task your thing is LongRunning in which case in the current 4.0 implementation it would simply create a separate non-pooled Thread).
Probably not the answer you were expecting, but I use Thread all the time when coding against the .NET Micro Framework. MF is quite cut down and doesn't include higher level abstractions and the Thread class is super flexible when you need to get the last bit of performance out of a low MHz CPU.
You could compare the Thread class to ADO.NET. It's not the recommended tool for getting the job done, but its not obsolete. Other tools build on top of it to ease the job.
Its not wrong to use the Thread class over other things, especially if those things don't provide a functionality that you need.
It's not definitely obsolete.
The problem with multithreaded apps is that they are very hard to get right (often indeterministic behavior, input, output and also internal state is important), so a programmer should push as much work as possible to framework/tools. Abstract it away. But, the mortal enemy of abstraction is performance.
So my question is, are there any cases when it's necessary or useful
to use a plain old Thread object instead of one of the above
constructs?
I'd go with Threads and locks only if there will be serious performance problems, high performance goals.
I've always used the Thread class when I need to keep count and control over the threads I've spun up. I realize I could use the threadpool to hold all of my outstanding work, but I've never found a good way to keep track of how much work is currently being done or what the status is.
Instead, I create a collection and place the threads in them after I spin them up - the very last thing a thread does is remove itself from the collection. That way, I can always tell how many threads are running, and I can use the collection to ask each what it's doing. If there's a case when I need to kill them all, normally you'd have to set some kind of "Abort" flag in your application, wait for every thread to notice that on its own and self-terminate - in my case, I can walk the collection and issue a Thread.Abort to each one in turn.
In that case, I haven't found a better way that working directly with the Thread class. As Eric Lippert mentioned, the others are just higher-level abstractions, and it's appropriate to work with the lower-level classes when the available high-level implementations don't meet your need. Just as you sometimes need to do Win32 API calls when .NET doesn't address your exact needs, there will always be cases where the Thread class is the best choice despite recent "advancements."
I have an application in C# with a list of work to do. I'm looking to do as much of that work as possible in parallel. However I need to be able to control the maximum amount of parallel tasks.
From what I understand this is possible with a ThreadPool or with Tasks. Is there an difference in which one I use? My main concern is being able to control how many threads are active at one time.
Please take a look at ParallelOptions.MaxDegreeOfParallelism for Tasks.
I would advise you to use Tasks, because they provide a higher level abstraction than the ThreadPool.
A very good read on the topic can be found here. Really, a must-have book and it's free on top of that :)
In TPL you can use the WithDegreeOfParallelism on a ParallelEnumerable or ParallelOptions.MaxDegreeOfParallism
There is also the CountdownEvent which may be a better option if you are just using custom threads or tasks.
In the ThreadPool, when you use SetMaxThreads its global for the AppDomain so you could potentially be limiting unrelated code unnecessarily.
You cannot set the number of worker threads or the number of I/O completion threads to a number smaller than the number of processors in the computer.
If the common language runtime is hosted, for example by Internet Information Services (IIS) or SQL Server, the host can limit or prevent changes to the thread pool size.
Use caution when changing the maximum number of threads in the thread pool. While your code might benefit, the changes might have an adverse effect on code libraries you use.
Setting the thread pool size too large can cause performance problems. If too many threads are executing at the same time, the task switching overhead becomes a significant factor.
I agree with the other answer that you should use TPL over the ThreadPool as its a better abstraction of multi-threading, but its possible to accomplish what you want in both.
In this article on msdn, they explain why they recommend Tasks instead of ThreadPool for Parallelism.
Task have a very charming feature to me, you can build chains of tasks. Which are executed on certain results of the task before.
A feature I often use is following: Task A is running in background to do some long running work. I chain Task B after it, only executing when Task A has finished regulary and I configure it to run in the foreground, so I can easily update my controls with the result of long running Task A.
You can also create a semaphore to control how many threads can execute at a single time. You can create a new semaphore and in the constructor specify how many simultaneous threads are able to use that semaphore at a single time. Since I don't know how you are going to be using the threads, this would be a good starting point.
MSDN Article on the Semaphore class
-Wesley
Is it possible to purge a ThreadPool?
Remove items from the ThreadPool?
Anything like that?
ThreadPool.QueueUserWorkItem(GetDataThread);
RegisteredWaitHandle Handle = ThreadPool.RegisterWaitForSingleObject(CompletedEvent, WaitProc, null, 10000, true);
Any thoughts?
I recommend using the Task class (added in .NET 4.0) if you need this kind of behaviour. It supports cancellation, and you can have any number of tasks listening to the same cancellation token, which enables you to cancel them all with a single method call.
Updated (non-4.0 solution):
You really only have two choices. One: implement your own event demultiplexer (this is far more complex than it appears, due to the 64-handle wait limitation); I can't recommend this - I had to do it once (in unmanaged code), and it was hideous.
That leaves the second choice: Have a signal to cancel the tasks. Naturally, RegisteredWaitHandle.Unregister can cancel the RWFSO part. The QUWI is more complex, but can be done by making the action aware of a "token" value. When the action executes, it first checks the token value against its stored token value; if they are different, then it shouldn't do anything.
One major thing to consider is race conditions. Just keep in mind that there is a race condition between cancelling an action and the ThreadPool executing it, so it is possible to see actions running after cancellation.
I have a blog post on this concept, which I call "asynchronous callback contexts". The CallbackContext type mentioned in the blog post is available in the Nito.Async library.
There's no interface for removing a queued item. However, nothing stops you from "poisoning" the delegate so that it returns immediately.
edit
Based on what Paul said, I'm thinking you might also want to consider a pipelined architecture, where you have a fixed number of threads reading from a blocking queue (like .NET 4.0's BlockingCollection on a ConcurrentQueue). This way, if you want to cancel items, you can just access the queue yourself.
Having said that, Stephen's advice about Task is likely better, in that it gives you all the control you would realistically want, without all the hard work that rolling your own pipelines involves. I mention this only for completion.
The ThreadPool exists to help you manage your threads. You should not have to worry about purging it at all since it will make the best performance decisions on your behalf.
If you think you need tighter control over your threads then you could consider creating your own thread management class (similar to ThreadPool) but it would take a lot of work to match and exceed the functionality that ThreadPool has built in.
Take a look here at some of the ThreadPool optimizations and the ideas behind it.
For my second point, I found an article on Code Project that implements a "Cancelable Threadpool", probably for some of your own similar reasons. It would be a good place to start looking if you're going to write your own.
I used multiple threads in a few programs, but still don't feel very comfortable about it.
What multi-threading libraries for C#/.NET are out there and which advantages does one have over the other?
By multi-threading libraries I mean everything which helps make programming with multiple threads easier.
What .NET integratet (i.e. like ThreadPool) do you use periodically?
Which problems did you encounter?
There are various reasons for using multiple threads in an application:
UI responsiveness
Concurrent operations
Parallel speedup
The approach one should choose depends on what you're trying to do. For UI responsiveness, consider using BackgroundWorker, for example.
For concurrent operations (e.g. a server: something that doesn't have to be parallel, but probably does need to be concurrent even on a single-core system), consider using the thread pool or, if the tasks are long-lived and you need a lot of them, consider using one thread per task.
If you have a so-called embarrassingly parallel problem that can be easily divided up into small subproblems, consider using a pool of worker threads (as many threads as CPU cores) that pull tasks from a queue. The Microsoft Task Parallel Library (TPL) may help here. If the job can be easily expressed as a monadic stream computation (i.e. with a query in LINQ with work in transformations and aggregations etc.), Parallel LINQ (same link) which runs on top of TPL may help.
There are other approaches, such as Actor-style parallelism as seen in Erlang, which are harder to implement efficiently in .NET because of the lack of a green threading model or means to implement same, such as CLR-supported continuations.
I like this one
http://www.codeplex.com/smartthreadpool
Check out the Power Threading library.
I have written a lot of threading code in my days, even implemented my own threading pool & dispatcher. A lot of it is documented here:
http://web.archive.org/web/20120708232527/http://devplanet.com/blogs/brianr/default.aspx
Just realize that I wrote these for very specific purposes and tested them in those conditions, and there is no real silver-bullet.
My advise would be to get comfortable with the thread pool before you move to any other libraries. A lot of the framework code uses the thread pool, so even if you happen to find The Best Threads Library(TM), you will still have to work with the thread pool, so you really need to understand that.
You should also keep in mind that a lot of work has been put into implementing the thread pool and tuning it. The upcoming version of .NET has numerous improvements triggered by the development the parallel libraries.
In my point of view many of the "problems" with the current thread pool can be amended by knowing its strengths and weaknesses.
Please keep in mind that you really should be closing threads (or allowing the threadpool to dispose) when you no longer need them, unless you will need them again soon. The reason I say this is that each thread requires stack memory (usually 1mb), so when you have applications sitting on threads but not using them, you are wasting memory.
For exmaple, Outlook on my machine right now has 20 threads open and is using 0% CPU. That is simply a waste of (a least) 20mb of memory. Word is also using another 10 threads with 0% CPU. 30mb may not seem like much, but what if every application was wasting 10-20 threads?
Again, if you need access to a threadpool on a regular basis then you don't need to close it (creating/destroying threads has an overhead).
You don't have to use the threadpool explicitly, you can use BeginInvoke-EndInvoke if you need async calls. It uses the threadpool behind the scenes. See here: http://msdn.microsoft.com/en-us/library/2e08f6yc.aspx
You should take a look at the Concurrency & Coordination Runtime. The CCR can be a little daunting at first as it requires a slightly different mind set. This video has a fairly good job of explanation of its workings...
In my opinion this would be the way to go, and I also hear that it will use the same scheduler as the TPL.
For me the builtin classes of the Framework are more than enough. The Threadpool is odd and lame, but you can write your own easily.
I often used the BackgroundWorker class for Frontends, cause it makes life much easier - invoking is done automatically for the eventhandlers.
I regularly start of threads manually and safe them in an dictionary with a ManualResetEvent to be able to examine who of them has ended already. I use the WaitHandle.WaitAll() Method for this. Problem there is, that WaitHandle.WaitAll does not acceppt Arrays with more than 64 WaitHandles at once.
You might want to look at the series of articles about threading patterns. Right now it has sample codes for implementing a WorkerThread and a ThreadedQueue.
http://devpinoy.org/blogs/jakelite/archive/tags/Threading+Patterns/default.aspx