For one of my projects thats kind of a content-aggregator i'd like to introduce concurrency and if possible parallelism. At first hand this may seem pointless because concurrency and parallelism take different approaches. (Concurrency via threads introduces immediate concurrency, where as parallelism provides a potential).
So to better explain my problem, let me summarize my problem set.
As my project is a content-aggregator (that aggregates feeds,podcasts and similar stuff) it basically reads the data from web, parses them to return the meaningful data.
So as of right now i took a very simplistic sequential approach. Let's say that we've some amount of feeds we have to parse.
foreach(feed in feeds)
{
read_from_web(feed)
parse(feed)
}
So with sequential approach time taken parse all feeds and process them greatly depends on not only the parser code but time needed to get the xml source from web. We all know that it may take variable time to get read the source from web (because of the network conditions and similar issues).
So to speed up the code i can take an approach of worker threads which will introduce an immediate concurrency;
So a defined number of worker threads can take a feed & parse concurrently (which will for sure speed up the whole the process - as we'll see lesser impact of waiting for data over the net).
This is all okay until the point that, my target audience of the project mostly runs multi-core cpus -- because of the fact that they're gamers --.
I want to also utilize these cores while processing the content so started reading on the potential parallelism http://oreilly.com/catalog/0790145310262. I've still not finished reading it yet and don't know if this is already discusses but i'm quite a bit obsessed with this and wanted to ask over stackoverflow to get an overall idea.
So as the book describes potential parallelism: Potential Parallelism means that your program is written so that it runs faster when parallel hardware is available and roughly the same as an equivalent sequential program when it's not.
So the real question is, while i'm using worker threads for concurrency, can i still use possible parallelism? (running my feed parsers on worker threads and still distributing them to cpu cores -- if the cpu supports multi-cores of course)
I think it's more useful to think about IO-bound work and CPU-bound work; threads can help with both.
For IO-bound work you are presumably waiting for external resources (in your case, feeds to be read). If you must wait on multiple external resources then it only makes sense to wait on them in parallel rather than wait on them one after the other. This is best done by spinning up threads which block on the IO.
For CPU-bound work you want to use all of your cores to maximize the throughput of completing that work. To do that, you should create a pool of worker threads roughly the same size as your number of cores and break up and distribute the work across them. [How you break up and distribute the work is itself an interesting problem.]
In practice, I find that most applications have both of these problems and it makes sense to use threads to solve both kinds of problems.
Okay it seems i was greatly mistaken by the books description on possible parallelism. Thanks to answers i was able to figure out things;
From msdn: http://msdn.microsoft.com/en-us/library/dd460717(VS.100).aspx
The Task Parallel Library (TPL) is a
set of public types and APIs in the
System.Threading and
System.Threading.Tasks namespaces in
the .NET Framework version 4. The
purpose of the TPL is to make
developers more productive by
simplifying the process of adding
parallelism and concurrency to
applications. The TPL scales the
degree of concurrency dynamically to
most efficiently use all the
processors that are available. In
addition, the TPL handles the
partitioning of the work, the
scheduling of threads on the
ThreadPool, cancellation support,
state management, and other low-level
details. By using TPL, you can
maximize the performance of your code
while focusing on the work that your
program is designed to accomplish.
So basically it means TPL can handle all the details of concurrency via threading and also supports parallelism on multi-cores.
Related
I'm implementing several small services, each of which uses entity-framework to store certain (but little) data. They also have a fair bit of business-logic so it makes sense to separate them from one another.
I'm certainly aware that async-methods and the async-await pattern itself can solve many problems in regards to performance especially when it comes to any I/O or cpu-intensive operations.
I'm uncertain wether to use the async-methods of entity-framework logic (e.g. SaveChangesAsync or FirstOrDefaultAsync) because I can't find metrics that say "now you do it, and now you don't" besides from "Is it I/O or CPU-Intensive or not?".
What I've found when researching this topic (not limited to this but these are showing the problem):
not using it can lead to your application stopping to respond because the threads (not the ones of the cpu, but virtual threads of the os) can run out because of the in that case blocking i/o calls to the database.
using it bloats your code and decreases performance because of the context-switches at every method. Especially when I apply those to entity-framework calls it means that I have at least three context switches for one call from controller to business-logic to the repository to the database.
What I don't know, and that's what I would like to know from you:
How many virtual os threads are there? Or to be more precise: If I expect my application and server to be able to handle 100 requests to this service within five seconds (and I don't expect them to be more, 100 is already exagerated), should I back away from using async/await there?
What are the precise metrics that I could look at to answer this question for any of my services?
Or should I rather always use async-methods for I/O calls because they are already there and it could always happen that the load-situation on my server changes and there's so much going on that the async-methods would help me a great deal with that?
I'm certainly aware that async-methods and the async-await pattern itself can solve many problems in regards to performance especially when it comes to any I/O or cpu-intensive operations.
Sort of. The primary benefit of asynchronous code is that it frees up threads. UI apps (i.e., desktop/mobile) manifest this benefit in more responsive user interfaces. Services such as the ones you're writing manifest this benefit in better scalability - the performance benefits are only visible when under load. Also, services only receive this benefit from I/O operations; CPU-bound operations require a thread no matter what, so using await Task.Run on service applications doesn't help at all.
not using it can lead to your application stopping to respond because the threads (not the ones of the cpu, but virtual threads of the os) can run out because of the in that case blocking i/o calls to the database.
Yes. More specifically, the thread pool has a limited injection rate, so it can only grow so far so quickly. Asynchrony (freeing up threads) helps your service handle bursty traffic and heavy load. Quote:
Bear in mind that asynchronous code does not replace the thread pool. This isn’t thread pool or asynchronous code; it’s thread pool and asynchronous code. Asynchronous code allows your application to make optimum use of the thread pool. It takes the existing thread pool and turns it up to 11.
Next question:
using it bloats your code and decreases performance because of the context-switches at every method.
The main performance drawback to async is usually memory related. There's additional structures that need to be allocated to keep track of ongoing asynchronous work. In the synchronous world, the thread stack itself has this information.
What I don't know, and that's what I would like to know from you: [when should I use async?]
Generally speaking, you should use async for any new code doing I/O-based operations (including all EF operations). The metrics-based arguments are more about cost/benefit analysis of converting to async - i.e., given an existing old synchronous codebase, at what point is it worth investing the time to convert it to async.
TLDR: Should I use async? YES!
You seem to have fallen for the most common mistake when trying to understand async/await. Async is orthogonal to multi-threading.
To answer your question, when should you the async method?
If currentContext.IsAsync && method.HasAsyncVersion
return UseAsync.Yes;
Else
return UseAsync.No;
That above is the short version.
Async/Await actually solves a few problems
Unblock UI thread
M:N threading
Multithreaded scheduling and synchronization
Interupt/Event based asynchronous scheduling
Given the large number of different use cases for async/await, the "assumptions" you state only apply to certain cases.
For example, context switching, only happens with Multi-Threading. Single-Threaded Interupt based Async actually reduces context switching by reducing blocking times and keeping the OS thread well fed with work.
Finally, your question on OS threads, is fundimentally wrong.
Firstly, OS threads each require creation of a stack (4MB of continous RAM, 100 threads means 400MB of RAM before any work is even done).
Secondly, unless you have 100 physical cores on your PC, your CPUs will have to context switch between each OS thread, resulting in the CPU stalling, whilst it loads that thread. By using M:N threading, you can keep the CPU running, by reducing the number of OS threads and instead using Green Threads (Task in dotnet).
Thirdly, not all "await" results in "async" behavior. Tasks are able to synchronously return, short-circuiting all of the "bloat".
In short, without digging really deep, it is hard to find optimization opportunities by switching from async to sync methods.
I am looking to clarify my understanding of .NET multithreading, and in particular, which .NET methods create threads which may potentially execute at the same time on different processors or cores in a multi-processor/core system.
In the .NET TPL framework you can use the methods Parallel.Invoke, or Task.Factory.StartNew to achieve some kind of parallelism.
My understanding is that in both cases .NET creates new Tasks (behind the scenes for Parallel.Invoke), which the .NET environment then allocates to managed threads behind the scenes, which are then assigned to threads, which the CPU may allocate to different cores or processors depending on the workload. The main difference between the two methods is semantics - Parallel.Invoke executes multiple tasks and waits for them to complete; Task.Factory.StartNew starts a new task in the background. In both cases, the actual work may be done on different cores or processors. As per Task Parallel Library (TPL).
I have a colleague who is convinced that only the Parallel.Invoke method allows the threads to be executed on different cores/processors, and that Task.Factory.StartNew starts a new thread but that thread will only be scheduled on one core/processor - so doesn't actually give parallelism.
I can't find any documentation or articles which explicitly state whether this is the case or not. My colleague refers me to the same articles that I am looking at, such as Task-based Asynchronous Programming, which I think validate my understanding but my colleague thinks validates his.
The documentation sometimes uses the term "parallel processing" with reference to Parallel.Invoke and "asynchronous tasks" with reference to "Task.Factory.StartNew", but as far as I understand the same thing happens in the background with regards to allocation to multiple processors/cores.
Can anyone please help to clarify the situation, if possible with links to documentation/articles.
I know this sounds like seeking resolution to an argument with a colleague, but I am genuinely looking to clarify whether or not I understand this correctly.
It's actually pretty easy to answer.
Task.Run()
Queues the specified work to run on the ThreadPool ....
Task Parallel Library
... In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, ....
Using the same ThreadPool how is it possible for the ThreadPool to determine the type of task in order to limit the CPU? Either they both run on all Processors or the they all run one a Single Processor.
Extra Credit:
This begs the question, Is the ThreadPool Multi-Core aware?
The answer is surprisingly, it doesn't care. The ThreadPool asks the operating system (just like any c# application that uses new Thread()) for a Thread, it actually the responsibility of the OS. I think it would be pretty clear by now that with all the abstraction that even suggesting that C# can by default limit how threads are used is a pretty ridiculous assertion. (Yes you can run a thread on whatever core you want etc etc, but that is not how the ThreadPool works by default).
I highly recommend reading StartNew is Dangerous... TLDR? Use Task.Run().
Although operating systems sometimes provide for "processor affinity," this is an edge-case and its use (or availability) is quite rare. So far as I am aware, .NET does not make any use of such things.
Your foundation assumption must always be: "a runnable thread/process will run where it damn well pleases," and it might switch from one CPU resource to another at any time. The .NET framework makes things a whole lot "nicer" for you in a lot of ways, but the underlying scheduling decisions are still being made – exclusively – by the host operating system.
What is a multithreading program and how does it work exactly? I read some documents but I'm confused. I know that code is executed line by line, but I can't understand how the program manages this.
A simple answer would be appreciated.c# example please (only animation!)
What is a multi-threading program and how does it work exactly?
Interesting part about this question is complete books are written on the topic, but still it is elusive to lot of people. I will try to explain in the order detailed underneath.
Please note this is just to provide a gist, an answer like this can never do justice to the depth and detail required. Regarding videos, best that I have come across are part of paid subscriptions (Wintellect and Pluralsight), check out if you can listen to them on trial basis, assuming you don't already have the subscription:
Wintellect by Jeffery Ritcher (from his Book, CLR via C#, has same chapter on Thread Fundamentals)
CLR Threading by Mike Woodring
Explanation Order
What is a thread ?
Why were threads introduced, main purpose ?
Pitfalls and how to avoid them, using Synchronization constructs ?
Thread Vs ThreadPool ?
Evolution of Multi threaded programming API, like Parallel API, Task API
Concurrent Collections, usage ?
Async-Await, thread but no thread, why they are best for IO
What is a thread ?
It is software implementation, which is purely a Windows OS concept (multi-threaded architecture), it is bare minimum unit of work. Every process on windows OS has at least one thread, every method call is done on the thread. Each process can have multiple threads, to do multiple things in parallel (provided hardware support).
Other Unix based OS are multi process architecture, in fact in Windows, even the most complex piece of software like Oracle.exe have single process with multiple threads for different critical background operations.
Why were threads introduced, main purpose ?
Contrary to the perception that concurrency is the main purpose, it was robustness that lead to the introduction of threads, imagine every process on Windows is running using same thread (in the initial 16 bit version) and out of them one process crash, that simply means system restart to recover in most of the cases. Usage of threads for concurrent operations, as multiple of them can be invoked in each process, came in picture down the line. In fact it is even important to utilize the processor with multiple cores to its full ability.
Pitfalls and how to avoid using Synchronization constructs ?
More threads means, more work completed concurrently, but issue comes, when same memory is accessed, especially for Write, as that's when it can lead to:
Memory corruption
Race condition
Also, another issue is thread is a very costly resource, each thread has a thread environment block, Kernel memory allocation. Also for scheduling each thread on a processor core, time is spent for context switching. It is quite possible that misuse can cause huge performance penalty, instead of improvement.
To avoid Thread related corruption issues, its important to use the Synchronization constructs, like lock, mutex, semaphore, based on requirement. Read is always thread safe, but Write needs appropriate Synchronization.
Thread Vs ThreadPool ?
Real threads are not the ones, we use in C#.Net, that's just the managed wrapper to invoke Win32 threads. Challenge remain in user's ability to grossly misuse, like invoking lot more than required number of threads, assigning the processor affinity, so isn't it better that we request a standard pool to queue the work item and its windows which decide when the new thread is required, when an already existing thread can schedule the work item. Thread is a costly resource, which needs to be optimized in usage, else it can be bane not boon.
Evolution of Multi threaded programming, like Parallel API, Task API
From .Net 4.0 onward, variety of new APIs Parallel.For, Parallel.ForEach for data paralellization and Task Parallelization, have made it very simple to introduce concurrency in the system. These APIs again work using a Thread pool internally. Task is more like scheduling a work for sometime in the future. Now introducing concurrency is like a breeze, though still synchronization constructs are required to avoid memory corruption, race condition or thread safe collections can be used.
Concurrent Collections, usage ?
Implementations like ConcurrentBag, ConcurrentQueue, ConcurrentDictionary, part of System.Collections.Concurrent are inherent thread safe, using spin-wait and much easier and quicker than explicit Synchronization. Also much easier to manage and work. There's another set API like ImmutableList System.Collections.Immutable, available via nuget, which are thread safe by virtue of creating another copy of data structure internally.
Async-Await, thread but no thread, why they are best for IO
This is an important aspect of concurrency meant for IO calls (disk, network), other APIs discussed till now, are meant for compute based concurrency so threads are important and make it faster, but for IO calls thread has no use except waiting for the call to return, IO calls are processed on hardware based queue IO Completion ports
A simple analogy might be found in the kitchen.
You've probably cooked using a recipe before -- start with the specified ingredients, follow the steps indicated in the recipe, and at the end you (hopefully) have a delicious dish ready to eat. If you do that, then you have executed a traditional (non-multithreaded) program.
But what if you have to cook a full meal, which includes a number of different dishes? The simple way to do it would be to start with the first recipe, do everything the recipe says, and when it's done, put the finished dish (and the first recipe) aside, then start on the second recipe, do everything it says, put the second dish (and second recipe) aside, and so on until you've gone through all of the recipes one after another. That will work, but you might end up spending 10 hours in the kitchen, and of course by the time the last dish is ready to eat, the first dish might be cold and unappetizing.
So instead you'd probably do what most chefs do, which is to start working on several recipes at the same time. For example, you might put the roast in the oven for 45 minutes, but instead of sitting in front of the oven waiting 45 minutes for the roast to cook, you'd spend the 45 minutes chopping the vegetables. When the oven timer rings, you put down your vegetable knife, pull the cooked roast out of the oven and let it cool, then go back to chopping vegetables, and so on. If you can do that, then you are successfully multitasking several recipes/programs. That is, you aren't literally working on multiple recipes at once (you still have only two hands!), but you are jumping back and forth from following one recipe to following another whenever necessary, and thereby making progress on several tasks rather than twiddling your thumbs a lot. Do this well and you can have the whole meal ready to eat in a much shorter amount of time, and everything will be hot and fresh at about the same time too. If you do this, you are executing a simple multithreaded program.
Then if you wanted to get really fancy, you might hire a few other chefs to work in the kitchen at the same time as you, so that you can get even more food prepared in a given amount of time. If you do this, your team is doing multiprocessing, with each chef taking one part of the total work and all of them working simultaneously. Note that each chef may well be working on multiple recipes (i.e. multitasking) as described in the previous paragraph.
As for how a computer does this sort of thing (no more analogies about chefs), it usually implements it using a list of ready-to-run threads and a timer. When the timer goes off (or when the thread that is currently executing has nothing to do for a while, because e.g. it is waiting to load data from a slow hard drive or something), the operating system does a context switch, in which pauses the current thread (by putting it into a list somewhere and no longer executing instructions from that thread's code anymore), then pulls another ready-to-run thread from the list of ready-to-run threads and starts executing instructions from that thread's code instead. This repeats for as long as necessary, often with context switches happening every few milliseconds, giving the illusion that multiple programs are running "at the same time" even on a single-core CPU. (On a multi-core CPU it does this same thing on each core, and in that case it's no longer just an illusion; multiple programs really are running at the same time)
Why don't you refer to Microsoft's very own documentation of the .net class System.Threading.Thread?
It has a handfull of simple example programs written in C# (at the bottom of the page) just as you asked for:
Thread Examples
actually multi thread is do multiple process at the same time together . and you can complete process parallel .
it's actually multi thread is do multiple process at the same time together . and you can complete process parallel . you can take task from your main thread then execute some other way and done .
I have an application in C# with a list of work to do. I'm looking to do as much of that work as possible in parallel. However I need to be able to control the maximum amount of parallel tasks.
From what I understand this is possible with a ThreadPool or with Tasks. Is there an difference in which one I use? My main concern is being able to control how many threads are active at one time.
Please take a look at ParallelOptions.MaxDegreeOfParallelism for Tasks.
I would advise you to use Tasks, because they provide a higher level abstraction than the ThreadPool.
A very good read on the topic can be found here. Really, a must-have book and it's free on top of that :)
In TPL you can use the WithDegreeOfParallelism on a ParallelEnumerable or ParallelOptions.MaxDegreeOfParallism
There is also the CountdownEvent which may be a better option if you are just using custom threads or tasks.
In the ThreadPool, when you use SetMaxThreads its global for the AppDomain so you could potentially be limiting unrelated code unnecessarily.
You cannot set the number of worker threads or the number of I/O completion threads to a number smaller than the number of processors in the computer.
If the common language runtime is hosted, for example by Internet Information Services (IIS) or SQL Server, the host can limit or prevent changes to the thread pool size.
Use caution when changing the maximum number of threads in the thread pool. While your code might benefit, the changes might have an adverse effect on code libraries you use.
Setting the thread pool size too large can cause performance problems. If too many threads are executing at the same time, the task switching overhead becomes a significant factor.
I agree with the other answer that you should use TPL over the ThreadPool as its a better abstraction of multi-threading, but its possible to accomplish what you want in both.
In this article on msdn, they explain why they recommend Tasks instead of ThreadPool for Parallelism.
Task have a very charming feature to me, you can build chains of tasks. Which are executed on certain results of the task before.
A feature I often use is following: Task A is running in background to do some long running work. I chain Task B after it, only executing when Task A has finished regulary and I configure it to run in the foreground, so I can easily update my controls with the result of long running Task A.
You can also create a semaphore to control how many threads can execute at a single time. You can create a new semaphore and in the constructor specify how many simultaneous threads are able to use that semaphore at a single time. Since I don't know how you are going to be using the threads, this would be a good starting point.
MSDN Article on the Semaphore class
-Wesley
I used multiple threads in a few programs, but still don't feel very comfortable about it.
What multi-threading libraries for C#/.NET are out there and which advantages does one have over the other?
By multi-threading libraries I mean everything which helps make programming with multiple threads easier.
What .NET integratet (i.e. like ThreadPool) do you use periodically?
Which problems did you encounter?
There are various reasons for using multiple threads in an application:
UI responsiveness
Concurrent operations
Parallel speedup
The approach one should choose depends on what you're trying to do. For UI responsiveness, consider using BackgroundWorker, for example.
For concurrent operations (e.g. a server: something that doesn't have to be parallel, but probably does need to be concurrent even on a single-core system), consider using the thread pool or, if the tasks are long-lived and you need a lot of them, consider using one thread per task.
If you have a so-called embarrassingly parallel problem that can be easily divided up into small subproblems, consider using a pool of worker threads (as many threads as CPU cores) that pull tasks from a queue. The Microsoft Task Parallel Library (TPL) may help here. If the job can be easily expressed as a monadic stream computation (i.e. with a query in LINQ with work in transformations and aggregations etc.), Parallel LINQ (same link) which runs on top of TPL may help.
There are other approaches, such as Actor-style parallelism as seen in Erlang, which are harder to implement efficiently in .NET because of the lack of a green threading model or means to implement same, such as CLR-supported continuations.
I like this one
http://www.codeplex.com/smartthreadpool
Check out the Power Threading library.
I have written a lot of threading code in my days, even implemented my own threading pool & dispatcher. A lot of it is documented here:
http://web.archive.org/web/20120708232527/http://devplanet.com/blogs/brianr/default.aspx
Just realize that I wrote these for very specific purposes and tested them in those conditions, and there is no real silver-bullet.
My advise would be to get comfortable with the thread pool before you move to any other libraries. A lot of the framework code uses the thread pool, so even if you happen to find The Best Threads Library(TM), you will still have to work with the thread pool, so you really need to understand that.
You should also keep in mind that a lot of work has been put into implementing the thread pool and tuning it. The upcoming version of .NET has numerous improvements triggered by the development the parallel libraries.
In my point of view many of the "problems" with the current thread pool can be amended by knowing its strengths and weaknesses.
Please keep in mind that you really should be closing threads (or allowing the threadpool to dispose) when you no longer need them, unless you will need them again soon. The reason I say this is that each thread requires stack memory (usually 1mb), so when you have applications sitting on threads but not using them, you are wasting memory.
For exmaple, Outlook on my machine right now has 20 threads open and is using 0% CPU. That is simply a waste of (a least) 20mb of memory. Word is also using another 10 threads with 0% CPU. 30mb may not seem like much, but what if every application was wasting 10-20 threads?
Again, if you need access to a threadpool on a regular basis then you don't need to close it (creating/destroying threads has an overhead).
You don't have to use the threadpool explicitly, you can use BeginInvoke-EndInvoke if you need async calls. It uses the threadpool behind the scenes. See here: http://msdn.microsoft.com/en-us/library/2e08f6yc.aspx
You should take a look at the Concurrency & Coordination Runtime. The CCR can be a little daunting at first as it requires a slightly different mind set. This video has a fairly good job of explanation of its workings...
In my opinion this would be the way to go, and I also hear that it will use the same scheduler as the TPL.
For me the builtin classes of the Framework are more than enough. The Threadpool is odd and lame, but you can write your own easily.
I often used the BackgroundWorker class for Frontends, cause it makes life much easier - invoking is done automatically for the eventhandlers.
I regularly start of threads manually and safe them in an dictionary with a ManualResetEvent to be able to examine who of them has ended already. I use the WaitHandle.WaitAll() Method for this. Problem there is, that WaitHandle.WaitAll does not acceppt Arrays with more than 64 WaitHandles at once.
You might want to look at the series of articles about threading patterns. Right now it has sample codes for implementing a WorkerThread and a ThreadedQueue.
http://devpinoy.org/blogs/jakelite/archive/tags/Threading+Patterns/default.aspx