Suppose I have a list of items that are currently processed in a normal foreach loop. Assume the number of items is significantly larger than the number of cores. How much time should each item take, as a rule of thumb, before I should consider refactoring the for-loop into a Parallel.ForEach?
This is one of the core problems of parallel programming. For an accurate answer you would still have to measure in the exact situation.
The big advantage of the TPL however is that the treshold is a lot smaller than it used to be, and that you're not punished (as much) when your workitems are too small.
I once made a demo with 2 nested loops and I wanted to show that only the outer one should be made to run in parallel. But the demo failed to show a significant disadvantage of turning both into a Parallel.For().
So if the code in you loop is independent, go for it.
The #items / #cores ratio is not very relevant, TPL wil partition the ranges and use the 'right' amount of threads.
On a large data processing project I'm working on any loop that I used that contained more than two or three statements benefited greatly from the Parallel.Foreach. If the data your loop is working on is atomic then I see very little downside compared to the tremendous benefit the Parallel library offers.
Related
I have some basic questions about Parallel.ForEach with partition approach and I'm facing some problems with them so I'd like to understand how this code works and what is the flow of it.
Code sample
var result = new StringBuilder();
Parallel.ForEach(Enumerable.Range(1, 5), () => new StringBuilder(), (x, option, sb) =>
{
sb.Append(x);
return sb;
}, sb =>
{
lock (result)
{
result.Append(sb.ToString());
}
});
Questions related to the code above:
Are they doing some partition work inside parallel foreach ?
When I debug the code, I can see that the iteration (execution) of the code happens more then 5 times, but as I understand it is supposed to fire only 5 times - Enumerable.Range(1, 5) .
When will be this code fired ? In both Parallel.Foreach and Parallel.For there are two blocks separated by {}. How these two blocks are executing and interact with each other?
lock (result)
{
result.Append(sb.ToString());
}
Bonus Q:
See this this block of code where 5 iteration not occurring rather more iteration is taking place. when i use Parallel For instead of foreach. see the code and tell me where i made the mistake.
var result = new StringBuilder();
Parallel.For(1, 5, () => new StringBuilder(), (x, option, sb) =>
{
sb.Append("line " + x + System.Environment.NewLine);
MessageBox.Show("aaa"+x.ToString());
return sb;
}, sb =>
{
lock (result)
{
result.Append(sb.ToString());
}
});
There are several misunderstands regarding how Parallel.XYZ works.
Couple of great points and suggestions have been mentioned in the comments, so I won't repeat them. Rather I would like to share some thoughts about Parallel programming.
The Parallel Class
Whenever we are talking about parallel programming we are usually distinguishing two kinds: Data parallelism and Task parallelism. The former is executing the same function(s) over a chunk of data in parallel. The latter is executing several independent functions in parallel.
(There is also a 3rd model called pipeline which is kind a mixture of these two. I won't spend time on it if you are interested about that one I would suggest to search for Task Parallel Library's Dataflow or System.Threading.Channels.)
The Parallel class supports both of the models. The For and ForEach are designed for data parallelism, while the Invoke for task parallelism.
Partitioning
In case data parallelism the tricky part is how do you slice your data to get the best throughput / performance. You have to put into the account the size of the data collection, the structure of the data, the processing logic and the available cores (and many more other aspects as well). So there is no one-rule-for-all suggestion.
The main concern about partitioning is to not under-use the resources (some cores are idle, while others are working hard) and neither over-use (there are way more waiting jobs than available cores, so the synchronization overhead can be significant).
Let's suppose your processing logic is firmly stable (in other words various input data will not change significantly the processing time). In this case you can load balance the data between the executors. If an executor finishes then it can grab the new piece of data to be processed.
The way how you choose which data should go to which executor can be defined by the Partitioner(1). By default .NET support Range, Chunk, Hash and Striped partitioning. Some are static (the partitioning is done before any processing) and some of them are dynamic (depending on the processing speed some executor might receive more than other ones).
The following two excellent articles can give you better insight how each of the partitioning works:
Dixin's blog
Nima's blog
Thread Safety
If each of the executor can execute its processing task without the need to interact with others than they are considered independent. If you can design your algorithm to have independent processing units then you minimize the synchronization.
In case of For and ForEach each partition can have its own partition-local-storage. That means the computations are independent because the intermediate results are stored in a partition aware storage. But as usual you want to merge these into a single collection or even into value.
That's the reason why these Parallel methods have body and localFinally parameters. The former is used to define the individual processing, while the latter is the aggregate and merge function. (It is kinda similar to the Map-Reduce approach) In the latter you have aware of thread safety by yourself.
PLINQ
I don't want to explore this topic, which outside of the scope of the question. But I would like to give you a notch where to get started:
MS Whitepaper about when to use Parallel and when to use PLINQ
Common pitfalls of PLINQ
Useful resources
Joe Albahari's Parallel Programming
BlackWasp's Parallel Programming
EDIT: How to decide that it's worth to run in parallel?
There is no single formula (at least to my knowledge) which will tell you when does it make sense to use parallel execution. As I tried to highlight in the Partitioning section is a quite complex topic, so several experiments and fine-tuning are needed to find the optimal solution.
I highly encourage you to measure and try several different settings.
Here is my guideline how you should tackle this:
Try to understand the current characteristics of your application
Perform several different measurements to spot the execution bottleneck
Capture the current solution's performance metrics as your baseline
If it possible try to extract that piece of code from the code base to ease the fine-tuning
Try to tackle the same problem with several different aspects and with various inputs
Measure them and compare them to your baseline
If you are satisfied with the result then put that piece of code into your code base and measure again under different workloads
Try to capture as many relevant metrics as you can
If it is possible consider to execute both (sequential and parallel) solutions and compare their results.
If you are satisfied then get rid of the sequential code
Details
There several really good tools that can help you to get insight about your application. For .NET Profiling I would encourage you to give it try to CodeTrack. Concurrency Visualizer is also good tool if no need custom metrics.
By several measurements I meant that you should measure several times with several different tools to exclude special circumstances. If you measure only once then you can get false positive result. So, measure twice, cut once.
Your sequential processing should serve as a baseline. Base over-parallelization can cause certain overhead that's why it make sense to be able to compare your new shine solution with current one. Under utilization can also cause significant performance degradation.
If you can extract your problematic code than you can perform micro-benchmarks. I encourage you to take a look at the awesome Benckmark.NET tool to create benchmarks.
The same problem can be solved many in ways. So try to find several different approaches (like Parallel, PLINQ can be used more or less for the same problems)
As I said earlier measure, measure and measure. You should also keep in mind .NET try to be smart. What I mean by that for example AsParallel does not give you a guarantee that it will run in parallel. .NET analysis your solution and data structure and decide how to run it. On the other hand you can enforce parallel execution if you are certain that it will help.
There are libraries like Scientist.NET which can help you to perform this short of parallel run and compare process.
Enjoy :D
I have a persistent B+tree, multiple threads are reading different chunks of the tree and performing some operations on read data. Interesting part: each thread produces a set of results, and as end user I want to see all the results in one place. What I do: one ConcurentDictionary and all threads are writing to it.
Everything works smooth this way. But the application is time critical, one extra second means a total dissatisfaction. ConcurentDictionary because of the thread-safety overhead is intrinsically slow compared to Dictionary.
I can use Dictionary, then each thread will write results to distinct dictionaries. But then I'll have the problem of merging different dictionaries.
.
My Questions:
Are concurrent collections a good decision for my scenario ?
If Not(1), then how would I merge optimally different dictionaries. Given that, (a) copying items one-by-one and (b) LINQ are known solutions and are not as optimal as expected :)
If Not(2) ;-) What would you suggest instead ?
.
A quick info:
#Thread = processorCount. The application can run on a standard laptop (i.e., 4 threads) or high-end server (i.e., <32 threads)
Item Count. The tree usually holds more than 1.0E+12 items.
From your timings it seems that the locking/building of the result dictionary is taking 3700ms per thread with the actual processing logic taking just 300ms.
I suggest that as an experiment you let each thread create its own local dictionary of results. Then you can see how much time is spent building the dictionary compared to how much is the effect of locking across threads.
If building the local dictionary adds more than 300ms then it will not be possible to meet your time limit. Because without any locking or any attempt to merge the results it has already taken too long.
Update
It seems that you can either pay the merge price as you go along, with the locking causing the threads to sit idle for a significant percentage of time, or pay the price in a post-processing merge. But the core problem is that the locking means you are not fully utilising the available CPU.
The only real solution to getting maximum performance from your cores is it use a non-blocking dictionary implementation that is also thread safe. I could not find a .NET implementation but did find a research paper detailing an algorithm that would indicate it is possible.
Implementing such an algorithm correctly is not trivial but would be fun!
Scalable and Lock-Free Concurrent Dictionaries
Had you considered async persistence?
Is it allowed in your scenario?
You can bypass to a queue in a separated thread pool (creating a thread pool would avoid the overhead of creating a (sub)thread for each request), and there you can handle the merging logic without affecting response time.
I'm using a parallel for loop in my code to run a long running process on a large number of entities (12,000).
The process parses a string, goes through a number of input files (I've read that given the number of IO based things the benefits of threading could be questionable, but it seems to have sped things up elsewhere) and outputs a matched result.
Initially, the process goes quite quickly - however it ends up slowing to a crawl. It's possible that it's just hit a number of particularly tricky input data, but this seems unlikely looking closer at things.
Within the loop, I added some debug code that prints "Started Processing: " and "Finished Processing: " when it begins/ends an iteration and then wrote a program that pairs a start and a finish, initially in order to find which ID was causing a crash.
However, looking at the number of unmatched ID's, it looks like the program is processing in excess of 400 different entities at once. This seems like, with the large number of IO, it could be the source of the issue.
So my question(s) is(are) this(these):
Am I interpreting the unmatched ID's properly, or is there some clever stuff going behind the scenes I'm missing, or even something obvious?
If you'd agree what I've spotted is correct, how can I limit the number it spins off and does at once?
I realise this is perhaps a somewhat unorthodox question and may be tricky to answer given there is no code, but any help is appreciated and if there's any more info you'd like, let me know in the comments.
Without seeing some code, I can guess at the answers to your questions:
Unmatched IDs indicate to me that the thread that is processing that data is being de-prioritized. This could be due to IO or the thread pool trying to optimize, however it seems like if you are strongly IO bound then that is most likely your issue.
I would take a look at Parallel.For, specifically using ParallelOptions.MaxDegreesOfParallelism to limit the maximum number of tasks to a reasonable number. I would suggest trial and error to determine the optimum number of degrees, starting around the number of processor cores you have.
Good luck!
Let me start by confirming that is indeed a very bad idea to read 2 files at the same time from a hard drive (at least until the majority of HDs out there are SSDs), let alone whichever number your whole thing is using.
The use of parallelism serves to optimize processing using an actually paralellizable resource, which is the CPU power. If you paralellized process reads from a hard drive then you're losing most of the benefit.
And even then, even the CPU power is not prone to infinite paralellization. A normal desktop CPU has the capacity to run up to 10 threads at the same time (depends of the model obviously, but that's the order of magnitude).
So two things
first, I am going to make the assumption that your entities use all your files, but your files are not too big to be loaded into memory. If it's the case, you should read your files into objects (i.e. into memory), then paralellize the processing of your entities using those objects. If not, you're basically relying on your hard drive's cache to not reread your files every time you need them, and your hard drive's cache is far smaller than your memory (1000-fold).
second, you shouldn't be running Parallel.For on 12.000 items. Parallel.For will actually (try to) create 12.000 threads, and that is actually worse than 10 threads, because of the big overhead that paralellizing will create, and the fact your CPU will not benefit from it at all since it cannot run more than 10 threads at a time.
You should probably use a more efficient method, which is the IEnumerable<T>.AsParallel() extension (comes with .net 4.0). This one will, at runtime, determine what is the optimal thread number to run, then divide your enumerable into as many batches. Basically, it does the job for you - but it creates a big overhead too, so it's only useful if the processing of one element is actually costly for the CPU.
From my experience, using anything parallel should always be evaluated against not using it in real-life, i.e. by actually profiling your application. Don't assume it's going to work better.
I have a nested for loop.
I have replaced the first For with a Parallel.For() and the speed of calculation increased.
My question is about replacing the second for (inside one) with a Parallel.For(). Will it increase the speed? or there is no difference? or it will be slower?
Edit:
Since the cores are not unlimited (usually there is 2 to 8 cores), the inside loop is running parallel. So, if I change the inside for with a Parallel.For(), again it runs parallel. But i'm not sure how it changes the performance and speed.
From "Too fine-grained, too coarse-grained" subsection, "Anti-patterns" section in "Patterns of parallel programming" book by .NET parallel computing team:
The answer is that the best balance is found through performance
testing. If the overheads of parallelization are minimal as compared
to the work being done, parallelize as much as possible: in this case,
that would mean parallelizing both loops. If the overheads of
parallelizing the inner loop would degrade performance on most
systems, think twice before doing so, as it’ll likely be best only to
parallelize the outer loop.
Take a look at that subsection, it is self-contained with detailed examples from parallel ray tracing application. And its suggestion of flattening the loops to have better degree of parallelism may be helpful for you too.
It again depends on many scenarios,
Number of parallel threads your cpu can run.
Number of iterations.
If your CPU is a single-core processor, you will not get any benefits.
If the number of iterations is greater, you will get some improvements.
If there are just a few iterations, it will be slow as it involves extra overload.
It depends a lot on the data and functions you use inside the for and the machine. I have been messing lately with the parallel.for and parallel.foreach and found out that they made my apps even slower... (on a 4 core machine, probably if you have a 24 core server is another story)
I think that managing the threads means too much overhead...
Even MS on their documentation (here is a very long pdf on msdn about it http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=19222) admits it doesnt make the apps run faster. You have to try every time, and if it works, great, and if not bad luck.
You should try with the external for and the internal, but at least on the apps i tried none of them made the app faster. External or internal didnt matter much i was just getting the same execution times or even worse.
Maybe if you use Concurrent collections too, you get better performance. But again, without trying there is no way to tell.
EDIT:
I just found a nice link on MSDN that proved to be very useful (in my case) to improve Parallel.foreach performance
http://msdn.microsoft.com/en-us/library/dd560853.aspx
I got a piece of code that loops through the array and looks for the similar and same strings in it - marking it whether it's unique or not.
loop X array for I
( loop X array for Y
(
If X is prefix of Y do. else if x is same length as Y and it's prefix do something.
)
Here is the code to finilize everything for I and corresponding (found/not found) matches in Y.
)
I'd like to make this for dual-core to multithread it. To my knowledge it is not possible, but it's highly probable that you may have some idea.
If the array sizes are considerable you could get some benefit of parallelizing, perhaps split the array in two and process each half in parallel. You should check out the Parallel Framework to do the multithreading.
I understand your question that you would like to know, how to parallelize this code:
I think you would get much more speedup from using a better algorithm: e.g. sort the array (you could do this using mergesort in parallel) and only compare adjacent entries. You can then also to the comparison easily in parallel by processing each half of the array by a separate thread.
If you need more details just let us know ...
A parallel algorithm might look like this:
Sort the list of terms alphabetically using a parallel sort algorithm
Divide the sorted list of terms into chunks, such that all interesting terms are in the same chunk. If I'm not mistaken, so long as each chunk starts with the same character, you'll never have interesting matches across two chunks (matches and prefixes will always have the same first character, right?)
For each chunk, find terms that are prefixes and/or matches, and take appropriate action. This should be easy, since matching terms will be right next to each other (since the big list is sorted, each chunk will be too).
Some notes:
This requires a parallel sort algorithm. Apparently these exist, but I don't know much else about them, since I've never had to use one directly. Your Mileage May Vary.
The second step (split the workload into chunks) doesn't appear to itself be parallelizable. You could implement it with modified binary searches to find where the first character changes, so hopefully this part is cheap, but it might not be, and you probably won't know for sure until you measure.
If you end up with many chunks, and one is by far the largest, your performance will suck.
Have you considered keeping the algorithm single-threaded, but changing it so that the first step sorts the list?
Currently the algorithm described in the question is O(n^2), as it loops through the list once per element in the list. If the list is sorted, then duplicates can be found in one pass through the list (duplicates will be right next to each other) -- including the sort, this is a total cost of O(n log n). For large data sets, this will be much, much, faster. Hopefully it will be fast enough that you can avoid multiple threads, which will be a lot of work.
I am not sure if Parallel Extensions to .net is your answer.
You may check it out from Download page and Project's blog
In general the idea would be to have one thread handle half the data and the other thread handle the other half -- i.e., thread one does odd indices, thread two does even. Unfortunately there's not really enough information about your problem to give any sort of reasonable answer since we have no idea whether there are any dependencies between the various actions. Say, for instance, that if I do find a prefix match that means that I want to modify the next element in the array to remove any prefix in it. Clearly, this dependency will break the naive parallel implementation. If your actions on the data are independent, though, this should be reasonably easy to parallelize by simply dividing the work.
If the middle check is a long running process you might run it as separate thread, and then at the end just join all threads (since you got so many threads use thread pooling with 2 threads limit -you shouldn't not launch all of them run, wait finish one launch a new one etc.-).
At the end, just join() all threads, that's it.