How to optimize Workqueue of well know time consuming processes

How to optimize Workqueue of well know time consuming processes - c#

I have an IEnumerable of actions and they are decendent ordered by the time they will consume when executing. Now i want all of them to be executed in parallel. Are there any better solutions than this one?
IEnumerable<WorkItem> workItemsOrderedByTime = myFactory.WorkItems.DecendentOrderedBy(t => t.ExecutionTime);
Parallel.ForEach(workItemsOrderedByTime, t => t.Execute(), Environment.ProcessorCount);
So my idea is to first execute all expensice tasks in terms of time they need to be done.
EDIT: The question is if there is a better solution to get all done in minimum of time.

To solve your XY Problem of
Because otherwise it can happen that 9 of 10 tasks are finished and the last one is executed on 1 core and all other cores are doing nothing.
What you need to do is tell Parallel.ForEach to only take one item from the source list at a time. That way when you are down to the last items you won't have a bunch of slow work items all in a single core's queue.
This can be done by using Partitioner.Create and passing in EnumerablePartitionerOptions.NoBuffering
Parallel.ForEach(Partitioner.Create(workItems, EnumerablePartitionerOptions.NoBuffering),
new ParallelOptions{MaxDegreeOfParallelism = Environment.ProcessorCount},
t => t.Execute());

By default there is no execution order guarantee in Parallel.ForEach
That is why your call to DecendentOrderedBy does not do anything good. Though it might do something bad: in case default partitioner decides to do a range partition dividing say 12 WorkItems into 4 groups of 3 items, by the order in IEnumerable. Then first core has much more work to do, thus creating the problem you try to avoid.
Easy fix to (2) is explained in the answer by Scott. If Parallel.ForEach takes just one item then you naturally get some load balancing. In most cases this will work fine
The optimal (in most cases) solution for an ordered IEnumerable (as you have) will be Striped Partitioning number of buckets = number of cores. AFIK there you don't get this out-of-the-box in .NET. But you can provide a custom OrderablePartitioner that will partition data just this way.
I am sorry to say it but: "No free lunch"

Related

PLINQ Custom Partitioner to allow iterating long ranges

I need help in creating a custom partitioner for PLINQ that will let me iterate over IEnumerable with length greater than Int32.
This is in reference to this question where the answer given to me was write a custom partitioner:
PLINQ query giving overflow exception
I have tried messing around with these from Dr. Dobbs tutorial but I'm not sure what I need to override to use a long for the index:
http://www.drdobbs.com/windows/custom-parallel-partitioning-with-net-4/224600406?pgno=3
Maybe I'm not going about this in the right way.
I can use Parallel.ForEach all day for any size that the IEnumerable I'm iterating over gets to.
But it's much slower than using PLINQ, which would make sense seeing that every iteration is a set of calculations on a combination of letters/numbers and it's the same calculation for every string combination, and I've read that Parallel.ForEach isn't best suited for this. PLINQ is using chunk partitioning and I believe that's why it's doing the iterations so much faster.
Is there a way I can tune Parallel.ForEach to give me the same speed as PLINQ but at the same time allow me to iterate over sizes greater than 2.14 billion?
ETA: Is there no solution to this, starting to thing after seeing Custom partition examples they have nothing to do with the issue of int to long.

Parallel.For and Break() misunderstanding?

I'm investigating the Parallelism Break in a For loop.
After reading this and this I still have a question:
I'd expect this code :
Parallel.For(0, 10, (i,state) =>
{
Console.WriteLine(i); if (i == 5) state.Break();
}
To yield at most 6 numbers (0..6).
not only he is not doing it but have different result length :
02351486
013542
0135642
Very annoying. (where the hell is Break() {after 5} here ??)
So I looked at msdn
Break may be used to communicate to the loop that no other iterations after the current iteration need be run.
If Break is called from the 100th iteration of a for loop iterating in
parallel from 0 to 1000, all iterations less than 100 should still be
run, but the iterations from 101 through to 1000 are not necessary.
Quesion #1 :
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
Question #2 :
Lets assume we are using Parallel + range partition (due to no cpu cost change between elements) so it divides the data among threads . So if we have 4 cores (and perfect divisions among them):
core #1 got 0..250
core #2 got 251..500
core #3 got 501..750
core #4 got 751..1000
so the thread in core #1 will meet value=100 sometime and will break.
this will be his iteration number 100 .
But the thread in core #4 got more quanta and he is on 900 now. he is way beyond his 100'th iteration.
He doesnt have index less 100 to be stopped !! - so he will show them all.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Question #3 :
How cn I truly break when (i == 5) ?
p.s.
I mean , come on ! when I do Break() , I want things the loop to stop.
excactly as I do in regular For loop.

To yield at most 6 numbers (0..6).
The problem is that this won't yield at most 6 numbers.
What happens is, when you hit a loop with an index of 5, you send the "break" request. Break() will cause the loop to no longer process any values >5, but process all values <5.
However, any values greater than 5 which were already started will still get processed. Since the various indices are running in parallel, they're no longer ordered, so you get various runs where some values >5 (such as 8 in your example) are still being executed.
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
This is the index being passed into Parallel.For. Break() won't prevent items from being processed, but provides a guarantee that all items up to 100 get processed, but items above 100 may or may not get processed.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Yes. If you use a partitioner like you've shown, as soon as you call Break(), items beyond the one where you break will no longer get scheduled. However, items (which is the entire partition) already scheduled will get processed fully. In your example, this means you're likely to always process all 1000 items.
How can I truly break when (i == 5) ?
You are - but when you run in Parallel, things change. What is the actual goal here? If you only want to process the first 6 items (0-5), you should restrict the items before you loop through them via a LINQ query or similar. You can then process the 6 items in Parallel.For or Parallel.ForEach without a Break() and without worry.
I mean , come on ! when I do Break() , I want things the loop to stop. excactly as I do in regular For loop.
You should use Stop() instead of Break() if you want things to stop as quickly as possible. This will not prevent items already running from stopping, but will no longer schedule any items (including ones at lower indices or earlier in the enumeration than your current position).

If Break is called from the 100th iteration of a for loop iterating in parallel from 0 to 1000
The 100th iteration of the loop is not necessarily (in fact probably not) the one with the index 99.
Your threads can and will run in an indeterminent order. When the .Break() instruction is encountered, no further loop iterations will be started. Exactly when that happens depends on the specifics of thread scheduling for a particular run.
I strongly recommend reading
Patterns of Parallel Programming
(free PDF from Microsoft)
to understand the design decisions and design tradeoffs that went into the TPL.

Which iterations ? the overall iteration counter ? or per thread ?
Off all the iterations scheduled (or yet to be scheduled).
Remember the delegate may be run out of order, there is no guarantee that iteration i == 5 will be the sixth to execute, rather this is unlikely to be the case except in rare cases.
Q2: Am I right ?
No, the scheduling is not so simplistic. Rather all the tasks are queued up and then the queue is processed. But the threads each use their own queue until it is empty when they steal from other the threads. This leads no way to predict which thread will process what delegate.
If the delegates are sufficiently trivial it might all be processed on the original calling thread (no other thread gets a chance to steal work).
Q3: How cn I truly break when (i == 5) ?
Don't use concurrently if you want linear (in specific) processing.
The Break method is there to support speculative execution: try various ways and stop as soon as any one completes.

Looping through two collections - performance and optimization possibilities

This is probably a very common problem which has a lot of answers. I was not able to get to an answer because I am not very sure how to search for it.
I have two collections of objects - both come from the database, and in some cases those collections are of the same object type. Further, I need to do some operations for every combination of those collections. So, for example:
foreach(var a in collection1){
foreach(var b in collection2){
if(a.Name == b.Name && a.Value != b.Value)
//do something with this combination
else
//do something else
}
}
This is very inefficient and it gets slower based on the number of objects in both collections.
What is the best way to solve this type of problems?
EDIT:
I am using .NET 4 at the moment so I am also interested in suggestions using Parallelism to speed that up.
EDIT 2:
I have added above an example of the business rules that need to be performed on each combination of objects. However, the business rules defined in the example can vary.
EDIT 3:
For example, inside the loop the following will be done:
If the business rules are satisfied (see above) a record will be created in the database with a reference to object A and object B. This is one of the operations that I need to do. (Operations will be configurable from child classes using this class).

If you really have to to process every item in list b for each item in list a, then it's going to take time proportional to a.Count * b.Count. There's nothing you can do to prevent it. Adding parallel processing will give you a linear speedup, but that's not going to make a dent in the processing time if the lists are even moderately large.
How large are these lists? Do you really have to check every combination of a and b? Can you give us some more information about the problem you're trying to solve? I suspect that there's a way to bring a more efficient algorithm to bear, which would reduce your processing time by orders of magnitude.
Edit after more info posted
I know that the example you posted is just an example, but it shows that you can find a better algorithm for at least some of your cases. In this particular example, you could sort a and b by name, and then do a straight merge. Or, you could sort b into an array or list, and use binary search to look up the names. Either of those two options would perform much better than your nested loops. So much better, in fact, that you probably wouldn't need to bother with parallelizing things.
Look at the numbers. If your a has 4,000 items in it and b has 100,000 items in it, your nested loop will do 400 million comparisons (a.Count * b.Count). But sorting is only n log n, and the merge is linear. So sorting and then merging will be approximately (a.Count * 12) + (b.Count * 17) + a.Count + b.Count, or in the neighborhood of 2 million comparisons. So that's approximately 200 times faster.
Compare that to what you can do with parallel processing: only a linear speedup. If you have four cores and you get a pure linear speedup, you'll only cut your time by a factor of four. The better algorithm cut the time by a factor of 200, with a single thread.
You just need to find better algorithms.
LINQ might also provide a good solution. I'm not an expert with LINQ, but it seems like it should be able to make quick work of something like this.

If you need to check all the variants one by one you can't do anything better. BUT you can parallel the loops. For ex if you are using c# 4.0 you can use parallel foreach loop.
You can find an example here... http://msdn.microsoft.com/en-us/library/dd460720.aspx
foreach(var a in collection1){
Parallel.ForEach(collection2, b =>
{
//do something with a and b
} //close lambda expression
);
}
In the same way you can parallel the first loop as well.

First of all, there is a reason you are searching with a value from the first collection in the second collection.
For example if you want to know that a value excites in the the second collection, you should put the second collection in a hashset, this will allow you to do a fast lookup. Creating the HashSet and accessing it is like 1 vs n for looping the collection.

Parallel.ForEach(a, currentA => Parallel.ForEach(b, currentB =>
{
// do something with currentA and currentB
}));

Sort or RemoveAll first on an IEnumerable that needs both?

When an IEnumerable needs both to be sorted and for elements to be removed, are there advantages/drawback of performing the stages in a particular order? My performance tests appear to indicate that it's irrelevant.
A simplified (and somewhat contrived) example of what I mean is shown below:
public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
{
IEnumerable<DataItem> result = this.GetDataItems();
result.Sort(sortOrder);
result.RemoveAll(item => !item.Display);
result = result.Take(maximum);
return result;
}

If your tests indicate it's irrelevant, than why worry about it? Don't optimize before you need to, only when it becomes a problem. If you find a problem with performance, and have used a profiler, and have found that that method is the hotspot, then you can worry more about it.
On second thought, have you considered using LINQ? Those calls could be replaced with a call to Where and OrderBy, both of which are deferred, and then calling Take, like you have in your example. The LINQ libraries should find the best way of doing this for you, and if your data size expands to the point where it takes a noticeable amount of time to process, you can use PLINQ with a simple call to AsParallel.

You might as well RemoveAll before sorting so that you'll have fewer elements to sort.

I think that Sort() method would usually have complexity of O(n*log(n)), and RemoveAll() just O(n), so in general it is probably better to remove items first.

You'd want something like this:
public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
{
IEnumerable<DataItem> result = this.GetDataItems();
return result
.Where(item => item.Display)
.OrderBy(sortOrder)
.Take(maximum);
}

There are two answers that are correct, but won't teach you anything:
It doesn't matter.
You should probably do RemoveAll first.
The first is correct because you said your performance tests showed it's irrelevant. The second is correct because it will have an effect on larger datasets.
There's a third answer that also isn't very useful: Sometimes it's faster to do removals afterwards.
Again, it doesn't actually tell you anything, but "sometimes" always means there is more to learn.
There's also only so much value in saying "profile first". What if profiling shows that 90% of the time is spent doing x.Foo(), which it does in a loop? Is the problem with Foo(), with the loop or with both? Obviously if we can make both more efficient we should, but how do we reason about that without knowledge outside of what a profiler tells us?
When something happens over multiple items (which is true of both RemoveAll and Sort) there are five things (I'm sure there are more I'm not thinking of now) that will affect the performance impact:
The per-set constant costs (both time and memory). How much it costs to do things like calling the function that we pass a collection to, etc. These are almost always negligible, but there could be some nasty high cost hidden there (often because of a mistake).
The per-item constant costs (both time and memory). How much it costs to do something that we do on some or all of the items. Because this happens multiple times, there can be an appreciable win in improving them.
The number of items. As a rule the more items, the more the performance impact. There are exceptions (next item), but unless those exceptions apply (and we need to consider the next item to know when this is the case), then this will be important.
The complexity of the operation. Again, this is a matter of both time-complexity and memory-complexity, but here the chances that we might choose to improve one at the cost of another. I'll talk about this more below.
The number of simultaneous operations. This can be a big difference between "works on my machine" and "works on the live system". If a super time-efficient approach uses .5GB of memory is tested on a machine with 2GB of memory available, it'll work wonderfully, but when you move it to a machine with 8GB of memory available and have multiple concurrent users, it'll hit a bottleneck at 16 simultaneous operations, and suddenly what was beating other approaches in your performance measurements becomes the application's hotspot.
To talk about complexity a bit more. The time complexity is a measure of how the time taken to do something relates the number of items it is done with, while memory complexity is a measure of how the memory used relates to that same number of items. Obtaining an item from a dictionary is O(1) or constant because it takes the same amount of time however large the dictionary is (not strictly true, strictly it "approaches" O(1), but it's close enough for most thinking). Finding something in an already sorted list can be O(log2 n) or logarithmic. Filtering through a list will be linear or O(n). Sorting something using a quicksort (which is what Sort uses) tends to be linearithmic or O(n log2 n) but in its worse case - against a list already sorted - will be quadratic O(n2).
Considering these, with a set of 8 items, an O(1) operation will take 1k seconds to do something, where k is a constant amount of time, O(log2 n) means 3k seconds, O(n) means 8k, O(n log2 n) means 24k and O(n2) means 64k. These are the most commonly found though there are plenty of others like O(nm) which is affected by two different sizes, or O(n!) which would be 40320k.
Obviously, we want as low a complexity as possible, though since k will be different in each case, sometimes the best solution for a small set has a high complexity (but low k constant) though a lower-complexity case will beat it with larger input.
So. Let's go back to the cases you are considering, viz filtering followed by sorting vs. sorting followed by filtering.
Per-set constants. Since we are moving two operations around but still doing both, this will be the same either way.
Per-item constants. Again, we're still doing the same things per item in either case, so no effect.
Number of items. Filtering reduces the number of items. Therefore the sooner we filter items, the more efficient the rest of the operation. Therefore doing RemoveAll first wins in this regard.
Complexity of the operation. It's either a O(n) followed by a average-case-O(log2 n)-worse-case-O(n2), or it's an average-case-O(log2 n)-worse-case-O(n2) followed by an O(n). Same either way.
Number of simultaneous cases. Total memory pressure will be relieved the sooner we remove some items, (slight win for RemoveAll first).
So, we've got two reasons to consider RemoveAll first as likely to be more efficient and none to consider it likely to be less efficient.
We would not assume that we were 100% guaranteed to be correct here. For a start we could simply have made a mistake in our reasoning. For another, there could be other factors we've dismissed as irrelevant that were actually pertinent. It is still true that we should profile before optimising, but reasoning about the sort of things I've mentioned above will both make us more likely to write performant code in the first place (not the same as optimising; but a matter of picking between options when readability, clarity and correctness is equal either way) and makes it easier to find likely ways to improve those things that profiling has found to be troublesome.
For a slightly different but relevant case, consider if the criteria sorted on matched those removed on. E.g. if we were to sort by date and remove all items after a given date.
In this case, if the list deallocates on all removals, it'll still be O(n), but with a much smaller constant. Alternatively, if it just moved the "last-item" pointer*, it becomes O(1). Finding the pointer is O(log2 n), so here there's both reasons to consider that filtering first will be faster (the reasons given above) and that sorting first will be faster (that removal can be made a much faster operation than it was before). With this sort of case it becomes only possible to tell by extending our profiling. It is also true that the performance will be affected by the type of data sent, so we need to profile with realistic data, rather than artificial test data, and we may even find that what was the more performant choice becomes the less performant choice months later when the dataset it is used on changes. Here the ability to reason becomes even more important, because we should note the possibility that changes in real-world use may make this change in this regard, and know that it is something we need to keep an eye on throughout the project's life.
(*Note, List<T> does not just move a last-item pointer for a RemoveRange that covers the last item, but another collection could.)

It would probably be better to the RemoveAll first, although it would only make much of a difference if your sorting comparison was intensive to calculate.

How to increase concurrent parallel tasks with System.Threading.Parallel (.Net 4.0)

I'm experimenting with the new System.Threading.Parallel methods like parallel for and foreach.
They seem to work nicely but I need a way to increase the number of concurrent threads that are executed which are 8 (I have a Quad core).
I know there is a way I just can find the place thy hidden the damn property.
Gilad.

quote:
var query = from item in source.AsParallel().WithDegreeOfParallelism(10)
where Compute(item) > 42
select item;
In cases where a query is performing a significant amount of non-compute-bound work such as File I/O, it might be beneficial to specify a degree of parallelism greater than the number of cores on the machine.
from: MSDN

IF you are using Parallel.For or Parallel.ForEach you can specify a ParallelOptions object which has a property MaxDegreesOfParallelism. Unfortunately this is just a maximum limit as the name suggests, and does not provide a lower bound guarantee. For the relationsship to WithDegreeOfParallelism see this blog post.

MAY NOT - enough said. Blindy commented it correctly

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.