I am evaluating the Polly library in terms of features and flexibility, and as part of the evaluation process I am trying to combine the WaitAndRetryPolicy with the BulkheadPolicy policies, to achieve a combination of resiliency and throttling. The problem is that the resulting behavior of this combination does not match my expectations and preferences. What I would like is to prioritize the retrying of failed operations over executing fresh/unprocessed operations.
The rationale is that (from my experience) a failed operation has greater chances of failing again. So if all failed operations get pushed to the end of the whole process, that last part of the whole process will be painfully slow and unproductive. Not only because these operations may fail again, but also because of the required delay between each retry, that may need to be progressively longer after each failed attempt. So what I want is that each time the BulkheadPolicy has room for starting a new operation, to choose a retry operation if there is one in its queue.
Here is an example that demonstrates the undesirable behavior I would like to fix. 10 items need to be processed. All fail on their first attempt and succeed on their second attempt, resulting to a total of 20 executions. The waiting period before retrying an item is one second. Only 2 operations should be active at any moment:
var policy = Policy.WrapAsync
(
Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(retryCount: 1, _ => TimeSpan.FromSeconds(1)),
Policy.BulkheadAsync(
maxParallelization: 2, maxQueuingActions: Int32.MaxValue)
);
var tasks = new List<Task>();
foreach (var item in Enumerable.Range(1, 10))
{
int attempt = 0;
tasks.Add(policy.ExecuteAsync(async () =>
{
attempt++;
Console.WriteLine($"{DateTime.Now:HH:mm:ss} Starting #{item}/{attempt}");
await Task.Delay(1000);
if (attempt == 1) throw new HttpRequestException();
}));
}
await Task.WhenAll(tasks);
Output (actual):
09:07:12 Starting #1/1
09:07:12 Starting #2/1
09:07:13 Starting #3/1
09:07:13 Starting #4/1
09:07:14 Starting #5/1
09:07:14 Starting #6/1
09:07:15 Starting #8/1
09:07:15 Starting #7/1
09:07:16 Starting #10/1
09:07:16 Starting #9/1
09:07:17 Starting #2/2
09:07:17 Starting #1/2
09:07:18 Starting #4/2
09:07:18 Starting #3/2
09:07:19 Starting #5/2
09:07:19 Starting #6/2
09:07:20 Starting #7/2
09:07:20 Starting #8/2
09:07:21 Starting #10/2
09:07:21 Starting #9/2
The expected output should be something like this (I wrote it by hand):
09:07:12 Starting #1/1
09:07:12 Starting #2/1
09:07:13 Starting #3/1
09:07:13 Starting #4/1
09:07:14 Starting #1/2
09:07:14 Starting #2/2
09:07:15 Starting #3/2
09:07:15 Starting #4/2
09:07:16 Starting #5/1
09:07:16 Starting #6/1
09:07:17 Starting #7/1
09:07:17 Starting #8/1
09:07:18 Starting #5/2
09:07:18 Starting #6/2
09:07:19 Starting #7/2
09:07:19 Starting #8/2
09:07:20 Starting #9/1
09:07:20 Starting #10/1
09:07:22 Starting #9/2
09:07:22 Starting #10/2
For example at the 09:07:14 mark the 1-second wait period of the failed item #1 has been expired, so its second attempt should be prioritized over doing the first attempt of the item #5.
An unsuccessful attempt to solve this problem is to reverse the order of the two policies. Unfortunately putting the BulkheadPolicy before the WaitAndRetryPolicy results to reduced parallelization. What happens is that the BulkheadPolicy considers all retries of an item to be a singe operation, and so the "wait" phase between two retries counts towards the parallelization limit. Obviously I don't want that. The documentation also makes it clear the the order of the two policies in my example is correct:
BulkheadPolicy: Usually innermost unless wraps a final TimeoutPolicy. Certainly inside any WaitAndRetry. The Bulkhead intentionally limits the parallelization. You want that parallelization devoted to running the delegate, not occupied by waits for a retry.
Is there any way to achieve the behavior I want, while staying in the realm of the Polly library?
I found a simple but not perfect solution to this problem. The solution is to include a second BulkheadPolicy positioned before the WaitAndRetryPolicy (in an "outer" position). This extra Bulkhead will serve only for reprioritizing the workload (by serving as an outer queue), and should have a substantially larger capacity (x10 or more) than the inner Bulkhead that controls the parallelization. The reason is that the outer Bulkhead could also affect (reduce) the parallelization in an unpredictable way, and we don't want that. This is why I consider this solution imperfect, because neither the prioritization is optimal, nor it is guaranteed that the parallelization will not be affected.
Here is the combined policy of the original example, enhanced with an outer BulkheadPolicy. Its capacity is only 2.5 times larger, which is suitable for this contrived example, but too small for the general case:
var policy = Policy.WrapAsync
(
Policy.BulkheadAsync( // For improving prioritization
maxParallelization: 5, maxQueuingActions: Int32.MaxValue),
Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(retryCount: 1, _ => TimeSpan.FromSeconds(1)),
Policy.BulkheadAsync( // For controlling paralellization
maxParallelization: 2, maxQueuingActions: Int32.MaxValue)
);
And here is the output of the execution:
12:36:02 Starting #1/1
12:36:02 Starting #2/1
12:36:03 Starting #3/1
12:36:03 Starting #4/1
12:36:04 Starting #2/2
12:36:04 Starting #5/1
12:36:05 Starting #1/2
12:36:05 Starting #3/2
12:36:06 Starting #6/1
12:36:06 Starting #4/2
12:36:07 Starting #8/1
12:36:07 Starting #5/2
12:36:08 Starting #9/1
12:36:08 Starting #7/1
12:36:09 Starting #10/1
12:36:09 Starting #6/2
12:36:10 Starting #7/2
12:36:10 Starting #8/2
12:36:11 Starting #9/2
12:36:11 Starting #10/2
Although this solution is not perfect, I believe that it should do more good than harm in the general case, and should result in a better performance overall.
Related
I have just did a sample for multithreading using This Link like below:
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
It gives me 15 thread before Parellel.For and after it gives me 17 thread only. So only 2 thread is occupy with Parellel.For.
Then I have created a another sample code using This Link like below:
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Console.WriteLine("MaxDegreeOfParallelism : {0}", Environment.ProcessorCount * 10);
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
In above code, I have set MaxDegreeOfParallelism where it sets 40 but is still taking same threads for Parallel.For.
So how can I increase running thread for Parallel.For?
I am facing a problem that some numbers is skipped inside the Parallel.For when I perform some heavy and complex functionality inside it. So here I want to increase the maximum thread and override the skipping issue.
What you're saying is something like: "My car is shaking when driving too fast. I'm trying to avoid this by driving even faster." That doesn't make any sense. What you need is to fix the car, not change the speed.
How exactly to do that depends on what are you actually doing in the loop. The code you showed is obviously placeholder, but even that's wrong. So I think what you should do first is to learn about thread safety.
Using a lock is one option, and it's the easiest one to get correct. But it's also hard to make it efficient. What you need is to lock only for a short amount of time each iteration.
There are other options how to achieve thread safety, including using Interlocked, overloads of Parallel.For that use thread-local data and approaches other than Parallel.For(), like PLINQ or TPL Dataflow.
After you made sure your code is thread safe, only then it's time to worry about things like the number of threads. And regarding that, I think there are two things to note:
For CPU-bound computations, it doesn't make sense to use more threads than the number of cores your CPU has. Using more threads than that will actually usually lead to slower code, since switching between threads has some overhead.
I don't think you can measure the number of threads used by Parallel.For() like that. Parallel.For() uses the thread pool and it's quite possible that there already are some threads in the pool before the loop begins.
Parallel loops use hardware CPU cores. If your CPU has 2 cores, this is the maximum degree of paralellism that you can get in your machine.
Taken from MSDN:
What to Expect
By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the
number of available cores. In typical scenarios, the more cores you
have, the faster your loop executes, until you reach the point of
diminishing returns that Amdahl's Law predicts. How much faster
depends on the kind of work your loop does.
Further reading:
Threading vs Parallelism, how do they differ?
Threading vs. Parallel Processing
Parallel loops will give you wrong result for summation operations without locks as result of each iteration depends on a single variable 'Count' and value of 'Count' in parallel loop is not predictable. However, using locks in parallel loops do not achieve actual parallelism. so, u should try something else for testing parallel loop instead of summation.
We have an application, wherein we have a materialized array of items which we are going to process through a Reactive pipeline. It looks a little like this
EventLoopScheduler eventLoop = new EventLoopScheduler();
IScheduler concurrency = new TaskPoolScheduler(
new TaskFactory(
new LimitedConcurrencyLevelTaskScheduler(threadCount)));
IEnumerable<int> numbers = Enumerable.Range(1, itemCount);
// 1. transform on single thread
IConnectableObservable<byte[]> source =
numbers.Select(Transform).ToObservable(eventLoop).Publish();
// 2. naive parallelization, restricts parallelization to Work
// only; chunk up sequence into smaller sequences and process
// in parallel, merging results
IObservable<int> final = source.
Buffer(10).
Select(
batch =>
batch.
ToObservable(concurrency).
Buffer(10).
Select(
concurrentBatch =>
concurrentBatch.
Select(Work).
ToArray().
ToObservable(eventLoop)).
Merge()).
Merge();
final.Subscribe();
source.Connect();
Await(final).Wait();
If you are really curious to play with this, the stand-in methods look like
private async static Task Await(IObservable<int> final)
{
await final.LastOrDefaultAsync();
}
private static byte[] Transform(int number)
{
if (number == itemCount)
{
Console.WriteLine("numbers exhausted.");
}
byte[] buffer = new byte[1000000];
Buffer.BlockCopy(bloat, 0, buffer, 0, bloat.Length);
return buffer;
}
private static int Work(byte[] buffer)
{
Console.WriteLine("t {0}.", Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(50);
return 1;
}
A little explanation. Range(1, itemCount) simulates raw inputs, materialized from a data-source. Transform simulates an enrichment process each input must go through, and results in a larger memory footprint. Work is a "lengthy" process which operates on the transformed input.
Ideally, we want to minimize the number of transformed inputs held concurrently by the system, while maximizing throughput by parallelizing Work. The number of transformed inputs in memory should be batch size (10 above) times concurrent work threads (threadCount).
So for 5 threads, we should retain 50 Transform items at any given time; and if, as here, the transform is a 1MB byte buffer, then we would expect memory consumption to be at about 50MB throughout the run.
What I find is quite different. Namely that Reactive is eagerly consuming all numbers, and Transform them up front (as evidenced by numbers exhausted. message), resulting in a massive memory spike up front (#1GB for 1000 itemCount).
My basic question is: Is there a way to achieve what I need (ie minimized consumption, throttled by multi-threaded batching)?
UPDATE: sorry for reversal James; at first, i did not think paulpdaniels and Enigmativity's composition of Work(Transform) applied (this has to do with the nature of our actual implementation, which is more complex than the simple scenario provided above), however, after some further experimentation, i may be able to apply the same principles: ie defer Transform until batch executes.
You have made a couple of mistakes with your code that throws off all of your conclusions.
First up, you've done this:
IEnumerable<int> numbers = Enumerable.Range(1, itemCount);
You've used Enumerable.Range which means that when you call numbers.Select(Transform) you are going to burn through all of the numbers as fast as a single thread can take it. Rx hasn't even had a chance to do any work because up till this point your pipeline is entirely enumerable.
The next issue is in your subscriptions:
final.Subscribe();
source.Connect();
Await(final).Wait();
Because you call final.Subscribe() & Await(final).Wait(); you are creating two separate subscriptions to the final observable.
Since there is a source.Connect() in the middle the second subscription may be missing out on values.
So, let's try to remove all of the cruft that's going on here and see if we can work things out.
If you go down to this:
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.Select(bs => Work(bs));
Things work well. The numbers get exhausted right at the end, and processing 20 items on my machine takes about 1 second.
But this is processing everything in sequence. And the Work step provides back-pressure on Transform to slow down the speed at which it consumes the numbers.
Let's add concurrency.
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.SelectMany(bs => Observable.Start(() => Work(bs)));
This processes 20 items in 0.284 seconds, and the numbers exhaust themselves after 5 items are processed. There is no longer any back-pressure on the numbers. Basically the scheduler is handing all of the work to the Observable.Start so it is ready for the next number immediately.
Let's reduce the concurrency.
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.SelectMany(bs => Observable.Start(() => Work(bs), concurrency));
Now the 20 items get processed in 0.5 seconds. Only two get processed before the numbers are exhausted. This makes sense as we've limited concurrency to two threads. But still there's no back pressure on the consumption of the numbers so they get chewed up pretty quickly.
Having said all of this, I tried to construct a query with the appropriate back pressure, but I couldn't find a way. The crux comes down to the fact that Transform(...) performs far faster than Work(...) so it completes far more quickly.
So then the obvious move for me was this:
IObservable<int> final =
Observable
.Range(1, itemCount)
.SelectMany(n => Observable.Start(() => Work(Transform(n)), concurrency));
This doesn't complete the numbers until the end, and it limits processing to two threads. It appears to do the right thing for what you want, except that I've had to do Work(Transform(...)) together.
The very fact that you want to limit the amount of work you are doing suggests you should be pulling data, not having it pushed at you. I would forget using Rx in this scenario, as fundamentally, what you have described is not a reactive application. Also, Rx is best suited processing items serially; it uses sequential event streams.
Why not just keep your data source enumerable, and use PLinq, Parallel.ForEach or DataFlow? All of those sound better suited for your problem.
As #JamesWorld said it may very well be that you want to use PLinq to perform this task, it really depends on if you are actually reacting to data in your real scenario or just iterating through it.
If you choose to go the Reactive route you can use Merge to control the level of parallelization occurring:
var source = numbers
.Select(n =>
Observable.Defer(() => Observable.Start(() => Work(Transform(n)), concurrency)))
//Maximum concurrency
.Merge(10)
//Schedule all the output back onto the event loop scheduler
.ObserveOn(eventLoop);
The above code will consume all the numbers first (sorry no way to avoid that), however, by wrapping the processing in a Defer and following it up with a Merge that limits parallelization, only x number of items can be in flight at a time. Start() takes a scheduler as the second argument which it uses to execute to the provided method. Finally, Since you are basically just pushing the values of Transform into Work I composed them within the Start method.
As a side note, you can await an Observable and it will be equivalent to the code you have, i.e:
await source; //== await source.LastAsync();
I have a multi-threaded application, and in a certain section of code I use a Stopwatch to measure the time of an operation:
MatchCollection matches = regex.Matches(text); //lazy evaluation
Int32 matchCount;
//inside this bracket program should not context switch
{
//start timer
MyStopwatch matchDuration = MyStopwatch.StartNew();
//actually evaluate regex
matchCount = matches.Count;
//adds the time regex took to a list
durations.AddDuration(matchDuration.Stop());
}
Now, the problem is if the program switches control to another thread somewhere else while the stopwatch is started, then the timed duration will be wrong. The other thread could have done any amount of work before the context switches back to this section.
Note that I am not asking about locking, these are all local variables so there is no need for that. I just want the timed section to execute continuously.
edit: another solution could be to subtract the context-switched time to get the actual time done doing work in the timed section. Don't know if that's possible.
You can't do that. Otherwise it would be very easy for any application to get complete control over the CPU timeslices assigned to it.
You can, however, give your process a high priority to reduce the probability of a context-switch.
Here is another thought:
Assuming that you don't measure the execution time of a regular expression just once but multiple times, you should not see the average execution time as an absolute value but as a relative value compared to the average execution times of other regular expressions.
With this thinking you can compare the average execution times of different regular expressions without knowing the times lost to context switches. The time lost to context switches would be about the same in every average, assuming the environment is relatively stable with regards to CPU utilization.
I don't think you can do that.
A "best effort", for me, would be to put your method in a separate thread, and use
Thread.CurrentThread.Priority = ThreadPriority.Highest;
to avoid as much as possible context switching.
If I may ask, why do you need such a precise measurement, and why can't you extract the function, and benchmark it in its own program if that's the point ?
Edit : Depending on the use case it may be useful to use
Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(2); // Or whatever core you want to stick to
to avoid switch between cores.
I need proccess several lines from a database (can be millions) in parallel in c#. The processing is quite quick (50 or 150ms/line) but I can not know this speed before runtime as it depends on hardware/network.
The ThreadPool or the newer TaskParallelLibrary seems to be what feets my needs as I am new to threading and want to get the most efficient way to process the data.
However these methods does not provide a way to control the speed execution of my tasks (lines/minute) : I want to be able to set a maximum speed limit for the processing or run it full speed.
Please note that setting the number of thread of the ThreadPool/TaskFactory does not provide sufficient accuracy for my needs as I would like to be able to set a speed limit below the 'one thread speed'.
Using a custom sheduler for the TPL seems to be a way to do that, but I did not find a way to implement it.
Furthermore, I'm worried about the efficiency cost that would take such a setup.
Could you provide me a way or advices how to achieve this work ?
Thanks in advance for your answers.
The TPL provides a convenient programming abstraction on top of the Thread Pool. I would always select TPL when that is an option.
If you wish to throttle the total processing speed, there's nothing built-in that would support that.
You can measure the total processing speed as you proceed through the file and regulate speed by introducing (non-spinning) delays in each thread. The size of the delay can be dynamically adjusted in your code based on observed processing speed.
I am not seeing the advantage of limiting a speed, but I suggest you look into limiting max degree of parallalism of the operation. That can be done via MaxDegreeOfParallelism in the ParalleForEach options property as the code works over the disparate lines of data. That way you can control the slots, for lack of a better term, which can be expanded or subtracted depending on the criteria which you are working under.
Here is an example using the ConcurrentBag to process lines of disperate data and to use 2 parallel tasks.
var myLines = new List<string> { "Alpha", "Beta", "Gamma", "Omega" };
var stringResult = new ConcurrentBag<string>();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = 2;
Parallel.ForEach( myLines, parallelOptions, line =>
{
if (line.Contains( "e" ))
stringResult.Add( line );
} );
Console.WriteLine( string.Join( " | ", stringResult ) );
// Outputs Beta | Omega
Note that parallel options also has a TaskScheduler property which you can refine more of the processing. Finally for more control, maybe you want to cancel the processing when a specific threshold is reached? If so look into CancellationToken property to exit the process early.
I'm investigating the Parallelism Break in a For loop.
After reading this and this I still have a question:
I'd expect this code :
Parallel.For(0, 10, (i,state) =>
{
Console.WriteLine(i); if (i == 5) state.Break();
}
To yield at most 6 numbers (0..6).
not only he is not doing it but have different result length :
02351486
013542
0135642
Very annoying. (where the hell is Break() {after 5} here ??)
So I looked at msdn
Break may be used to communicate to the loop that no other iterations after the current iteration need be run.
If Break is called from the 100th iteration of a for loop iterating in
parallel from 0 to 1000, all iterations less than 100 should still be
run, but the iterations from 101 through to 1000 are not necessary.
Quesion #1 :
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
Question #2 :
Lets assume we are using Parallel + range partition (due to no cpu cost change between elements) so it divides the data among threads . So if we have 4 cores (and perfect divisions among them):
core #1 got 0..250
core #2 got 251..500
core #3 got 501..750
core #4 got 751..1000
so the thread in core #1 will meet value=100 sometime and will break.
this will be his iteration number 100 .
But the thread in core #4 got more quanta and he is on 900 now. he is way beyond his 100'th iteration.
He doesnt have index less 100 to be stopped !! - so he will show them all.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Question #3 :
How cn I truly break when (i == 5) ?
p.s.
I mean , come on ! when I do Break() , I want things the loop to stop.
excactly as I do in regular For loop.
To yield at most 6 numbers (0..6).
The problem is that this won't yield at most 6 numbers.
What happens is, when you hit a loop with an index of 5, you send the "break" request. Break() will cause the loop to no longer process any values >5, but process all values <5.
However, any values greater than 5 which were already started will still get processed. Since the various indices are running in parallel, they're no longer ordered, so you get various runs where some values >5 (such as 8 in your example) are still being executed.
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
This is the index being passed into Parallel.For. Break() won't prevent items from being processed, but provides a guarantee that all items up to 100 get processed, but items above 100 may or may not get processed.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Yes. If you use a partitioner like you've shown, as soon as you call Break(), items beyond the one where you break will no longer get scheduled. However, items (which is the entire partition) already scheduled will get processed fully. In your example, this means you're likely to always process all 1000 items.
How can I truly break when (i == 5) ?
You are - but when you run in Parallel, things change. What is the actual goal here? If you only want to process the first 6 items (0-5), you should restrict the items before you loop through them via a LINQ query or similar. You can then process the 6 items in Parallel.For or Parallel.ForEach without a Break() and without worry.
I mean , come on ! when I do Break() , I want things the loop to stop. excactly as I do in regular For loop.
You should use Stop() instead of Break() if you want things to stop as quickly as possible. This will not prevent items already running from stopping, but will no longer schedule any items (including ones at lower indices or earlier in the enumeration than your current position).
If Break is called from the 100th iteration of a for loop iterating in parallel from 0 to 1000
The 100th iteration of the loop is not necessarily (in fact probably not) the one with the index 99.
Your threads can and will run in an indeterminent order. When the .Break() instruction is encountered, no further loop iterations will be started. Exactly when that happens depends on the specifics of thread scheduling for a particular run.
I strongly recommend reading
Patterns of Parallel Programming
(free PDF from Microsoft)
to understand the design decisions and design tradeoffs that went into the TPL.
Which iterations ? the overall iteration counter ? or per thread ?
Off all the iterations scheduled (or yet to be scheduled).
Remember the delegate may be run out of order, there is no guarantee that iteration i == 5 will be the sixth to execute, rather this is unlikely to be the case except in rare cases.
Q2: Am I right ?
No, the scheduling is not so simplistic. Rather all the tasks are queued up and then the queue is processed. But the threads each use their own queue until it is empty when they steal from other the threads. This leads no way to predict which thread will process what delegate.
If the delegates are sufficiently trivial it might all be processed on the original calling thread (no other thread gets a chance to steal work).
Q3: How cn I truly break when (i == 5) ?
Don't use concurrently if you want linear (in specific) processing.
The Break method is there to support speculative execution: try various ways and stop as soon as any one completes.