Change stop condition during Parrallel For loop - c#

I want to explore in parallel some locations. In Master Thread before parallel loop, I explored first location and I have new locations to explore (for example 50 new locations). But I want to explore 100 locations (if there is so many new locations), so I have to change stop condition of loop.
Parallel.For(1, Locations.Count, new ParallelOptions { MaxDegreeOfParallelism = 4 },
(index, state) =>
{
var newLocations = FindSomethingInLocation(Locations[index])
locationsInIndex(index, newLocations) // -> dictionary
if(Locations.Count < 100)
{
Locations.Add(newLocations)
}
}
I want to explore 100 locations and get elements for every of explored location.
How can I do that during. It is possible? Currently I am doing something as on example above, but if Locations have 50 elements on start of loop. Loop will explore only 50 locations, even when I increasing Locations list with new elements, stop condition did not change.
Maybe someone has got better idea for this solution?
And some other question about that. Do you know how Parallel.For is scheduling iterations? Do main thread divide problem into N Threads at the start? Or do it more dynamically - when thread finish some iteration, it gets some new work from main thread?
Thank you all!

Related

Increase in time taken to run LINQ query .asparallel when creating new separate tasks in C#

In a LINQ Query, I have used .AsParallel as follows:
var completeReservationItems = from rBase in reservation.AsParallel()
join rRel in relationship.AsParallel() on rBase.GroupCode equals rRel.SourceGroupCode
join rTarget in reservation.AsParallel() on rRel.TargetCode equals rTarget.GroupCode
where rRel.ProgramCode == programCode && rBase.StartDate <= rTarget.StartDate && rBase.EndDate >= rTarget.EndDate
select new Object
{
//Initialize based on the query
};
Then, I have created two separate Tasks and was running them in parallel, passing the same Lists to both the methods as follows:
Task getS1Status = Task.Factory.StartNew(
() =>
{
RunLinqQuery(params);
});
Task getS2Status = Task.Factory.StartNew(
() =>
{
RunLinqQuery(params);
});
Task.WaitAll(getS1Status, getS2Status);
I was capturing the timings and was surprised to see that the timings were as follows:
Above scenario: 6 sec (6000 ms)
Same code, running sequentially instead of 2 Tasks: 50 ms
Same code, but without .AsParallel() in the LINQ: 50 ms
I wanted to understand why this is taking so long in the above scenario.
Posting this as answer only because I have some code to show.
Firstly, I dont know how many threads will be created with AsParallel(). Documentation dont say anything about it https://msdn.microsoft.com/en-us/library/dd413237(v=vs.110).aspx
Imagine following code
void RunMe()
{
foreach (var threadId in Enumerable.Range(0, 100)
.AsParallel()
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Distinct())
Console.WriteLine(threadId);
}
How much thread's ids we will see? For me each time will see different number of threads, example output:
30 // only one thread!
Next time
27 // several threads
13
38
10
43
30
I think, number of threads depends of current scheduler. We can always define maximum number of threads by calling WithDegreeOfParallelism (https://msdn.microsoft.com/en-us/library/dd383719(v=vs.110).aspx) method, example
void RunMe()
{
foreach (var threadId in Enumerable.Range(0, 100)
.AsParallel()
.WithDegreeOfParallelism(2)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Distinct())
Console.WriteLine(threadId);
}
Now, output will contains maximum 2 threads.
7
40
Why this important? As I said, number of threads can directly influence on performance.
But, this is not all problems. In your 1 scenario, you are creating new tasks (which will perform inside thread pool and can add additional overhead), and then, you are calling Task.WaitAll. Take a look on source code for it https://referencesource.microsoft.com/#mscorlib/system/threading/Tasks/Task.cs,72b6b3fa5eb35695 , Im sure that those for loop by task will add additional overhead, and, in situation when AsParallel will take too much threads inside first task, next task can start continiously. Moreover, this CAN be happen, so, if you will run your 1 scenario 1000 times, probably, you will get very different results.
So, my last argument that you try to measure parallel code, but it is very hard to do it right. Im not recommend to use parallel stuff as much as you can, because it can raise performance degradation, if you dont know exactly, what are you doing.

C# delay in Parallel.ForEach

I use 2 Parallel.ForEach nested loops to retrieve information quickly from a url. This is the code:
while (searches.Count > 0)
{
Parallel.ForEach(searches, (search, loopState) =>
{
Parallel.ForEach(search.items, item =>
{
RetrieveInfo(item);
}
);
}
);
}
The outer ForEach has a list of, for example 10, whilst the inner ForEach has a list of 5. This means that I'm going to query the url 50 times, however I query it 5 times simultaneously (inner ForEach).
I need to add a delay for the inner loop so that after it queries the url, it waits for x seconds - the time taken for the inner loop to complete the 5 requests.
Using Thread.Sleep is not a good idea because it will block the complete thread and possibly the other parallel tasks.
Is there an alternative that might work?
To my understanding, you have 50 tasks and you wish to process 5 of them at a time.
If so, you should look into ParallelOptions.MaxDegreeOfParallelism to process 50 tasks with a maximum degree of parallelism at 5. When one task stops, another task is permitted to start.
If you wish to have tasks processed in chunks of five, followed by another chunk of five (as in, you wish to process chunks in serial), then you would want code similar to
for(...)
{
Parallel.ForEach(
[paralleloptions,]
set of 5, action
)
}

ThreadPool with speed execution control

I need proccess several lines from a database (can be millions) in parallel in c#. The processing is quite quick (50 or 150ms/line) but I can not know this speed before runtime as it depends on hardware/network.
The ThreadPool or the newer TaskParallelLibrary seems to be what feets my needs as I am new to threading and want to get the most efficient way to process the data.
However these methods does not provide a way to control the speed execution of my tasks (lines/minute) : I want to be able to set a maximum speed limit for the processing or run it full speed.
Please note that setting the number of thread of the ThreadPool/TaskFactory does not provide sufficient accuracy for my needs as I would like to be able to set a speed limit below the 'one thread speed'.
Using a custom sheduler for the TPL seems to be a way to do that, but I did not find a way to implement it.
Furthermore, I'm worried about the efficiency cost that would take such a setup.
Could you provide me a way or advices how to achieve this work ?
Thanks in advance for your answers.
The TPL provides a convenient programming abstraction on top of the Thread Pool. I would always select TPL when that is an option.
If you wish to throttle the total processing speed, there's nothing built-in that would support that.
You can measure the total processing speed as you proceed through the file and regulate speed by introducing (non-spinning) delays in each thread. The size of the delay can be dynamically adjusted in your code based on observed processing speed.
I am not seeing the advantage of limiting a speed, but I suggest you look into limiting max degree of parallalism of the operation. That can be done via MaxDegreeOfParallelism in the ParalleForEach options property as the code works over the disparate lines of data. That way you can control the slots, for lack of a better term, which can be expanded or subtracted depending on the criteria which you are working under.
Here is an example using the ConcurrentBag to process lines of disperate data and to use 2 parallel tasks.
var myLines = new List<string> { "Alpha", "Beta", "Gamma", "Omega" };
var stringResult = new ConcurrentBag<string>();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = 2;
Parallel.ForEach( myLines, parallelOptions, line =>
{
if (line.Contains( "e" ))
stringResult.Add( line );
} );
Console.WriteLine( string.Join( " | ", stringResult ) );
// Outputs Beta | Omega
Note that parallel options also has a TaskScheduler property which you can refine more of the processing. Finally for more control, maybe you want to cancel the processing when a specific threshold is reached? If so look into CancellationToken property to exit the process early.

Parallel.For and Break() misunderstanding?

I'm investigating the Parallelism Break in a For loop.
After reading this and this I still have a question:
I'd expect this code :
Parallel.For(0, 10, (i,state) =>
{
Console.WriteLine(i); if (i == 5) state.Break();
}
To yield at most 6 numbers (0..6).
not only he is not doing it but have different result length :
02351486
013542
0135642
Very annoying. (where the hell is Break() {after 5} here ??)
So I looked at msdn
Break may be used to communicate to the loop that no other iterations after the current iteration need be run.
If Break is called from the 100th iteration of a for loop iterating in
parallel from 0 to 1000, all iterations less than 100 should still be
run, but the iterations from 101 through to 1000 are not necessary.
Quesion #1 :
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
Question #2 :
Lets assume we are using Parallel + range partition (due to no cpu cost change between elements) so it divides the data among threads . So if we have 4 cores (and perfect divisions among them):
core #1 got 0..250
core #2 got 251..500
core #3 got 501..750
core #4 got 751..1000
so the thread in core #1 will meet value=100 sometime and will break.
this will be his iteration number 100 .
But the thread in core #4 got more quanta and he is on 900 now. he is way beyond his 100'th iteration.
He doesnt have index less 100 to be stopped !! - so he will show them all.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Question #3 :
How cn I truly break when (i == 5) ?
p.s.
I mean , come on ! when I do Break() , I want things the loop to stop.
excactly as I do in regular For loop.
To yield at most 6 numbers (0..6).
The problem is that this won't yield at most 6 numbers.
What happens is, when you hit a loop with an index of 5, you send the "break" request. Break() will cause the loop to no longer process any values >5, but process all values <5.
However, any values greater than 5 which were already started will still get processed. Since the various indices are running in parallel, they're no longer ordered, so you get various runs where some values >5 (such as 8 in your example) are still being executed.
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
This is the index being passed into Parallel.For. Break() won't prevent items from being processed, but provides a guarantee that all items up to 100 get processed, but items above 100 may or may not get processed.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Yes. If you use a partitioner like you've shown, as soon as you call Break(), items beyond the one where you break will no longer get scheduled. However, items (which is the entire partition) already scheduled will get processed fully. In your example, this means you're likely to always process all 1000 items.
How can I truly break when (i == 5) ?
You are - but when you run in Parallel, things change. What is the actual goal here? If you only want to process the first 6 items (0-5), you should restrict the items before you loop through them via a LINQ query or similar. You can then process the 6 items in Parallel.For or Parallel.ForEach without a Break() and without worry.
I mean , come on ! when I do Break() , I want things the loop to stop. excactly as I do in regular For loop.
You should use Stop() instead of Break() if you want things to stop as quickly as possible. This will not prevent items already running from stopping, but will no longer schedule any items (including ones at lower indices or earlier in the enumeration than your current position).
If Break is called from the 100th iteration of a for loop iterating in parallel from 0 to 1000
The 100th iteration of the loop is not necessarily (in fact probably not) the one with the index 99.
Your threads can and will run in an indeterminent order. When the .Break() instruction is encountered, no further loop iterations will be started. Exactly when that happens depends on the specifics of thread scheduling for a particular run.
I strongly recommend reading
Patterns of Parallel Programming
(free PDF from Microsoft)
to understand the design decisions and design tradeoffs that went into the TPL.
Which iterations ? the overall iteration counter ? or per thread ?
Off all the iterations scheduled (or yet to be scheduled).
Remember the delegate may be run out of order, there is no guarantee that iteration i == 5 will be the sixth to execute, rather this is unlikely to be the case except in rare cases.
Q2: Am I right ?
No, the scheduling is not so simplistic. Rather all the tasks are queued up and then the queue is processed. But the threads each use their own queue until it is empty when they steal from other the threads. This leads no way to predict which thread will process what delegate.
If the delegates are sufficiently trivial it might all be processed on the original calling thread (no other thread gets a chance to steal work).
Q3: How cn I truly break when (i == 5) ?
Don't use concurrently if you want linear (in specific) processing.
The Break method is there to support speculative execution: try various ways and stop as soon as any one completes.

Implement Batch Process Using Producer Consumer

Batch
read text from file or SQL
parse the text into words
load the words into SQL
Today
.NET 4.0
Step 1 is very fast.
Steps 2 and 3 are about the same length (avg 0.1 second) for the same size file.
On step 3 insert using BackGroundWorker and wait for last to complete.
Everything else is on the main thread.
On a big load will do this several million times.
Need step 3 to be serial and in the same order as 1.
This is to keep the SQL table PK index from fracturing.
Tried step 3 in parallel and fracturing the index killed it.
This data is fed sorted by the PK.
Other indexes are dropped at the start of the load then rebuilt at the end of the load.
Where this process is not effective is when the size of text changes.
And the size of the text from file to file does change drastically.
What I would like is to queue 1 and 2 so 3 is kept as busy as possible.
Need step 3 to dequeue the files in order they were enqueued in 1 (even if it waits).
Need a maximum queue size for memory management (like 4-10).
Would like to have step 2 parallel with up to 4 concurrent.
Moving to .NET 4.5.
Asking for general guidance on how to implement this?
I am learning that this is a producer consumer pattern.
If this is not a producer consumer pattern please let me know so I can change the title.
I think TPL Dataflow would be a good way to do this:
For step 2, you would use a TransformBlock with MaxDegreeOfParallelism set to 4 and BoundedCapacity also set to 4, so that its queues are empty when working. It will produce the items in the same order as they came in, you don't have to do anything special for that. For step 3, use an ActionBlock, with BoundedCapacity set to your limit. Then link the two together and start sending items to the TransformBlock, ideally using something like await stepTwoBlock.SendAsync(…), to asynchronously wait if the queue is full.
In code, it would look something like:
async Task ProcessData()
{
var stepTwoBlock = new TransformBlock<OriginalText, ParsedText>(
text => Parse(text),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 4,
BoundedCapacity = 4
});
var stepThreeBlock = new ActionBlock<ParsedText>(
text => LoadIntoDatabase(text),
new ExecutionDataflowBlockOptions { BoundedCapacity = 10 });
stepTwoBlock.LinkTo(
stepThreeBlock, new DataflowLinkOptions { PropagateCompletion = true });
// this is step one:
foreach (var id in IdsToProcess)
{
OriginalText text = ReadText(id);
await stepTwoBlock.SendAsync(text);
}
stepTwoBlock.Complete();
await stepThreeBlock.Completion;
}

Categories