I use 2 Parallel.ForEach nested loops to retrieve information quickly from a url. This is the code:
while (searches.Count > 0)
{
Parallel.ForEach(searches, (search, loopState) =>
{
Parallel.ForEach(search.items, item =>
{
RetrieveInfo(item);
}
);
}
);
}
The outer ForEach has a list of, for example 10, whilst the inner ForEach has a list of 5. This means that I'm going to query the url 50 times, however I query it 5 times simultaneously (inner ForEach).
I need to add a delay for the inner loop so that after it queries the url, it waits for x seconds - the time taken for the inner loop to complete the 5 requests.
Using Thread.Sleep is not a good idea because it will block the complete thread and possibly the other parallel tasks.
Is there an alternative that might work?
To my understanding, you have 50 tasks and you wish to process 5 of them at a time.
If so, you should look into ParallelOptions.MaxDegreeOfParallelism to process 50 tasks with a maximum degree of parallelism at 5. When one task stops, another task is permitted to start.
If you wish to have tasks processed in chunks of five, followed by another chunk of five (as in, you wish to process chunks in serial), then you would want code similar to
for(...)
{
Parallel.ForEach(
[paralleloptions,]
set of 5, action
)
}
Related
I am trying to optimize a data collection process in C#. I would like to understand why a certain method of parallelism I am trying is not working as expected (more details below; see "Question" section at the very bottom)
BACKGROUND
I have an external .NET Framework DLL, which is an API for an external data source; I can only consume the API, I do not have access to what goes on behind the scenes.
The API provides a function like: GetInfo(string partID, string fieldValue). Using this function, I can get information about a specific part, filtered for a single field/criteria. One single call (for just one part ID and one field value) takes around 20 milliseconds in an optimal case.
A part can have many values for the field criteria. So in order to get all the info for a part, I have to enumerate through all possible field values (13 in this case). And to get all the info for many parts (~260), I have to enumerate through all the part IDs and all the field values.
I already have all the part IDs and possible field values. The problem is performance. Using a serial approach (2 nested for-loops) is too slow (takes ~70 seconds). I would like to get the time down to ~5 seconds.
WHAT I HAVE TRIED
For different mediums of parallelizing work, I have tried:
calling API in parallel via Tasks within a single main application.
calling API in parallel via Parallel.ForEach within a single main application.
Wrapping the API call with a WCF service, and having multiple WCF service instances (just like having multiple Tasks, but this is multiple processes instead); the single main client application will call the API in parallel through these services.
For different logic of parallelizing work, I have tried:
experiment 0 has 2 nested for-loops; this is the base case without any parallel calls (so that's ~260 part IDs * 13 field values = ~3400 API calls in series).
experiment 1 has 13 parallel branches, and each branch deals with smaller 2 nested for-loops; essentially, it is dividing experiment 0 into 13 parallel branches (so rather than iterating over ~260 part IDs * 13 field values, each branch will iterate over ~20 part IDs * all 13 field values = ~260 API calls in series per branch).
experiment 2 has 13 parallel branches, and each branch deals with ALL part IDs but only 1 specific field value for each branch (so each branch will iterate over ~260 part IDs * 1 field value = ~260 API calls in series per branch).
experiment 3 has 1 for-loop which iterates over the part IDs in series, but inside the loop makes 13 parallel calls (for 13 field values); only when all 13 info is retrieved for one part ID will the loop move on to the next part ID.
I have tried experiments 1, 2, and 3 combined with the different mediums (Tasks, Parallel.ForEach, separate processes via WCF Services); so there is a total of 9 combinations. Plus the base case experiment 0 which is just 2 nested for-loops (no parallelizing there).
I also ran each combination 4 times (each time with a different set of ~260 part IDs), to test for repeatability.
In every experiment/medium combination, I am timing only the direct API call using Stopwatch; so the time is not affected by any other parts of the code (like Task creation, etc.).
Here is how I am wrapping the API call in WCF service (also shows how I am timing the API call):
public async Task<Info[]> GetInfosAsync(string[] partIDs, string[] fieldValues)
{
Info[] infos = new Info[partIDs.Length * fieldValues.Length];
await Task.Run(() =>
{
for (int i = 0; i < partIDs.Length; i++)
{
for (int j = 0; j < fieldValues.Length; j++)
{
Stopwatch timer = new Stopwatch();
timer.Restart();
infos[i * fieldValues.Length + j] = api.GetInfo(partIDs[i], fieldValues[j]);
timer.Stop();
// log timer.ElapsedMilliseconds to file (each parallel branch writes to its own file)
}
}
});
return infos;
}
And to better illustrate the 3 different experiments, here is how they are structured. These are run from the main application. I am only including how the experiments were done using the inter-process communication (GetInfosAsync defined above), as that gave me the most significant results (as explained under "Results" further below).
// experiment 1
Task<Info[]>[] tasks = new Task<Info[]>[numBranches]; // numBranches = 13
for (int k = 0; k < numBranches; k++)
{
tasks[k] = services[k].GetInfosAsync(partIDsForBranch[k], fieldValues); // each service/branch gets partIDsForBranch[k] (a subset of ~20 partIDs only used for branch k) and all 13 fieldValues
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 2
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(partIDs, new string[] { fieldValues[j] }); // each service/branch gets all ~260 partIDs and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 3
for (int i = 0; i < partIDs.Length; i++)
{
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(new string[] { partIDs[i] }, new string[] { fieldValues[j] }); // each branch/service gets the currently iterated partID and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
}
RESULTS
For experiments 1 and 2...
Task (within same application) and Parallel.ForEach (within same application) perform almost just like the base case experiment (approximately 70 to 80 seconds).
inter-process communication (i.e. making parallel calls to multiple WCF services separate from the main application) performs significantly better than Task/Parallel.ForEach. This made sense to me (I've read about how multi-process could potentially be faster than multi-thread). Experiment 2 performs better than experiment 1, with the best experiment 2 run being around 8 seconds.
For experiment 3...
Task and Parallel.ForEach (within same application) perform close to their experiment 1 and 2 counterparts (but around 10 to 20 seconds more).
inter-process communication was significantly worse compared to all other experiments, taking around 200 to 300 seconds in total. This is the result I don't understand (see "What I Expected" section further below).
The graphs below give a visual representation of these results. Except for the bar chart summary, I only included the charts for inter-process communication results since that gave significantly good/bad results.
Figure 1 (above). Elapsed times of each individual API call for experiments 1, 2, and 3 for a particular run, for inter-process communication; experiment 0 is also included (top-left).
Figure 2 (above). Summary for each method/experiment for all 4 runs (top-left). And aggregate versions for the experiment graphs above (for experiments 1 and 2, this is the sum of each branch, and the total time would be the max of these sums; for experiment 3, this is the max of each loop, and the total time would be the sum of all these maxes). So in experiment 3, almost every iteration of the outer loop is taking around 1 second, meaning there is one parallel API call in every iteration that is taking 1 second...
WHAT I EXPECTED
The best performance I got was experiment 2 with inter-process communication (best run was around 8 seconds in total). Since experiment 2 runs were better than experiment 1 runs, perhaps there is some optimization behind-the-scenes on the field value i.e. experiment 1 could potentially have different branches clash by calling the same field value at any point in time, whereas each branch in experiment 2 calls their own unique field value at any point in time).
I understand that the backend will be restricted by a certain number of calls per time period, so the spikes I see in experiments 1 and 2 make sense (and why there is almost no spikes in experiment 0).
That's why I thought, for experiment 3 using inter-process communication, I am only making 13 API calls in parallel at any single point in time (for a single part ID, each branch having its own field value), and not proceeding to the next part ID until all 13 are done. This seemed like less API calls per time period than experiment 2, which continuously makes calls on each branch. So for experiment 3, I expected little spikes and all 13 to complete in the time it took for a single API call (~20ms).
But what actually happened was that experiment 3 took the most time, and majority of API call times are spiking significantly (i.e. each iteration having a call taking around 1 second).
I also understand that experiments 1 and 2 only have 13 long-lasting parallel branches that last throughout the lifetime of a single run, whereas experiment 3 creates 13 new short-lived parallel branches ~260 times (so there around be ~3400 short-lived parallel branches created throughout the lifetime of the run). If I was timing the task creation, I would understand the increased time due to overhead, but if I am timing the API call directly, how does this impact the API call itself?
QUESTION
Is there a possible explanation to why experiment 3 behaved this way? Or is there a way to profile/investigate this? I understand that this is not much to go off of without knowing what happens behind-the-scenes of the API... But what I am asking is how to go about investigating this, if possible.
This may be because your experiment 3 used two loops, and the object created in the loop will cause each loop to be created, increasing the workload.
I'm writing a batching pipeline that processes X outstanding operations every Y seconds. It feels like System.Reactive would be a good fit for this, but I'm not able to get the subscriber to execute in parallel. My code looks like this:
var subject = new Subject<int>();
var concurrentCount = 0;
using var reader = subject
.Buffer(TimeSpan.FromSeconds(1), 100)
.Subscribe(list =>
{
var c = Interlocked.Increment(ref concurrentCount);
if (c > 1) Console.WriteLine("Executing {0} simultaneous batches", c); // This never gets printed, because Subscribe is only ever called on a single thread.
Interlocked.Decrement(ref concurrentCount);
});
Parallel.For(0, 1_000_000, i =>
{
subject.OnNext(i);
});
subject.OnCompleted();
Is there an elegant way to read from this buffered Subject, in a concurrent manner?
The Rx subscription code is always synchronous¹. What you need to do is to remove the processing code from the Subscribe delegate, and make it a side-effect of the observable sequence. Here is how it can be done:
Subject<int> subject = new();
int concurrentCount = 0;
Task processor = subject
.Buffer(TimeSpan.FromSeconds(1), 100)
.Select(list => Observable.Defer(() => Observable.Start(() =>
{
int c = Interlocked.Increment(ref concurrentCount);
if (c > 1) Console.WriteLine($"Executing {c} simultaneous batches");
Interlocked.Decrement(ref concurrentCount);
})))
.Merge(maxConcurrent: 2)
.DefaultIfEmpty() // Prevents exception in corner case (empty source)
.ToTask(); // or RunAsync (either one starts the processor)
for (int i = 0; i < 1_000_000; i++)
{
subject.OnNext(i);
}
subject.OnCompleted();
processor.Wait();
The Select+Observable.Defer+Observable.Start combination converts the source sequence to an IObservable<IObservable<Unit>>. It's a nested sequence, with each inner sequence representing the processing of one list. When the delegate of the Observable.Start completes, the inner sequence emits a Unit value and then completes. The wrapping Defer operator ensures that the inner sequences are "cold", so that they are not started before they are subscribed. Then follows the Merge operator, which unwraps the outer sequence to a flat IObservable<Unit> sequence. The maxConcurrent parameter configures how many of the inner sequences will be subscribed concurrently. Every time an inner sequence is subscribed by the Merge operator, the corresponding Observable.Start delegate starts running on a ThreadPool thread.
If you set the maxConcurrent too high, the ThreadPool may run out of workers (in other words it may become saturated), and
the concurrency of your code will then become dependent on the ThreadPool availability. If you wish, you can increase the number of workers that the ThreadPool creates instantly on demand, by using the ThreadPool.SetMinThreads method. But if your workload is CPU-bound, and you increase the worker threads above the Environment.ProcessorCount value, then most probably your CPU will be saturated instead.
If your workload is asynchronous, you can replace the Observable.Defer+Observable.Start combo with the Observable.FromAsync operator, as shown here.
¹ An unpublished library exists, the AsyncRx.NET, that plays with the idea of asynchronous subscriptions. It is based on the new interfaces IAsyncObservable<T> and IAsyncObserver<T>.
You say this:
// This never gets printed, because Subscribe is only ever called on a single thread.
It's just not true. The reason nothing gets printed is because the code in the Subscribe happens in a locked manner - only one thread at a time executes in a Subscribe so you are incrementing the value and then decrementing it almost immediately. And since it starts at zero it never has a chance to rise above 1.
Now that's just because of the Rx contract. Only one thread in subscribe at once.
We can fix that.
Try this code:
using var reader = subject
.Buffer(TimeSpan.FromSeconds(1), 100)
.SelectMany(list =>
Observable
.Start(() =>
{
var c = Interlocked.Increment(ref concurrentCount);
Console.WriteLine("Starting {0} simultaneous batches", c);
})
.Finally(() =>
{
var c = Interlocked.Decrement(ref concurrentCount);
Console.WriteLine("Ending {0} simultaneous batches", c);
}))
.Subscribe();
Now when I run it (with less than the 1_000_000 iterations that you set) I get output like this:
Starting 1 simultaneous batches
Starting 4 simultaneous batches
Ending 3 simultaneous batches
Ending 2 simultaneous batches
Starting 3 simultaneous batches
Starting 3 simultaneous batches
Ending 1 simultaneous batches
Ending 2 simultaneous batches
Starting 4 simultaneous batches
Starting 5 simultaneous batches
Ending 3 simultaneous batches
Starting 2 simultaneous batches
Starting 2 simultaneous batches
Ending 2 simultaneous batches
Starting 3 simultaneous batches
Ending 0 simultaneous batches
Ending 4 simultaneous batches
Ending 1 simultaneous batches
Starting 1 simultaneous batches
Starting 1 simultaneous batches
Ending 0 simultaneous batches
Ending 0 simultaneous batches
I want to explore in parallel some locations. In Master Thread before parallel loop, I explored first location and I have new locations to explore (for example 50 new locations). But I want to explore 100 locations (if there is so many new locations), so I have to change stop condition of loop.
Parallel.For(1, Locations.Count, new ParallelOptions { MaxDegreeOfParallelism = 4 },
(index, state) =>
{
var newLocations = FindSomethingInLocation(Locations[index])
locationsInIndex(index, newLocations) // -> dictionary
if(Locations.Count < 100)
{
Locations.Add(newLocations)
}
}
I want to explore 100 locations and get elements for every of explored location.
How can I do that during. It is possible? Currently I am doing something as on example above, but if Locations have 50 elements on start of loop. Loop will explore only 50 locations, even when I increasing Locations list with new elements, stop condition did not change.
Maybe someone has got better idea for this solution?
And some other question about that. Do you know how Parallel.For is scheduling iterations? Do main thread divide problem into N Threads at the start? Or do it more dynamically - when thread finish some iteration, it gets some new work from main thread?
Thank you all!
In a LINQ Query, I have used .AsParallel as follows:
var completeReservationItems = from rBase in reservation.AsParallel()
join rRel in relationship.AsParallel() on rBase.GroupCode equals rRel.SourceGroupCode
join rTarget in reservation.AsParallel() on rRel.TargetCode equals rTarget.GroupCode
where rRel.ProgramCode == programCode && rBase.StartDate <= rTarget.StartDate && rBase.EndDate >= rTarget.EndDate
select new Object
{
//Initialize based on the query
};
Then, I have created two separate Tasks and was running them in parallel, passing the same Lists to both the methods as follows:
Task getS1Status = Task.Factory.StartNew(
() =>
{
RunLinqQuery(params);
});
Task getS2Status = Task.Factory.StartNew(
() =>
{
RunLinqQuery(params);
});
Task.WaitAll(getS1Status, getS2Status);
I was capturing the timings and was surprised to see that the timings were as follows:
Above scenario: 6 sec (6000 ms)
Same code, running sequentially instead of 2 Tasks: 50 ms
Same code, but without .AsParallel() in the LINQ: 50 ms
I wanted to understand why this is taking so long in the above scenario.
Posting this as answer only because I have some code to show.
Firstly, I dont know how many threads will be created with AsParallel(). Documentation dont say anything about it https://msdn.microsoft.com/en-us/library/dd413237(v=vs.110).aspx
Imagine following code
void RunMe()
{
foreach (var threadId in Enumerable.Range(0, 100)
.AsParallel()
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Distinct())
Console.WriteLine(threadId);
}
How much thread's ids we will see? For me each time will see different number of threads, example output:
30 // only one thread!
Next time
27 // several threads
13
38
10
43
30
I think, number of threads depends of current scheduler. We can always define maximum number of threads by calling WithDegreeOfParallelism (https://msdn.microsoft.com/en-us/library/dd383719(v=vs.110).aspx) method, example
void RunMe()
{
foreach (var threadId in Enumerable.Range(0, 100)
.AsParallel()
.WithDegreeOfParallelism(2)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Distinct())
Console.WriteLine(threadId);
}
Now, output will contains maximum 2 threads.
7
40
Why this important? As I said, number of threads can directly influence on performance.
But, this is not all problems. In your 1 scenario, you are creating new tasks (which will perform inside thread pool and can add additional overhead), and then, you are calling Task.WaitAll. Take a look on source code for it https://referencesource.microsoft.com/#mscorlib/system/threading/Tasks/Task.cs,72b6b3fa5eb35695 , Im sure that those for loop by task will add additional overhead, and, in situation when AsParallel will take too much threads inside first task, next task can start continiously. Moreover, this CAN be happen, so, if you will run your 1 scenario 1000 times, probably, you will get very different results.
So, my last argument that you try to measure parallel code, but it is very hard to do it right. Im not recommend to use parallel stuff as much as you can, because it can raise performance degradation, if you dont know exactly, what are you doing.
I'm investigating the Parallelism Break in a For loop.
After reading this and this I still have a question:
I'd expect this code :
Parallel.For(0, 10, (i,state) =>
{
Console.WriteLine(i); if (i == 5) state.Break();
}
To yield at most 6 numbers (0..6).
not only he is not doing it but have different result length :
02351486
013542
0135642
Very annoying. (where the hell is Break() {after 5} here ??)
So I looked at msdn
Break may be used to communicate to the loop that no other iterations after the current iteration need be run.
If Break is called from the 100th iteration of a for loop iterating in
parallel from 0 to 1000, all iterations less than 100 should still be
run, but the iterations from 101 through to 1000 are not necessary.
Quesion #1 :
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
Question #2 :
Lets assume we are using Parallel + range partition (due to no cpu cost change between elements) so it divides the data among threads . So if we have 4 cores (and perfect divisions among them):
core #1 got 0..250
core #2 got 251..500
core #3 got 501..750
core #4 got 751..1000
so the thread in core #1 will meet value=100 sometime and will break.
this will be his iteration number 100 .
But the thread in core #4 got more quanta and he is on 900 now. he is way beyond his 100'th iteration.
He doesnt have index less 100 to be stopped !! - so he will show them all.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Question #3 :
How cn I truly break when (i == 5) ?
p.s.
I mean , come on ! when I do Break() , I want things the loop to stop.
excactly as I do in regular For loop.
To yield at most 6 numbers (0..6).
The problem is that this won't yield at most 6 numbers.
What happens is, when you hit a loop with an index of 5, you send the "break" request. Break() will cause the loop to no longer process any values >5, but process all values <5.
However, any values greater than 5 which were already started will still get processed. Since the various indices are running in parallel, they're no longer ordered, so you get various runs where some values >5 (such as 8 in your example) are still being executed.
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
This is the index being passed into Parallel.For. Break() won't prevent items from being processed, but provides a guarantee that all items up to 100 get processed, but items above 100 may or may not get processed.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Yes. If you use a partitioner like you've shown, as soon as you call Break(), items beyond the one where you break will no longer get scheduled. However, items (which is the entire partition) already scheduled will get processed fully. In your example, this means you're likely to always process all 1000 items.
How can I truly break when (i == 5) ?
You are - but when you run in Parallel, things change. What is the actual goal here? If you only want to process the first 6 items (0-5), you should restrict the items before you loop through them via a LINQ query or similar. You can then process the 6 items in Parallel.For or Parallel.ForEach without a Break() and without worry.
I mean , come on ! when I do Break() , I want things the loop to stop. excactly as I do in regular For loop.
You should use Stop() instead of Break() if you want things to stop as quickly as possible. This will not prevent items already running from stopping, but will no longer schedule any items (including ones at lower indices or earlier in the enumeration than your current position).
If Break is called from the 100th iteration of a for loop iterating in parallel from 0 to 1000
The 100th iteration of the loop is not necessarily (in fact probably not) the one with the index 99.
Your threads can and will run in an indeterminent order. When the .Break() instruction is encountered, no further loop iterations will be started. Exactly when that happens depends on the specifics of thread scheduling for a particular run.
I strongly recommend reading
Patterns of Parallel Programming
(free PDF from Microsoft)
to understand the design decisions and design tradeoffs that went into the TPL.
Which iterations ? the overall iteration counter ? or per thread ?
Off all the iterations scheduled (or yet to be scheduled).
Remember the delegate may be run out of order, there is no guarantee that iteration i == 5 will be the sixth to execute, rather this is unlikely to be the case except in rare cases.
Q2: Am I right ?
No, the scheduling is not so simplistic. Rather all the tasks are queued up and then the queue is processed. But the threads each use their own queue until it is empty when they steal from other the threads. This leads no way to predict which thread will process what delegate.
If the delegates are sufficiently trivial it might all be processed on the original calling thread (no other thread gets a chance to steal work).
Q3: How cn I truly break when (i == 5) ?
Don't use concurrently if you want linear (in specific) processing.
The Break method is there to support speculative execution: try various ways and stop as soon as any one completes.