Batch
read text from file or SQL
parse the text into words
load the words into SQL
Today
.NET 4.0
Step 1 is very fast.
Steps 2 and 3 are about the same length (avg 0.1 second) for the same size file.
On step 3 insert using BackGroundWorker and wait for last to complete.
Everything else is on the main thread.
On a big load will do this several million times.
Need step 3 to be serial and in the same order as 1.
This is to keep the SQL table PK index from fracturing.
Tried step 3 in parallel and fracturing the index killed it.
This data is fed sorted by the PK.
Other indexes are dropped at the start of the load then rebuilt at the end of the load.
Where this process is not effective is when the size of text changes.
And the size of the text from file to file does change drastically.
What I would like is to queue 1 and 2 so 3 is kept as busy as possible.
Need step 3 to dequeue the files in order they were enqueued in 1 (even if it waits).
Need a maximum queue size for memory management (like 4-10).
Would like to have step 2 parallel with up to 4 concurrent.
Moving to .NET 4.5.
Asking for general guidance on how to implement this?
I am learning that this is a producer consumer pattern.
If this is not a producer consumer pattern please let me know so I can change the title.
I think TPL Dataflow would be a good way to do this:
For step 2, you would use a TransformBlock with MaxDegreeOfParallelism set to 4 and BoundedCapacity also set to 4, so that its queues are empty when working. It will produce the items in the same order as they came in, you don't have to do anything special for that. For step 3, use an ActionBlock, with BoundedCapacity set to your limit. Then link the two together and start sending items to the TransformBlock, ideally using something like await stepTwoBlock.SendAsync(…), to asynchronously wait if the queue is full.
In code, it would look something like:
async Task ProcessData()
{
var stepTwoBlock = new TransformBlock<OriginalText, ParsedText>(
text => Parse(text),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 4,
BoundedCapacity = 4
});
var stepThreeBlock = new ActionBlock<ParsedText>(
text => LoadIntoDatabase(text),
new ExecutionDataflowBlockOptions { BoundedCapacity = 10 });
stepTwoBlock.LinkTo(
stepThreeBlock, new DataflowLinkOptions { PropagateCompletion = true });
// this is step one:
foreach (var id in IdsToProcess)
{
OriginalText text = ReadText(id);
await stepTwoBlock.SendAsync(text);
}
stepTwoBlock.Complete();
await stepThreeBlock.Completion;
}
Related
I am trying to optimize a data collection process in C#. I would like to understand why a certain method of parallelism I am trying is not working as expected (more details below; see "Question" section at the very bottom)
BACKGROUND
I have an external .NET Framework DLL, which is an API for an external data source; I can only consume the API, I do not have access to what goes on behind the scenes.
The API provides a function like: GetInfo(string partID, string fieldValue). Using this function, I can get information about a specific part, filtered for a single field/criteria. One single call (for just one part ID and one field value) takes around 20 milliseconds in an optimal case.
A part can have many values for the field criteria. So in order to get all the info for a part, I have to enumerate through all possible field values (13 in this case). And to get all the info for many parts (~260), I have to enumerate through all the part IDs and all the field values.
I already have all the part IDs and possible field values. The problem is performance. Using a serial approach (2 nested for-loops) is too slow (takes ~70 seconds). I would like to get the time down to ~5 seconds.
WHAT I HAVE TRIED
For different mediums of parallelizing work, I have tried:
calling API in parallel via Tasks within a single main application.
calling API in parallel via Parallel.ForEach within a single main application.
Wrapping the API call with a WCF service, and having multiple WCF service instances (just like having multiple Tasks, but this is multiple processes instead); the single main client application will call the API in parallel through these services.
For different logic of parallelizing work, I have tried:
experiment 0 has 2 nested for-loops; this is the base case without any parallel calls (so that's ~260 part IDs * 13 field values = ~3400 API calls in series).
experiment 1 has 13 parallel branches, and each branch deals with smaller 2 nested for-loops; essentially, it is dividing experiment 0 into 13 parallel branches (so rather than iterating over ~260 part IDs * 13 field values, each branch will iterate over ~20 part IDs * all 13 field values = ~260 API calls in series per branch).
experiment 2 has 13 parallel branches, and each branch deals with ALL part IDs but only 1 specific field value for each branch (so each branch will iterate over ~260 part IDs * 1 field value = ~260 API calls in series per branch).
experiment 3 has 1 for-loop which iterates over the part IDs in series, but inside the loop makes 13 parallel calls (for 13 field values); only when all 13 info is retrieved for one part ID will the loop move on to the next part ID.
I have tried experiments 1, 2, and 3 combined with the different mediums (Tasks, Parallel.ForEach, separate processes via WCF Services); so there is a total of 9 combinations. Plus the base case experiment 0 which is just 2 nested for-loops (no parallelizing there).
I also ran each combination 4 times (each time with a different set of ~260 part IDs), to test for repeatability.
In every experiment/medium combination, I am timing only the direct API call using Stopwatch; so the time is not affected by any other parts of the code (like Task creation, etc.).
Here is how I am wrapping the API call in WCF service (also shows how I am timing the API call):
public async Task<Info[]> GetInfosAsync(string[] partIDs, string[] fieldValues)
{
Info[] infos = new Info[partIDs.Length * fieldValues.Length];
await Task.Run(() =>
{
for (int i = 0; i < partIDs.Length; i++)
{
for (int j = 0; j < fieldValues.Length; j++)
{
Stopwatch timer = new Stopwatch();
timer.Restart();
infos[i * fieldValues.Length + j] = api.GetInfo(partIDs[i], fieldValues[j]);
timer.Stop();
// log timer.ElapsedMilliseconds to file (each parallel branch writes to its own file)
}
}
});
return infos;
}
And to better illustrate the 3 different experiments, here is how they are structured. These are run from the main application. I am only including how the experiments were done using the inter-process communication (GetInfosAsync defined above), as that gave me the most significant results (as explained under "Results" further below).
// experiment 1
Task<Info[]>[] tasks = new Task<Info[]>[numBranches]; // numBranches = 13
for (int k = 0; k < numBranches; k++)
{
tasks[k] = services[k].GetInfosAsync(partIDsForBranch[k], fieldValues); // each service/branch gets partIDsForBranch[k] (a subset of ~20 partIDs only used for branch k) and all 13 fieldValues
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 2
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(partIDs, new string[] { fieldValues[j] }); // each service/branch gets all ~260 partIDs and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 3
for (int i = 0; i < partIDs.Length; i++)
{
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(new string[] { partIDs[i] }, new string[] { fieldValues[j] }); // each branch/service gets the currently iterated partID and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
}
RESULTS
For experiments 1 and 2...
Task (within same application) and Parallel.ForEach (within same application) perform almost just like the base case experiment (approximately 70 to 80 seconds).
inter-process communication (i.e. making parallel calls to multiple WCF services separate from the main application) performs significantly better than Task/Parallel.ForEach. This made sense to me (I've read about how multi-process could potentially be faster than multi-thread). Experiment 2 performs better than experiment 1, with the best experiment 2 run being around 8 seconds.
For experiment 3...
Task and Parallel.ForEach (within same application) perform close to their experiment 1 and 2 counterparts (but around 10 to 20 seconds more).
inter-process communication was significantly worse compared to all other experiments, taking around 200 to 300 seconds in total. This is the result I don't understand (see "What I Expected" section further below).
The graphs below give a visual representation of these results. Except for the bar chart summary, I only included the charts for inter-process communication results since that gave significantly good/bad results.
Figure 1 (above). Elapsed times of each individual API call for experiments 1, 2, and 3 for a particular run, for inter-process communication; experiment 0 is also included (top-left).
Figure 2 (above). Summary for each method/experiment for all 4 runs (top-left). And aggregate versions for the experiment graphs above (for experiments 1 and 2, this is the sum of each branch, and the total time would be the max of these sums; for experiment 3, this is the max of each loop, and the total time would be the sum of all these maxes). So in experiment 3, almost every iteration of the outer loop is taking around 1 second, meaning there is one parallel API call in every iteration that is taking 1 second...
WHAT I EXPECTED
The best performance I got was experiment 2 with inter-process communication (best run was around 8 seconds in total). Since experiment 2 runs were better than experiment 1 runs, perhaps there is some optimization behind-the-scenes on the field value i.e. experiment 1 could potentially have different branches clash by calling the same field value at any point in time, whereas each branch in experiment 2 calls their own unique field value at any point in time).
I understand that the backend will be restricted by a certain number of calls per time period, so the spikes I see in experiments 1 and 2 make sense (and why there is almost no spikes in experiment 0).
That's why I thought, for experiment 3 using inter-process communication, I am only making 13 API calls in parallel at any single point in time (for a single part ID, each branch having its own field value), and not proceeding to the next part ID until all 13 are done. This seemed like less API calls per time period than experiment 2, which continuously makes calls on each branch. So for experiment 3, I expected little spikes and all 13 to complete in the time it took for a single API call (~20ms).
But what actually happened was that experiment 3 took the most time, and majority of API call times are spiking significantly (i.e. each iteration having a call taking around 1 second).
I also understand that experiments 1 and 2 only have 13 long-lasting parallel branches that last throughout the lifetime of a single run, whereas experiment 3 creates 13 new short-lived parallel branches ~260 times (so there around be ~3400 short-lived parallel branches created throughout the lifetime of the run). If I was timing the task creation, I would understand the increased time due to overhead, but if I am timing the API call directly, how does this impact the API call itself?
QUESTION
Is there a possible explanation to why experiment 3 behaved this way? Or is there a way to profile/investigate this? I understand that this is not much to go off of without knowing what happens behind-the-scenes of the API... But what I am asking is how to go about investigating this, if possible.
This may be because your experiment 3 used two loops, and the object created in the loop will cause each loop to be created, increasing the workload.
I am using TPL a lot and have large data flow pipeline structure.
As part of the pipeline network I want to write some data to azure blob storage. We have a lot of data therefore we have 4 storage accounts and we want to distribute the data evenly between them.
Wanted to continue using the dataflow pipeline pattern therefore I want to implement a SourceBlock that if I link it to several target blocks it will send the messages to them with round robin. BufferBlock is not good enough because he is sending the message to the first block that accepts it, and assuming all the target blocks have large bounded capacity - all the messages will go to the first target block. BroadcastBlock is not good as well because I don`t want duplicates.
Any recommendations? Implementing the ISourceBlock interface with the round robing behavior seems not so simple and I wondered if there are simpler solutions out there? or any extensions to TPL that I am not familar with?
Are you aware of possibility to link the blocks with a predicate? This is a very simple and not well tested solution for sample:
var buffer = new BufferBlock<int>();
var consumer1 = new ActionBlock<int>(i => Console.WriteLine($"First: {i}"));
var consumer2 = new ActionBlock<int>(i => Console.WriteLine($"Second: {i}"));
var consumer3 = new ActionBlock<int>(i => Console.WriteLine($"Third: {i}"));
var consumer4 = new ActionBlock<int>(i => Console.WriteLine($"Forth: {i}"));
buffer.LinkTo(consumer1, i => Predicate(0));
buffer.LinkTo(consumer2, i => Predicate(1));
buffer.LinkTo(consumer3, i => Predicate(2));
buffer.LinkTo(consumer4, i => Predicate(3));
buffer.LinkTo(DataflowBlock.NullTarget<int>());
for (var i = 0; i < 10; ++i)
{
buffer.Post(i);
}
buffer.Completion.Wait();
One of the outputs:
Third: 2
First: 0
Forth: 3
Second: 1
Second: 5
Second: 9
Third: 6
Forth: 7
First: 4
First: 8
What is going on here is you're maintaining the number of operation, and if current is suitable for the consumer, we just increment that. Note that you still should link the block without any predicate at least once for avoiding the memory issues (also, it's a good idea to test the round robin with block which do monitor the lost messages).
I want to explore in parallel some locations. In Master Thread before parallel loop, I explored first location and I have new locations to explore (for example 50 new locations). But I want to explore 100 locations (if there is so many new locations), so I have to change stop condition of loop.
Parallel.For(1, Locations.Count, new ParallelOptions { MaxDegreeOfParallelism = 4 },
(index, state) =>
{
var newLocations = FindSomethingInLocation(Locations[index])
locationsInIndex(index, newLocations) // -> dictionary
if(Locations.Count < 100)
{
Locations.Add(newLocations)
}
}
I want to explore 100 locations and get elements for every of explored location.
How can I do that during. It is possible? Currently I am doing something as on example above, but if Locations have 50 elements on start of loop. Loop will explore only 50 locations, even when I increasing Locations list with new elements, stop condition did not change.
Maybe someone has got better idea for this solution?
And some other question about that. Do you know how Parallel.For is scheduling iterations? Do main thread divide problem into N Threads at the start? Or do it more dynamically - when thread finish some iteration, it gets some new work from main thread?
Thank you all!
In a LINQ Query, I have used .AsParallel as follows:
var completeReservationItems = from rBase in reservation.AsParallel()
join rRel in relationship.AsParallel() on rBase.GroupCode equals rRel.SourceGroupCode
join rTarget in reservation.AsParallel() on rRel.TargetCode equals rTarget.GroupCode
where rRel.ProgramCode == programCode && rBase.StartDate <= rTarget.StartDate && rBase.EndDate >= rTarget.EndDate
select new Object
{
//Initialize based on the query
};
Then, I have created two separate Tasks and was running them in parallel, passing the same Lists to both the methods as follows:
Task getS1Status = Task.Factory.StartNew(
() =>
{
RunLinqQuery(params);
});
Task getS2Status = Task.Factory.StartNew(
() =>
{
RunLinqQuery(params);
});
Task.WaitAll(getS1Status, getS2Status);
I was capturing the timings and was surprised to see that the timings were as follows:
Above scenario: 6 sec (6000 ms)
Same code, running sequentially instead of 2 Tasks: 50 ms
Same code, but without .AsParallel() in the LINQ: 50 ms
I wanted to understand why this is taking so long in the above scenario.
Posting this as answer only because I have some code to show.
Firstly, I dont know how many threads will be created with AsParallel(). Documentation dont say anything about it https://msdn.microsoft.com/en-us/library/dd413237(v=vs.110).aspx
Imagine following code
void RunMe()
{
foreach (var threadId in Enumerable.Range(0, 100)
.AsParallel()
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Distinct())
Console.WriteLine(threadId);
}
How much thread's ids we will see? For me each time will see different number of threads, example output:
30 // only one thread!
Next time
27 // several threads
13
38
10
43
30
I think, number of threads depends of current scheduler. We can always define maximum number of threads by calling WithDegreeOfParallelism (https://msdn.microsoft.com/en-us/library/dd383719(v=vs.110).aspx) method, example
void RunMe()
{
foreach (var threadId in Enumerable.Range(0, 100)
.AsParallel()
.WithDegreeOfParallelism(2)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Distinct())
Console.WriteLine(threadId);
}
Now, output will contains maximum 2 threads.
7
40
Why this important? As I said, number of threads can directly influence on performance.
But, this is not all problems. In your 1 scenario, you are creating new tasks (which will perform inside thread pool and can add additional overhead), and then, you are calling Task.WaitAll. Take a look on source code for it https://referencesource.microsoft.com/#mscorlib/system/threading/Tasks/Task.cs,72b6b3fa5eb35695 , Im sure that those for loop by task will add additional overhead, and, in situation when AsParallel will take too much threads inside first task, next task can start continiously. Moreover, this CAN be happen, so, if you will run your 1 scenario 1000 times, probably, you will get very different results.
So, my last argument that you try to measure parallel code, but it is very hard to do it right. Im not recommend to use parallel stuff as much as you can, because it can raise performance degradation, if you dont know exactly, what are you doing.
I use 2 Parallel.ForEach nested loops to retrieve information quickly from a url. This is the code:
while (searches.Count > 0)
{
Parallel.ForEach(searches, (search, loopState) =>
{
Parallel.ForEach(search.items, item =>
{
RetrieveInfo(item);
}
);
}
);
}
The outer ForEach has a list of, for example 10, whilst the inner ForEach has a list of 5. This means that I'm going to query the url 50 times, however I query it 5 times simultaneously (inner ForEach).
I need to add a delay for the inner loop so that after it queries the url, it waits for x seconds - the time taken for the inner loop to complete the 5 requests.
Using Thread.Sleep is not a good idea because it will block the complete thread and possibly the other parallel tasks.
Is there an alternative that might work?
To my understanding, you have 50 tasks and you wish to process 5 of them at a time.
If so, you should look into ParallelOptions.MaxDegreeOfParallelism to process 50 tasks with a maximum degree of parallelism at 5. When one task stops, another task is permitted to start.
If you wish to have tasks processed in chunks of five, followed by another chunk of five (as in, you wish to process chunks in serial), then you would want code similar to
for(...)
{
Parallel.ForEach(
[paralleloptions,]
set of 5, action
)
}