In my code I have a method such as:
void PerformWork(List<Item> items)
{
HostingEnvironment.QueueBackgroundWorkItem(async cancellationToken =>
{
foreach (var item in items)
{
await itemHandler.PerformIndividualWork(item);
}
});
}
Where Item is just a known model and itemHandler just does some work based off of the model (the ItemHandler class is defined in a separately maintained code base as nuget pkg I'd rather not modify).
The purpose of this code is to have work done for a list of items in the background but synchronously.
As part of the work, I would like to create a unit test to verify that when this method is called, the items are handled synchronously. I'm pretty sure the issue can be simplified down to this:
await MyTask(1);
await MyTask(2);
Assert.IsTrue(/* MyTask with arg 1 was completed before MyTask with arg 2 */);
The first part of this code I can easily unit test is that the sequence is maintained. For example, using NSubstitute I can check method call order on the library code:
Received.InOrder(() =>
{
itemHandler.PerformIndividualWork(Arg.Is<Item>(arg => arg.Name == "First item"));
itemHandler.PerformIndividualWork(Arg.Is<Item>(arg => arg.Name == "Second item"));
itemHandler.PerformIndividualWork(Arg.Is<Item>(arg => arg.Name == "Third item"));
});
But I'm not quite sure how to ensure that they aren't run in parallel. I've had several ideas which seem bad like mocking the library to have an artificial delay when PerformIndividualWork is called and then either checking a time elapsed on the whole background task being queued or checking the timestamps of the itemHandler received calls for a minimum time between the calls. For instance, if I have PerformIndividualWork mocked to delay 500 milliseconds and I'm expecting three items, then I could check elapsed time:
stopwatch.Start();
// I have an interface instead of directly calling HostingEnvironment, so I can access the task being queued here
backgroundTask.Invoke(...);
stopwatch.Stop();
Assert.IsTrue(stopwatch.ElapsedMilliseconds > 1500);
But that doesn't feels right and could lead to false positives. Perhaps the solution lies in modifying the code itself; however, I can't think of a way of meaningfully changing it to make this sort of unit test (testing tasks are run in order) possible. We'll definitely have system/integration testing to ensure the issue caused by asynchronous performance of the individual items doesn't happen, but I would like to hit testing here at this level as well.
Not sure if this is a good idea, but one approach could be to use an itemHandler that will detect when items are handled in parallel. Here is a quick and dirty example:
public class AssertSynchronousItemHandler : IItemHandler
{
private volatile int concurrentWork = 0;
public List<Item> Items = new List<Item>();
public Task PerformIndividualWork(Item item) =>
Task.Run(() => {
var result = Interlocked.Increment(ref concurrentWork);
if (result != 1) {
throw new Exception($"Expected 1 work item running at a time, but got {result}");
}
Items.Add(item);
var after = Interlocked.Decrement(ref concurrentWork);
if (after != 0) {
throw new Exception($"Expected 0 work items running once this item finished, but got {after}");
}
});
}
There are probably big problems with this, but the basic idea is to check how many items are already being handled when we enter the method, then decrement the counter and check there are still no other items being handled. With threading stuff I think it is very hard to make guarantees about things from tests alone, but with enough items processed this can give us a little confidence that it is working as expected:
[Fact]
public void Sample() {
var handler = new AssertSynchronousItemHandler();
var subject = new Subject(handler);
var input = Enumerable.Range(0, 100).Select(x => new Item(x.ToString())).ToList();
subject.PerformWork(input);
// With the code from the question we don't have a way of detecting
// when `PerformWork` finishes. If we can't change this we need to make
// sure we wait "long enough". Yes this is yuck. :)
Thread.Sleep(1000);
Assert.Equal(input, handler.Items);
}
If I modify PerformWork to do things in parallel I get the test failing:
public void PerformWork2(List<Item> items) {
Task.WhenAll(
items.Select(item => itemHandler.PerformIndividualWork(item))
).Wait(2000);
}
// ---- System.Exception : Expected 1 work item running at a time, but got 4
That said, if it is very important to run synchronously and it is not apparent from glancing at the implementation with async/await then maybe it is worth using a more obviously synchronous design, like a queue serviced by only one thread, so that you're guaranteed synchronous execution by design and people won't inadvertently change it to async during refactoring (i.e. it is deliberately synchronous and documented that way).
Related
Consider this situation:
class Product { }
interface IWorker
{
Task<Product> CreateProductAsync();
}
I am now given an IEnumerable<IWorker> workers and am supposed to create an IEnumerable<Product> from it that I have to pass to some other function that I cannot alter:
void CheckProducts(IEnumerable<Product> products);
This methods needs to have access to the entire IEnumerable<Product>. It is not possible to subdivide it and call CheckProducts on multiple subsets.
One obvious solution is this:
CheckProducts(workers.Select(worker => worker.CreateProductAsync().Result));
But this is blocking, of course, and hence it would only be my last resort.
Syntactically, I need precisely this, just without blocking.
I cannot use await inside of the function I'm passing to Select() as I would have to mark it as async and that would require it to return a Task itself and I would have gained nothing. In the end I need an IEnumerable<Product> and not an IEnumerable<Task<Product>>.
It is important to know that the order of the workers creating their products does matter, their work must not overlap. Otherwise, I would do this:
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
var tasks = workers.Select(worker => worker.CreateProductAsync());
return await Task.WhenAll(tasks);
}
But unfortunately, Task.WhenAll() executes some tasks in parallel while I need them executed sequentially.
Here is one possibility to implement it if I had an IReadOnlyList<IWorker> instead of an IEnumerable<IWorker>:
async Task<IEnumerable<Product>> CreateProductsAsync(IReadOnlyList<IWorker> workers)
{
var resultList = new Product[workers.Count];
for (int i = 0; i < resultList.Length; ++i)
resultList[i] = await workers[i].CreateProductAsync();
return resultList;
}
But I must deal with an IEnumerable and, even worse, it is usually quite huge, sometimes it is even unlimited, yielding workers forever. If I knew that its size was decent, I would just call ToArray() on it and use the method above.
The ultimate solution would be this:
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
foreach (var worker in workers)
yield return await worker.CreateProductAsync();
}
But yield and await are incompatible as described in this answer. Looking at that answer, would that hypothetical IAsyncEnumerator help me here? Does something similar meanwhile exist in C#?
A summary of the issues I'm facing:
I have a potentially endless IEnumerable<IWorker>
I want to asynchronously call CreateProductAsync() on each of them in the same order as they are coming in
In the end I need an IEnumerable<Product>
A summary of what I already tried, but doesn't work:
I cannot use Task.WhenAll() because it executes tasks in parallel.
I cannot use ToArray() and process that array manually in a loop because my sequence is sometimes endless.
I cannot use yield return because it's incompatible with await.
Does anybody have a solution or workaround for me?
Otherwise I will have to use that blocking code...
IEnumerator<T> is a synchronous interface, so blocking is unavoidable if CheckProducts enumerates the next product before the next worker has finished creating the product.
Nevertheless, you can achieve parallelism by creating products on another thread, adding them to a BlockingCollection<T>, and yielding them on the main thread:
static IEnumerable<Product> CreateProducts(IEnumerable<IWorker> workers)
{
var products = new BlockingCollection<Product>(3);
Task.Run(async () => // On the thread pool...
{
foreach (IWorker worker in workers)
{
Product product = await worker.CreateProductAsync(); // Create products serially.
products.Add(product); // Enqueue the product, blocking if the queue is full.
}
products.CompleteAdding(); // Notify GetConsumingEnumerable that we're done.
});
return products.GetConsumingEnumerable();
}
To avoid unbounded memory consumption, you can optionally specify the capacity of the queue as a constructor argument to BlockingCollection<T>. I used 3 in the code above.
The Situation:
Here you're saying you need to do this synchronously, because IEnumerable doesn't support async and the requirements are you need an IEnumerable<Product>.
I am now given an IEnumerable workers and am supposed to
create an IEnumerable from it that I have to pass to some
other function that I cannot alter:
Here you say the entire product set needs to be processed at the same time, presumably making a single call to void CheckProducts(IEnumerable<Product> products).
This methods needs to check the entire Product set as a whole. It is
not possible to subdivide the result.
And here you say the enumerable can yield an indefinite number of items
But I must deal with an IEnumerable and, even worse, it is usually
quite huge, sometimes it is even unlimited, yielding workers forever.
If I knew that its size was decent, I would just call ToArray() on it
and use the method above.
So lets put these together. You need to do asynchronous processing of an indefinite number of items within a synchronous environment and then evaluate the entire set as a whole... synchronously.
The Underlying Problems:
1: To evaluate a set as a whole, it must be completely enumerated. To completely enumerate a set, it must be finite. Therefore it is impossible to evaluate an infinite set as a whole.
2: Switching back and forth between sync and async forces the async code to run synchronously. that might be ok from a requirements perspective, but from a technical perspective it can cause deadlocks (maybe unavoidable, I don't know. Look that up. I'm not the expert).
Possible Solutions to Problem 1:
1: Force the source to be an ICollection<T> instead of IEnumerable<T>. This enforces finiteness.
2: Alter the CheckProducts algorithm to process iteratively, potentially yielding intermediary results while still maintaining an ongoing aggregation internally.
Possible Solutions to Problem 2:
1: Make the CheckProducts method asynchronous.
2: Make the CreateProduct... method synchronous.
Bottom Line
You can't do what you're asking how you're asking, and it sounds like someone else is dictating your requirements. They need to change some of the requirements, because what they're asking for is (and I really hate using this word) impossible. Is it possible you have misinterpreted some of the requirements?
Two ideas for you OP
Multiple call solution
If you are allowed to call CheckProducts more than once, you could simply do this:
foreach (var worker in workers)
{
var product = await worker.CreateProductAsync();
CheckProducts(new [] { product } );
}
If it adds value, I'm pretty sure you could work out a way to do it in batches of, say, 100 at a time, too.
Thread pool solution
If you are not allowed to call CheckProducts more than once, and not allowed to modify CheckProducts, there is no way to force it to yield control and allow other continuations to run. So no matter what you do, you cannot force asynchronousness into the IEnumerable that you pass to it, not just because of the compiler checking, but because it would probably deadlock.
So here is a thread pool solution. The idea is to create one separate thread to process the products in series; the processor is async, so a call to CreateProductAsync() will still yield control to anything else that has been posted to the synchronization context, as needed. However it can't magically force CheckProduct to give up control, so there is still some possibility that it will block occasionally if it is able to check products faster than they are created. In my example I'm using Monitor.Wait() so the O/S won't schedule the thread until there is something waiting for it. You'll still be using up a thread resource while it blocks, but at least you won't be wasting CPU time in a busy-wait loop.
public static IEnumerable<Product> CreateProducts(IEnumerable<Worker> workers)
{
var queue = new ConcurrentQueue<Product>();
var task = Task.Run(() => ConvertProducts(workers.GetEnumerator(), queue));
while (true)
{
while (queue.Count > 0)
{
Product product;
var ok = queue.TryDequeue(out product);
if (ok) yield return product;
}
if (task.IsCompleted && queue.Count == 0) yield break;
Monitor.Wait(queue, 1000);
}
}
private static async Task ConvertProducts(IEnumerator<Worker> input, ConcurrentQueue<Product> output)
{
while (input.MoveNext())
{
var current = input.Current;
var product = await current.CreateProductAsync();
output.Enqueue(product);
Monitor.Pulse(output);
}
}
From your requirements I can put together the following:
1) Workers processed in order
2) Open to receive new Workers at any time
So using the fact that a dataflow TransformBlock has a built in queue and processes items in order. Now we can accept Workers from the producer at any time.
Next we make the result of the TransformBlockobservale so that the consumer can consume Products on demand.
Made some quick changes and started the consumer portion. This simply takes the observable produced by the Transformer and maps it to an enumerable that yields each product. For background here is the ToEnumerable().
The ToEnumerator operator returns an enumerator from an observable sequence. The enumerator will yield each item in the sequence as it is produced
Source
using System;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
namespace ClassLibrary1
{
public class WorkerProducer
{
public async Task ProduceWorker()
{
//await ProductTransformer_Transformer.SendAsync(new Worker())
}
}
public class ProductTransformer
{
public IObservable<Product> Products { get; private set; }
public TransformBlock<Worker, Product> Transformer { get; private set; }
private Task<Product> CreateProductAsync(Worker worker) => Task.FromResult(new Product());
public ProductTransformer()
{
Transformer = new TransformBlock<Worker, Product>(wrk => CreateProductAsync(wrk));
Products = Transformer.AsObservable();
}
}
public class ProductConsumer
{
private ThirdParty ThirdParty { get; set; } = new ThirdParty();
private ProductTransformer Transformer { get; set; }
public ProductConsumer()
{
ThirdParty.CheckProducts(Transformer.Products.ToEnumerable());
}
public class Worker { }
public class Product { }
public class ThirdParty
{
public void CheckProducts(IEnumerable<Product> products)
{
}
}
}
Unless I misunterstood something, I don't see why you don't simply do it like this:
var productList = new List<Product>(workers.Count())
foreach(var worker in workers)
{
productList.Add(await worker.CreateProductAsync());
}
CheckProducts(productList);
What about if you simply keep clearing a List of size 1?
var productList = new List<Product>(1);
var checkTask = Task.CompletedTask;
foreach(var worker in workers)
{
await checkTask;
productList.Clear();
productList.Add(await worker.CreateProductAsync());
checkTask = Task.Run(CheckProducts(productList));
}
await checkTask;
You can use Task.WhenAll, but instead of returning result of Task.WhenAll, return collection of tasks transformed to the collection of results.
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
var tasks = workers.Select(worker => worker.CreateProductAsync()).ToList();
await Task.WhenAll(tasks);
return tasks.Select(task => task.Result);
}
Order of tasks will be persisted.
And seems like should be ok to go with just return await Task.WhenAll()
From docs of Task.WhenAll Method (IEnumerable>)
The Task.Result property of the returned task will be set to
an array containing all of the results of the supplied tasks in the
same order as they were provided...
If workers need to be executed one by one in the order they were created and based on requirement that another function need whole set of workers results
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
var products = new List<product>();
foreach (var worker in workers)
{
product = await worker.CreateProductAsync();
products.Add(product);
}
return products;
}
You can do this now with async, IEnumerable and LINQ but every method in the chain after the async would be a Task<T>, and you need to use something like await Task.WhenAll at the end. You can use async lambdas in the LINQ methods, which return Task<T>. You don't need to wait synchronously in these.
The Select will start your tasks sequentially i.e. they won't even exist as tasks until the select enumerates each one, and won't keep going after you stop enumerating. You could also run your own foreach over the enumerable of tasks if you want to await them all individually.
You can break out of this like any other foreach without it starting all of them, so this will also work on an infinite enumerable.
public async Task Main()
{
// This async method call could also be an async lambda
foreach (var task in GetTasks())
{
var result = await task;
Console.WriteLine($"Result is {result}");
if (result > 5) break;
}
}
private IEnumerable<Task<int>> GetTasks()
{
return GetNumbers().Select(WaitAndDoubleAsync);
}
private async Task<int> WaitAndDoubleAsync(int i)
{
Console.WriteLine($"Waiting {i} seconds asynchronously");
await Task.Delay(TimeSpan.FromSeconds(i));
return i * 2;
}
/// Keeps yielding numbers
private IEnumerable<int> GetNumbers()
{
var i = 0;
while (true) yield return i++;
}
Outputs, the following, then stops:
Waiting 0 seconds asynchronously
Result is 0
Waiting 1 seconds asynchronously
Result is 2
Waiting 2 seconds asynchronously
Result is 4
Waiting 3 seconds asynchronously
Result is 6
The important thing is that you can't mix yield and await in the same method, but you can yield Tasks returned from a method that uses await absolutely fine, so you can use them together just by splitting them into separate methods. Select is already a method that uses yield, so you may not need to write your own method for this.
In your post you were looking for a Task<IEnumerable<Product>>, but what you can actually use is a IEnumerable<Task<Product>>.
You can go even further with this e.g. if you had something like a REST API where one resource can have links to other resources, like if you just wanted to get a list of users of a group, but stop when you found the user you were interested in:
public async Task<IEnumerable<Task<User>>> GetUserTasksAsync(int groupId)
{
var group = await GetGroupAsync(groupId);
return group.UserIds.Select(GetUserAsync);
}
foreach (var task in await GetUserTasksAsync(1))
{
var user = await task;
...
}
There is no solution to your problem. You can't transform a deferred IEnumerable<Task<Product>> to a deferred IEnumerable<Product>, such that the consuming thread will not get blocked while enumerating the IEnumerable<Product>. The IEnumerable<T> is a synchronous interface. It returns an enumerator with a synchronous MoveNext method. The MoveNext returns bool, which is not an awaitable type. An asynchronous interface IAsyncEnumerable<T> exists, whose enumerator has an asynchronous MoveNextAsync method, with a return type of ValueTask<bool>. But you have explicitly said that you can't change the consuming method, so you are stuck with the IEnumerable<T> interface. No solution then.
try
workers.ForEach(async wrkr =>
{
var prdlist = await wrkr.CreateProductAsync();
//Remaing tasks....
});
My C# application stops responding for a long time, as I break the Debug it stops on a function.
foreach (var item in list)
{
xmldiff.Compare(item, secondary, output);
...
}
I guess the running time of this function is long or it hangs. Anyway, I want to wait for a certain time (e.g. 5 seconds) for the execution of this function, and if it exceeds this time, I skip it and go to the next item in the loop. How can I do it? I found some similar question but they are mostly for processes or asynchronous methods.
You can do it the brutal way: spin up a thread to do the work, join it with timeout, then abort it, if the join didn't work.
Example:
var worker = new Thread( () => { xmlDiff.Compare(item, secondary, output); } );
worker.Start();
if (!worker.Join( TimeSpan.FromSeconds( 1 ) ))
worker.Abort();
But be warned - aborting threads is not considered nice and can make your app unstable. If at all possible try to modify Compare to accept a CancellationToken to cancel the comparison.
I would avoid directly using threads and use Microsoft's Reactive Extensions (NuGet "Rx-Main") to abstract away the management of the threads.
I don't know the exact signature of xmldiff.Compare(item, secondary, output) but if I assume it produces an integer then I could do this with Rx:
var query =
from item in list.ToObservable()
from result in
Observable
.Start(() => xmldiff.Compare(item, secondary, output))
.Timeout(TimeSpan.FromSeconds(5.0), Observable.Return(-1))
select new { item, result };
var subscription =
query
.Subscribe(x =>
{
/* do something with `x.item` and/or `x.result` */
});
This automatically iterates through each item and starts a background computation of xmldiff.Compare, but only allows each computation to take as much as 5.0 seconds before returning a default value of -1.
The subscription variable is an IDisposable, so if you want to abort the entire query before it completes just call .Dispose().
I skip it and go to the next item in the loop
By "skip it", do you mean "leave it there" or "cancel it"? The two scenarios are quite different. But for both two I suggest you use Task.
//generate 10 example tasks
var tasks = Enumerable
.Range(0, 10)
.Select(n => new Task(() => DoSomething(n)))
.ToList();
var maxExecutionTime = TimeSpan.FromSeconds(5);
foreach (var task in tasks)
{
if (task.Wait(maxExecutionTime))
{
//the task is finished in time
}
else
{
// the task is over time
// just leave it there
// the loop continues
// if you want to cancel it, see
// http://stackoverflow.com/questions/4783865/how-do-i-abort-cancel-tpl-tasks
}
}
One thing to improve is "do you really need to run your tasks one by one?" If they are independent you can run them in parallel.
I have the following code:
var tasks = await taskSeedSource
.Select(taskSeed => GetPendingOrRunningTask(taskSeed, createTask, onFailed, onSuccess, sem))
.ToList()
.ToTask();
if (tasks.Count == 0)
{
return;
}
if (tasks.Contains(null))
{
tasks = tasks.Where(t => t != null).ToArray();
if (tasks.Count == 0)
{
return;
}
}
await Task.WhenAll(tasks);
Where taskSeedSource is a Reactive Observable. It could be that this code have many problems, but I see at least two:
I am collecting tasks whereas I could do without it.
Somehow, the returned tasks list may contain nulls, even though GetPendingOrRunningTask is an async method and hence never returns null. I failed to understand why it happens, so I had to defend against it without understanding the cause of the problem.
I would like to use the AsyncCountdownEvent from the AsyncEx framework instead of collecting the tasks and then awaiting on them.
So, I can pass the countdown event to GetPendingOrRunningTask which will increment it immediately and signal before returning after awaiting for the completion of its internal logic. However, I do not understand how to integrate the countdown event into the monad (that is the Reactive jargon, isn't it?).
What is the right way to do it?
EDIT
Guys, let us forget about the mysterious nulls in the returned list. Suppose everything is green and the code is
var tasks = await taskSeedSource
.Select(taskSeed => GetPendingOrRunningTask(taskSeed, ...))
.ToList()
.ToTask();
await Task.WhenAll(tasks);
Now the question is how do I do it with the countdown event? So, suppose I have:
var c = new AsyncCountdownEvent(1);
and
async Task GetPendingOrRunningTask<T>(AsyncCountdownEvent c, T taskSeed, ...)
{
c.AddCount();
try
{
await ....
}
catch (Exception exc)
{
// The exception is handled
}
c.Signal();
}
My problem is that I no longer need the returned task. These tasks where collected and awaited to get the moment when all the work items are over, but now the countdown event can be used to indicate when the work is over.
My problem is that I am not sure how to integrate it into the Reactive chain. Essentially, the GetPendingOrRunningTask can be async void. And here I am stuck.
EDIT 2
Strange appearance of a null entry in the list of tasks
#Servy is correct that you need to solve the null Task problem at the source. Nobody wants to answer a question about how to workaround a problem that violates the contracts of a method that you've defined yourself and yet haven't provided the source for examination.
As for the issue about collecting tasks, it's easy to avoid with Merge if your method returns a generic Task<T>:
await taskSeedSource
.Select(taskSeed => GetPendingOrRunningTask(taskSeed, createTask, onFailed, onSuccess, sem))
.Where(task => task != null) // According to you, this shouldn't be necessary.
.Merge();
However, unfortunately there's no official Merge overload for the non-generic Task but that's easy enough to define:
public static IObservable<Unit> Merge(this IObservable<Task> sources)
{
return sources.Select(async source =>
{
await source.ConfigureAwait(false);
return Unit.Default;
})
.Merge();
}
I'm trying to write my own scheduler; the rationale behind it is that all the submitted actions will be executed in order, according to a delay. For example, if at time 0 I schedule action A with delay 5 and at time 1 I schedule action B with delay 2, then B should be executed first at time 3 and A should be executed second, at time 5.
Basically, what I am trying to do is something like:
public class MyScheduler
{
Task _task = new Task(() => { });
public MyScheduler()
{
_task.Start();
}
public void Schedule(Action action, long delay)
{
Task.Delay(TimeSpan.FromTicks(delay)).ContinueWith(_ =>
lock(_task) {
_task = _task.ContinueWith(task => action())
}
);
}
}
A relevant test for this code would be:
var waiter = new Waiter(3);
int _count = 0;
mysched = new MyScheduler();
mysched.Schedule(() => { _count++; waiter.Signal(); });
mysched.Schedule(() => { Task.Delay(100).Wait(); _count *= 3; waiter.Signal(); });
mysched.Schedule(() => { _count++; waiter.Signal(); });
waiter.Await();
Assert.AreEqual(4, _count);
In the above code, Waiter is a class with an internal variable initialized in the constructor; the Signal method decrements that internal variable and the Await method loops (and sleeps 10 ms on each iteration) until the internal variable is less than or equal to zero.
The aim of the test is to show that the scheduled actions have been performed in order.
Most of the times this is true and the test passes, but on few occasions the resulting value for _count is 2 instead of 4. I have spent a lot of time trying to figure out why this happens, but I can't seem to figure it out and my lack of experience with C# is not helping either.
Does anyone have any suggestions?
For one thing, _count is not synchronized for access from different threads.
I recommend that you not use ContinueWith at all; it is a very low-level method and is very easy to get the details wrong (for example, the default scheduler is TaskScheduler.Current, which is almost never what you want). Your general logic code should use await instead of ContinueWith.
Regarding the scheduler, these days it is almost impossible to make a good use case for developing your own. There are better ones available that are developed by geniuses and extremely well-tested. Consider Reactive Extensions: they provide several schedulers, and they all support scheduling.
I have multiple enumerators that enumerate over flat files. I originally had each enumerator in a Parallel Invoke and each Action was adding to a BlockingCollection<Entity> and that collections was returning a ConsumingEnumerable();
public interface IFlatFileQuery
{
IEnumerable<Entity> Run();
}
public class FlatFile1 : IFlatFileQuery
{
public IEnumerable<Entity> Run()
{
// loop over a flat file and yield each result
yield return Entity;
}
}
public class Main
{
public IEnumerable<Entity> DoLongTask(ICollection<IFlatFileQuery> _flatFileQueries)
{
// do some other stuff that needs to be returned first:
yield return Entity;
// then enumerate and return the flat file data
foreach (var entity in GetData(_flatFileQueries))
{
yield return entity;
}
}
private IEnumerable<Entity> GetData(_flatFileQueries)
{
var buffer = new BlockingCollection<Entity>(100);
var actions = _flatFileQueries.Select(fundFileQuery => (Action)(() =>
{
foreach (var entity in fundFileQuery.Run())
{
buffer.TryAdd(entity, Timeout.Infinite);
}
})).ToArray();
Task.Factory.StartNew(() =>
{
Parallel.Invoke(actions);
buffer.CompleteAdding();
});
return buffer.GetConsumingEnumerable();
}
}
However after a bit of testing it turns out that the code change below is about 20-25% faster.
private IEnumerable<Entity> GetData(_flatFileQueries)
{
return _flatFileQueries.AsParallel().SelectMany(ffq => ffq.Run());
}
The trouble with the code change is that it waits till all flat file queries are enumerated before it returns the whole lot that can then be enumerated and yielded.
Would it be possible to yield in the above bit of code somehow to make it even faster?
I should add that at most the combined results of all the flat file queries might only be 1000 or so Entities.
Edit:
Changing it to the below doesn't make a difference to the run time. (R# even suggests to go back to the way it was)
private IEnumerable<Entity> GetData(_flatFileQueries)
{
foreach (var entity in _flatFileQueries.AsParallel().SelectMany(ffq => ffq.Run()))
{
yield return entity;
}
}
The trouble with the code change is that it waits till all flat file queries are enumerated before it returns the whole lot that can then be enumerated and yielded.
Let's prove that it's false by a simple example. First, let's create a TestQuery class that will yield a single entity after a given time. Second, let's execute several test queries in parallel and measure how long it took to yield their result.
public class TestQuery : IFlatFileQuery {
private readonly int _sleepTime;
public IEnumerable<Entity> Run() {
Thread.Sleep(_sleepTime);
return new[] { new Entity() };
}
public TestQuery(int sleepTime) {
_sleepTime = sleepTime;
}
}
internal static class Program {
private static void Main() {
Stopwatch stopwatch = Stopwatch.StartNew();
var queries = new IFlatFileQuery[] {
new TestQuery(2000),
new TestQuery(3000),
new TestQuery(1000)
};
foreach (var entity in queries.AsParallel().SelectMany(ffq => ffq.Run()))
Console.WriteLine("Yielded after {0:N0} seconds", stopwatch.Elapsed.TotalSeconds);
Console.ReadKey();
}
}
This code prints:
Yielded after 1 seconds
Yielded after 2 seconds
Yielded after 3 seconds
You can see with this output that AsParallel() will yield each result as soon as its available, so everything works fine. Note that you might get different timings depending on the degree of parallelism (such as "2s, 5s, 6s" with a degree of parallelism of 1, effectively making the whole operation not parallel at all). This output comes from an 4-cores machine.
Your long processing will probably scale with the number of cores, if there is no common bottleneck between the threads (such as a shared locked resource). You might want to profile your algorithm to see if there are slow parts that can be improved using tools such as dotTrace.
I don't think there is a red flag in your code anywhere. There are no outrageous inefficiencies. I think it comes down to multiple smaller differences.
PLINQ is very good at processing streams of data. Internally, it works more efficiently than adding items to a synchronized list one-by-one. I suspect that your calls to TryAdd are a bottleneck because each call requires at least two Interlocked operations internally. Those can put enormous load on the inter-processor memory bus because all threads will compete for the same cache line.
PLINQ is cheaper because internally, it does some buffering. I'm sure it doesn't output items one-by-one. Probably it batches them and amortizes sycnhronization cost that way over multiple items.
A second issue would be the bounded capacity of the BlockingCollection. 100 is not a lot. This might lead to a lot of waiting. Waiting is costly because it requires a call to the kernel and a context switch.
I make this alternative that works good for me in any scenario:
This works for me:
In a Task in a Parallel.Foreach Enqueue in a ConcurrentQueue the item
transformed to be processed.
The task has a continue that marks a
flag with that task ends.
In the same thread of execution with tasks
ends a while dequeue and yields
Fast and excellent results for me:
Task.Factory.StartNew (() =>
{
Parallel.ForEach<string> (TextHelper.ReadLines(FileName), ProcessHelper.DefaultParallelOptions,
(string currentLine) =>
{
// Read line, validate and enqeue to an instance of FileLineData (custom class)
});
}).
ContinueWith
(
ic => isCompleted = true
);
while (!isCompleted || qlines.Count > 0)
{
if (qlines.TryDequeue (out returnLine))
{
yield return returnLine;
}
}
By default the ParallelQuery class, when is working on IEnumerable<T> sources, employs a partitioning strategy known as "chunk partitioning". With this strategy each worker thread grabs a progressively larger number of items each time. This means that it has an input buffer. Then the results are accumulated into an output buffer, having a size chosen by the system, before they are available to the consumer of the query. You can disable both buffers by using the configuration options EnumerablePartitionerOptions.NoBuffering and ParallelMergeOptions.NotBuffered.
private IEnumerable<Entity> GetData(ICollection<IFlatFileQuery> flatFileQueries)
{
return Partitioner
.Create(flatFileQueries, EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.AsOrdered()
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.SelectMany(ffq => ffq.Run());
}
This way each worker thread will grab only one item at a time, and will propagate the result as soon as it is computed.
NoBuffering: Create a partitioner that takes items from the source enumerable one at a time and does not use intermediate storage that can be accessed more efficiently by multiple threads. This option provides support for low latency (items will be processed as soon as they are available from the source) and provides partial support for dependencies between items (a thread cannot deadlock waiting for an item that the thread itself is responsible for processing).
NotBuffered: Use a merge without output buffers. As soon as result elements have been computed, make that element available to the consumer of the query.