I have a a computation that I'm parallelizing using PLINQ as follows:
Source IEnumerable<T> source is providing objects read from a
file.
I have a heavyweight computation HeavyComputation I need to do on
each T, and I want these farmed out across threads, so I am
using PLINQ like: AsParallel().Select(HeavyComputation)
Here's where it gets interesting: due to constraints on the file
reader type that provides the source, I need source to be
enumerated on the initial thread, not on the parallel workers. I need
the full evaluation of the source to be bound to the main
thread. However it seems the source is actually enumerated on worker
threads.
My question is: Is there a straightforward way to modify this code to
bind the enumeration of the source to the initial thread, while
farming out the heavy work to the parallel workers? Keep in mind that
just doing an eager .ToList() before the AsParallel() is not an option here,
as the data stream coming from the file is massive.
Here is some example code that demonstrates the problem as I see it:
using System.Threading;
using System.Collections.Generic;
using System.Linq;
using System;
public class PlinqTest
{
private static string FormatItems<T>(IEnumerable<T> source)
{
return String.Format("[{0}]", String.Join(";", source));
}
public static void Main()
{
var expectedThreadIds = new[] { Thread.CurrentThread.ManagedThreadId };
var threadIds = Enumerable.Range(1, 1000)
.Select(x => Thread.CurrentThread.ManagedThreadId) // (1)
.AsParallel()
.WithDegreeOfParallelism(8)
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.AsOrdered()
.Select(x => x) // (2)
.ToArray();
// In the computation above, the lambda in (1) is a
// stand in for the file-reading operation that we
// want to be bound to the main thread, while the
// lambda in (2) is a stand-in for the "expensive
// computation" that we want to be farmed out to the
// parallel worker threads. In fact, (1) is being
// executed on all threads, as can be seen from the
// output.
Console.WriteLine("Expected thread IDs: {0}",
FormatItems(expectedThreadIds));
Console.WriteLine("Found thread IDs: {0}",
FormatItems(threadIds.Distinct()));
}
}
Example output I get is:
Expected thread IDs: [1]
Found thread IDs: [7;4;8;6;11;5;10;9]
This is fairly straightforward (although perhaps not as concise) if you abandon PLINQ and just use the Task Parallel Library explicitly:
// Limits the parallelism of the "expensive task"
var semaphore = new SemaphoreSlim(8);
var tasks = Enumerable.Range(1, 1000)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Select(async x =>
{
await semaphore.WaitAsync();
var result = await Task.Run(() => Tuple.Create(x, Thread.CurrentThread.ManagedThreadId));
semaphore.Release();
return result;
});
return Task.WhenAll(tasks).Result;
Note that I'm using Tuple.Create to record both the thread ID coming from the main thread and the thread ID coming from the spawned task. From my test, the former is always the same for every tuple, while the latter varies, which is as it should be.
The semaphore makes sure that the degree of parallelism never goes above 8 (although with the inexpensive task of creating a tuple this isn't very likely anyway). If you get to 8, any new tasks will wait until there's spots available on the semaphore.
You could use the OffloadQueryEnumeration method below, which ensures that the enumeration of the source sequence will occur on the same thread that enumerates the resulting IEnumerable<TResult>. The querySelector is a delegate that converts a proxy of the source sequence to a ParallelQuery<T>. This query is enumerated internally on a ThreadPool thread, but the output values are surfaced back on the current thread.
/// <summary>
/// Enumerates the source sequence on the current thread, and enumerates
/// the projected query on a ThreadPool thread.
/// </summary>
public static IEnumerable<TResult> OffloadQueryEnumeration<TSource, TResult>(
this IEnumerable<TSource> source,
Func<IEnumerable<TSource>, IEnumerable<TResult>> querySelector)
{
ArgumentNullException.ThrowIfNull(source);
ArgumentNullException.ThrowIfNull(querySelector);
object locker = new();
(TSource Value, bool HasValue) input = default; bool inputCompleted = false;
(TResult Value, bool HasValue) output = default; bool outputCompleted = false;
using IEnumerator<TSource> sourceEnumerator = source.GetEnumerator();
IEnumerable<TSource> GetSourceProxy()
{
while (true)
{
TSource sourceItem;
lock (locker)
{
while (true)
{
if (inputCompleted || outputCompleted) yield break;
if (input.HasValue) break;
Monitor.Wait(locker);
}
sourceItem = input.Value;
input = default; Monitor.PulseAll(locker);
}
yield return sourceItem;
}
}
IEnumerable<TResult> query = querySelector(GetSourceProxy());
Task outputReaderTask = Task.Run(() =>
{
try
{
foreach (TResult result in query)
{
lock (locker)
{
while (true)
{
if (outputCompleted) return;
if (!output.HasValue) break;
Monitor.Wait(locker);
}
output = (result, true); Monitor.PulseAll(locker);
}
}
}
finally
{
lock (locker) { outputCompleted = true; Monitor.PulseAll(locker); }
}
});
// Main loop
List<Exception> exceptions = new();
while (true)
{
TResult resultItem;
lock (locker)
{
// Inner loop
while (true)
{
if (output.HasValue)
{
resultItem = output.Value;
output = default; Monitor.PulseAll(locker);
goto yieldResult;
}
if (outputCompleted) goto exitMainLoop;
if (!inputCompleted && !input.HasValue)
{
// Fill the empty input slot, by reading the enumerator.
try
{
if (sourceEnumerator.MoveNext())
input = (sourceEnumerator.Current, true);
else
inputCompleted = true;
}
catch (Exception ex)
{
exceptions.Add(ex);
inputCompleted = true;
}
Monitor.PulseAll(locker); continue;
}
Monitor.Wait(locker);
}
}
yieldResult:
bool yieldOK = false;
try { yield return resultItem; yieldOK = true; }
finally
{
if (!yieldOK)
{
// The consumer stopped enumerating prematurely
lock (locker) { outputCompleted = true; Monitor.PulseAll(locker); }
Task.WhenAny(outputReaderTask).Wait();
}
}
}
exitMainLoop:
// Propagate possible exceptions
try { outputReaderTask.GetAwaiter().GetResult(); }
catch (OperationCanceledException) { throw; }
catch (AggregateException aex) { exceptions.AddRange(aex.InnerExceptions); }
if (exceptions.Count > 0)
throw new AggregateException(exceptions);
}
This method uses the Monitor.Wait/Monitor.Pulse mechanism (tutorial), in order to synchronize the transfer of the values from the one thread to the other.
Usage example:
int[] threadIds = Enumerable
.Range(1, 1000)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.OffloadQueryEnumeration(proxy => proxy
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(8)
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.Select(x => x)
)
.ToArray();
Online demo.
The OffloadQueryEnumeration is a significantly intricate method. It is jungling non-stop three threads:
The current thread that both enumerates the source sequence, and consumes the PLINQ-generated elements, alternating between the two operations.
The ThreadPool thread (outputReaderTask) that enumerates the PLINQ-generated sequence.
The worker thread that is tasked by the PLINQ machinery to fetch the next item from the GetSourceProxy() iterator. This thread is not the same all the time, but at any given moment only one worker thread at most is assigned this task.
So lots of things are going on, and there are lots of opportunities for hidden bugs to pass undetected. This is the kind of API that would require writing a dozen of tests, to assert the correctness of the numerous possible scenarios (for example failure in the source sequence, failure in the PLINQ operators, failure in the consumer, cancellation, abandoned enumeration etc). I have tested manually some of these scenarios, but I haven't written any tests, so use this method with caution.
Related
I am using concurrentbag for scraping URLs , Right now its working fine for 500 / 100 urls but when I am trying to scrape 8000 urls . All URLs not processing and some items pending in inputQueue.
But I am using while (!inputQueue.IsEmpty) . So, it should run loop till any items exists into inputqueue.
I want only run 100 threads max. So, I first creating 100 threads and calling "Run()" method and inside that method I am running a loop to take items till items exits in inputqueue and add into output queue after scraping urls.
public ConcurrentBag<Data> inputQueue = new ConcurrentBag<Data>();
public ConcurrentBag<Data> outPutQueue = new ConcurrentBag<Data>();
public List<Data> Scrapes(List<Data> scrapeRequests)
{
ServicePointManager.ServerCertificateValidationCallback += (sender, cert, chain, sslPolicyErrors) => true;
string proxy_session_id = new Random().Next().ToString();
numberOfRequestSent = 0;
watch.Start();
foreach (var sRequest in scrapeRequests)
{
inputQueue.Add(sRequest);
}
//inputQueue.CompleteAdding();
var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) //create 100 threads only
{
taskList.Add(Task.Factory.StartNew(async () =>
{
await Run();
}, TaskCreationOptions.RunContinuationsAsynchronously));
}
Task.WaitAll(taskList.ToArray()); //Waiting
//print result
Console.WriteLine("Number Of URLs Found - {0}", scrapeRequests.Count);
Console.WriteLine("Number Of Request Sent - {0}", numberOfRequestSent);
Console.WriteLine("Input Queue - {0}", inputQueue.Count);
Console.WriteLine("OutPut Queue - {0}", outPutQueue.ToList().Count);
Console.WriteLine("Success - {0}", outPutQueue.ToList().Where(x=>x.IsProxySuccess==true).Count().ToString());
Console.WriteLine("Failed - {0}", outPutQueue.ToList().Where(x => x.IsProxySuccess == false).Count().ToString());
Console.WriteLine("Process Time In - {0}", watch.Elapsed);
return outPutQueue.ToList();
}
async Task<string> Run()
{
while (!inputQueue.IsEmpty)
{
var client = new Client(super_proxy_ip, "US");
if (!client.have_good_super_proxy())
client.switch_session_id();
if (client.n_req_for_exit_node == switch_ip_every_n_req)
client.switch_session_id();
var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);
try
{
numberOfRequestSent++;
// Console.WriteLine("Sending request for - {0}", scrapeRequest.URL);
scrapeRequest.HTML = client.DownloadString((string)scrapeRequest.URL);
//Console.WriteLine("Response done for - {0}", scrapeRequest.URL);
scrapeRequest.IsProxySuccess = true;
outPutQueue.Add(scrapeRequest); //add object to output queue
//lumanti code
client.handle_response();
}
catch (WebException e)
{
Console.WriteLine("Failed");
scrapeRequest.IsProxySuccess = false;
Console.WriteLine(e.Message);
outPutQueue.Add(scrapeRequest); //add object to output queue
//lumanti code
client.handle_response(e);
}
client.clean_connection_pool();
client.Dispose();
}
return await Task.Run(() => "Done");
}
There are multiple problems here, but none of them seems to be the cause for the inputQueue.Count having a none-zero value at the end. In any case I would like to point at the problems I can see.
var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) // create 100 threads only
{
taskList.Add(Task.Factory.StartNew(async () =>
{
await Run();
}, TaskCreationOptions.RunContinuationsAsynchronously));
}
The method Task.Factory.StartNew doesn't understand async delegates, so when it is called with an async lambda as argument it returns a nested task. In this case it returns a Task<Task<string>>. You store this nested task in List<Task> collection, which is possible because the type Task<TResult> inherits from the type Task, but doing so you lose the ability to await for the completion (and get the result) of the inner task. You only hold a reference to the outer task. Miraculously this is not a problem in this case (it usually is) since the outer task does all the work, and the inner task does essentially nothing (other than using a thread-pool thread to return a "Done" string that is not really needed anywhere).
You also don't attach any continuations to the outer tasks, so the flag TaskCreationOptions.RunContinuationsAsynchronously seems redundant.
// create 100 threads only
You don't create 100 threads, you create 100 tasks. These tasks are scheduled in the ThreadPool, which will be immediately starved because the tasks are long-running, and will start injecting one new thread every 500 msec until all scheduled tasks have been assigned to a thread.
var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);
Here you instantiate an object of type ProductResearch_ProData that is immediately discarded and becomes eligible for garbage collection in the very next line. The TryTake method will either return an object removed from the bag, or null if the bag is empty. You ignore the return value of the TryTake method, which is entirely possible to be false because meanwhile the bag may have been emptied by another worker, and then proceed with a scrapeRequest that has possibly a null value, resulting in that case to a NullReferenceException.
Worth noting that you extract an object of type ProductResearch_ProData from a ConcurrentBag<Data>, so either the class Data inherits from
the base class ProductResearch_ProData, or there is a transcription error in the code.
I am getting items from an upstream API which is quite slow. I try to speed this up by using TPL Dataflow to create multiple connections and bring these together, like this;
class Stuff
{
int Id { get; }
}
async Task<Stuff> GetStuffById(int id) => throw new NotImplementedException();
async Task<IEnumerable<Stuff>> GetLotsOfStuff(IEnumerable<int> ids)
{
var bagOfStuff = new ConcurrentBag<Stuff>();
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5
};
var processor = new ActionBlock<int>(async id =>
{
bagOfStuff.Add(await GetStuffById(id));
}, options);
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
await processor.Completion;
return bagOfStuff.ToArray();
}
The problem is that I have to wait until I have finished querying the entire collection of Stuff before I can return it to the caller. What I would prefer is that, whenever any of the multiple parallel queries returns an item, I return that item in a yield return fashion. Therefore I don't need to return an sync Task<IEnumerable<Stuff>>, I can just return an IEnumerable<Stuff> and the caller advances the iteration as soon as any items return.
I tried doing it like this;
IEnumerable<Stuff> GetLotsOfStuff(IEnumerable<int> ids)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5
};
var processor = new ActionBlock<int>(async id =>
{
yield return await GetStuffById(id);
}, options);
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
processor.Completion.Wait();
yield break;
}
But I get an error
The yield statement cannot be used inside an anonymous method or lambda expression
How can I restructure my code?
You can return an IEnumerable, but to do so you must block your current thread. You need a TransformBlock to process the ids, and a feeder-task that will feed asynchronously the TransformBlock with ids. Finally the current thread will enter a blocking loop, waiting for produced stuff to yield:
static IEnumerable<Stuff> GetLotsOfStuff(IEnumerable<int> ids)
{
using var completionCTS = new CancellationTokenSource();
var processor = new TransformBlock<int, Stuff>(async id =>
{
return await GetStuffById(id);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
BoundedCapacity = 50, // Avoid buffering millions of ids
CancellationToken = completionCTS.Token
});
var feederTask = Task.Run(async () =>
{
try
{
foreach (int id in ids)
if (!await processor.SendAsync(id)) break;
}
finally { processor.Complete(); }
});
try
{
while (processor.OutputAvailableAsync().Result)
while (processor.TryReceive(out var stuff))
yield return stuff;
}
finally // This runs when the caller exits the foreach loop
{
completionCTS.Cancel(); // Cancel the TransformBlock if it's still running
}
Task.WaitAll(feederTask, processor.Completion); // Propagate all exceptions
}
No ConcurrentBag is needed, since the TransformBlock has an internal output buffer. The tricky part is dealing with the case that the caller will abandon the enumeration of the IEnumerable<Stuff> by breaking early, or by being obstructed by an exception. In this case you don't want the feeder-task to keep pumping the IEnumerable<int> with the ids till the end. Fortunately there is a solution. Enclosing the yielding loop in a try/finally block allows a notification of this event to be received, so that the feeder-task can be terminated in a timely manner.
An alternative implementation could remove the need for a feeder-task by combining pumping the ids, feeding the block, and yielding stuff in a single loop. In this case you would want a lag between pumping and yielding. To achieve it, the MoreLinq's Lag (or Lead) extension method could be handy.
Update: Here is a different implementation, that enumerates and yields in the same loop. To achieve the desired lagging, the source enumerable is right-padded with some dummy elements, equal in number with the degree of concurrency.
This implementation accepts generic types, instead of int and Stuff.
public static IEnumerable<TResult> Transform<TSource, TResult>(
IEnumerable<TSource> source, Func<TSource, Task<TResult>> taskFactory,
int degreeOfConcurrency)
{
var processor = new TransformBlock<TSource, TResult>(async item =>
{
return await taskFactory(item);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = degreeOfConcurrency
});
var paddedSource = source.Select(item => (item, true))
.Concat(Enumerable.Repeat((default(TSource), false), degreeOfConcurrency));
int index = -1;
bool completed = false;
foreach (var (item, hasValue) in paddedSource)
{
index++;
if (hasValue) { processor.Post(item); }
else if (!completed) { processor.Complete(); completed = true; }
if (index >= degreeOfConcurrency)
{
if (!processor.OutputAvailableAsync().Result) break; // Blocking call
if (!processor.TryReceive(out var result))
throw new InvalidOperationException(); // Should never happen
yield return result;
}
}
processor.Completion.Wait();
}
Usage example:
IEnumerable<Stuff> lotsOfStuff = Transform(ids, GetStuffById, 5);
Both implementations can be modified trivially to return an IAsyncEnumerable instead of IEnumerable, to avoid blocking the calling thread.
There's probably a few different ways you can handle this based on your specific use case. But to handle items as they come through in terms of TPL-Dataflow, you'd change your source block to a TransformBlock<,> and flow the items to another block to process your items. Note that now you can get rid of collecting ConcurrentBag and be sure to set EnsureOrdered to false if you don't care what order you receive your items in. Also link the blocks and propagate completion to ensure your pipeline finishes once all item are retrieved and subsequently processed.
class Stuff
{
int Id { get; }
}
public class GetStuff
{
async Task<Stuff> GetStuffById(int id) => throw new NotImplementedException();
async Task GetLotsOfStuff(IEnumerable<int> ids)
{
//var bagOfStuff = new ConcurrentBag<Stuff>();
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
EnsureOrdered = false
};
var processor = new TransformBlock<int, Stuff>(id => GetStuffById(id), options);
var handler = new ActionBlock<Stuff>(s => throw new NotImplementedException());
processor.LinkTo(handler, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
await handler.Completion;
}
}
Other options could be making your method an observable streaming out of the TransformBlock or using IAsyncEnumerable to yield return and async get method.
The purpose is to do some async work on a scarce resource in a RX operator, Select for example. Issues arise when observable notifications came at a rate that is faster than the time it takes for the async operation to complete.
Now I actually solved the problem. My question would be what is the correct terminology for this particular kind of issue? Does it have a name? Is it backpressure? Research I did until now indicate that this is some kind of a pressure problem, but not necessarily backpressure from my understanding. The most relevant resources I found are these:
https://github.com/ReactiveX/RxJava/wiki/Backpressure-(2.0)
http://reactivex.io/documentation/operators/backpressure.html
Now to the actual code. Suppose there is a scarce resource and it's consumer. In this case exception is thrown when resource is in use. Please note that this code should not be changed.
public class ScarceResource
{
private static bool inUse = false;
public async Task<int> AccessResource()
{
if (inUse) throw new Exception("Resource is alredy in use");
var result = await Task.Run(() =>
{
inUse = true;
Random random = new Random();
Thread.Sleep(random.Next(1, 2) * 1000);
inUse = false;
return random.Next(1, 10);
});
return result;
}
}
public class ResourceConsumer
{
public IObservable<int> DoWork()
{
var resource = new ScarceResource();
return resource.AccessResource().ToObservable();
}
}
Now here is the problem with a naive implementation to consume the resource. Error is thrown because notifications came at a faster rate than the consumer takes to run.
private static void RunIntoIssue()
{
var numbers = Enumerable.Range(1, 10);
var observableSequence = numbers
.ToObservable()
.SelectMany(n =>
{
Console.WriteLine("In observable: {0}", n);
var resourceConsumer = new ResourceConsumer();
return resourceConsumer.DoWork();
});
observableSequence.Subscribe(n => Console.WriteLine("In observer: {0}", n));
}
With the following code the problem is solved. I slow down processing by using a completed BehaviorSubject in conjunction with the Zip operator. Essentially what this code does is to take a sequential approach instead of a parallel one.
private static void RunWithZip()
{
var completed = new BehaviorSubject<bool>(true);
var numbers = Enumerable.Range(1, 10);
var observableSequence = numbers
.ToObservable()
.Zip(completed, (n, c) =>
{
Console.WriteLine("In observable: {0}, completed: {1}", n, c);
var resourceConsumer = new ResourceConsumer();
return resourceConsumer.DoWork();
})
.Switch()
.Select(n =>
{
completed.OnNext(true);
return n;
});
observableSequence.Subscribe(n => Console.WriteLine("In observer: {0}", n));
Console.Read();
}
Question
Is this backpressure, and if not does it have another terminology associated?
You're basically implementing a form of locking, or a mutex. Your code an cause backpressure, it's not really handling it.
Imagine if your source wasn't a generator function, but rather a series of data pushes. The data pushes arrive at a constant rate of every millisecond. It takes you 10 Millis to process each one, and your code forces serial processing. This causes backpressure: Zip will queue up the unprocessed datapushes infinitely until you run out of memory.
So here is the scenario:
I have to take a group of data, process it, build an object and then insert those objects into a database.
In order to increase performance, I am multi-threading the processing of the data using a parallel loop and storing the objects in a CollectionBag list.
That part works fine. However, the issue here is I now need to take that list, convert it into a DataTable object and insert the data into the database. It's very ugly and I feel like I'm not doing this in the best way possible (pseudo below):
ConcurrentBag<FinalObject> bag = new ConcurrentBag<FinalObject>();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(allData, parallelOptions, dataObj =>
{
.... Process data ....
bag.Add(theData);
Thread.Sleep(100);
});
DataTable table = createTable();
foreach(FinalObject moveObj in bag) {
table.Rows.Add(moveObj.x);
}
This is a good candidate for PLINQ (or Rx - I'll focus on PLINQ since it's part of the Base Class Library).
IEnumerable<FinalObject> bag = allData
.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount)
.Select(dataObj =>
{
FinalObject theData = Process(dataObj);
Thread.Sleep(100);
return theData;
});
DataTable table = createTable();
foreach (FinalObject moveObj in bag)
{
table.Rows.Add(moveObj.x);
}
Realistically, instead of throttling the loop via Thread.Sleep, you should be limiting the maximum degree of parallelism further until you get the CPU usage down to the desired level.
Disclaimer: all of the below is meant for entertainment only, although it does actually work.
Of course you can always kick it up a notch and produce a full-on async Parallel.ForEach implementation that allows you to process input in parallel and do your throttling asynchronously, without blocking any thread pool threads.
async Task ParallelForEachAsync<TInput, TResult>(IEnumerable<TInput> input,
int maxDegreeOfParallelism,
Func<TInput, Task<TResult>> body,
Action<TResult> onCompleted)
{
Queue<TInput> queue = new Queue<TInput>(input);
if (queue.Count == 0) {
return;
}
List<Task<TResult>> tasksInFlight = new List<Task<TResult>>(maxDegreeOfParallelism);
do
{
while (tasksInFlight.Count < maxDegreeOfParallelism && queue.Count != 0)
{
TInput item = queue.Dequeue();
Task<TResult> task = body(item);
tasksInFlight.Add(task);
}
Task<TResult> completedTask = await Task.WhenAny(tasksInFlight).ConfigureAwait(false);
tasksInFlight.Remove(completedTask);
TResult result = completedTask.GetAwaiter().GetResult(); // We know the task has completed. No need for await.
onCompleted(result);
}
while (queue.Count != 0 || tasksInFlight.Count != 0);
}
Usage (full Fiddle here):
async Task<DataTable> ProcessAllAsync(IEnumerable<InputObject> allData)
{
DataTable table = CreateTable();
int maxDegreeOfParallelism = Environment.ProcessorCount;
await ParallelForEachAsync(
allData,
maxDegreeOfParallelism,
// Loop body: these Tasks will run in parallel, up to {maxDegreeOfParallelism} at any given time.
async dataObj =>
{
FinalObject o = await Task.Run(() => Process(dataObj)).ConfigureAwait(false); // Thread pool processing.
await Task.Delay(100).ConfigureAwait(false); // Artificial throttling.
return o;
},
// Completion handler: these will be executed one at a time, and can safely mutate shared state.
moveObj => table.Rows.Add(moveObj.x)
);
return table;
}
struct InputObject
{
public int x;
}
struct FinalObject
{
public int x;
}
FinalObject Process(InputObject o)
{
// Simulate synchronous work.
Thread.Sleep(100);
return new FinalObject { x = o.x };
}
Same behaviour, but without Thread.Sleep and ConcurrentBag<T>.
Sounds like you've complicated things quite a bit by tring to make everything run in parallel, but if you store DataRow obejcts in your bag instead of plain objects, at the end you can use DataTableExtensions to create a DataTable from a generic collection quite easily:
var dataTable = bag.CopyToDataTable();
Just add a reference to System.Data.DataSetExtensions in your project.
I think something like this should give better performance, looks like object[] is a better option than DataRow as you need DataTable to get a DataRow object.
ConcurrentBag<object[]> bag = new ConcurrentBag<object[]>();
Parallel.ForEach(allData,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
dataObj =>
{
object[] row = new object[colCount];
//do processing
bag.Add(row);
Thread.Sleep(100);
});
DataTable table = createTable();
foreach (object[] row in bag)
{
table.Rows.Add(row);
}
I've come up with some code to consume all wating items from a queue. Rather than processing the items 1 by 1, it makes sense to process all waiting items as a set.
I've declared my queue like this.
private BlockingCollection<Item> items =
new BlockingCollection<Item>(new ConcurrentQueue<Item>);
Then, on a consumer thread, I plan to read the items in batches like this,
Item nextItem;
while (this.items.TryTake(out nextItem, -1))
{
var workToDo = new List<Item>();
workToDo.Add(nextItem);
while(this.items.TryTake(out nextItem))
{
workToDo.Add(nextItem);
}
// process workToDo, then go back to the queue.
}
This approach lacks the utility of GetConsumingEnumerable and I can't help wondering if I've missed a better way, or if my approach is flawed.
Is there a better way to consume a BlockingCollection<T> in batches?
A solution is to use the BufferBlock<T> from
System.Threading.Tasks.Dataflow (which is included in .net core 3+). It does not use GetConsumingEnumerable(), but it still does allow you the same utility, mainly:
allows parallel processing w/ multiple (symmetrical and/or asymmetrical) consumers and producers
thread safe (allowing for the above) - no race conditions to worry about
can be cancelled by a cancellation token and/or collection completion
consumers block until data is available, avoiding wasting CPU cycles on polling
There is also a BatchBlock<T>, but that limits you to fixed sized batches.
var buffer = new BufferBlock<Item>();
while (await buffer.OutputAvailableAsync())
{
if (buffer.TryReceiveAll(out var items))
//process items
}
Here is a working example, which demos the following:
multiple symmetrical consumers which process variable length batches in parallel
multiple symmetrical producers (not truly operating in parallel in this example)
ability to complete the collection when the producers are done
to keep the example short, I did not demonstrate the use of a CancellationToken
ability to wait until the producers and/or consumers are done
ability to call from an area that doesn't allow async, such as a constructor
the Thread.Sleep() calls are not required, but help simulate some processing time that would occur in more taxing scenarios
both the Task.WaitAll() and the Thread.Sleep() can optionally be converted to their async equivalents
no need to use any external libraries
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
static class Program
{
static void Main()
{
var buffer = new BufferBlock<string>();
// Kick off consumer task(s)
List<Task> consumers = new List<Task>();
for (int i = 0; i < 3; i++)
{
consumers.Add(Task.Factory.StartNew(async () =>
{
// need to copy this due to lambda variable capture
var num = i;
while (await buffer.OutputAvailableAsync())
{
if (buffer.TryReceiveAll(out var items))
Console.WriteLine($"Consumer {num}: " +
items.Aggregate((a, b) => a + ", " + b));
// real life processing would take some time
await Task.Delay(500);
}
Console.WriteLine($"Consumer {num} complete");
}));
// give consumer tasks time to activate for a better demo
Thread.Sleep(100);
}
// Kick off producer task(s)
List<Task> producers = new List<Task>();
for (int i = 0; i < 3; i++)
{
producers.Add(Task.Factory.StartNew(() =>
{
for (int j = 0 + (1000 * i); j < 500 + (1000 * i); j++)
buffer.Post(j.ToString());
}));
// space out the producers for a better demo
Thread.Sleep(10);
}
// may also use the async equivalent
Task.WaitAll(producers.ToArray());
Console.WriteLine("Finished waiting on producers");
// demo being able to complete the collection
buffer.Complete();
// may also use the async equivalent
Task.WaitAll(consumers.ToArray());
Console.WriteLine("Finished waiting on consumers");
Console.ReadLine();
}
}
Here is a mondernised and simplified version of the code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
class Program
{
private static async Task Main()
{
var buffer = new BufferBlock<string>();
// Kick off consumer task(s)
var consumers = new List<Task>();
for (var i = 0; i < 3; i++)
{
var id = i;
consumers.Add(Task.Run(() => StartConsumer(id, buffer)));
// give consumer tasks time to activate for a better demo
await Task.Delay(100);
}
// Kick off producer task(s)
var producers = new List<Task>();
for (var i = 0; i < 3; i++)
{
var pid = i;
producers.Add(Task.Run(() => StartProducer(pid, buffer)));
// space out the producers for a better demo
await Task.Delay(10);
}
// may also use the async equivalent
await Task.WhenAll(producers);
Console.WriteLine("Finished waiting on producers");
// demo being able to complete the collection
buffer.Complete();
// may also use the async equivalent
await Task.WhenAll(consumers);
Console.WriteLine("Finished waiting on consumers");
Console.ReadLine();
}
private static async Task StartConsumer(
int id,
IReceivableSourceBlock<string> buffer)
{
while (await buffer.OutputAvailableAsync())
{
if (buffer.TryReceiveAll(out var items))
{
Console.WriteLine($"Consumer {id}: " +
items.Aggregate((a, b) => a + ", " + b));
}
// real life processing would take some time
await Task.Delay(500);
}
Console.WriteLine($"Consumer {id} complete");
}
private static Task StartProducer(int pid, ITargetBlock<string> buffer)
{
for (var j = 0 + (1000 * pid); j < 500 + (1000 * pid); j++)
{
buffer.Post(j.ToString());
}
return Task.CompletedTask;
}
}
While not as good as ConcurrentQueue<T> in some ways, my own LLQueue<T> allows for a batched dequeue with a AtomicDequeueAll method where all items currently on the queue are taken from it in a single (atomic and thread-safe) operation, and are then in a non-threadsafe collection for consumption by a single thread. This method was designed precisely for the scenario where you want to batch the read operations.
This isn't blocking though, though it could be used to create a blocking collection easily enough:
public BlockingBatchedQueue<T>
{
private readonly AutoResetEvent _are = new AutoResetEvent(false);
private readonly LLQueue<T> _store;
public void Add(T item)
{
_store.Enqueue(item);
_are.Set();
}
public IEnumerable<T> Take()
{
_are.WaitOne();
return _store.AtomicDequeueAll();
}
public bool TryTake(out IEnumerable<T> items, int millisecTimeout)
{
if(_are.WaitOne(millisecTimeout))
{
items = _store.AtomicDequeueAll();
return true;
}
items = null;
return false;
}
}
That's a starting point that doesn't do the following:
Deal with a pending waiting reader upon disposal.
Worry about a potential race with multiple readers both being triggered by a write happening while one was reading (it just considers the occasional empty result enumerable to be okay).
Place any upper-bound on writing.
All of which could be added too, but I wanted to keep to the minimum of some practical use, that hopefully isn't buggy within the defined limitations above.
No, there is no better way. Your approach is basically correct.
You could wrap the "consume-in-batches" functionality in an extension method, for ease of use. The implementation below uses the same List<T> as a buffer during the whole enumeration, with the intention to prevent the allocation of a new buffer on each iteration. It also includes a maxSize parameter, that allows to limit the size of the emitted batches:
/// <summary>
/// Consumes the items in the collection in batches. Each batch contains all
/// the items that are immediately available, up to a specified maximum number.
/// </summary>
public static IEnumerable<T[]> GetConsumingEnumerableBatch<T>(
this BlockingCollection<T> source, int maxSize,
CancellationToken cancellationToken = default)
{
ArgumentNullException.ThrowIfNull(source);
if (maxSize < 1) throw new ArgumentOutOfRangeException(nameof(maxSize));
if (source.IsCompleted) yield break;
var buffer = new List<T>();
while (source.TryTake(out var item, Timeout.Infinite, cancellationToken))
{
Debug.Assert(buffer.Count == 0);
buffer.Add(item);
while (buffer.Count < maxSize && source.TryTake(out item))
buffer.Add(item);
T[] batch = buffer.ToArray();
int batchSize = batch.Length;
buffer.Clear();
yield return batch;
if (batchSize < buffer.Capacity >> 2)
buffer.Capacity = buffer.Capacity >> 1; // Shrink oversized buffer
}
}
Usage example:
foreach (Item[] batch in this.items.GetConsumingEnumerableBatch(Int32.MaxValue))
{
// Process the batch
}
The buffer is shrank in half, every time an emitted batch is smaller than a quarter of the buffer's capacity. This will keep the buffer in control, in case it has become oversized at some point during the enumeration.
The intention of the if (source.IsCompleted) yield break line is to replicate the behavior of the built-in GetConsumingEnumerable method, when it is supplied with an already canceled token, and the collection is empty and completed.
In case of cancellation, no buffered messages are in danger of being lost. The cancellationToken is checked only when the buffer is empty.
A simpler implementation without memory management features, can be found in the first revision of this answer.