Microsoft TPL Dataflow Library | MaxDegreeOfParallelism and IO-bound work

Microsoft TPL Dataflow Library | MaxDegreeOfParallelism and IO-bound work - c#

My use case is this: send 100,000+ web requests to our application server and wait the results. Here, most of the delay is IO-bound, not CPU-bound, so I understand the Dataflow libraries may not be the best tool for this. I've managed to use it will a lot of success and have set the MaxDegreeOfParallelism to the number of requests that I trust the server to be able to handle, however, since this is the maximum number of tasks, it's no guarantee that this will actually be the number of tasks running at any time.
The only bit of information I could find in the documentation is this:
Because the MaxDegreeOfParallelism property represents the maximum degree of parallelism, the dataflow block might execute with a lesser degree of parallelism than you specify. The dataflow block can use a lesser degree of parallelism to meet its functional requirements or to account for a lack of available system resources. A dataflow block never chooses a greater degree of parallelism than you specify.
This explanation is quite vague on how it actually determines when to spin up a new task. My hope was that it will recognize that the task is blocked due to IO, not any system resources, and it will basically stay at the maximum degrees of parallelism for the entire duration of the operation.
However, after monitoring a network capture, it seems to be MUCH quicker in the beginning and slower near the end. I can see from the capture, that at the beginning it does reach the maximum as specified. The TPL library doesn't have any built-in way to monitor the current number of active threads, so I'm not really sure of the best way to investigate further on that end.
My implementation:
internal static ExecutionDataflowBlockOptions GetDefaultBlockOptions(int maxDegreeOfParallelism,
CancellationToken token) => new()
{
MaxDegreeOfParallelism = maxDegreeOfParallelism,
CancellationToken = token,
SingleProducerConstrained = true,
EnsureOrdered = false
};
private static async ValueTask<T?> ReceiveAsync<T>(this ISourceBlock<T?> block, bool configureAwait, CancellationToken token)
{
try
{
return await block.ReceiveAsync(token).ConfigureAwait(configureAwait);
} catch (InvalidOperationException)
{
return default;
}
}
internal static async IAsyncEnumerable<T> YieldResults<T>(this ISourceBlock<T?> block, bool configureAwait,
[EnumeratorCancellation]CancellationToken token)
{
while (await block.OutputAvailableAsync(token).ConfigureAwait(configureAwait))
if (await block.ReceiveAsync(configureAwait, token).ConfigureAwait(configureAwait) is T result)
yield return result;
// by the time OutputAvailableAsync returns false, the block is gauranteed to be complete. However,
// we want to await it anyway, since this will propogate any exception thrown to the consumer.
// we don't simply await the completed task, because that wouldn't return all aggregate exceptions,
// just the last to occur
if (block.Completion.Exception != null)
throw block.Completion.Exception;
}
public static IAsyncEnumerable<TResult> ParallelSelectAsync<T, TResult>(this IEnumerable<T> source, Func<T, Task<TResult?>> body,
int maxDegreeOfParallelism = DataflowBlockOptions.Unbounded, TaskScheduler? scheduler = null, CancellationToken token = default)
{
var options = GetDefaultBlockOptions(maxDegreeOfParallelism, token);
if (scheduler != null)
options.TaskScheduler = scheduler;
var block = new TransformBlock<T, TResult?>(body, options);
foreach (var item in source)
block.Post(item);
block.Complete();
return block.YieldResults(scheduler != null && scheduler != TaskScheduler.Default, token);
}
So, basically, my question is this: when an IO-bound action is executed in a TPL Dataflow block, how can I ensure the block stays at the MaxDegreeOfParallelism that is set?

On the contrary, Dataflow is great at IO work and perfect for this scenario. DataFlow architectures work by creating pipelines similar to Bash or PowerShell pipelines. Each block acting as a separate command, reading messages from its input queue and passing them to the next block through its output queue. That's why the default DOP is 1 - parallelism and concurrency come from using multiple commands/blocks, not a fat block with a high DOP
This is a simplified example of what I use at work request daily sales reports from about a hundred airlines (BSPs for those that know about air tickets), parse the reports and then download individual ticket records, before importing everything into the database.
In this case the head block downloads content with a DOP=10, then the parser block parses the responses one at a time. The downloader is IO-bound so it can make a lot more requests than there are cores, as many as the services allow, or the application wants to handle.
The parser on the other hand is CPU bound. A high DOP would lock a lot of core which would harm not just the application, but other processes as well.
// Create the blocks
var dlOptions = new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism=10
};
var downloader=new TransformBlock<string,string>(
url => _client.GetStringAsync(url,cancellationToken),
dlOptions);
var parser=new TransformBlock<string,Something>(ParseIntoSomething);
var importer=new ActionBlock<Something>(ImportInDb);
// Link the blocks
var linkOptions = new DataflowLinkOptions {PropagateCompletion = true};
downloader.LinkTo(parser,linkOptions);
parser.LinkTo(importer,linkOptions);
After building this 3 step pipeline I post URLs at the front and expect the tail block to complete
foreach(var url in urls)
{
downloader.Post(url);
}
downloader.Complete();
await importer.Completion;
There are a lot of improvements to this. Right now, if the downloader is faster than the parser, all the content will be buffered in memory. In a long running pipeline this can easily take up all available memory.
A simple way to avoid this is add BoundedCapacity=N to the parser block options. If the parser input buffer is full, upstream blocks, in this case the downloader, will pause and wait until a slot becomes available :
var parserOptions = new ExecutionDataflowBlockOptions {
BoundedCapacity=2,
MaxDegreeOfParallelism=2,
};
var parser=new TransformBlock<string,Something>(ParseIntoSomething, parserOptions);

Related

gRPC intermittently has high delays

I have a server app (C# with .Net 5) that exposes a gRPC bi-directional endpoint. This endpoint takes in a binary stream in which the server analyzes and produces responses that are sent back to the gRPC response stream.
Each file being sent over gRPC is few megabytes and it takes few minutes for the gRPC call to complete streaming (without latency). With latencies, this time increases sometimes by 50%.
On the client, I have 2 tasks (Task.Run) running, one streaming the file from the clients' file system using FileStream, other reading responses from the server (gRPC).
On the server also, I have 2 tasks running, one reading messages from the gRPC request stream and pushing them into a queue (DataFlow.BufferBlock<byte[]>), other processing messages from the queue, and writing responses to gRPC.
The problem:
If I disable (comment out) all the server processing code, and simply read and log messages from gRPC, there's almost 0 latency from client to server.
When the server has processing enabled, the clients see latencies while writing to grpcClient.
With just 10 active parallel sessions (gRPC Calls) these latencies can go up to 10-15 seconds.
PS: this only happens when I have more than one client running, a higher number of concurrent clients means higher latency.
The client code looks a bit like the below:
FileStream fs = new(audioFilePath, FileMode.Open, FileAccess.Read, FileShare.Read, 1024 * 1024, true);
byte[] buffer = new byte[10_000];
GrpcClient client = new GrpcClient(_singletonChannel); // using single channel since only 5-10 clients are there right now
BiDiCall call = client.BiDiService(hheaders: null, deadline: null, CancellationToken.None);
var writeTask = Task.Run(async () => {
while (fs.ReadAsync(buffer, 0, buffer.Length))
{
call.RequestStream.WriteAsync(new() { Chunk = ByteString.CopyFrom(buffer) });
}
await call.RequestStream.CompleteAsync();
});
var readTask = Task.Run(async () => {
while (await call.ResponseStream.MoveNext())
{
// write to log call.ResponseStream.Current
}
});
await Task.WhenAll(writeTask, readTask);
await call;
Server code looks like:
readonly BufferBlock<MessageRequest> messages = new();
MessageProcessor _processor = new();
public override async Task BiDiService(IAsyncStreamReader<MessageRequest> requestStream,
IServerStreamWriter<MessageResponse> responseStream,
ServerCallContext context)
{
var readTask = TaskFactory.StartNew(() => {
while (await requestStream.MoveNext())
{
messages.Post(requestStream.Current); // add to queue
}
messages.Complete();
}, TaskCreationOptions.LongRunning).ConfigureAwait(false);
var processTask = Task.Run(() => {
while (await messages.OutputAvailableAsync())
{
var message = await messages.ReceiveAsync(); // pick from queue
// if I comment out below line and run with multiple clients = latency disappears
var result = await _processor.Process(message); // takes some time to process
if (result.IsImportantForClient())
await responseStrem.WriteAsync(result.Value);
}
});
await Task.WhenAll(readTask, processTask);
}

So, as it turned out, the problem was due to the delay in the number of worker threads spawned by the ThreadPool.
The ThreadPool was taking more time to spawn threads to process these tasks causing gRPC reads to have a significant lag.
This was fixed after increasing the minThread count for spawn requstes using ThreadPool.SetMinThreads. MSDN reference

There have been a number of promising comments on the SO's initial question, but wanted to paraphrase what I thought was important: there's
a an outer async method that calls in to 2
Task.Run()'s - with TaskCreationOptions.LongRunning option that wrap async loops, and finally a
returns a Task.WhenAll() rejoins the two Tasks...
Alois Kraus offers that an OS task scheduler is an OS and its scheduling could be abstracting away what you might think is more efficient - this could very well be true and if it is
i would offer the suggestion to try and remove the asynchronous processing and see what kind of benefit difference you might see with various sync/async blends might work better for your particular scenario.
one thing to make sure to remember is that asynce/await logically blocks at an expense of automatic thread management - this is great for single-path-ish I/O bound processing (ex. needing to call a db/webservice before moving on to next step of execution) and can be less beneficial as you move toward compute-bound processing (execution that needs to be explicitly re-joined - async/await implicitly take care of Task re-join)

Waiting on a continuous UI background polling task

I am somewhat new to parallel programming C# (When I started my project I worked through the MSDN examples for TPL) and would appreciate some input on the following example code.
It is one of several background worker tasks. This specific task pushes status messages to a log.
var uiCts = new CancellationTokenSource();
var globalMsgQueue = new ConcurrentQueue<string>();
var backgroundUiTask = new Task(
() =>
{
while (!uiCts.IsCancellationRequested)
{
while (globalMsgQueue.Count > 0)
ConsumeMsgQueue();
Thread.Sleep(backgroundUiTimeOut);
}
},
uiCts.Token);
// Somewhere else entirely
backgroundUiTask.Start();
Task.WaitAll(backgroundUiTask);
I'm asking for professional input after reading several topics like Alternatives to using Thread.Sleep for waiting, Is it always bad to use Thread.Sleep()?, When to use Task.Delay, when to use Thread.Sleep?, Continuous polling using Tasks
Which prompts me to use Task.Delay instead of Thread.Sleep as a first step and introduce TaskCreationOptions.LongRunning.
But I wonder what other caveats I might be missing? Is polling the MsgQueue.Count a code smell? Would a better version rely on an event instead?

First of all, there's no reason to use Task.Start or use the Task constructor. Tasks aren't threads, they don't run themselves. They are a promise that something will complete in the future and may or may not produce any results. Some of them will run on a threadpool thread. Use Task.Run to create and run the task in a single step when you need to.
I assume the actual problem is how to create a buffered background worker. .NET already offers classes that can do this.
ActionBlock< T >
The ActionBlock class already implements this and a lot more - it allows you to specify how big the input buffer is, how many tasks will process incoming messages concurrently, supports cancellation and asynchronous completion.
A logging block could be as simple as this :
_logBlock=new ActionBlock<string>(msg=>File.AppendAllText("myLog.txt",msg));
The ActionBlock class itself takes care of buffering the inputs, feeding new messages to the worker function when it arrives, potentially blocking senders if the buffer gets full etc. There's no need for polling.
Other code can use Post or SendAsync to send messages to the block :
_block.Post("some message");
When we are done, we can tell the block to Complete() and await for it to process any remaining messages :
_block.Complete();
await _block.Completion;
Channels
A newer, lower-level option is to use Channels. You can think of channels as a kind of asynchronous queue, although they can be used to implement complex processing pipelines. If ActionBlock was written today, it would use Channels internally.
With channels, you need to provide the "worker" task yourself. There's no need for polling though, as the ChannelReader class allows you to read messages asynchronously or even use await foreach.
The writer method could look like this :
public ChannelWriter<string> LogIt(string path,CancellationToken token=default)
{
var channel=Channel.CreateUnbounded<string>();
var writer=channel.Writer;
_=Task.Run(async ()=>{
await foreach(var msg in channel.Reader.ReadAllAsync(token))
{
File.AppendAllText(path,msg);
}
},token).ContinueWith(t=>writer.TryComplete(t.Exception);
return writer;
}
....
_logWriter=LogIt(somePath);
Other code can send messages by using WriteAsync or TryWrite, eg :
_logWriter.TryWrite(someMessage);
When we're done, we can call Complete() or TryComplete() on the writer :
_logWriter.TryComplete();
The line
.ContinueWith(t=>writer.TryComplete(t.Exception);
is needed to ensure the channel is closed even if an exception occurs or the cancellation token is signaled.
This may seem too cumbersome at first. Channels allow us to easily run initialization code or carry state from one message to the next. We could open a stream before the loop starts and use it instead of reopening the file each time we call File.AppendAllText, eg :
public ChannelWriter<string> LogIt(string path,CancellationToken token=default)
{
var channel=Channel.CreateUnbounded<string>();
var writer=channel.Writer;
_=Task.Run(async ()=>{
//***** Can't do this with an ActionBlock ****
using(var writer=File.AppendText(somePath))
{
await foreach(var msg in channel.Reader.ReadAllAsync(token))
{
writer.WriteLine(msg);
//Or
//await writer.WriteLineAsync(msg);
}
}
},token).ContinueWith(t=>writer.TryComplete(t.Exception);
return writer;
}

Definitely Task.Delay is better than Thread.Sleep, because you will not be blocking the thread on the pool, and during the wait the thread on the pool will be available to handle other tasks. Then, you don't need to make your task long-running. Long-running tasks are run in a dedicated thread, and then Task.Delay is meaningless.
Instead, I will recommend a different approach. Just use System.Threading.Timer and make your life simple. Timers are kernel objects that will run their callback on the thread pool, and you will not have to worry about delay or sleep.

The TPL Dataflow library is the preferred tool for this kind of job. It allows building efficient producer-consumer pairs quite easily, and more complex pipelines as well, while offering a complete set of configuration options. In your case using a single ActionBlock should be enough.
A simpler solution you might consider is to use a BlockingCollection. It has the advantage of not requiring the installation of any package (because it is built-in), and it's also much easier to learn. You don't have to learn more than the methods Add, CompleteAdding, and GetConsumingEnumerable. It also supports cancellation. The drawback is that it's a blocking collection, so it blocks the consumer thread while waiting for new messages to arrive, and the producer thread while waiting for available space in the internal buffer (only if you specify a boundedCapacity in the constructor).
var uiCts = new CancellationTokenSource();
var globalMsgQueue = new BlockingCollection<string>();
var backgroundUiTask = new Task(() =>
{
foreach (var item in globalMsgQueue.GetConsumingEnumerable(uiCts.Token))
{
ConsumeMsgQueueItem(item);
}
}, uiCts.Token);
The BlockingCollection uses a ConcurrentQueue internally as a buffer.

Parallel.ForEach with a custom TaskScheduler to prevent OutOfMemoryException

I am processing PDFs of vastly varying sizes (simple 2MB to high DPI scans of a few hundred MB) via a Parallel.ForEach and am occasionally getting to an OutOfMemoryException - understandably due to the process being 32 bit and the threads spawned by the Parallel.ForEach taking up an unknown amount of memory consuming work.
Restricting MaxDegreeOfParallelism does work, though the throughput for the times when there is a large (10k+) batch of small PDFs to work with is not sufficient as there could be more threads working due to the small memory footprint of said threads. This is a CPU heavy process with Parallel.ForEach easily reaching 100% CPU before hitting the occasional group of large PDFs and getting an OutOfMemoryException. Running the Performance Profiler backs this up.
From my understanding, having a partitioner for my Parallel.ForEach won't improve my performance.
This leads me to using a custom TaskScheduler passed to my Parallel.ForEach with a MemoryFailPoint check. Searching around it seems there is scarce information on creating custom TaskScheduler objects.
Looking between Specialized Task Schedulers in .NET 4 Parallel Extensions Extras, A custom TaskScheduler in C# and various answers here on Stackoverflow, I've created my own TaskScheduler and have my QueueTask method as such:
protected override void QueueTask(Task task)
{
lock (tasks) tasks.AddLast(task);
try
{
using (MemoryFailPoint memFailPoint = new MemoryFailPoint(600))
{
if (runningOrQueuedCount < maxDegreeOfParallelism)
{
runningOrQueuedCount++;
RunTasks();
}
}
}
catch (InsufficientMemoryException e)
{
// somehow return thread to pool?
Console.WriteLine("InsufficientMemoryException");
}
}
While the try/catch is a little expensive my goal here is to catch when the probable maximum size PDF (+ a little extra memory overhead) of 600MB will throw an OutOfMemoryException. This solution through seems to kill off the thread attempting to do the work when I catch the InsufficientMemoryException. With enough large PDFs my code ends up being a single thread Parallel.ForEach.
Other questions found on Stackoverflow on Parallel.ForEach and OutOfMemoryExceptions don't appear to suit my use case of maximum throughput with dynamic memory usage on threads and often just leverage MaxDegreeOfParallelism as a static solution, E.g.:
Parallel.For System.OutOfMemoryException
Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object
So to have maximum throughput for variable working memory sizes, either:
How do I return a thread back into the threadpool when it has been denied work via the MemoryFailPoint check?
How/where do I safely spawn new threads to pick up work again when there is free memory?
Edit:
The PDF size on disk may not linearly represent size in memory due to the rasterization and rasterized image manipulation component which is dependent on the PDF content.

Using LimitedConcurrencyLevelTaskScheduler from Samples for Parallel Programming with the .NET Framework I was able to make a minor adjustment to get something that looked about what I wanted. The following is the NotifyThreadPoolOfPendingWork method of the LimitedConcurrencyLevelTaskScheduler class after modification:
private void NotifyThreadPoolOfPendingWork()
{
ThreadPool.UnsafeQueueUserWorkItem(_ =>
{
// Note that the current thread is now processing work items.
// This is necessary to enable inlining of tasks into this thread.
_currentThreadIsProcessingItems = true;
try
{
// Process all available items in the queue.
while (true)
{
Task item;
lock (_tasks)
{
// When there are no more items to be processed,
// note that we're done processing, and get out.
if (_tasks.Count == 0)
{
--_delegatesQueuedOrRunning;
break;
}
// Get the next item from the queue
item = _tasks.First.Value;
_tasks.RemoveFirst();
}
// Execute the task we pulled out of the queue
//base.TryExecuteTask(item);
try
{
using (MemoryFailPoint memFailPoint = new MemoryFailPoint(650))
{
base.TryExecuteTask(item);
}
}
catch (InsufficientMemoryException e)
{
Thread.Sleep(500);
lock (_tasks)
{
_tasks.AddLast(item);
}
}
}
}
// We're done processing items on the current thread
finally { _currentThreadIsProcessingItems = false; }
}, null);
}
We'll look at the catch, but in reverse. We add the task we were going to work on back to the list of tasks (_tasks) which triggers an event to get an available thread to pick up that work. But we sleep the current thread first in order for it to not pick up the work straight way and go back into a failed MemoryFailPoint check.

The idea of a memory-aware TaskScheduler that is based on the MemoryFailPoint class is pretty neat. Here is another idea. You could limit the parallelism based on the known size of each PDF file, by using a SemaphoreSlim. Before processing a file you could acquire the semaphore as many times as the size of the file in megabytes, and after the processing is completed you could release the semaphore an equal number of times.
The tricky part is that the SemaphoreSlim doesn't have an API that acquires it atomically more than once, and acquiring it multiple times non-atomically in parallel could result easily in a deadlock. One way to synchronize the acquisition of the semaphore could be to use as an asynchronous mutex a second new SemaphoreSlim(1, 1). Another way is to move the acquisition one step back, at the enumeration phase of the source sequence. The implementation below demonstrates the second approach. It is a variant of the .NET 6 API Parallel.ForEachAsync, that on top of the existing features it is equiped with two additional parameters sizeSelector and maxConcurrentSize:
public static Task ParallelForEachAsync_LimitedBySize<TSource>(
IEnumerable<TSource> source,
ParallelOptions parallelOptions,
Func<TSource, CancellationToken, ValueTask> body,
Func<TSource, int> sizeSelector,
int maxConcurrentSize)
{
ArgumentNullException.ThrowIfNull(source);
ArgumentNullException.ThrowIfNull(parallelOptions);
ArgumentNullException.ThrowIfNull(body);
ArgumentNullException.ThrowIfNull(sizeSelector);
if (maxConcurrentSize < 1)
throw new ArgumentOutOfRangeException(nameof(maxConcurrentSize));
SemaphoreSlim semaphore = new(maxConcurrentSize, maxConcurrentSize);
async IAsyncEnumerable<(TSource, int)> Iterator()
{
foreach (TSource item in source)
{
int size = sizeSelector(item);
size = Math.Clamp(size, 0, maxConcurrentSize);
for (int i = 0; i < size; i++)
await semaphore.WaitAsync().ConfigureAwait(false);
yield return (item, size);
}
}
return Parallel.ForEachAsync(Iterator(), parallelOptions, async (entry, ct) =>
{
(TSource item, int size) = entry;
try { await body(item, ct).ConfigureAwait(false); }
finally { if (size > 0) semaphore.Release(size); }
});
}
Internally it calls the Parallel.ForEachAsync overload that has a source of type IAsyncEnumerable<T>.
Usage example:
ParallelOptions options = new() { MaxDegreeOfParallelism = 10 };
await ParallelForEachAsync_LimitedBySize(paths, options, async (path, ct) =>
{
// Process the file
await Task.CompletedTask;
}, sizeSelector: path =>
{
// Return the size of the file in MB
return (int)(new FileInfo(path).Length / 1_000_000);
}, maxConcurrentSize: 2_000);
The paths will be processed with a maximum parallelism of 10, and a maximum concurrent size of 2 GB.

How can I make sure a dataflow block only creates threads on a on-demand basis?

I've written a small pipeline using the TPL Dataflow API which receives data from multiple threads and performs handling on them.
Setup 1
When I configure it to use MaxDegreeOfParallelism = Environment.ProcessorCount (comes to 8 in my case) for each block, I notice it fills up buffers in multiple threads and processing the second block doesn't start until +- 1700 elements have been received across all threads. You can see this in action here.
Setup 2
When I set MaxDegreeOfParallelism = 1 then I notice all elements are received on a single thread and processing the sending already starts after +- 40 elements are received. Data here.
Setup 3
When I set MaxDegreeOfParallelism = 1 and I introduce a delay of 1000ms before sending each input, I notice elements get sent as soon as they are received and every received element is put on a separate thread. Data here.
So far the setup. My questions are the following:
When I compare setups 1 & 2 I notice that processing elements starts much faster when done in serial compared to parallel (even after accounting for the fact that parallel has 8x as many threads). What causes this difference?
Since this will be run in an ASP.NET environment, I don't want to spawn unnecessary threads since they all come from a single threadpool. As shown in setup 3 it will still spread itself over multiple threads even when there is only a handful of data. This is also surprising because from setup 1 I would assume that data is spread sequentially over threads (notice how the first 50 elements all go to thread 16). Can I make sure it only creates new threads on a on-demand basis?
There is another concept called the BufferBlock<T>. If the TransformBlock<T> already queues input, what would be the practical difference of swapping the first step in my pipeline (ReceiveElement) for a BufferBlock?
class Program
{
static void Main(string[] args)
{
var dataflowProcessor = new DataflowProcessor<string>();
var amountOfTasks = 5;
var tasks = new Task[amountOfTasks];
for (var i = 0; i < amountOfTasks; i++)
{
tasks[i] = SpawnThread(dataflowProcessor, $"Task {i + 1}");
}
foreach (var task in tasks)
{
task.Start();
}
Task.WaitAll(tasks);
Console.WriteLine("Finished feeding threads"); // Needs to use async main
Console.Read();
}
private static Task SpawnThread(DataflowProcessor<string> dataflowProcessor, string taskName)
{
return new Task(async () =>
{
await FeedData(dataflowProcessor, taskName);
});
}
private static async Task FeedData(DataflowProcessor<string> dataflowProcessor, string threadName)
{
foreach (var i in Enumerable.Range(0, short.MaxValue))
{
await Task.Delay(1000); // Only used for the delayedSerialProcessing test
dataflowProcessor.Process($"Thread name: {threadName}\t Thread ID:{Thread.CurrentThread.ManagedThreadId}\t Value:{i}");
}
}
}
public class DataflowProcessor<T>
{
private static readonly ExecutionDataflowBlockOptions ExecutionOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
private static readonly TransformBlock<T, T> ReceiveElement = new TransformBlock<T, T>(element =>
{
Console.WriteLine($"Processing received element in thread {Thread.CurrentThread.ManagedThreadId}");
return element;
}, ExecutionOptions);
private static readonly ActionBlock<T> SendElement = new ActionBlock<T>(element =>
{
Console.WriteLine($"Processing sent element in thread {Thread.CurrentThread.ManagedThreadId}");
Console.WriteLine(element);
}, ExecutionOptions);
static DataflowProcessor()
{
ReceiveElement.LinkTo(SendElement);
ReceiveElement.Completion.ContinueWith(x =>
{
if (x.IsFaulted)
{
((IDataflowBlock) ReceiveElement).Fault(x.Exception);
}
else
{
ReceiveElement.Complete();
}
});
}
public void Process(T newElement)
{
ReceiveElement.Post(newElement);
}
}

Before you deploy your solution to the ASP.NET environment, I suggest you to change your architecture: IIS can suspend threads in ASP.NET for it's own use after the request handled so your task could be unfinished. Better approach is to create a separate windows service daemon, which handles your dataflow.
Now back to the TPL Dataflow.
I love the TPL Dataflow library but it's documentation is a real mess.
The only useful document I've found is Introduction to TPL Dataflow.
There are some clues in it which can be helpful, especially the ones about Configuration Settings (I suggest you to investigate the implementing your own TaskScheduler with using your own TheadPool implementation, and MaxMessagesPerTask option) if you need:
The built-in dataflow blocks are configurable, with a wealth of control provided over how and where blocks perform their work. Here are some key knobs available to the developer, all of which are exposed through the DataflowBlockOptions class and its derived types (ExecutionDataflowBlockOptions and GroupingDataflowBlockOptions), instances of which may be provided to blocks at construction time.
TaskScheduler customization, as #i3arnon mentioned:
By default, dataflow blocks schedule work to TaskScheduler.Default, which targets the internal workings of the .NET ThreadPool.
MaxDegreeOfParallelism
It defaults to 1, meaning only one thing may happen in a block at a time. If set to a value higher than 1, that number of messages may be processed concurrently by the block. If set to DataflowBlockOptions.Unbounded (-1), any number of messages may be processed concurrently, with the maximum automatically managed by the underlying scheduler targeted by the dataflow block. Note that MaxDegreeOfParallelism is a maximum, not a requirement.
MaxMessagesPerTask
TPL Dataflow is focused on both efficiency and control. Where there are necessary trade-offs between the two, the system strives to provide a quality default but also enable the developer to customize behavior according to a particular situation. One such example is the trade-off between performance and fairness. By default, dataflow blocks try to minimize the number of task objects that are necessary to process all of their data. This provides for very efficient execution; as long as a block has data available to be processed, that block’s tasks will remain to process the available data, only retiring when no more data is available (until data is available again, at which point more tasks will be spun up). However, this can lead to problems of fairness. If the system is currently saturated processing data from a given set of blocks, and then data arrives at other blocks, those latter blocks will either need to wait for the first blocks to finish processing before they’re able to begin, or alternatively risk oversubscribing the system. This may or may not be the correct behavior for a given situation. To address this, the MaxMessagesPerTask option exists.
It defaults to DataflowBlockOptions.Unbounded (-1), meaning that there is no maximum. However, if set to a positive number, that number will represent the maximum number of messages a given block may use a single task to process. Once that limit is reached, the block must retire the task and replace it with a replica to continue processing. These replicas are treated fairly with regards to all other tasks scheduled to the scheduler, allowing blocks to achieve a modicum of fairness between them. In the extreme, if MaxMessagesPerTask is set to 1, a single task will be used per message, achieving ultimate fairness at the potential expense of more tasks than may otherwise have been necessary.
MaxNumberOfGroups
The grouping blocks are capable of tracking how many groups they’ve produced, and automatically complete themselves (declining further offered messages) after that number of groups has been generated. By default, the number of groups is DataflowBlockOptions.Unbounded (-1), but it may be explicitly set to a value greater than one.
CancellationToken
This token is monitored during the dataflow block’s lifetime. If a cancellation request arrives prior to the block’s completion, the block will cease operation as politely and quickly as possible.
Greedy
By default, target blocks are greedy and want all data offered to them.
BoundedCapacity
This is the limit on the number of items the block may be storing and have in flight at any one time.

Best way to work on 15000 work items that need 1-2 I/O calls each

I have a C#/.NET 4.5 application that does work on around 15,000 items that are all independent of each other. Each item has a relatively small cpu work to do (no more than a few milliseconds) and 1-2 I/O calls to WCF services implemented in .NET 4.5 with SQL Server 2008 backend. I assume they will queue concurrent requests that they can't process quick enough? These I/O operations can take anywhere from a few milliseconds to a full second. The work item then has a little more cpu work(less than 100 milliseconds) and it is done.
I am running this on a quad-core machine with hyper-threading. Using the task parallel library, I am trying to get the best performance with machine as I can with as little waiting on I/O as possible by running those operations asynchronously and the CPU work done in parallel.
Synchronously, with no parallel processes and no async operations, the application takes around 9 hours to run. I believe I can speed this up to under an hour or less but I am not sure if I am going about this the right way.
What is the best way to do the work per item in .NET? Should I make 15000 threads and have them doing all the work with context switching? Or should I just make 8 threads (how many logical cores I have) and go about it that way? Any help on this would be greatly appreciated.

My usuall suggestion is TPL Dataflow.
You can use an ActionBlock with an async operation and set the parallelism as high as you need it to be:
var block = new ActionBlock<WorkItem>(wi =>
{
DoWork(wi);
await Task.WhenAll(DoSomeWorkAsync(wi), DoOtherWorkAsync(wi));
},
new ExecutionDataflowBlockOptions{ MaxDegreeOfParallelism = 1000 });
foreach (var workItem in workItems)
{
block.Post(workItem);
}
block.Complete();
await block.Completion;
That way you can test and tweak MaxDegreeOfParallelism until you find the number that fits your specific situation the most.
For CPU intensive work having higher parallelism than your cores doesn't help, but for I/O (and other async operations) it definitely does so if your CPU intensive work is short then I would go with at least a 1000.

You definitely don't want to kick off 15000 threads and let them all thrash it out. If you can make your I/O methods completely async - meaning I/O completion ports based - then you can get some very nice controlled parallelism going on.
If you have to tie up threads whilst waiting for I/O you're going to be massively limiting your ability to process the items.
TaskFactory taskFactory = new TaskFactory(new WorkStealingTaskScheduler(Environment.ProcessorCount));
public Job[] GetJobs() { get { return new Job[15000];} }
public async Task ProcessJobs(Job[] jobs)
{
var jobTasks = jobs.Select(j => StartJob(j));
await Task.WhenAll(jobTasks);
}
private async Task StartJob(Job j)
{
var initialCpuResults = await taskFactory.StartNew(() => j.DoInitialCpuWork());
var wcfResult = await DoIOCalls(initialCpuResults);
await taskFactory.StartNew(() => j.DoLastCpuWork(wcfResult));
}
private async Task<bool> DoIOCalls(Result r)
{
// Sequential...
await myWcfClientProxy.DoIOAsync(...); // These MUST be fully IO completion port based methods [not Task.Run etc] to achieve good throughput
await mySQLServerClient.DoIOAsync(...);
// or in Parallel...
// await Task.WhenAll(myWcfClientProxy.DoIOAsync(...), mySQLServerClient.DoIOAsync(...));
return true;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.