Parallel.ForEach with a custom TaskScheduler to prevent OutOfMemoryException

Parallel.ForEach with a custom TaskScheduler to prevent OutOfMemoryException - c#

I am processing PDFs of vastly varying sizes (simple 2MB to high DPI scans of a few hundred MB) via a Parallel.ForEach and am occasionally getting to an OutOfMemoryException - understandably due to the process being 32 bit and the threads spawned by the Parallel.ForEach taking up an unknown amount of memory consuming work.
Restricting MaxDegreeOfParallelism does work, though the throughput for the times when there is a large (10k+) batch of small PDFs to work with is not sufficient as there could be more threads working due to the small memory footprint of said threads. This is a CPU heavy process with Parallel.ForEach easily reaching 100% CPU before hitting the occasional group of large PDFs and getting an OutOfMemoryException. Running the Performance Profiler backs this up.
From my understanding, having a partitioner for my Parallel.ForEach won't improve my performance.
This leads me to using a custom TaskScheduler passed to my Parallel.ForEach with a MemoryFailPoint check. Searching around it seems there is scarce information on creating custom TaskScheduler objects.
Looking between Specialized Task Schedulers in .NET 4 Parallel Extensions Extras, A custom TaskScheduler in C# and various answers here on Stackoverflow, I've created my own TaskScheduler and have my QueueTask method as such:
protected override void QueueTask(Task task)
{
lock (tasks) tasks.AddLast(task);
try
{
using (MemoryFailPoint memFailPoint = new MemoryFailPoint(600))
{
if (runningOrQueuedCount < maxDegreeOfParallelism)
{
runningOrQueuedCount++;
RunTasks();
}
}
}
catch (InsufficientMemoryException e)
{
// somehow return thread to pool?
Console.WriteLine("InsufficientMemoryException");
}
}
While the try/catch is a little expensive my goal here is to catch when the probable maximum size PDF (+ a little extra memory overhead) of 600MB will throw an OutOfMemoryException. This solution through seems to kill off the thread attempting to do the work when I catch the InsufficientMemoryException. With enough large PDFs my code ends up being a single thread Parallel.ForEach.
Other questions found on Stackoverflow on Parallel.ForEach and OutOfMemoryExceptions don't appear to suit my use case of maximum throughput with dynamic memory usage on threads and often just leverage MaxDegreeOfParallelism as a static solution, E.g.:
Parallel.For System.OutOfMemoryException
Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object
So to have maximum throughput for variable working memory sizes, either:
How do I return a thread back into the threadpool when it has been denied work via the MemoryFailPoint check?
How/where do I safely spawn new threads to pick up work again when there is free memory?
Edit:
The PDF size on disk may not linearly represent size in memory due to the rasterization and rasterized image manipulation component which is dependent on the PDF content.

Using LimitedConcurrencyLevelTaskScheduler from Samples for Parallel Programming with the .NET Framework I was able to make a minor adjustment to get something that looked about what I wanted. The following is the NotifyThreadPoolOfPendingWork method of the LimitedConcurrencyLevelTaskScheduler class after modification:
private void NotifyThreadPoolOfPendingWork()
{
ThreadPool.UnsafeQueueUserWorkItem(_ =>
{
// Note that the current thread is now processing work items.
// This is necessary to enable inlining of tasks into this thread.
_currentThreadIsProcessingItems = true;
try
{
// Process all available items in the queue.
while (true)
{
Task item;
lock (_tasks)
{
// When there are no more items to be processed,
// note that we're done processing, and get out.
if (_tasks.Count == 0)
{
--_delegatesQueuedOrRunning;
break;
}
// Get the next item from the queue
item = _tasks.First.Value;
_tasks.RemoveFirst();
}
// Execute the task we pulled out of the queue
//base.TryExecuteTask(item);
try
{
using (MemoryFailPoint memFailPoint = new MemoryFailPoint(650))
{
base.TryExecuteTask(item);
}
}
catch (InsufficientMemoryException e)
{
Thread.Sleep(500);
lock (_tasks)
{
_tasks.AddLast(item);
}
}
}
}
// We're done processing items on the current thread
finally { _currentThreadIsProcessingItems = false; }
}, null);
}
We'll look at the catch, but in reverse. We add the task we were going to work on back to the list of tasks (_tasks) which triggers an event to get an available thread to pick up that work. But we sleep the current thread first in order for it to not pick up the work straight way and go back into a failed MemoryFailPoint check.

The idea of a memory-aware TaskScheduler that is based on the MemoryFailPoint class is pretty neat. Here is another idea. You could limit the parallelism based on the known size of each PDF file, by using a SemaphoreSlim. Before processing a file you could acquire the semaphore as many times as the size of the file in megabytes, and after the processing is completed you could release the semaphore an equal number of times.
The tricky part is that the SemaphoreSlim doesn't have an API that acquires it atomically more than once, and acquiring it multiple times non-atomically in parallel could result easily in a deadlock. One way to synchronize the acquisition of the semaphore could be to use as an asynchronous mutex a second new SemaphoreSlim(1, 1). Another way is to move the acquisition one step back, at the enumeration phase of the source sequence. The implementation below demonstrates the second approach. It is a variant of the .NET 6 API Parallel.ForEachAsync, that on top of the existing features it is equiped with two additional parameters sizeSelector and maxConcurrentSize:
public static Task ParallelForEachAsync_LimitedBySize<TSource>(
IEnumerable<TSource> source,
ParallelOptions parallelOptions,
Func<TSource, CancellationToken, ValueTask> body,
Func<TSource, int> sizeSelector,
int maxConcurrentSize)
{
ArgumentNullException.ThrowIfNull(source);
ArgumentNullException.ThrowIfNull(parallelOptions);
ArgumentNullException.ThrowIfNull(body);
ArgumentNullException.ThrowIfNull(sizeSelector);
if (maxConcurrentSize < 1)
throw new ArgumentOutOfRangeException(nameof(maxConcurrentSize));
SemaphoreSlim semaphore = new(maxConcurrentSize, maxConcurrentSize);
async IAsyncEnumerable<(TSource, int)> Iterator()
{
foreach (TSource item in source)
{
int size = sizeSelector(item);
size = Math.Clamp(size, 0, maxConcurrentSize);
for (int i = 0; i < size; i++)
await semaphore.WaitAsync().ConfigureAwait(false);
yield return (item, size);
}
}
return Parallel.ForEachAsync(Iterator(), parallelOptions, async (entry, ct) =>
{
(TSource item, int size) = entry;
try { await body(item, ct).ConfigureAwait(false); }
finally { if (size > 0) semaphore.Release(size); }
});
}
Internally it calls the Parallel.ForEachAsync overload that has a source of type IAsyncEnumerable<T>.
Usage example:
ParallelOptions options = new() { MaxDegreeOfParallelism = 10 };
await ParallelForEachAsync_LimitedBySize(paths, options, async (path, ct) =>
{
// Process the file
await Task.CompletedTask;
}, sizeSelector: path =>
{
// Return the size of the file in MB
return (int)(new FileInfo(path).Length / 1_000_000);
}, maxConcurrentSize: 2_000);
The paths will be processed with a maximum parallelism of 10, and a maximum concurrent size of 2 GB.

Related

Microsoft TPL Dataflow Library | MaxDegreeOfParallelism and IO-bound work

My use case is this: send 100,000+ web requests to our application server and wait the results. Here, most of the delay is IO-bound, not CPU-bound, so I understand the Dataflow libraries may not be the best tool for this. I've managed to use it will a lot of success and have set the MaxDegreeOfParallelism to the number of requests that I trust the server to be able to handle, however, since this is the maximum number of tasks, it's no guarantee that this will actually be the number of tasks running at any time.
The only bit of information I could find in the documentation is this:
Because the MaxDegreeOfParallelism property represents the maximum degree of parallelism, the dataflow block might execute with a lesser degree of parallelism than you specify. The dataflow block can use a lesser degree of parallelism to meet its functional requirements or to account for a lack of available system resources. A dataflow block never chooses a greater degree of parallelism than you specify.
This explanation is quite vague on how it actually determines when to spin up a new task. My hope was that it will recognize that the task is blocked due to IO, not any system resources, and it will basically stay at the maximum degrees of parallelism for the entire duration of the operation.
However, after monitoring a network capture, it seems to be MUCH quicker in the beginning and slower near the end. I can see from the capture, that at the beginning it does reach the maximum as specified. The TPL library doesn't have any built-in way to monitor the current number of active threads, so I'm not really sure of the best way to investigate further on that end.
My implementation:
internal static ExecutionDataflowBlockOptions GetDefaultBlockOptions(int maxDegreeOfParallelism,
CancellationToken token) => new()
{
MaxDegreeOfParallelism = maxDegreeOfParallelism,
CancellationToken = token,
SingleProducerConstrained = true,
EnsureOrdered = false
};
private static async ValueTask<T?> ReceiveAsync<T>(this ISourceBlock<T?> block, bool configureAwait, CancellationToken token)
{
try
{
return await block.ReceiveAsync(token).ConfigureAwait(configureAwait);
} catch (InvalidOperationException)
{
return default;
}
}
internal static async IAsyncEnumerable<T> YieldResults<T>(this ISourceBlock<T?> block, bool configureAwait,
[EnumeratorCancellation]CancellationToken token)
{
while (await block.OutputAvailableAsync(token).ConfigureAwait(configureAwait))
if (await block.ReceiveAsync(configureAwait, token).ConfigureAwait(configureAwait) is T result)
yield return result;
// by the time OutputAvailableAsync returns false, the block is gauranteed to be complete. However,
// we want to await it anyway, since this will propogate any exception thrown to the consumer.
// we don't simply await the completed task, because that wouldn't return all aggregate exceptions,
// just the last to occur
if (block.Completion.Exception != null)
throw block.Completion.Exception;
}
public static IAsyncEnumerable<TResult> ParallelSelectAsync<T, TResult>(this IEnumerable<T> source, Func<T, Task<TResult?>> body,
int maxDegreeOfParallelism = DataflowBlockOptions.Unbounded, TaskScheduler? scheduler = null, CancellationToken token = default)
{
var options = GetDefaultBlockOptions(maxDegreeOfParallelism, token);
if (scheduler != null)
options.TaskScheduler = scheduler;
var block = new TransformBlock<T, TResult?>(body, options);
foreach (var item in source)
block.Post(item);
block.Complete();
return block.YieldResults(scheduler != null && scheduler != TaskScheduler.Default, token);
}
So, basically, my question is this: when an IO-bound action is executed in a TPL Dataflow block, how can I ensure the block stays at the MaxDegreeOfParallelism that is set?

On the contrary, Dataflow is great at IO work and perfect for this scenario. DataFlow architectures work by creating pipelines similar to Bash or PowerShell pipelines. Each block acting as a separate command, reading messages from its input queue and passing them to the next block through its output queue. That's why the default DOP is 1 - parallelism and concurrency come from using multiple commands/blocks, not a fat block with a high DOP
This is a simplified example of what I use at work request daily sales reports from about a hundred airlines (BSPs for those that know about air tickets), parse the reports and then download individual ticket records, before importing everything into the database.
In this case the head block downloads content with a DOP=10, then the parser block parses the responses one at a time. The downloader is IO-bound so it can make a lot more requests than there are cores, as many as the services allow, or the application wants to handle.
The parser on the other hand is CPU bound. A high DOP would lock a lot of core which would harm not just the application, but other processes as well.
// Create the blocks
var dlOptions = new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism=10
};
var downloader=new TransformBlock<string,string>(
url => _client.GetStringAsync(url,cancellationToken),
dlOptions);
var parser=new TransformBlock<string,Something>(ParseIntoSomething);
var importer=new ActionBlock<Something>(ImportInDb);
// Link the blocks
var linkOptions = new DataflowLinkOptions {PropagateCompletion = true};
downloader.LinkTo(parser,linkOptions);
parser.LinkTo(importer,linkOptions);
After building this 3 step pipeline I post URLs at the front and expect the tail block to complete
foreach(var url in urls)
{
downloader.Post(url);
}
downloader.Complete();
await importer.Completion;
There are a lot of improvements to this. Right now, if the downloader is faster than the parser, all the content will be buffered in memory. In a long running pipeline this can easily take up all available memory.
A simple way to avoid this is add BoundedCapacity=N to the parser block options. If the parser input buffer is full, upstream blocks, in this case the downloader, will pause and wait until a slot becomes available :
var parserOptions = new ExecutionDataflowBlockOptions {
BoundedCapacity=2,
MaxDegreeOfParallelism=2,
};
var parser=new TransformBlock<string,Something>(ParseIntoSomething, parserOptions);

How to ensure parallel tasks dequeue unique entries from ConcurrentQueue<T>?

Hi I have a concurrent Queue that is loaded with files from database. These files are to be processed by parallel Tasks that will dequeue the files. However I run into issues where after some time, I start getting tasks that dequeue the same file at the same time (which leads to "used by another process errors on the file). And I also get more tasks than are supposed to be allocated. I have even seen 8 tasks running at once which should not be happening. The active tasks limit is 5
Rough code:
private void ParseQueuedTDXFiles()
{
while (_signalParseQueuedFilesEvent.WaitOne())
{
Task.Run(() => SetParsersTask());
}
}
The _signalParseQueuedFilesEvent is set on a timer in a Windows Service
The above function then calls SetParsersTask. This is why I use a concurrent Dictionary to track how many active tasks there are. And make sure they are below _ActiveTasksLimit:
private void SetParsersTask()
{
if (_ConcurrentqueuedTdxFilesToParse.Count > 0)
{
if (_activeParserTasksDict.Count < _ActiveTasksLimit) //ConcurrentTask Dictionary Used to control how many Tasks should run
{
int parserCountToStart = _ActiveTasksLimit - _activeParserTasksDict.Count;
Parallel.For(0, parserCountToStart, parserToStart =>
{
lock(_concurrentQueueLock)
Task.Run(() => PrepTdxParser());
});
}
}
}
Which then calls this function which dequeues the Concurrent Queue:
private void PrepTdxParser()
{
TdxFileToProcessData fileToProcess;
lock (_concurrentQueueLock)
_ConcurrentqueuedTdxFilesToParse.TryDequeue(out fileToProcess);
if (!string.IsNullOrEmpty(fileToProcess.TdxFileName))
{
LaunchTDXParser(fileToProcess);
}
}
I even put a lock on _ConcurrentqueuedTdxFilesToParse even though I know it doesn't need one. All to make sure that I never run into a situation where two Tasks are dequeuing the same file.
This function is where I add and remove Tasks as well as launch the file parser for the dequeued file:
private void LaunchTDXParser(TdxFileToProcessData fileToProcess)
{
string fileName = fileToProcess.TdxFileName;
Task startParserTask = new Task(() => ConfigureAndStartProcess(fileName));
_activeParserTasksDict.TryAdd(fileName, startParserTask);
startParserTask.Start();
Task.WaitAll(startParserTask);
_activeParserTasksDict.TryRemove(fileName, out Task taskToBeRemoved);
}
Can you guys help me understand why I am getting the same file dequeued in two different Tasks? And why I am getting more Tasks than the _ActiveTasksLimit?

There is a number of red flags in this¹ code:
Using a WaitHandle. This tool it too primitive. I've never seen a problem solved with WaitHandles, that can't be solved in a simpler way without them.
Launching Task.Run tasks in a fire-and-forget fashion.
Launching a Parallel.For loop without configuring the MaxDegreeOfParallelism. This practically guarantees that the ThreadPool will get saturated.
Protecting a queue (_queuedTdxFilesToParse) with a lock (_concurrentQueueLock) only partially. If the queue is a Queue<T>, you must protect it on each and every operation, otherwise the behavior of the program is undefined. If the queue is a ConcurrentQueue<T>, there is no need to protect it because it is thread-safe by itself.
Calling Task.Factory.StartNew and Task.Start without configuring the scheduler argument.
So I am not surprised that your code is not working as expected. I can't point to a specific error that needs to be fixed. For me the whole approach is dubious, and needs to be reworked/scraped. Some concepts and tools that you might want to research before attempting to rewrite this code:
The producer-consumer pattern.
The BlockingCollection<T> class.
The TPL Dataflow library.
Optionally you could consider familiarizing yourself with asynchronous programming. It can help at reducing the number of threads that your program uses while running, resulting in a more efficient and scalable program. Two powerful asynchronous tools is the Channel<T> class and the Parallel.ForEachAsync API (available from .NET 6 and later).
¹ This answer was intended for a related question that is now deleted.

So I fixed my problem. The solution was first to not add more parallelism than needs be. I was trying to create a situaion where private void SetParsersTask() would not be held by tasks that still needed to finish process a file. So I foolishly threw in Parallel.For in addition to Task.Start which is already parallel. I fixed this by generating Fire and Forget Tasks in a normal for loop as opposed to Paralle.For:
private void SetParsersTask()
{
if (_queuedTdxFilesToParse.Count > 0)
{
if (_activeParserTasksDict.Count < _tdxParsersInstanceCount)
{
int parserCountToStart = _tdxParsersInstanceCount - _activeParserTasksDict.Count;
_queuedTdxFilesToParse = new ConcurrentQueue<TdxFileToProcessData>(_queuedTdxFilesToParse.Distinct());
for (int i = 0; i < parserCountToStart; i++)
{
Task.Run(() => PrepTdxParser());
}
}
}
}
After that I was still getting the occasional duplicate files so I moved the queue loading to another long running thread. And for that thread I use an AutoResetEvent so that the queue is only populated only once at any instance of time. As opposed to potentially another task loading it with duplicate files. It could be that both my enqueue and dequeue were both responsible and now it's addressed:
var _loadQueueTask = Task.Factory.StartNew(() => LoadQueue(), TaskCreationOptions.LongRunning);
private void LoadQueue()
{
while (_loadConcurrentQueueEvent.WaitOne())
{
if (_queuedTdxFilesToParse.Count < _tdxParsersInstanceCount)
{
int numFilesToGet = _tdxParsersInstanceCount - _activeParserTasksDict.Count;
var filesToAdd = ServiceDBHelper.GetTdxFilesToEnqueueForProcessingFromDB(numFilesToGet);
foreach (var fileToProc in filesToAdd)
{
ServiceDBHelper.UpdateTdxFileToProcessStatusAndUpdateDateTime(fileToProc.TdxFileName, 1, DateTime.Now);
_queuedTdxFilesToParse.Enqueue(fileToProc);
}
}
}
}
Thanks to Theo for pointing me to additional tools and making me look closer in my parallel loops

Thread WaitReason.UserRequest

A Windows Service uses too many threads. I added some logging to find out more. Sadly, there's little support from the .Net framework.
ThreadPool.GetAvailableThreads(out workerThreads, out completionPortThreads); starts with some 32760 workerThreads, and 1000 completionThreads, resp.
After a few hours, available workerThreads went down to 31817, i.e. almost 1000 managed threads are in use.
What are they doing? There's no way to find out (you may find some workaround where you place the threads you create into some collection, and later analyze that collection, but that fails when you also use Parellel.ForEach or Task.Run).
Well, there is another possibility. Try ProcessThreadCollection currentThreads = Process.GetCurrentProcess().Threads; That will give you a list of non-managed threads (there number is also shown in Windows Task Manager).
My Windows Service starts with some 20 of them. After a few hours, I detect 3828, i.e. about 4 non-managed threads for each managed thread...
Now I can ask each of them when it started, what its priority is, what it is doing currently, and why it is waiting. Yes, for almost all of them the current state is Wait. And the WaitReason is in most cases UserRequest.
So my question is: what are those threads actually doing? There is no User Interface, even no command line associated with that executable: it is a Windows Service...
Also, I'd like to know how to get rid off them: many threads are created and should also run to completion in a short time (within seconds) - but some are "waiting" for hours.

I solved that issue by using the rationale that a thread which is not created cannot hang around uselessly.
I removed some calls to Parallel.Foreach(collection, item => { item.DoSomething(parameters); } );. Now implementations of IItem.DoSomething(parameters) just enqueue the parameters for later processing, and IItems have a thread for that processing (Active Object Pattern). Consequently, a "common" foreach can be used.
When results need to be collected, the pattern is more complicated:
private List<IResult> CollectResults(IEnumerable<IItem> collection, int maximumProcessingMilliseconds )
{
List<IResult> results = new List<IResult>();
CancellationTokenSource cts = new CancellationTokenSource();
cts.CancelAfter(maximumProcessingMilliseconds);
var tasks = new List<Task<IResult>>();
foreach (IItem item in collection)
{
IItem localItem = item;
tasks.Add(Task.Run(() => localItem.GetResult(cts.Token), cts.Token));
}
Task[] tasksArray = tasks.ToArray();
try
{
Task.WaitAll(tasksArray, TimeSpan.FromMilliseconds(maximumProcessingMilliseconds));
Task.WaitAll(tasksArray);
}
catch (AggregateException ex)
{
Logger.LogException(ex);
}
foreach (Task<IResult> task in tasks)
{
if (task.Status == TaskStatus.RanToCompletion)
{
results.Add(task.Result);
}
}
return results;
}
This is a terrible lot of boiler plate code which obfuscates the actual meaning of the function.
In the end, the number of (unmanaged) threads used by the application hardly ever grows beyond 100, and if it did, it does return to lower values quickly.

How can I make sure a dataflow block only creates threads on a on-demand basis?

I've written a small pipeline using the TPL Dataflow API which receives data from multiple threads and performs handling on them.
Setup 1
When I configure it to use MaxDegreeOfParallelism = Environment.ProcessorCount (comes to 8 in my case) for each block, I notice it fills up buffers in multiple threads and processing the second block doesn't start until +- 1700 elements have been received across all threads. You can see this in action here.
Setup 2
When I set MaxDegreeOfParallelism = 1 then I notice all elements are received on a single thread and processing the sending already starts after +- 40 elements are received. Data here.
Setup 3
When I set MaxDegreeOfParallelism = 1 and I introduce a delay of 1000ms before sending each input, I notice elements get sent as soon as they are received and every received element is put on a separate thread. Data here.
So far the setup. My questions are the following:
When I compare setups 1 & 2 I notice that processing elements starts much faster when done in serial compared to parallel (even after accounting for the fact that parallel has 8x as many threads). What causes this difference?
Since this will be run in an ASP.NET environment, I don't want to spawn unnecessary threads since they all come from a single threadpool. As shown in setup 3 it will still spread itself over multiple threads even when there is only a handful of data. This is also surprising because from setup 1 I would assume that data is spread sequentially over threads (notice how the first 50 elements all go to thread 16). Can I make sure it only creates new threads on a on-demand basis?
There is another concept called the BufferBlock<T>. If the TransformBlock<T> already queues input, what would be the practical difference of swapping the first step in my pipeline (ReceiveElement) for a BufferBlock?
class Program
{
static void Main(string[] args)
{
var dataflowProcessor = new DataflowProcessor<string>();
var amountOfTasks = 5;
var tasks = new Task[amountOfTasks];
for (var i = 0; i < amountOfTasks; i++)
{
tasks[i] = SpawnThread(dataflowProcessor, $"Task {i + 1}");
}
foreach (var task in tasks)
{
task.Start();
}
Task.WaitAll(tasks);
Console.WriteLine("Finished feeding threads"); // Needs to use async main
Console.Read();
}
private static Task SpawnThread(DataflowProcessor<string> dataflowProcessor, string taskName)
{
return new Task(async () =>
{
await FeedData(dataflowProcessor, taskName);
});
}
private static async Task FeedData(DataflowProcessor<string> dataflowProcessor, string threadName)
{
foreach (var i in Enumerable.Range(0, short.MaxValue))
{
await Task.Delay(1000); // Only used for the delayedSerialProcessing test
dataflowProcessor.Process($"Thread name: {threadName}\t Thread ID:{Thread.CurrentThread.ManagedThreadId}\t Value:{i}");
}
}
}
public class DataflowProcessor<T>
{
private static readonly ExecutionDataflowBlockOptions ExecutionOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
private static readonly TransformBlock<T, T> ReceiveElement = new TransformBlock<T, T>(element =>
{
Console.WriteLine($"Processing received element in thread {Thread.CurrentThread.ManagedThreadId}");
return element;
}, ExecutionOptions);
private static readonly ActionBlock<T> SendElement = new ActionBlock<T>(element =>
{
Console.WriteLine($"Processing sent element in thread {Thread.CurrentThread.ManagedThreadId}");
Console.WriteLine(element);
}, ExecutionOptions);
static DataflowProcessor()
{
ReceiveElement.LinkTo(SendElement);
ReceiveElement.Completion.ContinueWith(x =>
{
if (x.IsFaulted)
{
((IDataflowBlock) ReceiveElement).Fault(x.Exception);
}
else
{
ReceiveElement.Complete();
}
});
}
public void Process(T newElement)
{
ReceiveElement.Post(newElement);
}
}

Before you deploy your solution to the ASP.NET environment, I suggest you to change your architecture: IIS can suspend threads in ASP.NET for it's own use after the request handled so your task could be unfinished. Better approach is to create a separate windows service daemon, which handles your dataflow.
Now back to the TPL Dataflow.
I love the TPL Dataflow library but it's documentation is a real mess.
The only useful document I've found is Introduction to TPL Dataflow.
There are some clues in it which can be helpful, especially the ones about Configuration Settings (I suggest you to investigate the implementing your own TaskScheduler with using your own TheadPool implementation, and MaxMessagesPerTask option) if you need:
The built-in dataflow blocks are configurable, with a wealth of control provided over how and where blocks perform their work. Here are some key knobs available to the developer, all of which are exposed through the DataflowBlockOptions class and its derived types (ExecutionDataflowBlockOptions and GroupingDataflowBlockOptions), instances of which may be provided to blocks at construction time.
TaskScheduler customization, as #i3arnon mentioned:
By default, dataflow blocks schedule work to TaskScheduler.Default, which targets the internal workings of the .NET ThreadPool.
MaxDegreeOfParallelism
It defaults to 1, meaning only one thing may happen in a block at a time. If set to a value higher than 1, that number of messages may be processed concurrently by the block. If set to DataflowBlockOptions.Unbounded (-1), any number of messages may be processed concurrently, with the maximum automatically managed by the underlying scheduler targeted by the dataflow block. Note that MaxDegreeOfParallelism is a maximum, not a requirement.
MaxMessagesPerTask
TPL Dataflow is focused on both efficiency and control. Where there are necessary trade-offs between the two, the system strives to provide a quality default but also enable the developer to customize behavior according to a particular situation. One such example is the trade-off between performance and fairness. By default, dataflow blocks try to minimize the number of task objects that are necessary to process all of their data. This provides for very efficient execution; as long as a block has data available to be processed, that block’s tasks will remain to process the available data, only retiring when no more data is available (until data is available again, at which point more tasks will be spun up). However, this can lead to problems of fairness. If the system is currently saturated processing data from a given set of blocks, and then data arrives at other blocks, those latter blocks will either need to wait for the first blocks to finish processing before they’re able to begin, or alternatively risk oversubscribing the system. This may or may not be the correct behavior for a given situation. To address this, the MaxMessagesPerTask option exists.
It defaults to DataflowBlockOptions.Unbounded (-1), meaning that there is no maximum. However, if set to a positive number, that number will represent the maximum number of messages a given block may use a single task to process. Once that limit is reached, the block must retire the task and replace it with a replica to continue processing. These replicas are treated fairly with regards to all other tasks scheduled to the scheduler, allowing blocks to achieve a modicum of fairness between them. In the extreme, if MaxMessagesPerTask is set to 1, a single task will be used per message, achieving ultimate fairness at the potential expense of more tasks than may otherwise have been necessary.
MaxNumberOfGroups
The grouping blocks are capable of tracking how many groups they’ve produced, and automatically complete themselves (declining further offered messages) after that number of groups has been generated. By default, the number of groups is DataflowBlockOptions.Unbounded (-1), but it may be explicitly set to a value greater than one.
CancellationToken
This token is monitored during the dataflow block’s lifetime. If a cancellation request arrives prior to the block’s completion, the block will cease operation as politely and quickly as possible.
Greedy
By default, target blocks are greedy and want all data offered to them.
BoundedCapacity
This is the limit on the number of items the block may be storing and have in flight at any one time.

Multi-threading potentially long running operations

I am writing a Windows Service. I have a 'backlog' (in SQL) of records that have to be processed by the service. The backlog might also be empty. The record processing is a potentially very long running operation (3+ minutes).
I have a class and method in it which would go to the SQL, choose a record and process it, if there are any records to process. Then the method will exist and that's it. Note: I can't know in advance which records will be processed - the class method decides this as part of its logic.
I want to achieve parallel processing. I want to have X number of workers (where X is the optimal for the host PC) at any time. While the backlog is empty, those workers finish their jobs and exit pretty quickly (~50-100ms, maybe). I want any 'freed' worker to start over again (i.e. re-run).
I have done some reading and I deduct that ThreadPool is not a good option for long-running operations. The .net 4.0+ parallel library is not a good option either, as I don't want to wait all workers to finish and I don't want to predefine/declare in advance the tasks.
In layman terms I want to have X workers who query the data source for items and when some of them find such - operate on it, the rest would continue to look for newly pushed items into the backlog.
What would be the best approach? I think I will have to manage the threads entirely by myself? i.e. first step - determine the optimum number of threads (perhaps by checking the Environment.ProcessorCount) and then start the X threads. Monitor for IsAlive on each thread and restart it? This seems awfully unprofessional.
Any suggestions?

You can start one task per core,As tasks finish start new ones.You can use numOfThreads depending on ProcessorCount or specific number
int numOfThreads = System.Environment.ProcessorCount;
// int numOfThreads = X;
for(int i =0; i< numOfThreads; i++)
task.Add(Task.Factory.StartNew(()=> {});
while(task.count>0) //wait for task to finish
{
int index = Task.WaitAny(tasks.ToArray());
tasks.RemoveAt(index);
if(incomplete work)
task.Add(Task.Factory.StartNew()=> {....});
}
or
var options = new ParallelOptions();
options.MaxDegreeOfParllelism = System.Environment.ProcessorCount;
Parallel.For(0,N,options, (i) => {/*long running computattion*/};
or
You can Implement Producer-Coustomer pattern with BlockingCollection
This topic is excellently taught by Dr.Joe Hummel on his Pluralsight course "Async and parallel programming: Application design "

Consider using ActionBlock<T> from TPL.DataFlow library. It can be configured to process concurrently multiple messages using all available CPU cores.
ActionBlock<QueueItem> _processor;
Task _completionTask;
bool _done;
async Task ReadQueueAsync(int pollingInterval)
{
while (!_done)
{
// Get a list of items to process from SQL database
var list = ...;
// Schedule the work
foreach(var item in list)
{
_processor.Post(item);
}
// Give SQL server time to re-fill the queue
await Task.Delay(pollingInterval);
}
// Signal the processor that we are done
_processor.Complete();
}
void ProcessItem(QueueItem item)
{
// Do your work here
}
void Setup()
{
// Configure action block to process items concurrently
// using all available CPU cores
_processor= new ActionBlock<QueueItem>(new Action<QueueItem>(ProcessItem),
new ExecutionDataFlowBlock{MaxDegreeOfParallelism = DataFlowBlockOptions.Unbounded});
_done = false;
var queueReaderTask = ReadQueueAsync(QUEUE_POLL_INTERVAL);
_completionTask = Task.WhenAll(queueReaderTask, _processor.Completion);
}
void Complete()
{
_done = true;
_completionTask.Wait();
}

Per MaxDegreeOfParallelism's documentation: "Generally, you do not need to modify this setting. However, you may choose to set it explicitly in advanced usage scenarios such as these:
When you know that a particular algorithm you're using won't scale
beyond a certain number of cores. You can set the property to avoid
wasting cycles on additional cores.
When you're running multiple algorithms concurrently and want to
manually define how much of the system each algorithm can utilize.
You can set a MaxDegreeOfParallelism value for each.
When the thread pool's heuristics is unable to determine the right
number of threads to use and could end up injecting too many
threads. For example, in long-running loop body iterations, the
thread pool might not be able to tell the difference between
reasonable progress or livelock or deadlock, and might not be able to reclaim threads that were added to improve performance. In this
case, you can set the property to ensure that you don't use more
than a reasonable number of threads."
If you do not have an advanced usage scenario like the 3 cases above, you may want to hand your list of items or tasks to be run to the Task Parallel Library and let the framework handle the processor count.
List<InfoObject> infoList = GetInfo();
ConcurrentQueue<ResultObject> output = new ConcurrentQueue<ResultObject>();
await Task.Run(() =>
{
Parallel.Foreach<InfoObject>(infoList, (item) =>
{
ResultObject result = ProcessInfo(item);
output.Add(result);
});
});
foreach(var resultObj in output)
{
ReportOnResultObject(resultObj);
}
OR
List<InfoObject> infoList = GetInfo();
List<Task<ResultObject>> tasks = new List<Task<ResultObject>>();
foreach (var item in infoList)
{
tasks.Add(Task.Run(() => ProcessInfo(item)));
}
var results = await Task.WhenAll(tasks);
foreach(var resultObj in results)
{
ReportOnResultObject(resultObj);
}
H/T to IAmTimCorey tutorials:
https://www.youtube.com/watch?v=2moh18sh5p4
https://www.youtube.com/watch?v=ZTKGRJy5P2M

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.