I need to process a large number of files overnight, with a defined start and end time to avoid disrupting users. I've been investigating but there are so many ways of handling threading now that I'm not sure which way to go. The files come into an Exchange inbox as attachments.
My current attempt, based on some examples from here and a bit of experimentation, is:
while (DateTime.Now < dtEndTime.Value)
{
var finished = new CountdownEvent(1);
for (int i = 0; i < numThreads; i++)
{
object state = offset;
finished.AddCount();
ThreadPool.QueueUserWorkItem(delegate
{
try
{
StartProcessing(state);
}
finally
{
finished.Signal();
}
});
offset += numberOfFilesPerPoll;
}
finished.Signal();
finished.Wait();
}
It's running in a winforms app at the moment for ease, but the core processing is in a dll so I can spawn the class I need from a windows service, from a console running under a scheduler, however is easiest. I do have a Windows Service set up with a Timer object that kicks off the processing at a time set in the config file.
So my question is - in the above code, I initialise a bunch of threads (currently 10), then wait for them all to process. My ideal would be a static number of threads, where as one finishes I fire off another, and then when I get to the end time I just wait for all threads to complete.
The reason for this is that the files I'm processing are variable sizes - some might take seconds to process and some might take hours, so I don't want the whole application to wait while one thread completes if I can have it ticking along in the background.
(edit)As it stands, each thread instantiates a class and passes it an offset. The class then gets the next x emails from the inbox, starting at the offset (using the Exchange Web Services paging functionality). As each file is processed, it's moved to a separate folder. From some of the replies so far, I'm wondering if actually I should grab the e-mails in the outer loop, and spawn threads as needed.
To cloud the issue, I currently have a backlog of e-mails that I'm trying to process through. Once the backlog has been cleared, it's likely that the nightly run will have a significantly lower load.
On average there are around 1000 files to process each night.
Update
I've rewritten large chunks of my code so that I can use the Parallel.Foreach and I've come up against an issue with thread safety. The calling code now looks like this:
public bool StartProcessing()
{
FindItemsResults<Item> emails = GetEmails();
var source = new CancellationTokenSource(TimeSpan.FromHours(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions { MaxDegreeOfParallelism = 8, CancellationToken = source.Token };
try
{
Parallel.ForEach(emails, opts, processAttachment);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
catch (Exception err)
{
WriteToLogFile(err.Message + "\r\n");
WriteToLogFile(err.StackTrace + "r\n");
}
return true;
}
So far so good (excuse temporary error handling). I have a new issue now with the fact that the properties of the "Item" object, which is an email, not being threadsafe. So for example when I start processing an e-mail, I move it to a "processing" folder so that another process can't grab it - but it turns out that several of the threads might be trying to process the same e-mail at a time. How do I guarantee that this doesn't happen? I know I need to add a lock, can I add this in the ForEach or should it be in the processAttachments method?
Use the TPL:
Parallel.ForEach( EnumerateFiles(),
new ParallelOptions { MaxDegreeOfParallelism = 10 },
file => ProcessFile( file ) );
Make EnumerateFiles stop enumerating when your end time is reached, trivially like this:
IEnumerable<string> EnumerateFiles()
{
foreach (var file in Directory.EnumerateFiles( "*.txt" ))
if (DateTime.Now < _endTime)
yield return file;
else
yield break;
}
You can use a combination of Parallel.ForEach() along with a cancellation token source which will cancel the operation after a set time:
using System;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static Random rng = new Random();
static void Main()
{
// Simulate having a list of files.
var fileList = Enumerable.Range(1, 100000).Select(i => i.ToString());
// For demo purposes, cancel after a few seconds.
var source = new CancellationTokenSource(TimeSpan.FromSeconds(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions {MaxDegreeOfParallelism = 8, CancellationToken = source .Token};
try
{
Parallel.ForEach(fileList, opts, processFile);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
}
static void processFile(string file)
{
Console.WriteLine("Processing file: " + file);
// Simulate taking a varying amount of time per file.
int delay;
lock (rng)
{
delay = rng.Next(200, 2000);
}
Thread.Sleep(delay);
Console.WriteLine("Processed file: " + file);
}
}
}
As an alternative to using a cancellation token, you can write a method that returns IEnumerable<string> which returns the list of filenames, and stop returning them when time is up, for example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static Random rng = new Random();
static void Main()
{
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions {MaxDegreeOfParallelism = 8};
Parallel.ForEach(fileList(), opts, processFile);
}
static IEnumerable<string> fileList()
{
// Simulate having a list of files.
var fileList = Enumerable.Range(1, 100000).Select(x => x.ToString()).ToArray();
// Simulate finishing after a few seconds.
DateTime endTime = DateTime.Now + TimeSpan.FromSeconds(10);
int i = 0;
while (DateTime.Now <= endTime)
yield return fileList[i++];
}
static void processFile(string file)
{
Console.WriteLine("Processing file: " + file);
// Simulate taking a varying amount of time per file.
int delay;
lock (rng)
{
delay = rng.Next(200, 2000);
}
Thread.Sleep(delay);
Console.WriteLine("Processed file: " + file);
}
}
}
Note that you don't need the try/catch with this approach.
You should consider using Microsoft's Reactive Framework. It lets you use LINQ queries to process multithreaded asynchronous processing in a very simple way.
Something like this:
var query =
from file in filesToProcess.ToObservable()
where DateTime.Now < stopTime
from result in Observable.Start(() => StartProcessing(file))
select new { file, result };
var subscription =
query.Subscribe(x =>
{
/* handle result */
});
Truly, that's all the code you need if StartProcessing is already defined.
Just NuGet "Rx-Main".
Oh, and to stop processing at any time just call subscription.Dispose().
This was a truly fascinating task, and it took me a while to get the code to a level that I was happy with it.
I ended up with a combination of the above.
The first thing worth noting is that I added the following lines to my web service call, as the operation timeout I was experiencing, and which I thought was because I'd exceeded some limit set on the endpoint, was actually due to a limit set by microsoft way back in .Net 2.0:
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.Expect100Continue = false;
See here for more information:
What to set ServicePointManager.DefaultConnectionLimit to
As soon as I added those lines of code, my processing increased from 10/minute to around 100/minute.
But I still wasn't happy with the looping, and partitioning etc. My service moved onto a physical server to minimise CPU contention, and I wanted to allow the operating system to dictate how fast it ran, rather than my code throttling it.
After some research, this is what I ended up with - arguably not the most elegant code I've written, but it's extremely fast and reliable.
List<XElement> elements = new List<XElement>();
while (XMLDoc.ReadToFollowing("ElementName"))
{
using (XmlReader r = XMLDoc.ReadSubtree())
{
r.Read();
XElement node = XElement.Load(r);
//do some processing of the node here...
elements.Add(node);
}
}
//And now pass the list of elements through PLinQ to the actual web service call, allowing the OS/framework to handle the parallelism
int failCount=0; //the method call below sets this per request; we log and continue
failCount = elements.AsParallel()
.Sum(element => IntegrationClass.DoRequest(element.ToString()));
It ended up fiendishly simple and lightning fast.
I hope this helps someone else trying to do the same thing!
Related
I have a Windows service that reads data from the database and processes this data using multiple REST API calls.
Originally, this service ran on a timer where it would read unprocessed data from the database and process it using multiple threads limited using SemaphoreSlim. This worked well except that the database read had to wait for all processing to finish before reading again.
ServicePointManager.DefaultConnectionLimit = 10;
Original that works:
// Runs every 5 seconds on a timer
private void ProcessTimer_Elapsed(object sender, ElapsedEventArgs e)
{
var hasLock = false;
try
{
Monitor.TryEnter(timerLock, ref hasLock);
if (hasLock)
{
ProcessNewData();
}
else
{
log.Info("Failed to acquire lock for timer."); // This happens all of the time
}
}
finally
{
if (hasLock)
{
Monitor.Exit(timerLock);
}
}
}
public void ProcessNewData()
{
var unproceesedItems = GetDatabaseItems();
if (unproceesedItems.Count > 0)
{
var downloadTasks = new Task[unproceesedItems.Count];
var maxThreads = new SemaphoreSlim(semaphoreSlimMinMax, semaphoreSlimMinMax); // semaphoreSlimMinMax = 10 is max threads
for (var i = 0; i < unproceesedItems .Count; i++)
{
maxThreads.Wait();
var iClosure = i;
downloadTasks[i] =
Task.Run(async () =>
{
try
{
await ProcessItemsAsync(unproceesedItems[iClosure]);
}
catch (Exception ex)
{
// handle exception
}
finally
{
maxThreads.Release();
}
});
}
Task.WaitAll(downloadTasks);
}
}
To improve efficiency, I rewrite the service to run GetDatabaseItems in a separate thread from the rest so that there is a ConcurrentDictionary of unprocessed items between them that GetDatabaseItems fills and ProcessNewData empties.
The problem is that while 10 unprocessed items are send to ProcessItemsAsync, they are processed two at a time instead of all 10.
The code inside of ProcessItemsAsync calls var response = await client.SendAsync(request); where the delay occurs. All 10 threads make it to this code but come out of it two at a time. None of this code changed between the old version and the new.
Here is the code in the new version that did change:
public void Start()
{
ServicePointManager.DefaultConnectionLimit = maxSimultaneousThreads; // 10
// Start getting unprocessed data
getUnprocessedDataTimer.Interval = getUnprocessedDataInterval; // 5 seconds
getUnprocessedDataTimer.Elapsed += GetUnprocessedData; // writes data into a ConcurrentDictionary
getUnprocessedDataTimer.Start();
cancellationTokenSource = new CancellationTokenSource();
// Create a new thread to process data
Task.Factory.StartNew(() =>
{
try
{
ProcessNewData(cancellationTokenSource.Token);
}
catch (Exception ex)
{
// error handling
}
}, TaskCreationOptions.LongRunning
);
}
private void ProcessNewData(CancellationToken token)
{
// Check if task has been canceled.
while (!token.IsCancellationRequested)
{
if (unprocessedDictionary.Count > 0)
{
try
{
var throttler = new SemaphoreSlim(maxSimultaneousThreads, maxSimultaneousThreads); // maxSimultaneousThreads = 10
var tasks = unprocessedDictionary.Select(async item =>
{
await throttler.WaitAsync(token);
try
{
if (unprocessedDictionary.TryRemove(item.Key, out var item))
{
await ProcessItemsAsync(item);
}
}
catch (Exception ex)
{
// handle error
}
finally
{
throttler.Release();
}
});
Task.WhenAll(tasks);
}
catch (OperationCanceledException)
{
break;
}
}
Thread.Sleep(1000);
}
}
Environment
.NET Framework 4.7.1
Windows Server 2016
Visual Studio 2019
Attempts to fix:
I tried the following with the same bad result (two await client.SendAsync(request) completing at a time):
Set Max threads and ServicePointManager.DefaultConnectionLimit to 30
Manually create threads using Thread.Start()
Replace async/await pattern with sync HttpClient calls
Call data processing using Task.Run(async () => and Task.WaitAll(downloadTasks);
Replace the new long-running thread for ProcessNewData with a timer
What I want is to run GetUnprocessedData and ProcessNewData concurrently with an HttpClient connection limit of 10 (set in config) so that 10 requests are processed at the same time.
Note: the issue is similar to HttpClient.GetAsync executes only 2 requests at a time? but the DefaultConnectionLimit is increased and the service runs on a Windows Server. It also creates more than 2 connections when original code runs.
Update
I went back to the original project to make sure it still worked, it did. I added a new timer to perform some unrelated operations and the httpClient issue came back. I removed the timer, everything worked. I added a new thread to do parallel processing, the problem came back.
This is not a direct answer to your question, but a suggestion for simplifying your service that could make the debugging of any problem easier. My suggestion is to implement the producer-consumer pattern using an iterator for producing the unprocessed items, and a parallel loop for consuming them. Ideally the parallel loop would have async delegates, but since you are targeting the .NET Framework you don't have access to the .NET 6 method Parallel.ForEachAsync. So I will suggest the slightly wasteful approach of using a synchronous parallel loop that blocks threads. You could use either the Parallel.ForEach method, or the PLINQ like in the example below:
private IEnumerable<Item> Iterator(CancellationToken token)
{
while (true)
{
Task delayTask = Task.Delay(5000, token);
foreach (Item item in GetDatabaseItems()) yield return item;
delayTask.GetAwaiter().GetResult();
}
}
public void Start()
{
//...
ThreadPool.SetMinThreads(degreeOfParallelism, Environment.ProcessorCount);
new Thread(() =>
{
try
{
Partitioner
.Create(Iterator(token), EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.WithDegreeOfParallelism(degreeOfParallelism)
.WithCancellation(token)
.ForAll(item => ProcessItemAsync(item).GetAwaiter().GetResult());
}
catch (OperationCanceledException) { } // Ignore
}).Start();
}
Online demo.
The Iterator fetches unprocessed items from the database in batches, and yields them one by one. The database won't be hit more frequently than once every 5 seconds.
The PLINQ query is going to fetch a new item from the Iterator each time it has a worker available, according to the WithDegreeOfParallelism policy. The setting EnumerablePartitionerOptions.NoBuffering ensures that it won't try to fetch more items in advance.
The ThreadPool.SetMinThreads is used in order to boost the availability of ThreadPool threads, since the PLINQ is going to use lots of them. Without it the ThreadPool will not be able to satisfy the demand immediately, although it will gradually inject more threads and eventually will catch up. But since you already know how many threads you'll need, you can configure the ThreadPool from the start.
In case you dislike the idea of blocking threads, you can find a simple substitute of the Parallel.ForEachAsync here, based on the TPL Dataflow library. It requires installing a NuGet package.
The issue turned out to be the place where ServicePointManager.DefaultConnectionLimit is set.
In the version where HttpClient was only doing two requests at a time, ServicePointManager.DefaultConnectionLimit was being set before the threads were being created but after the HttpClient was initialized.
Once I moved it into the constructor before the HttpClient is initialized, everything started working.
Thank you very much to #Theodor Zoulias for the help.
TLDR; Set ServicePointManager.DefaultConnectionLimit before initializing the HttpClient.
I'm using FileSystemWatcher in order to catch every created, changed, deleted and renamed change over whichever file in a folder.
Over this changes I need to perform a simple checksum of the contents of these files. Simply, I'm opening a filestream and pass it to MD5 class:
private byte[] calculateChecksum(string frl)
{
using (FileStream stream = File.Open(frl, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
return this.md5.ComputeHash(stream);
}
}
The problem is according the amount of files I need to handle. For example, imagine I have 200 files created along the time in a folder, and then I copy all of them and paste them on the same folder. This action is going to cause 200 event and 200 calculateChecksum() performings.
How could I solve this kind of problems?
In FileSystemWatcher handler put tasks to queue that will processed by some worker. Worker can process checksum calc tasks with targeted speed or/and frequency. Probably one worker will be better because many readers can slow down hdd with many read seeks.
Try read about BlockingCollection:
https://msdn.microsoft.com/ru-ru/library/dd997371(v=vs.110).aspx
and Producer-Consumer Dataflow Pattern
https://msdn.microsoft.com/ru-ru/library/hh228601(v=vs.110).aspx
var workerCount = 2;
BlockingCollection<String>[] filesQueues= new BlockingCollection<String>[workerCount];
for(int i = 0; i < workerCount; i++)
{
filesQueues[i] = new BlockingCollection<String>(500);
// Worker
Task.Run(() =>
{
while (!filesQueues[i].IsCompleted)
{
string url;
try
{
url= filesQueues[i].Take();
}
catch (InvalidOperationException) { }
if (!string.IsNullOrWhiteSpace(url))
{
calculateChecksum(url);
}
}
}
}
// inside of FileSystemWatcher handler
var queueIndex = hash(filename) % workersCount
// Warning!!
// Blocks if numbers.Count == dataItems.BoundedCapacity
filesQueues[queueIndex].Add(fileName);
filesQueues[queueIndex].CompleteAdding();
Also you can make multiple consumers, just call Take or TryTake concurrently - each item will only be consumed by a single consumer. But take into account in that case one file can be processed by many workers, and multiple hdd readers can slow down hdd.
UPD in case of multiple workers, it would be better to make multiple BlockingCollections, and push files in queue with index:
I've scketched a cosumer-producer pattern to solve that, and I've tried to use a thread pool in order to smooth the big amount of work, sharing a BlockingCollection
BlockingCollection & ThreadPool:
private BlockingCollection<Index.ResourceIndexDocument> documents;
this.pool = new SmartThreadPool(SmartThreadPool.DefaultIdleTimeout, 4);
this.documents = new BlockingCollection<string>();
As you cann see, I've created a I treadPool setting concurrency to 4. So, there is going to work only 4 thread at the same time regasdless of whether there is x > 4 work's units to handle in the pool.
Producer:
public void warn(string channel, string frl)
{
this.pool.QueueWorkItem<string, string>(
(file) => this.files.Add(file),
channel,
frl
);
}
Consumer:
Task.Factory.StartNew(() =>
{
Index.ResourceIndexDocument document = null;
while (this.documents.TryTake(out document, TimeSpan.FromSeconds(1)))
{
IEnumerable<Index.ResourceIndexDocument> documents = this.documents.Take(this.documents.Count);
Index.IndexEngine.Instance.index(documents);
}
},
TaskCreationOptions.LongRunning
);
I am trying to create a application that multi threaded downloads images from a website, as a introduction into threading. (never used threading properly before)
But currently it seems to create 1000+ threads and I am not sure where they are coming from.
I first queue a thread into a thread pool, for starters i only have 1 job in the jobs array
foreach (Job j in Jobs)
{
ThreadPool.QueueUserWorkItem(Download, j);
}
Which starts the void Download(object obj) on a new thread where it loops through a certain amount of pages (images needed / 42 images per page)
for (var i = 0; i < pages; i++)
{
var downloadLink = new System.Uri("http://www." + j.Provider.ToString() + "/index.php?page=post&s=list&tags=" + j.Tags + "&pid=" + i * 42);
using (var wc = new WebClient())
{
try
{
wc.DownloadStringAsync(downloadLink);
wc.DownloadStringCompleted += (sender, e) =>
{
response = e.Result;
ProcessPage(response, false, j);
};
}
catch (System.Exception e)
{
// Unity editor equivalent of console.writeline
Debug.Log(e);
}
}
}
correct me if I am wrong, the next void gets called on the same thread
void ProcessPage(string response, bool secondPass, Job j)
{
var wc = new WebClient();
LinkItem[] linkResponse = LinkFinder.Find(response).ToArray();
foreach (LinkItem i in linkResponse)
{
if (secondPass)
{
if (string.IsNullOrEmpty(i.Href))
continue;
else if (i.Href.Contains("http://loreipsum."))
{
if (DownloadImage(i.Href, ID(i.Href)))
j.Downloaded++;
}
}
else
{
if (i.Href.Contains(";id="))
{
var alterResponse = wc.DownloadString("http://www." + j.Provider.ToString() + "/index.php?page=post&s=view&id=" + ID(i.Href));
ProcessPage(alterResponse, true, j);
}
}
}
}
And finally passes on to the last function and downloads the actual image
bool DownloadImage(string target, int id)
{
var url = new System.Uri(target);
var fi = new System.IO.FileInfo(url.AbsolutePath);
var ext = fi.Extension;
if (!string.IsNullOrEmpty(ext))
{
using (var wc = new WebClient())
{
try
{
wc.DownloadFileAsync(url, id + ext);
return true;
}
catch(System.Exception e)
{
if (DEBUG) Debug.Log(e);
}
}
}
else
{
Debug.Log("Returned Without a extension: " + url + " || " + fi.FullName);
return false;
}
return true;
}
I am not sure how I am starting this many threads, but would love to know.
Edit
The goal of this program is to download the different job in jobs at the same time (max of 5) each downloading a maximum of 42 images at the time.
so a maximum of 210 images can/should be downloaded maximum at all times.
First of all, how did you measure the thread count? Why do you think that you have thousand of them in your application? You are using the ThreadPool, so you don't create them by yourself, and the ThreadPool wouldn't create such great amount of them for it's needs.
Second, you are mixing synchronious and asynchronious operations in your code. As you can't use TPL and async/await, let's go through you code and count the unit-of-works you are creating, so you can minimize them. After you do this, the number of queued items in ThreadPool will decrease and your application will gain performance you need.
You don't set the SetMaxThreads method in your application, so, according the MSDN:
Maximum Number of Thread Pool Threads
The number of operations that can be queued to the thread pool is limited only by available memory;
however, the thread pool limits the number of threads that can be
active in the process simultaneously. By default, the limit is 25
worker threads per CPU and 1,000 I/O completion threads.
So you must set the maximum to the 5.
I can't find a place in your code where you check the 42 images per Job, you are only incrementing the value in ProcessPage method.
Check the ManagedThreadId for the handle of WebClient.DownloadStringCompleted - does it execute in different thread or not.
You are adding the new item in ThreadPool queue, why are you using the asynchronious operation for Downloading? Use a synchronious overload, like this:
ProcessPage(wc.DownloadString(downloadLink), false, j);
This will not create another one item in ThreadPool queue, and you wouldn't have a sinchronisation context switch here.
In ProcessPage your wc variable doesn't being garbage collected, so you aren't freeing all your resourses here. Add using statement here:
void ProcessPage(string response, bool secondPass, Job j)
{
using (var wc = new WebClient())
{
LinkItem[] linkResponse = LinkFinder.Find(response).ToArray();
foreach (LinkItem i in linkResponse)
{
if (secondPass)
{
if (string.IsNullOrEmpty(i.Href))
continue;
else if (i.Href.Contains("http://loreipsum."))
{
if (DownloadImage(i.Href, ID(i.Href)))
j.Downloaded++;
}
}
else
{
if (i.Href.Contains(";id="))
{
var alterResponse = wc.DownloadString("http://www." + j.Provider.ToString() + "/index.php?page=post&s=view&id=" + ID(i.Href));
ProcessPage(alterResponse, true, j);
}
}
}
}
}
In DownloadImage method you also use the asynchronious load. This also adds item in ThreadPoll queue, and I think that you can avoid this, and use synchronious overload too:
wc.DownloadFile(url, id + ext);
return true;
So, in general, avoid the context-switching operations and dispose your resources properly.
Your wc WebClinet will go out of scope and be randomly garbage collected before the async callback. Also on all async calls you have to allow for immediate return and the actual delegated function return. So processPage will have to be in two places. Also the j in the original loop may be going out of scope depending on where Download in the original loop is declared.
At work one of our processes uses a SQL database table as a queue. I've been designing a queue reader to check the table for queued work, update the row status when work starts, and delete the row when the work is finished. I'm using Parallel.Foreach to give each process its own thread and setting MaxDegreeOfParallelism to 4.
When the queue reader starts up it checks for any unfinished work and loads the work into an list, then it does a Concat with that list and a method that returns an IEnumerable which runs in an infinite loop checking for new work to do. The idea is that the unfinished work should be processed first and then the new work can be worked as threads are available. However what I'm seeing is that FetchQueuedWork will change dozens of rows in the queue table to 'Processing' immediately but only work on a few items at a time.
What I expected to happen was that FetchQueuedWork would only get new work and update the table when a slot opened up in the Parallel.Foreach. What's really odd to me is that it behaves exactly as I would expect when I run the code in my local developer environment, but in production I get the above problem.
I'm using .Net 4. Here is the code:
public void Go()
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = 4 }, DoWork);
}
private IEnumerable<WorkData> FetchQueuedWork()
{
while (true)
{
var workUnit = WorkData.GetQueuedWorkAndSetStatusToProcessing();
yield return workUnit;
}
}
private void DoWork(WorkData workUnit)
{
if (!workUnit.Loaded)
{
System.Threading.Thread.Sleep(5000);
return;
}
Work();
}
I suspect that the default (Release mode?) behaviour is to buffer the input. You might need to create your own partitioner and pass it the NoBuffering option:
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
var options = new ParallelOptions { MaxDegreeOfParallelism = 4 };
var partitioner = Partitioner.Create(work, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partioner, options, DoWork);
Blorgbeard's solution is correct when it comes to .NET 4.5 - hands down.
If you are constrained to .NET 4, you have a few options:
Replace your Parallel.ForEach with work.AsParallel().WithDegreeOfParallelism(4).ForAll(DoWork). PLINQ is more conservative when it comes to buffering items, so this should do the trick.
Write your own enumerable partitioner (good luck).
Create a grotty semaphore-based hack such as this:
(Side-effecting Select used for the sake of brevity)
public void Go()
{
const int MAX_DEGREE_PARALLELISM = 4;
using (var semaphore = new SemaphoreSlim(MAX_DEGREE_PARALLELISM, MAX_DEGREE_PARALLELISM))
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork
.Concat(FetchQueuedWork())
.Select(w =>
{
// Side-effect: bad practice, but easier
// than writing your own IEnumerable.
semaphore.Wait();
return w;
});
// You still need to specify MaxDegreeOfParallelism
// here so as not to saturate your thread pool when
// Parallel.ForEach's load balancer kicks in.
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = MAX_DEGREE_PARALLELISM }, workUnit =>
{
try
{
this.DoWork(workUnit);
}
finally
{
semaphore.Release();
}
});
}
}
I am having this issue when I ran something like this:
Parallel.ForEach(dataTable.AsEnumerable(), row =>
{
//do processing
}
Assuming that there are 500+ records say 870. Once the Parallel.ForEach completes 850, it seems to be running sequentially i.e. only 1 operation at a time. It completed 850 operations very fast but when it comes close to the end of the iteration it becomes very slow and seems to be performing like a regular for each. I even tried for 2000 records.
Is something wrong in my code? Please give suggestions.
Below is the code I am using
Sorry I just posted the wrong example. This is the correct code:
Task newTask = Task.Factory.StartNew(() =>
{
Parallel.ForEach(dtResult.AsEnumerable(), dr =>
{
string extractQuery = "";
string downLoadFileFullName = "";
lock (foreachObject)
{
string fileName = extractorConfig.EncodeFileName(dr);
extractQuery = extractorConfig.GetExtractQuery(dr);
if (string.IsNullOrEmpty(extractQuery)) throw new Exception("Extract Query not found. Please check the configuration");
string newDownLoadPath = CommonUtil.GetFormalizedDataPath(sDownLoadPath, uKey.CobDate);
//create folder if it doesn't exist
if (!Directory.Exists(newDownLoadPath)) Directory.CreateDirectory(newDownLoadPath);
downLoadFileFullName = Path.Combine(newDownLoadPath, fileName);
}
Interlocked.Increment(ref index);
ExtractorClass util = new ExtractorClass(SourceDbConnStr);
util.LoadToFile(extractQuery, downLoadFileFullName);
Interlocked.Increment(ref uiTimerIndex);
});
});
My guess:
This looks to have a high degree of potential IO from:
Database+Disk
Network communication to DB and back
Writing results to disk
Therefore a lot of time is going to be spent waiting for IO. My guess is that the waiting is only getting worse as more threads are being added to the mix and IO is being further stressed. For instance a disk only has one set of heads, so you cannot write to it concurrently. If you have a large number of threads trying to write concurrently, performance degrades.
Try limiting the maximum number of threads you are using:
var options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(dtResult.AsEnumerable(), options, dr =>
{
//Do stuff
});
Update
After your code edit, I would suggest the following which has a couple of changes:
Reduce maximum number of threads - this can be experimented with.
Only perform directory check and creation once.
Code:
private static bool isDirectoryCreated;
//...
var options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(dtResult.AsEnumerable(), options, dr =>
{
string fileName, extractQuery, newDownLoadPath;
lock (foreachObject)
{
fileName = extractorConfig.EncodeFileName(dr);
extractQuery = extractorConfig.GetExtractQuery(dr);
if (string.IsNullOrEmpty(extractQuery))
throw new Exception("Extract Query not found. Please check the configuration");
newDownLoadPath = CommonUtil.GetFormalizedDataPath(sDownLoadPath, uKey.CobDate);
if (!isDirectoryCreated)
{
if (!Directory.Exists(newDownLoadPath))
Directory.CreateDirectory(newDownLoadPath);
isDirectoryCreated = true;
}
}
string downLoadFileFullName = Path.Combine(newDownLoadPath, fileName);
Interlocked.Increment(ref index);
ExtractorClass util = new ExtractorClass(SourceDbConnStr);
util.LoadToFile(extractQuery, downLoadFileFullName);
Interlocked.Increment(ref uiTimerIndex);
});
It’s hard to give details without the relevant code but in general this is the expected behaviour. .NET tries to schedule the tasks such that every processor is evenly busy.
But this can only ever be approximated sind not all of the tasks take the same amount of time. At the end some processors will be done working and some won’t, and re-distributing the work is costly and not always beneficial.
I don’t know details about the load balancing used by PLinq but the bottom line is that this behaviour can never be fully prevented.