TPL vs Multithreading

TPL vs Multithreading - c#

I am new to threading and I need a clarification for the below scenario.
I am working on apple push notification services. My application demands to send notifications to 30k users when a new deal is added to the website.
can I split the 30k users into lists, each list containing 1000 users and start multiple threads or can use task?
Is the following way efficient?
if (lstDevice.Count > 0)
{
for (int i = 0; i < lstDevice.Count; i += 2)
{
splitList.Add(lstDevice.Skip(i).Take(2).ToList<DeviceHelper>());
}
var tasks = new Task[splitList.Count];
int count=0;
foreach (List<DeviceHelper> lst in splitList)
{
tasks[count] = Task.Factory.StartNew(() =>
{
QueueNotifications(lst, pMessage, pSubject, pNotificationType, push);
},
TaskCreationOptions.None);
count++;
}
QueueNotification method will just loop through each list item and creates a payload like
foreach (DeviceHelper device in splitList)
{
if (device.PlatformType.ToLower() == "ios")
{
push.QueueNotification(new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(pMessage)
.WithBadge(device.Badge)
);
Console.Write("Waiting for Queue to Finish...");
}
}
push.StopAllServices(true);

Technically it is sure possible to split a list and then start threads that runs your List in parallel. You can also implement everything yourself, as you already have done, but this isn't a good approach. At first splitting a List into chunks that gets processed in parallel is already what Parallel.For or Parallel.ForEach does. There is no need to re-implement everything yourself.
Now, you constantly ask if something can run 300 or 500 notifications in parallel. But actually this is not a good question because you completly miss the point of running something in parallel.
So, let me explain you why that question is not good. At first, you should ask yourself why do you want to run something in parallel? The answer to that is, you want that something runs faster by using multiple CPU-cores.
Now your simple idea is probably that spawning 300 or 500 threads is faster, because you have more threads and it runs more things "in parallel". But that is not exactly the case.
At first, creating a thread is not "free". Every thread you create has some overhead, it takes some CPU-time to create a thread, and also it needs some memory. On top of that, if you create 300 threads it doesn't mean 300 threads run in parallel. If you have for example an 8 core CPU only 8 threads really can run in parallel. Creating more threads can even hurt your performance. Because now your program needs to switch constanlty between threads, that also cost CPU-performance.
The result of all that is. If you have something lightweight some small code that don't do a lot of computation it ends that creating a lot of threads will slow down your application instead of running faster, because the managing of your threads creates more overhead than running it on (for example) 8 cpu-cores.
That means, if you have a list of 30,000 of somewhat. It usally end that it is faster to just split your list in 8 chunks and work through your list in 8 threads as creating 300 Threads.
Your goal should never be: Can it run xxx things in parallel?
The question should be like: How many threads do i need, and how much items should every thread process to get my work as fastest done.
That is an important difference because just spawning more threads doesn't mean something ends up beeing fast.
So how many threads do you need, and how many items should every thread process? Well, you can write a lot of code to test it. But the amount changes from hardware to hardware. A PC with just 4 cores have another optimum than a system with 8 cores. If what you are doing is IO bound (for example read/write to disk/network) you also don't get more speed by increasing your threads.
So what you now can do is test everything, try to get the correct thread number and do a lot of benchmarking to find the best numbers.
But actually, that is the whole purpose of the TPL library with the Task<T> class. The Task<T> class already looks at your computer how many cpu-cores it have. And when you are running your Task it automatically tries to create as much threads needed to get the maximum out of your system.
So my suggestion is that you should use the TPL library with the Task<T> class. In my opinion you should never create Threads directly yourself or doing partition yourself, because all of that is already done in TPL.

I think the Task-Class is a good choise for your aim, becuase you have an easy handling over the async process and don't have to deal with Threads directly.
Maybe this help: Task vs Thread differences
But to give you a better answer, you should improve your question an give us more details.
You should be careful with creating to much parallel threads, because this can slow down your application. Read this nice article from SO: How many threads is too many?. The best thing is you make it configurable and than test some values.

I agree Task is a good choice however creating too many tasks also bring risks to your system and for failures, your decision is also a factor to come up a solution. For me I prefer MSQueue combining with thread pool.

If you want parallelize the creation of the push notifications and maximize the performance by using all CPU's on the computer you should use Parallel.ForEach:
Parallel.ForEach(
devices,
device => {
if (device.PlatformType.ToUpperInvariant() == "IOS") {
push.QueueNotification(
new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(message)
.WithBadge(device.Badge)
);
}
}
);
push.StopAllServices(true);
This assumes that calling push.QueueNotification is thread-safe. Also, if this call locks a shared resource you may see lower than expected performance because of lock contention.
To avoid this lock contention you may be able to create a separate queue for each partition that Parallel.ForEach creates. I am improvising a bit here because some details are missing from the question. I assume that the variable push is an instance of the type Push:
Parallel.ForEach(
devices,
() => new Push(),
(device, _, push) => {
if (device.PlatformType.ToUpperInvariant() == "IOS") {
push.QueueNotification(
new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(message)
.WithBadge(device.Badge)
);
}
return push;
},
push.StopAllServices(true);
);
This will create a separate Push instance for each partition that Parallel.ForEach creates and when the partition is complete it will call StopAllServices on the instance.
This approach should perform no worse than splitting the devices into N lists where N is the number of CPU's and and starting either N threads or N tasks to process each list. If one thread or task "gets behind" the total execution time will be the execution time of this "slow" thread or task. With Parallel.ForEach all CPU's are used until all devices have been processed.

Related

What does the Parallel.Foreach do behind the scenes?

So I just cant grasp the concept here.
I have a Method that uses the Parallel class with the Foreach method.
But the thing I dont understand is, does it create new threads so it can run the function faster?
Let's take this as an example.
I do a normal foreach loop.
private static void DoSimpleWork()
{
foreach (var item in collection)
{
//DoWork();
}
}
What that will do is, it will take the first item in the list, assign the method DoWork(); to it and wait until it finishes. Simple, plain and works.
Now.. There are three cases I am curious about
If I do this.
Parallel.ForEach(stringList, simpleString =>
{
DoMagic(simpleString);
});
Will that split up the Foreach into let's say 4 chunks?
So what I think is happening is that it takes the first 4 lines in the list, assigns each string to each "thread" (assuming parallel creates 4 virtual threads) does the work and then starts with the next 4 in that list?
If that is wrong please correct me I really want to understand how this works.
And then we have this.
Which essentially is the same but with a new parameter
Parallel.ForEach(stringList, new ParallelOptions() { MaxDegreeOfParallelism = 32 }, simpleString =>
{
DoMagic(simpleString);
});
What I am curious about is this
new ParallelOptions() { MaxDegreeOfParallelism = 32 }
Does that mean it will take the first 32 strings from that list (if there even is that many in the list) and then do the same thing as I was talking about above?
And for the last one.
Task.Factory.StartNew(() =>
{
Parallel.ForEach(stringList, simpleString =>
{
DoMagic(simpleString);
});
});
Would that create a new task, assigning each "chunk" to it's own task?

Do not mix async code with parallel. Task is for async operations - querying a DB, reading file, awaiting some comparatively-computation-cheap operation such that your UI won't be blocked and unresponsive.
Parallel is different. That's designed for 1) multi-core systems and 2) computational-intensive operations. I won't go in details how it works, that kind of info could be found in an MS documentation. Long story short, Parallel.For most probably will make it's own decision on what exactly when and how to run. It might disobey you parameters, i.e. MaxDegreeOfParallelism or somewhat else. The whole idea is to provide the best possible parallezation, thus complete your operation as fast as possible.

Parallel.ForEach perform the equivalent of a C# foreach loop, but with each iteration executing in parallel instead of sequentially. There is no sequencing, it depends on whether the OS can find an available thread, if there is it will execute
MaxDegreeOfParallelism
By default, For and ForEach will utilize as many threads as the OS provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used by the application.
You do not need to modify this parameter in general but may choose to change it in advanced scenarios:
When you know that a particular algorithm you're using won't scale
beyond a certain number of cores. You can set the property to avoid
wasting cycles on additional cores.
When you're running multiple algorithms concurrently and want to
manually define how much of the system each algorithm can utilize.
When the thread pool's heuristics is unable to determine the right
number of threads to use and could end up injecting too many
threads. e.g. in long-running loop body iterations, the
thread pool might not be able to tell the difference between
reasonable progress or livelock or deadlock, and might not be able
to reclaim threads that were added to improve performance. You can set the property to ensure that you don't use more than a reasonable number of threads.
Task.StartNew is usually used when you require fine-grained control for a long-running, compute-bound task, and like what #Сергей Боголюбов mentioned, do not mix them up
It creates a new task, and that task will create threads asynchronously to run the for loop
You may find this ebook useful: http://www.albahari.com/threading/#_Introduction

does the work and then starts with the next 4 in that list?
This depends on your machine's hardware and how busy the machine's cores are with other processes/apps your CPU is working on
Does that mean it will take the first 32 strings from that list (if there even if that many in the list) and then do the same thing as I was talking about above?
No, there's is no guarantee that it will take first 32, could be less. It will vary each time you execute the same code
Task.Factory.StartNew creates a new tasks but it will not create a new one for each chunk as you expect.
Putting a Parallel.ForEach inside a new Task will not help you further reduce the time taken for the parallel tasks themselves.

De-queue Items with worker threads

I have been trying to figure out how to solve an requirement I have but for the life of me I just can't come up with a solution.
I have a database of items which stores them a kind of queue.
(The database has already been implemented and other processes will be adding items to this queue.)
The items require a lot of work/time to "process" so I need to be able to:
Constantly de-queue items from the database.
For each item run a new thread and process the item and then return true/false it it was successfully processed. (this will be used to re-add it to the database queue or not)
But to only do this while the current number of active threads (one per item being processed) is less then a maximum number of threads parameter.
Once the maximum number of threads has been reached I need to stop de-queuing items from the database until the current number of threads is less than the maximum number of threads.
At which point it needs to continue de-queuing items.
It feels like this should be something I can come up with but it is just not coming to me.
To clarify: I only need to implement the threading. The database has already be implemented.

One really easy way to do this is with a Semaphore. You have one thread that dequeues items and creates threads to process them. For example:
const int MaxThreads = 4;
Semaphore sem = new Semaphore(MaxThreads, MaxThreads);
while (Queue.HasItems())
{
sem.WaitOne();
var item = Queue.Dequeue();
Threadpool.QueueUserWorkItem(ProcessItem, item); // see below
}
// When the queue is empty, you have to wait for all processing
// threads to complete.
// If you can acquire the semaphore MaxThreads times, all workers are done
int count = 0;
while (count < MaxThreads)
{
sem.WaitOne();
++count;
}
// the code to process an item
void ProcessItem(object item)
{
// cast the item to whatever type you need,
// and process it.
// when done processing, release the semaphore
sem.Release();
}
The above technique works quite well. It's simple to code, easy to understand, and very effective.
One change is that you might want to use the Task API rather Threadpool.QueueUserWorkItem. Task gives you more control over the asynchronous processing, including cancellation. I used QueueUserWorkItem in my example because I'm more familiar with it. I would use Task in a production program.
Although this does use N+1 threads (where N is the number of items you want processed concurrently), that extra thread isn't often doing anything. The only time it's running is when it's assigning work to worker threads. Otherwise, it's doing a non-busy wait on the semaphore.

Do you just not know where to start?
Consider a thread pool with a max number of threads. http://msdn.microsoft.com/en-us/library/y5htx827.aspx
Consider spinning up your max number of threads immediately and monitoring the DB. http://msdn.microsoft.com/en-us/library/system.threading.threadpool.queueuserworkitem.aspx is convenient.
Remember that you can't guarantee your process will be ended safely...crashes happen. Consider logging of processing state.
Remember that your select and remove-from-queue operations should be atomic.

Ok, so the architecture of the solution is going to depend on one thing: does the processing time per queue item vary according to the item's data?
If not then you can have something that merely round-robins between the processing threads. This will be fairly simple to implement.
If the processing time does vary then you're going to need something with more of a 'next available' feel to it, so that whichever of you threads happens to be free first gets given the job of processing the data item.
Having worked that out you're then going to have the usual run around with how to synchronise between a queue reader and the processing threads. The difference between 'next-available' and 'round-robin' is how you do that synchronisation.
I'm not overly familiar with C#, but I've heard tell of a beast called a background worker. That is likely to be an acceptable means of bringing this about.
For round robin, just start up a background worker per queue item, storing the workers' references in an array. Limit yourself to, say, 16 in progress background workers. The idea is that having started 16 you would then wait for the first to complete before starting the 17th, and so on. I believe that background workers actually run as jobs on the thread pool, so that will automatically limit the number of threads that are actually running at any one time to something appropriate for the underlying hardware. To wait for a background worker see this. Having waited for a background worker to complete you'd then handle its result and start another up.
For the next available approach its not so different. Instead of waiting for the 1st to complete you would use WaitAny() to wait for any of the workers to complete. You handle the return from whichever one completed, and then start another one up and go back to WaitAny().
The general philosophy of both approaches is to keep a number of threads on the boil all the time. A features of the next-available approach is that the order in which you emit the results is not necessarily the same as the order of the input items. If that matters then the round robin approach with more background workers than CPU cores will be reasonably efficient (the threadpool will just start commissioned but not yet running workers anyway). However the latency will vary with the processing time.
BTW 16 is an arbitrary number chosen on the basis of how many cores you think will be on the PC running the software. More cores, bigger number.
Of course, in the seemingly restless and ever changing world of .NET there may now be a better way of doing this.
Good luck!

Parallel.For vs regular threads

I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.

The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.

Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());

Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..

Performance of running Parallel.Foreach on several threads

I have 3 main processing threads, each of them performing operations on the values of ConcurrentDictionaries by means of Parallel.Foreach. The dictionaries vary in size from 1,000 elements to 250,000 elements
TaskFactory factory = new TaskFactory();
Task t1 = factory.StartNew(() =>
{
Parallel.ForEach(dict1.Values, item => ProcessItem(item));
});
Task t2 = factory.StartNew(() =>
{
Parallel.ForEach(dict2.Values, item => ProcessItem(item));
});
Task t3 = factory.StartNew(() =>
{
Parallel.ForEach(dict3.Values, item => ProcessItem(item));
});
t1.Wait();
t2.Wait();
t3.Wait();
I compared the performance (total execution time) of this construct with just running the Parallel.Foreach in the main thread and the performance improved a lot (the execution time was reduced approximately 5 times)
My questions are:
Is there something wrong with the
approach above? If yes, what and how
can it be improved?
What is the reason for the different execution times?
What is a good way to debug/analyze such a situation?
EDIT: To further clarify the situation: I am mocking the client calls on a WCF service, that each comes on a separate thread (the reason for the Tasks). I also tried to use ThreadPool.QueueUserWorkItem instead of Task, without a performance improvement. The objects in the dictionary have between 20 and 200 properties (just decimals and strings) and there is no I/O activity
I solved the problem by queuing the processing requests in a BlockingCollection and processing them one at the time

You're probably over-parallelizing.
You don't need to create 3 tasks if you already use a good (and balanced) parallelization inside each one of them.
Parallel.Foreach already try to use the right number of threads to exploit the full CPU potential without saturating it. And by creating other tasks having Parallel.Foreach you're probably saturating it.
(EDIT: as Henk said, they probably have some problems in coordinating the number of threads to spawn when run in parallel, and at least this leads to a bigger overhead).
Have a look here for some hints.

First of all, a Task is not a Thread.
Your Parallel.ForEach() calls are run by a scheduler that uses the ThreadPool and should try to optimize Thread usage. The ForEach applies a Partitioner. When you run these in parallel they cannot coordinate very well.
Only if there is a performance problem, consider helping with extra tasks or DegreeOfParallelism directives. And then always profile and analyze first.
An explanation of your results is difficult, it could be caused by many factors (I/O for example) but the advantage of the 'single main task' is that the scheduler has more control and the CPU and Cache are used better (locality).

The dictionaries vary widely in size and by the looks of it (given everything finishes in <5s) the amount of processing work is small. Without knowing more it's hard to say what's actually going on. How big are your dictionary items? The main thread scenario you're comparing this to looks like this right?
Parallel.ForEach(dict1.Values, item => ProcessItem(item));
Parallel.ForEach(dict2.Values, item => ProcessItem(item));
Parallel.ForEach(dict3.Values, item => ProcessItem(item));
By adding the Tasks around each ForEach your adding more overhead to manage the tasks and probably causing memory contention as dict1, dict2 and dict3 all try and be in memory and hot in cache at the same time. Remember, CPU cycles are cheap, cache misses are not.

Multi threaded file processing with .NET

There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.
I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.
How do I approach such a multithreaded application?
Is it better to use a distributed hash table or a queue?
Which data structure do I use that would avoid locks?
Is there a better approach to this scheme?

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.
Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):
var files =
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();
var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration
sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");
Slightly changing your loop to parallelize the query is all that's needed in
most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.
This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().
.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.
The results (Computer 1 - Physical Hard Disk):
Serial: 1288 - 1333ms
Parallel: 461 - 503ms
Computer specs - for comparison:
Quad Core i7 920 # 2.66 GHz
12 GB RAM (DDR 1333)
300 GB 10k rpm WD VelociRaptor
The results (Computer 2 - Solid State Drive):
Serial: 545 - 601 ms
Parallel: 248 - 278 ms
Computer specifications - for comparison:
Quad Core 2 Quad Q9100 # 2.26 GHz
8 GB RAM (DDR 1333)
120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)
I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).
These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.
Another disclaimer: Quad core + .NET 4 parallel isn't going to give you four times the performance, it doesn't scale linearly... There are other considerations and bottlenecks in play.
I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

Design
The Producer/Consumer pattern will probably be the most useful for this situation. You should create enough threads to maximize the throughput.
Here are some questions about the Producer/Consumer pattern to give you an idea of how it works:
C# Producer/Consumer pattern
C# producer/consumer
You should use a blocking queue and the producer should add files to the queue while the consumers process the files from the queue. The blocking queue requires no locking, so it's about the most efficient way to solve your problem.
If you're using .NET 4.0 there are several concurrent collections that you can use out of the box:
ConcurrentQueue: http://msdn.microsoft.com/en-us/library/dd267265%28v=VS.100%29.aspx
BlockingCollection: http://msdn.microsoft.com/en-us/library/dd267312%28VS.100%29.aspx
Threading
A single producer thread will probably be the most efficient way to load the files from disk and push them onto the queue; subsequently multiple consumers will be popping items off the queue and they'll process them. I would suggest that you try 2-4 consumer threads per core and take some performance measurements to determine which is most optimal (i.e. the number of threads that provide you with the maximum throughput). I would not recommend the use a ThreadPool for this specific example.
P.S. I don't understand what's the concern with a single point of failure and the use of distributed hash tables? I know DHTs sound like a really cool thing to use, but I would try the conventional methods first unless you have a specific problem in mind that you're trying to solve.

I recommend that you queue a thread for each file and keep track of the running threads in a dictionary, launching a new thread when a thread completes, up to a maximum limit. I prefer to create my own threads when they can be long-running, and use callbacks to signal when they're done or encountered an exception. In the sample below I use a dictionary to keep track of the running worker instances. This way I can call into an instance if I want to stop work early. Callbacks can also be used to update a UI with progress and throughput. You can also dynamically throttle the running thread limit for added points.
The example code is an abbreviated demonstrator, but it does run.
class Program
{
static void Main(string[] args)
{
Supervisor super = new Supervisor();
super.LaunchWaitingThreads();
while (!super.Done) { Thread.Sleep(200); }
Console.WriteLine("\nDone");
Console.ReadKey();
}
}
public delegate void StartCallbackDelegate(int idArg, Worker workerArg);
public delegate void DoneCallbackDelegate(int idArg);
public class Supervisor
{
Queue<Thread> waitingThreads = new Queue<Thread>();
Dictionary<int, Worker> runningThreads = new Dictionary<int, Worker>();
int maxThreads = 20;
object locker = new object();
public bool Done {
get {
lock (locker) {
return ((waitingThreads.Count == 0) && (runningThreads.Count == 0));
}
}
}
public Supervisor()
{
// queue up a thread for each file
Directory.GetFiles("C:\\folder").ToList().ForEach(n => waitingThreads.Enqueue(CreateThread(n)));
}
Thread CreateThread(string fileNameArg)
{
Thread thread = new Thread(new Worker(fileNameArg, WorkerStart, WorkerDone).ProcessFile);
thread.IsBackground = true;
return thread;
}
// called when a worker starts
public void WorkerStart(int threadIdArg, Worker workerArg)
{
lock (locker)
{
// update with worker instance
runningThreads[threadIdArg] = workerArg;
}
}
// called when a worker finishes
public void WorkerDone(int threadIdArg)
{
lock (locker)
{
runningThreads.Remove(threadIdArg);
}
Console.WriteLine(string.Format(" Thread {0} done", threadIdArg.ToString()));
LaunchWaitingThreads();
}
// launches workers until max is reached
public void LaunchWaitingThreads()
{
lock (locker)
{
while ((runningThreads.Count < maxThreads) && (waitingThreads.Count > 0))
{
Thread thread = waitingThreads.Dequeue();
runningThreads.Add(thread.ManagedThreadId, null); // place holder so count is accurate
thread.Start();
}
}
}
}
public class Worker
{
string fileName;
StartCallbackDelegate startCallback;
DoneCallbackDelegate doneCallback;
public Worker(string fileNameArg, StartCallbackDelegate startCallbackArg, DoneCallbackDelegate doneCallbackArg)
{
fileName = fileNameArg;
startCallback = startCallbackArg;
doneCallback = doneCallbackArg;
}
public void ProcessFile()
{
startCallback(Thread.CurrentThread.ManagedThreadId, this);
Console.WriteLine(string.Format("Reading file {0} on thread {1}", fileName, Thread.CurrentThread.ManagedThreadId.ToString()));
File.ReadAllBytes(fileName);
doneCallback(Thread.CurrentThread.ManagedThreadId);
}
}

Generally speaking, 1000 small files (how small, btw?) should not take six minutes to process. As a quick test, do a find "foobar" * in the directory containing the files (the first argument in quotes doesn't matter; it can be anything) and see how long it takes to process every file. If it takes more than one second, I'll be disappointed.
Assuming this test confirms my suspicion, then the process is CPU-bound, and you'll get no improvement from separating the reading into its own thread. You should:
Figure out why it takes more than 350 ms, on average, to process a small input, and hopefully improve the algorithm.
If there's no way to speed up the algorithm and you have a multicore machine (almost everyone does, these days), use a thread pool to assign 1000 tasks each the job of reading one file.

You could have a central queue, the reader threads would need write access during the push of the in-memory contents to the queue. The processing threads would need read access to this central queue to pop off the next memory stream to-be-processed. This way you minimize the time spent in locks and don't have to deal with the complexities of lock free code.
EDIT: Ideally, you'd handle all exceptions/error conditions (if any) gracefully, so you don't have points of failure.
As an alternative, you can have multiple threads, each one "claims" a file by renaming it before processing, thus the filesystem becomes the implementation for locked access. No clue if this is any more performant than my original answer, only testing would tell.

You might consider a queue of files to process. Populate the queue once by scanning the directory when you start and have the queue updated with a FileSystemWatcher to efficiently add new files to the queue without constantly re-scanning the directory.
If at all possible, read and write to different physical disks. That will give you maximum IO performance.
If you have an initial burst of many files to process and then an uneven pace of new files being added and this all happens on the same disk (read/write), you could consider buffering the processed files to memory until one of two conditions applies:
There are (temporarily) no new files
You have buffered so many files that
you don't want to use more memory for
buffering (ideally a configurable
threshold)
If your actual processing of the files is CPU intensive, you could consider having one processing thread per CPU core. However, for "normal" processing CPU time will be trivial compared to IO time and the complexity would not be worth any minor gains.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.