How do I spawn threads on different CPU cores?

How do I spawn threads on different CPU cores? - c#

Let's say I had a program in C# that did something computationally expensive, like encoding a list of WAV files into MP3s. Ordinarily I would encode the files one at a time, but let's say I wanted the program to figure out how many CPU cores I had and spin up an encoding thread on each core. So, when I run the program on a quad core CPU, the program figures out it's a quad core CPU, figures out there are four cores to work with, then spawns four threads for the encoding, each of which is running on its own separate CPU. How would I do this?
And would this be any different if the cores were spread out across multiple physical CPUs? As in, if I had a machine with two quad core CPUs on it, are there any special considerations or are the eight cores across the two dies considered equal in Windows?

Don't bother doing that.
Instead use the Thread Pool. The thread pool is a mechanism (actually a class) of the framework that you can query for a new thread.
When you ask for a new thread it will either give you a new one or enqueue the work until a thread get freed. In that way the framework is in charge on deciding wether it should create more threads or not depending on the number of present CPUs.
Edit: In addition, as it has been already mentioned, the OS is in charge of distributing the threads among the different CPUs.

It is not necessarily as simple as using the thread pool.
By default, the thread pool allocates multiple threads for each CPU. Since every thread which gets involved in the work you are doing has a cost (task switching overhead, use of the CPU's very limited L1, L2 and maybe L3 cache, etc...), the optimal number of threads to use is <= the number of available CPU's - unless each thread is requesting services from other machines - such as a highly scalable web service. In some cases, particularly those which involve more hard disk reading and writing than CPU activity, you can actually be better off with 1 thread than multiple threads.
For most applications, and certainly for WAV and MP3 encoding, you should limit the number of worker threads to the number of available CPU's. Here is some C# code to find the number of CPU's:
int processors = 1;
string processorsStr = System.Environment.GetEnvironmentVariable("NUMBER_OF_PROCESSORS");
if (processorsStr != null)
processors = int.Parse(processorsStr);
Unfortunately, it is not as simple as limiting yourself to the number of CPU's. You also have to take into account the performance of the hard disk controller(s) and disk(s).
The only way you can really find the optimal number of threads is trial an error. This is particularly true when you are using hard disks, web services and such. With hard disks, you might be better off not using all four processers on you quad processor CPU. On the other hand, with some web services, you might be better off making 10 or even 100 requests per CPU.

Although I agree with most of the answers here, I think it's worth it to add a new consideration: Speedstep technology.
When running a CPU intensive, single threaded job on a multi-core system, in my case a Xeon E5-2430 with 6 real cores (12 with HT) under windows server 2012, the job got spread out among all the 12 cores, using around 8.33% of each core and never triggering a speed increase. The CPU remained at 1.2 GHz.
When I set the thread affinity to a specific core, it used ~100% of that core, causing the CPU to max out at 2.5 GHz, more than doubling the performance.
This is the program I used, which just loops increasing a variable. When called with -a, it will set the affinity to core 1. The affinity part was based on this post.
using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.InteropServices;
using System.Threading;
namespace Esquenta
{
class Program
{
private static int numThreads = 1;
static bool affinity = false;
static void Main(string[] args)
{
if (args.Contains("-a"))
{
affinity = true;
}
if (args.Length < 1 || !int.TryParse(args[0], out numThreads))
{
numThreads = 1;
}
Console.WriteLine("numThreads:" + numThreads);
for (int j = 0; j < numThreads; j++)
{
var param = new ParameterizedThreadStart(EsquentaP);
var thread = new Thread(param);
thread.Start(j);
}
}
static void EsquentaP(object numero_obj)
{
int i = 0;
DateTime ultimo = DateTime.Now;
if(affinity)
{
Thread.BeginThreadAffinity();
CurrentThread.ProcessorAffinity = new IntPtr(1);
}
try
{
while (true)
{
i++;
if (i == int.MaxValue)
{
i = 0;
var lps = int.MaxValue / (DateTime.Now - ultimo).TotalSeconds / 1000000;
Console.WriteLine("Thread " + numero_obj + " " + lps.ToString("0.000") + " M loops/s");
ultimo = DateTime.Now;
}
}
}
finally
{
Thread.EndThreadAffinity();
}
}
[DllImport("kernel32.dll")]
public static extern int GetCurrentThreadId();
[DllImport("kernel32.dll")]
public static extern int GetCurrentProcessorNumber();
private static ProcessThread CurrentThread
{
get
{
int id = GetCurrentThreadId();
return Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Single(x => x.Id == id);
}
}
}
}
And the results:
Processor speed, as shown by Task manager, similar to what CPU-Z reports:

In the case of managed threads, the complexity of doing this is a degree greater than that of native threads. This is because CLR threads are not directly tied to a native OS thread. In other words, the CLR can switch a managed thread from native thread to native thread as it sees fit. The function Thread.BeginThreadAffinity is provided to place a managed thread in lock-step with a native OS thread. At that point, you could experiment with using native API's to give the underlying native thread processor affinity. As everyone suggests here, this isn't a very good idea. In fact there is documentation suggesting that threads can receive less processing time if they are restricted to a single processor or core.
You can also explore the System.Diagnostics.Process class. There you can find a function to enumerate a process' threads as a collection of ProcessThread objects. This class has methods to set ProcessorAffinity or even set a preferred processor -- not sure what that is.
Disclaimer: I've experienced a similar problem where I thought the CPU(s) were under utilized and researched a lot of this stuff; however, based on all that I read, it appeared that is wasn't a very good idea, as evidenced by the comments posted here as well. However, it's still interesting and a learning experience to experiment.

You can definitely do this by writing the routine inside your program.
However you should not try to do it, since the Operating System is the best candidate to manage these stuff. I mean user mode program should not do try to do it.
However, sometimes, it can be done (for really advanced user) to achieve the load balancing and even to find out true multi thread multi core problem (data racing/cache coherence...) as different threads would be truly executing on different processor.
Having said that, if you still want to achieve we can do it in the following way. I am providing you the pseudo code for(Windows OS), however they could easily be done on Linux as well.
#define MAX_CORE 256
processor_mask[MAX_CORE] = {0};
core_number = 0;
Call GetLogicalProcessorInformation();
// From Here we calculate the core_number and also we populate the process_mask[] array
// which would be used later on to set to run different threads on different CORES.
for(j = 0; j < THREAD_POOL_SIZE; j++)
Call SetThreadAffinityMask(hThread[j],processor_mask[j]);
//hThread is the array of handles of thread.
//Now if your number of threads are higher than the actual number of cores,
// you can use reset the counters(j) once you reach to the "core_number".
After the above routine is called, the threads would always be executing in the following manner:
Thread1-> Core1
Thread2-> Core2
Thread3-> Core3
Thread4-> Core4
Thread5-> Core5
Thread6-> Core6
Thread7-> Core7
Thread8-> Core8
Thread9-> Core1
Thread10-> Core2
...............
For more information, please refer to manual/MSDN to know more about these concepts.

You shouldn't have to worry about doing this yourself. I have multithreaded .NET apps running on dual-quad machines, and no matter how the threads are started, whether via the ThreadPool or manually, I see a nice even distribution of work across all cores.

Where each thread goes is generally handled by the OS itself...so generate 4 threads on a 4 core system and the OS will decide which cores to run each on, which will usually be 1 thread on each core.

It is the operating system's job to split threads across different cores, and it will do so when automatically when your threads are using a lot of CPU time. Don't worry about that. As for finding out how many cores your user has, try Environment.ProcessorCount in C#.

you cannot do this, as only operating system has the privileges to do it. If you will decide it.....then it will be difficult to code applications. Because then you also need to take care for inter-processor communication. critical sections. for each application you have to create you own semaphores or mutex......to which operating system gives a common solution by doing it itself.......

One of the reasons you should not (as has been said) try to allocated this sort of stuff yourself, is that you just don't have enough information to do it properly, particularly into the future with NUMA, etc.
If you have a thread read-to-run, and there's a core idle, the kernel will run your thread, don't worry.

Related

Odd behavior when trying to delay loop with Thread.Sleep(Timespan) [duplicate]

I want to call thread sleep with less than 1 millisecond.
I read that neither thread.Sleep nor Windows-OS support that.
What's the solution for that?
For all those who wonder why I need this:
I'm doing a stress test, and want to know how many messages my module can handle per second.
So my code is:
// Set the relative part of Second hat will be allocated for each message
//For example: 5 messages - every message will get 200 miliseconds
var quantum = 1000 / numOfMessages;
for (var i = 0; i < numOfMessages; i++)
{
_bus.Publish(new MyMessage());
if (rate != 0)
Thread.Sleep(quantum);
}
I'll be glad to get your opinion on that.

You can't do this. A single sleep call will typically block for far longer than a millisecond (it's OS and system dependent, but in my experience, Thread.Sleep(1) tends to block for somewhere between 12-15ms).
Windows, in general, is not designed as a real-time operating system. This type of control is typically impossible to achieve on normal (desktop/server) versions of Windows.
The closest you can get is typically to spin and eat CPU cycles until you've achieved the wait time you want (measured with a high performance counter). This, however, is pretty awful - you'll eat up an entire CPU, and even then, you'll likely get preempted by the OS at times and effectively "sleep" for longer than 1ms...

The code below will most definitely offer a more precise way of blocking, rather than calling Thread.Sleep(x); (although this method will block the thread, not put it to sleep). Below we are using the StopWatch class to measure how long we need to keep looping and block the calling thread.
using System.Diagnostics;
private static void NOP(double durationSeconds)
{
var durationTicks = Math.Round(durationSeconds * Stopwatch.Frequency);
var sw = Stopwatch.StartNew();
while (sw.ElapsedTicks < durationTicks)
{
}
}
Example usage,
private static void Main()
{
NOP(5); // Wait 5 seconds.
Console.WriteLine("Hello World!");
Console.ReadLine();
}

Why?
Usually there are a very limited number of CPUs and cores on one machine - you get just a small number if independent execution units.
From the other hands there are a number of processes and many more threads. Each thread requires some processor time, that is assigned internally by Windows core processes. Usually Windows blocks all threads and gives a certain amount of CPU core time to particular threads, then it switches the context to other threads.
When you call Thread.Sleep no matter how small you kill the whole time span Windows gave to the thread, as there is no reason to simply wait for it and the context is switched straight away. It can take a few ms when Windows gives your thread some CPU next time.
What to use?
Alternatively, you can spin your CPU, spinning is not a terrible thing to do and can be very useful. It is for example used in System.Collections.Concurrent namespace a lot with non-blocking collections, e.g.:
SpinWait sw = new SpinWait();
sw.SpinOnce();

Most of the legitimate reasons for using Thread.Sleep(1) or Thread.Sleep(0) involve fairly advanced thread synchronization techniques. Like Reed said, you will not get the desired resolution using conventional techniques. I do not know for sure what it is you are trying to accomplish, but I think I can assume that you want to cause an action to occur at 1 millisecond intervals. If that is the case then take a look at multimedia timers. They can provide resolution down to 1ms. Unfortunately, there is no API built into the .NET Framework (that I am aware of) that taps into this Windows feature. But you can use the interop layer to call directly into the Win32 APIs. There are even examples of doing this in C# out there.

In the good old days, you would use the "QueryPerformanceTimer" API of Win32, when sub milisecond resolution was needed.
There seems to be more info on the subject over on Code-Project: http://www.codeproject.com/KB/cs/highperformancetimercshar.aspx
This won't allow you to "Sleep()" with the same resolution as pointed out by Reed Copsey.
Edit:
As pointed out by Reed Copsey and Brian Gideon the QueryPerfomanceTimer has been replaced by Stopwatch in .NET

I was looking for the same thing as the OP, and managed to find an answer that works for me. I'm surprised that none of the other answers mentioned this.
When you call Thread.Sleep(), you can use one of two overloads: An int with the number of milliseconds, or a TimeSpan.
A TimeSpan's Constructor, in turn, has a number of overloads. One of them is a single long denoting the number of ticks the TimeSpan represents. One tick is a lot less than 1ms. In fact, another part of TimeSpan's docs gave an example of 10000 ticks happening in 1ms.
Therefore, I think the closest answer to the question is that if you want Thread.Sleep for less than 1ms, you would create a TimeSpan with less than 1ms worth of ticks, then pass that to Thread.Sleep().

Task.Delay(<ms>).Wait(); sometimes causing a 15ms delay in messaging system [duplicate]

I want to call thread sleep with less than 1 millisecond.
I read that neither thread.Sleep nor Windows-OS support that.
What's the solution for that?
For all those who wonder why I need this:
I'm doing a stress test, and want to know how many messages my module can handle per second.
So my code is:
// Set the relative part of Second hat will be allocated for each message
//For example: 5 messages - every message will get 200 miliseconds
var quantum = 1000 / numOfMessages;
for (var i = 0; i < numOfMessages; i++)
{
_bus.Publish(new MyMessage());
if (rate != 0)
Thread.Sleep(quantum);
}
I'll be glad to get your opinion on that.

You can't do this. A single sleep call will typically block for far longer than a millisecond (it's OS and system dependent, but in my experience, Thread.Sleep(1) tends to block for somewhere between 12-15ms).
Windows, in general, is not designed as a real-time operating system. This type of control is typically impossible to achieve on normal (desktop/server) versions of Windows.
The closest you can get is typically to spin and eat CPU cycles until you've achieved the wait time you want (measured with a high performance counter). This, however, is pretty awful - you'll eat up an entire CPU, and even then, you'll likely get preempted by the OS at times and effectively "sleep" for longer than 1ms...

The code below will most definitely offer a more precise way of blocking, rather than calling Thread.Sleep(x); (although this method will block the thread, not put it to sleep). Below we are using the StopWatch class to measure how long we need to keep looping and block the calling thread.
using System.Diagnostics;
private static void NOP(double durationSeconds)
{
var durationTicks = Math.Round(durationSeconds * Stopwatch.Frequency);
var sw = Stopwatch.StartNew();
while (sw.ElapsedTicks < durationTicks)
{
}
}
Example usage,
private static void Main()
{
NOP(5); // Wait 5 seconds.
Console.WriteLine("Hello World!");
Console.ReadLine();
}

Why?
Usually there are a very limited number of CPUs and cores on one machine - you get just a small number if independent execution units.
From the other hands there are a number of processes and many more threads. Each thread requires some processor time, that is assigned internally by Windows core processes. Usually Windows blocks all threads and gives a certain amount of CPU core time to particular threads, then it switches the context to other threads.
When you call Thread.Sleep no matter how small you kill the whole time span Windows gave to the thread, as there is no reason to simply wait for it and the context is switched straight away. It can take a few ms when Windows gives your thread some CPU next time.
What to use?
Alternatively, you can spin your CPU, spinning is not a terrible thing to do and can be very useful. It is for example used in System.Collections.Concurrent namespace a lot with non-blocking collections, e.g.:
SpinWait sw = new SpinWait();
sw.SpinOnce();

Most of the legitimate reasons for using Thread.Sleep(1) or Thread.Sleep(0) involve fairly advanced thread synchronization techniques. Like Reed said, you will not get the desired resolution using conventional techniques. I do not know for sure what it is you are trying to accomplish, but I think I can assume that you want to cause an action to occur at 1 millisecond intervals. If that is the case then take a look at multimedia timers. They can provide resolution down to 1ms. Unfortunately, there is no API built into the .NET Framework (that I am aware of) that taps into this Windows feature. But you can use the interop layer to call directly into the Win32 APIs. There are even examples of doing this in C# out there.

In the good old days, you would use the "QueryPerformanceTimer" API of Win32, when sub milisecond resolution was needed.
There seems to be more info on the subject over on Code-Project: http://www.codeproject.com/KB/cs/highperformancetimercshar.aspx
This won't allow you to "Sleep()" with the same resolution as pointed out by Reed Copsey.
Edit:
As pointed out by Reed Copsey and Brian Gideon the QueryPerfomanceTimer has been replaced by Stopwatch in .NET

I was looking for the same thing as the OP, and managed to find an answer that works for me. I'm surprised that none of the other answers mentioned this.
When you call Thread.Sleep(), you can use one of two overloads: An int with the number of milliseconds, or a TimeSpan.
A TimeSpan's Constructor, in turn, has a number of overloads. One of them is a single long denoting the number of ticks the TimeSpan represents. One tick is a lot less than 1ms. In fact, another part of TimeSpan's docs gave an example of 10000 ticks happening in 1ms.
Therefore, I think the closest answer to the question is that if you want Thread.Sleep for less than 1ms, you would create a TimeSpan with less than 1ms worth of ticks, then pass that to Thread.Sleep().

Thread priority (how to get fixed order)

in console because threads sleep with randoms it will show the order of threads
3,2,1 or 1,2,3 or ...
how can I have fixed order?
and why when I set priority it doeasn't effect the code?
// ThreadTester.cs
// Multiple threads printing at different intervals.
using System;
using System.Threading;
namespace threadTester
{
// class ThreadTester demonstrates basic threading concepts
class ThreadTester
{
static void Main(string[] args)
{
// Create and name each thread. Use MessagePrinter's
// Print method as argument to ThreadStart delegate.
MessagePrinter printer1 = new MessagePrinter();
Thread thread1 =
new Thread(new ThreadStart(printer1.Print));
thread1.Name = "thread1";
MessagePrinter printer2 = new MessagePrinter();
Thread thread2 =
new Thread(new ThreadStart(printer2.Print));
thread2.Name = "thread2";
MessagePrinter printer3 = new MessagePrinter();
Thread thread3 =
new Thread(new ThreadStart(printer3.Print));
thread3.Name = "thread3";
Console.WriteLine("Starting threads");
// call each thread's Start method to place each
// thread in Started state
thread1.Priority = ThreadPriority.Lowest;
thread2.Priority = ThreadPriority.Normal;
thread3.Priority = ThreadPriority.Highest;
thread1.Start();
thread2.Start();
thread3.Start();
Console.WriteLine("Threads started\n");
Console.ReadLine();
} // end method Main
} // end class ThreadTester
// Print method of this class used to control threads
class MessagePrinter
{
private int sleepTime;
private static Random random = new Random();
// constructor to initialize a MessagePrinter object
public MessagePrinter()
{
// pick random sleep time between 0 and 5 seconds
sleepTime = random.Next(5001);
}
// method Print controls thread that prints messages
public void Print()
{
// obtain reference to currently executing thread
Thread current = Thread.CurrentThread;
// put thread to sleep for sleepTime amount of time
Console.WriteLine(
current.Name + " going to sleep for " + sleepTime);
Thread.Sleep(sleepTime);
// print thread name
Console.WriteLine(current.Name + " done sleeping");
} // end method Print
} // end class MessagePrinter
}

You use threads precisely because you do not care about having things happen in a particular order, but want either:
At the same time, if there are enough cores to allow them to happen together.
With some making progress while others are waiting for something.
Interleaved with paying attention to I/O or user-input, so as to continue being responsive.
In each of these cases, you just don't care that you don't know just which bit of what will happen when.
However:
You may still care about the order of certain sequences. In the simplest case, you just have these things happen in sequence within the same thread, while other things happen in other threads. More complicated cases can be served by chaining tasks together.
You may want the results from different threads to finally be put into a different order. The simplest approach is to put them all into order after they've all finished, though you can also sort results as they come (tricky though).
For ideal performance, there should be one thread running on each core (or possibly two on a hyperthreaded core, but that has further complications) at all times. Let's say you have a machine with 4 cores and 8 tasks you need done.
If the tasks involved a lot of waiting on I/O, then four will start, each will reach a point where it's waiting on that I/O, and allow one of the other tasks to make some progress. Chances are that even with the number of tasks being twice the number of cores, it'll still end up with plenty of idle time. If each task was going to take 20seconds, then doing them on different threads will probably have them all done in just a little over 20seconds, since all of them were spending most of their 20seconds waiting on something else.
If you are doing tasks that keep the CPU busy all the time (not much waiting for memory and certainly not for I/O) then you will be able to have four such tasks going at a time, while the others are waiting for them to either finish, or give up their slice of time. Here if each took 20seconds, the best you could hope for is a total time of about 40seconds (and that's assuming that no other thread from any process on the system wants the CPU, that you've a perfect lack of overhead in setting up the threads, etc).
In cases where there is more work to do (active work to do, rather than waiting for I/O to complete, another thread to release a lock, etc.) than cores, the OSs scheduler will swap around between different threads that want to be active. The exact details differs from OS to OS (different Windows versions, including some important differences between desktop and server set ups, take different approaches, different Linux versions with some particularly big changes from 2.4 to 2.6 and different Unixes, etc. all have different strategies).
One thing they all have in common is the common goal of making sure stuff gets done.
Thread priorities and process priorities are ways to influence this scheduling. With Windows, whenever there's more threads waiting to work than cores to work, those of the highest priority get given CPU time in a round-robin fashion. Should there be no threads of that priority, then those of the next lowest are given CPU time, then the next and so on.
This is a great way to have things grind things to a halt. It can lead to complications where a thread that was given high priority (presumably because it's work is considered particularly crucial) is waiting on a thread given low priority (presumably because its work is considered less important and one wants it to always cede time to the others), and the low-priority thread keeps not being given CPU time, because there's always more threads of higher priority than available cores. Hence the supposedly high-priority thread gets no CPU time at all.
To fix this situation, windows will occasionally promote the threads that haven't run in a long time. This fixes things, but now means you've got the supposedly low-priority threads bursting along as super-high priority to the detriment not just of the rest of the application but also the rest of the sytem.
(One of the best things about having a multi-core system, is it means your computing experience is less affected by people who set the priority of threads!)
If you use a debugger to stop a multi-threaded .NET application and examine the threads you'll probably find that all of them are at normal except for one at highest. This one at highest will be the finalizer thread and its running at highest priority is one of the reasons its important that finalizers should not take a long time to execute - having work done at highest priority is a bad thing and while it is justified in this case, it must end as soon as possible.
At least 95% of all other cases where someone sets the priority of a thread is a logical bug - it'll do nothing most of the time and allows things to get very messed up the rest. They can be used well (or we wouldn't have that ability at all), but should definitely be put in the "advanced techniques" category. (I like to spend my free time experimenting with multi-threading techniques that would count as excessive and premature optimisation most of the time, and I still hardly ever touch priorities).
In your example, priority will have little effect because each thread spends most of its time sleeping, so whichever thread does want CPU time can get it for the few nano-seconds it needs to run. What it could do though is cause the whole thing to become needlessly slower should you run it on a machine where the cores are also busy with other normal threads. In this case thead1 wouldn't get any CPU time at first (because there's always a higher priority thread that wants the CPU), then after 3seconds the scheduler would realise its been starved for an eternity the terms of CPU speeds (9billion CPU cycles or so) and give it a burst to highest priority for long enough to let it screw with the timing of vital windows services! Luckily it then sleeps and then does a minute amount of work before finishing, so it does no harm, but if it was doing anything real it could have some really nasty effects on the entire system's performance.

You can't guarantee when windows will execute a particular thread. You can make suggestions to the OS (I.E. the priority level) but ultimately Windows will decide when, what and where.
If you want to ensure that 1 starts before 2 which in turns starts before 3 you should make thread 1 start thread 2 and thread 2 start thread 3.

Threads are considered lightweight processes, in that they run completely independent of each other. If your task relies heavily on the order in which threads execute, you probably shouldn't be using threads.
Otherwise, you need to look at the thread synchronization constructs that the .NET framework provides.

You can not synchronize threads like this. If you need the work done in a certain order, don't use seperate threads, or use ResetEvents or something similar.

Thread scheduling is never guaranteed. Order is never preserved unless you explicitly force it through your code via locks/etc.

Multi threaded file processing with .NET

There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.
I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.
How do I approach such a multithreaded application?
Is it better to use a distributed hash table or a queue?
Which data structure do I use that would avoid locks?
Is there a better approach to this scheme?

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.
Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):
var files =
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();
var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration
sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");
Slightly changing your loop to parallelize the query is all that's needed in
most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.
This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().
.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.
The results (Computer 1 - Physical Hard Disk):
Serial: 1288 - 1333ms
Parallel: 461 - 503ms
Computer specs - for comparison:
Quad Core i7 920 # 2.66 GHz
12 GB RAM (DDR 1333)
300 GB 10k rpm WD VelociRaptor
The results (Computer 2 - Solid State Drive):
Serial: 545 - 601 ms
Parallel: 248 - 278 ms
Computer specifications - for comparison:
Quad Core 2 Quad Q9100 # 2.26 GHz
8 GB RAM (DDR 1333)
120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)
I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).
These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.
Another disclaimer: Quad core + .NET 4 parallel isn't going to give you four times the performance, it doesn't scale linearly... There are other considerations and bottlenecks in play.
I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

Design
The Producer/Consumer pattern will probably be the most useful for this situation. You should create enough threads to maximize the throughput.
Here are some questions about the Producer/Consumer pattern to give you an idea of how it works:
C# Producer/Consumer pattern
C# producer/consumer
You should use a blocking queue and the producer should add files to the queue while the consumers process the files from the queue. The blocking queue requires no locking, so it's about the most efficient way to solve your problem.
If you're using .NET 4.0 there are several concurrent collections that you can use out of the box:
ConcurrentQueue: http://msdn.microsoft.com/en-us/library/dd267265%28v=VS.100%29.aspx
BlockingCollection: http://msdn.microsoft.com/en-us/library/dd267312%28VS.100%29.aspx
Threading
A single producer thread will probably be the most efficient way to load the files from disk and push them onto the queue; subsequently multiple consumers will be popping items off the queue and they'll process them. I would suggest that you try 2-4 consumer threads per core and take some performance measurements to determine which is most optimal (i.e. the number of threads that provide you with the maximum throughput). I would not recommend the use a ThreadPool for this specific example.
P.S. I don't understand what's the concern with a single point of failure and the use of distributed hash tables? I know DHTs sound like a really cool thing to use, but I would try the conventional methods first unless you have a specific problem in mind that you're trying to solve.

I recommend that you queue a thread for each file and keep track of the running threads in a dictionary, launching a new thread when a thread completes, up to a maximum limit. I prefer to create my own threads when they can be long-running, and use callbacks to signal when they're done or encountered an exception. In the sample below I use a dictionary to keep track of the running worker instances. This way I can call into an instance if I want to stop work early. Callbacks can also be used to update a UI with progress and throughput. You can also dynamically throttle the running thread limit for added points.
The example code is an abbreviated demonstrator, but it does run.
class Program
{
static void Main(string[] args)
{
Supervisor super = new Supervisor();
super.LaunchWaitingThreads();
while (!super.Done) { Thread.Sleep(200); }
Console.WriteLine("\nDone");
Console.ReadKey();
}
}
public delegate void StartCallbackDelegate(int idArg, Worker workerArg);
public delegate void DoneCallbackDelegate(int idArg);
public class Supervisor
{
Queue<Thread> waitingThreads = new Queue<Thread>();
Dictionary<int, Worker> runningThreads = new Dictionary<int, Worker>();
int maxThreads = 20;
object locker = new object();
public bool Done {
get {
lock (locker) {
return ((waitingThreads.Count == 0) && (runningThreads.Count == 0));
}
}
}
public Supervisor()
{
// queue up a thread for each file
Directory.GetFiles("C:\\folder").ToList().ForEach(n => waitingThreads.Enqueue(CreateThread(n)));
}
Thread CreateThread(string fileNameArg)
{
Thread thread = new Thread(new Worker(fileNameArg, WorkerStart, WorkerDone).ProcessFile);
thread.IsBackground = true;
return thread;
}
// called when a worker starts
public void WorkerStart(int threadIdArg, Worker workerArg)
{
lock (locker)
{
// update with worker instance
runningThreads[threadIdArg] = workerArg;
}
}
// called when a worker finishes
public void WorkerDone(int threadIdArg)
{
lock (locker)
{
runningThreads.Remove(threadIdArg);
}
Console.WriteLine(string.Format(" Thread {0} done", threadIdArg.ToString()));
LaunchWaitingThreads();
}
// launches workers until max is reached
public void LaunchWaitingThreads()
{
lock (locker)
{
while ((runningThreads.Count < maxThreads) && (waitingThreads.Count > 0))
{
Thread thread = waitingThreads.Dequeue();
runningThreads.Add(thread.ManagedThreadId, null); // place holder so count is accurate
thread.Start();
}
}
}
}
public class Worker
{
string fileName;
StartCallbackDelegate startCallback;
DoneCallbackDelegate doneCallback;
public Worker(string fileNameArg, StartCallbackDelegate startCallbackArg, DoneCallbackDelegate doneCallbackArg)
{
fileName = fileNameArg;
startCallback = startCallbackArg;
doneCallback = doneCallbackArg;
}
public void ProcessFile()
{
startCallback(Thread.CurrentThread.ManagedThreadId, this);
Console.WriteLine(string.Format("Reading file {0} on thread {1}", fileName, Thread.CurrentThread.ManagedThreadId.ToString()));
File.ReadAllBytes(fileName);
doneCallback(Thread.CurrentThread.ManagedThreadId);
}
}

Generally speaking, 1000 small files (how small, btw?) should not take six minutes to process. As a quick test, do a find "foobar" * in the directory containing the files (the first argument in quotes doesn't matter; it can be anything) and see how long it takes to process every file. If it takes more than one second, I'll be disappointed.
Assuming this test confirms my suspicion, then the process is CPU-bound, and you'll get no improvement from separating the reading into its own thread. You should:
Figure out why it takes more than 350 ms, on average, to process a small input, and hopefully improve the algorithm.
If there's no way to speed up the algorithm and you have a multicore machine (almost everyone does, these days), use a thread pool to assign 1000 tasks each the job of reading one file.

You could have a central queue, the reader threads would need write access during the push of the in-memory contents to the queue. The processing threads would need read access to this central queue to pop off the next memory stream to-be-processed. This way you minimize the time spent in locks and don't have to deal with the complexities of lock free code.
EDIT: Ideally, you'd handle all exceptions/error conditions (if any) gracefully, so you don't have points of failure.
As an alternative, you can have multiple threads, each one "claims" a file by renaming it before processing, thus the filesystem becomes the implementation for locked access. No clue if this is any more performant than my original answer, only testing would tell.

You might consider a queue of files to process. Populate the queue once by scanning the directory when you start and have the queue updated with a FileSystemWatcher to efficiently add new files to the queue without constantly re-scanning the directory.
If at all possible, read and write to different physical disks. That will give you maximum IO performance.
If you have an initial burst of many files to process and then an uneven pace of new files being added and this all happens on the same disk (read/write), you could consider buffering the processed files to memory until one of two conditions applies:
There are (temporarily) no new files
You have buffered so many files that
you don't want to use more memory for
buffering (ideally a configurable
threshold)
If your actual processing of the files is CPU intensive, you could consider having one processing thread per CPU core. However, for "normal" processing CPU time will be trivial compared to IO time and the complexity would not be worth any minor gains.

The relationship between cores and the number of threads I can spawn

I have a Intel Quad Core CPU.
If I was to develop a Winforms application which only be used on my machine (I use C# btw), how many threads can I spawn?
Is there some sort of correlation between cores and the max number of threads I can have running at any one time? Would I need to find out how many threads are running at any one time and if so, is this possible? (I know there's properties like min and max threads)? Would this depend on the threadpool (does the max no of threads in this pool change?). This is the C# part of this post/thread.

It all depends, if your threads are active (and not waiting for IO) 100% of the time then there is little point in having more that 1 thread per CPU. However, this is rarely the case unless you are performing complex numeric calculations.
.Nets threadpool has: http://msdn.microsoft.com/en-us/library/system.threading.threadpool.aspx
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads.
So, I would say, there is very little recommendations anyone can give you, besides:
Measure measure measure.
At some point when you add more threads stuff will get slower, due to context switching and synchronization.

You have to measure. That said, with N cores I usually get the best results by spawning between N+1 and 2N threads. But you have to measure.

While there is a loose correlation between threads and cores (the only way for threads to execute truly concurrently would be for them to run on separate cores, but that knowledge is of less value than you might think), the real work is done by the operating system scheduler, in this case the thread scheduler in Windows.
As to how many threads you can create, that will vary from system to system. The ThreadPool class does not place any restrictions on spawning your own threads; it has a, well, pool of threads that it manages itself internally. Those are the values you can see when inspecting the properties of the ThreadPool class. That is not to say, however, that you should spawn limitless threads; eventually the OS will be spending more time switching between your threads than it will spend actually allowing your threads to run. Figure out how many threads are appropriate for your application through benchmarking.
What exactly are you trying to do?

how many threads can I spawn?
Waaay, waaay more (hundreds or thousands of times) than you would want to spawn for optimal throughput.
The per-thread limitations (on Windows) I'm aware of are:
16-bit thread ID
4-8KB allocation for user-space stack (typically, much more)
Non-pageable kernel-space context and stack, something like 16KB
Dotnet probably adds a bunch of per-thread overhead for its own stuff. GC and the like.
One formula I like use for a WAG is:
threads = 2 * (cpu cores + active disk spindles)
The optimal number is usually within a factor of two of that. of that. The theory is that needed threads are proprotional to cpu cores (for obvious reasons), but also that some threads will block on disk I/O. Multiplying by two gives the CPU something to do while other threads are blocked.
Anyway, start with that and measure it. The number of worker threads is the easiest part of the whole problem to adjust later, so don't worry about it too much now.

Just wondered what limit would be hit first by starting many threads. I wrote the following simple test program and tried it. Now I assume that memory is the limiting factor. I was able to run 1000 threads, but without the Thread.Sleep() the system became "a bit inresponsive". With 2000 threads I got an out of memory exception after starting around 1800 threads. (Notebook with a Intel Core 2 Duo T5800 2.0 GHz, 3.0 GiB RAM and a "few" applications running on WIndows XP SP3 with .NET Framework 3.5 SP1)
UPDATE
The out of memory exception is caused by the stack of the threads. After specifying the stack size on the thread constructor (I used 64 kB but got probably the minimum size I don't know at the moment) I was able to start 3500 threads (with Thread.Sleep()).
using System;
using System.Linq;
using System.Threading;
namespace GeneralTestApplication
{
class Program
{
private static void Main()
{
Console.WriteLine("Enter the number of threads to start.");
while (!Int32.TryParse(Console.ReadLine(), out Program.numberThreads)) { }
Program.counters = new Int64[Program.numberThreads];
Console.WriteLine("Starting {0} threads.", Program.numberThreads);
for (Int32 threadNumber = 0; threadNumber < Program.numberThreads; threadNumber++)
{
new Thread(Program.ThreadMethod).Start(threadNumber);
}
Console.WriteLine("Press enter to perform work on all threads.");
Console.ReadLine();
Program.manualResetEvent.Set();
Console.WriteLine("Press enter to stop all threads.");
Console.ReadLine();
Program.stop = true;
Console.WriteLine("At least {0} threads ran.", Program.counters.Count(c => c > 0));
Console.ReadLine();
}
private static Int32 numberThreads = 0;
private static Int64[] counters = null;
private static readonly ManualResetEvent manualResetEvent = new ManualResetEvent(false);
private static volatile Boolean stop = false;
public static void ThreadMethod(Object argument)
{
Int32 threadNumber = (Int32)argument;
Program.manualResetEvent.WaitOne();
while (!Program.stop)
{
Program.counters[threadNumber]++;
// Uncomment to simulate heavy work.
Thread.Sleep(10);
}
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.