Multi threaded file processing with .NET - c#

There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.
I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.
How do I approach such a multithreaded application?
Is it better to use a distributed hash table or a queue?
Which data structure do I use that would avoid locks?
Is there a better approach to this scheme?

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.
Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):
var files =
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();
var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration
sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");
Slightly changing your loop to parallelize the query is all that's needed in
most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.
This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().
.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.
The results (Computer 1 - Physical Hard Disk):
Serial: 1288 - 1333ms
Parallel: 461 - 503ms
Computer specs - for comparison:
Quad Core i7 920 # 2.66 GHz
12 GB RAM (DDR 1333)
300 GB 10k rpm WD VelociRaptor
The results (Computer 2 - Solid State Drive):
Serial: 545 - 601 ms
Parallel: 248 - 278 ms
Computer specifications - for comparison:
Quad Core 2 Quad Q9100 # 2.26 GHz
8 GB RAM (DDR 1333)
120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)
I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).
These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.
Another disclaimer: Quad core + .NET 4 parallel isn't going to give you four times the performance, it doesn't scale linearly... There are other considerations and bottlenecks in play.
I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

Design
The Producer/Consumer pattern will probably be the most useful for this situation. You should create enough threads to maximize the throughput.
Here are some questions about the Producer/Consumer pattern to give you an idea of how it works:
C# Producer/Consumer pattern
C# producer/consumer
You should use a blocking queue and the producer should add files to the queue while the consumers process the files from the queue. The blocking queue requires no locking, so it's about the most efficient way to solve your problem.
If you're using .NET 4.0 there are several concurrent collections that you can use out of the box:
ConcurrentQueue: http://msdn.microsoft.com/en-us/library/dd267265%28v=VS.100%29.aspx
BlockingCollection: http://msdn.microsoft.com/en-us/library/dd267312%28VS.100%29.aspx
Threading
A single producer thread will probably be the most efficient way to load the files from disk and push them onto the queue; subsequently multiple consumers will be popping items off the queue and they'll process them. I would suggest that you try 2-4 consumer threads per core and take some performance measurements to determine which is most optimal (i.e. the number of threads that provide you with the maximum throughput). I would not recommend the use a ThreadPool for this specific example.
P.S. I don't understand what's the concern with a single point of failure and the use of distributed hash tables? I know DHTs sound like a really cool thing to use, but I would try the conventional methods first unless you have a specific problem in mind that you're trying to solve.

I recommend that you queue a thread for each file and keep track of the running threads in a dictionary, launching a new thread when a thread completes, up to a maximum limit. I prefer to create my own threads when they can be long-running, and use callbacks to signal when they're done or encountered an exception. In the sample below I use a dictionary to keep track of the running worker instances. This way I can call into an instance if I want to stop work early. Callbacks can also be used to update a UI with progress and throughput. You can also dynamically throttle the running thread limit for added points.
The example code is an abbreviated demonstrator, but it does run.
class Program
{
static void Main(string[] args)
{
Supervisor super = new Supervisor();
super.LaunchWaitingThreads();
while (!super.Done) { Thread.Sleep(200); }
Console.WriteLine("\nDone");
Console.ReadKey();
}
}
public delegate void StartCallbackDelegate(int idArg, Worker workerArg);
public delegate void DoneCallbackDelegate(int idArg);
public class Supervisor
{
Queue<Thread> waitingThreads = new Queue<Thread>();
Dictionary<int, Worker> runningThreads = new Dictionary<int, Worker>();
int maxThreads = 20;
object locker = new object();
public bool Done {
get {
lock (locker) {
return ((waitingThreads.Count == 0) && (runningThreads.Count == 0));
}
}
}
public Supervisor()
{
// queue up a thread for each file
Directory.GetFiles("C:\\folder").ToList().ForEach(n => waitingThreads.Enqueue(CreateThread(n)));
}
Thread CreateThread(string fileNameArg)
{
Thread thread = new Thread(new Worker(fileNameArg, WorkerStart, WorkerDone).ProcessFile);
thread.IsBackground = true;
return thread;
}
// called when a worker starts
public void WorkerStart(int threadIdArg, Worker workerArg)
{
lock (locker)
{
// update with worker instance
runningThreads[threadIdArg] = workerArg;
}
}
// called when a worker finishes
public void WorkerDone(int threadIdArg)
{
lock (locker)
{
runningThreads.Remove(threadIdArg);
}
Console.WriteLine(string.Format(" Thread {0} done", threadIdArg.ToString()));
LaunchWaitingThreads();
}
// launches workers until max is reached
public void LaunchWaitingThreads()
{
lock (locker)
{
while ((runningThreads.Count < maxThreads) && (waitingThreads.Count > 0))
{
Thread thread = waitingThreads.Dequeue();
runningThreads.Add(thread.ManagedThreadId, null); // place holder so count is accurate
thread.Start();
}
}
}
}
public class Worker
{
string fileName;
StartCallbackDelegate startCallback;
DoneCallbackDelegate doneCallback;
public Worker(string fileNameArg, StartCallbackDelegate startCallbackArg, DoneCallbackDelegate doneCallbackArg)
{
fileName = fileNameArg;
startCallback = startCallbackArg;
doneCallback = doneCallbackArg;
}
public void ProcessFile()
{
startCallback(Thread.CurrentThread.ManagedThreadId, this);
Console.WriteLine(string.Format("Reading file {0} on thread {1}", fileName, Thread.CurrentThread.ManagedThreadId.ToString()));
File.ReadAllBytes(fileName);
doneCallback(Thread.CurrentThread.ManagedThreadId);
}
}

Generally speaking, 1000 small files (how small, btw?) should not take six minutes to process. As a quick test, do a find "foobar" * in the directory containing the files (the first argument in quotes doesn't matter; it can be anything) and see how long it takes to process every file. If it takes more than one second, I'll be disappointed.
Assuming this test confirms my suspicion, then the process is CPU-bound, and you'll get no improvement from separating the reading into its own thread. You should:
Figure out why it takes more than 350 ms, on average, to process a small input, and hopefully improve the algorithm.
If there's no way to speed up the algorithm and you have a multicore machine (almost everyone does, these days), use a thread pool to assign 1000 tasks each the job of reading one file.

You could have a central queue, the reader threads would need write access during the push of the in-memory contents to the queue. The processing threads would need read access to this central queue to pop off the next memory stream to-be-processed. This way you minimize the time spent in locks and don't have to deal with the complexities of lock free code.
EDIT: Ideally, you'd handle all exceptions/error conditions (if any) gracefully, so you don't have points of failure.
As an alternative, you can have multiple threads, each one "claims" a file by renaming it before processing, thus the filesystem becomes the implementation for locked access. No clue if this is any more performant than my original answer, only testing would tell.

You might consider a queue of files to process. Populate the queue once by scanning the directory when you start and have the queue updated with a FileSystemWatcher to efficiently add new files to the queue without constantly re-scanning the directory.
If at all possible, read and write to different physical disks. That will give you maximum IO performance.
If you have an initial burst of many files to process and then an uneven pace of new files being added and this all happens on the same disk (read/write), you could consider buffering the processed files to memory until one of two conditions applies:
There are (temporarily) no new files
You have buffered so many files that
you don't want to use more memory for
buffering (ideally a configurable
threshold)
If your actual processing of the files is CPU intensive, you could consider having one processing thread per CPU core. However, for "normal" processing CPU time will be trivial compared to IO time and the complexity would not be worth any minor gains.

Related

Issues with Multithreading and file writing

I'm going to start by describing my use case:
I have built an app which processes LARGE datasets, runs various transformations on them and them spits them out. This process is very time sensitive so a lot of time has gone into optimising.
The idea is to read a bunch of records at a time, process each one on different threads and write the results to file. But instead of writing them to one file, the results are written to one of many temp files which get combined into the desired output file at the end. This is so that we avoid memory write protection exceptions or bottlenecks (as much as possible).
To achieve that, we have an array of 10 fileUtils, 1 of which get passed to a thread as it is initiated. There is a threadCountIterator which increments at each localInit, and is reset back to zero when that count reaches 10. That value is what determines which of the fileUtils objects get passed to the record processing object per thread. The idea is that each util class is responsible for collecting and writing to just one of the temp output files.
It's worth nothing that each FileUtils object gathers about 100 records in a member outputBuildString variable before writing it out, hence having them exist separately and outside of the threading process, where objects lifespan is limited.
The is to more or less evenly disperse the responsability for collecting, storing and then writing the output data across multiple fileUtil objects which means we can write more per second than if we were just writing to one file.
my problem is that this approach results in a Array Out Of Bounds exception as my threadedOutputIterator jumps above the upper limit value, despite there being code that is supposed to reduce it when this happens:
//by default threadCount = 10
private void ProcessRecords()
{
try
{
Parallel.ForEach(clientInputRecordList, new ParallelOptions { MaxDegreeOfParallelism = threadCount }, LocalInit, ThreadMain, LocalFinally);
}
catch (Exception e)
{
Console.WriteLine("The following error occured: " + e);
}
}
private SplitLineParseObject LocalInit()
{
if (threadedOutputIterator >= threadCount)
{
threadedOutputIterator = 0;
}
//still somehow goes above 10, and this is where the excepetion hits since there are only 10 objects in the threadedFileUtils array
SplitLineParseObject splitLineParseUtil = new SplitLineParseObject(parmUtils, ref recCount, ref threadedFileUtils[threadedOutputIterator], ref recordsPassedToFileUtils);
if (threadedOutputIterator<threadCount)
{
threadedOutputIterator++;
}
return splitLineParseUtil;
}
private SplitLineParseObject ThreadMain(ClientInputRecord record, ParallelLoopState state, SplitLineParseObject threadLocalObject)
{
threadLocalObject.clientInputRecord = record;
threadLocalObject.ProcessRecord();
recordsPassedToObject++;
return threadLocalObject;
}
private void LocalFinally(SplitLineParseObject obj)
{
obj = null;
}
As explained in the above comment,it still manages to jump above 10, and this is where the excepetion hits since there are only 10 objects in the threadedFileUtils array. I understand that this is because multiple threads would be incrementing that number at the same time before either of the code in those if statements could be called, meaning theres still the chance it will fail in its current state.
How could I better approach this such that I avoid that exception, while still being able to take advantage of the read, store and write efficiency that having multiple fileUtils gives me?
Thanks!
But instead of writing them to one file, the results are written to one of many temp files which get combined into the desired output file at the end
That is probably not a great idea. If you can fit the data in memory it is most likely better to keep it in memory, or do the merging of data concurrently with the production of data.
To achieve that, we have an array of 10 fileUtils, 1 of which get passed to a thread as it is initiated. There is a threadCountIterator which increments at each localInit, and is reset back to zero when that count reaches 10
This does not sound safe to me. The parallel loop should guarantee that no more than 10 threads should run concurrently (if that is your limit), and that local init will run once for each thread that is used. As far as I know it makes no guarantee that no more than 10 threads will be used in total, so it seem possible that thread #0 and thread #10 could run concurrently.
The correct usage would be to create a new fileUtils-object in the localInit.
This more or less works and ends up being more efficient than if we are writing to just one file
Are you sure? typically IO does not scale very well with concurrency. While SSDs are absolutely better than HDDs, both tend to work best with sequential IO.
How could I better approach this?
My approach would be to use a single writing thread, and a blockingCollection as a thread-safe buffer between the producers and the writer. This assumes that the order of items is not significant:
public async Task ProcessAndWriteItems(List<int> myItems)
{
// BlockingCollection uses a concurrentQueue by default
// Can also set a max size , in case the writer cannot keep up with the producers
var writeQueue = new BlockingCollection<string>();
var writeTask = Task.Run(() => Writer(writeQueue));
Parallel.ForEach(
myItems,
item =>
{
writeQueue.Add(item.ToString());
});
writeQueue.CompleteAdding(); // signal the writer to stop once all items has been processed
await writeTask;
}
private void Writer(BlockingCollection<string> queue)
{
using var stream = new StreamWriter(myFilePath);
foreach (var line in queue.GetConsumingEnumerable())
{
stream.WriteLine(line);
}
}
There is also dataflow that should be suitable for tasks like this. But I have not used it, so I cannot provide specific recommendations.
Note that multi threaded programming is difficult. While it can be made easier by proper use of modern programming techniques, you still need need to know a fair bit about thread safety to understand the problems, and what options and tools exist to solve them. You will not always be so lucky to get actual exceptions, a more typical result of multi threading bugs would be that your program just produces the wrong result. If you are unlucky this only occur in production, on a full moon, and only when processing important data.
LocalInit obviously is not thread safe, so when invoked multiple times in parallel it will have all the multithreading problems caused by not-atomic operations. As a quick fix you can lock the whole method:
private object locker = new object();
private SplitLineParseObject LocalInit()
{
lock (locker)
{
if (threadedOutputIterator >= threadCount)
{
threadedOutputIterator = 0;
}
SplitLineParseObject splitLineParseUtil = new SplitLineParseObject(parmUtils, ref recCount,
ref threadedFileUtils[threadedOutputIterator], ref recordsPassedToFileUtils);
if (threadedOutputIterator < threadCount)
{
threadedOutputIterator++;
}
return splitLineParseUtil;
}
}
Or maybe try to workaround with Interlocked for more fine-grained control and better performance (but it would not be very easy, if even possible).
Note that even if you will implement this in current code - there is still no guarantee that all previous writes are actually finished i.e. for 10 files there is a possibility that the one with 0 index is not yet finished while next 9 are and the 10th will try writing to the same file as 0th is writing too. Possibly you should consider another approach (if you still want to write to multiple files, though IO does not usually scale that well, so possibly just blocking write with queue in one file is a way to go) - you can consider splitting your data in chunks and process them in parallel (i.e. "thread" per chunk) while every chunk writes to it's own file, so there is no need for sync.
Some potentially useful reading:
Overview of synchronization primitives
System.Threading.Channels
TPL Dataflow
Threading in C# by Joseph Albahari

Why is my Parallel.Foreach appear not to be using 10 threads?

There does not seem to be a MinDegreeOfParallelism. The following code only seems to use 1% cpu, so I suspect it is NOT using cores properly:
Parallel.ForEach(lls, new ParallelOptions { MaxDegreeOfParallelism = 10 }, GetFileSizeFSO);
Is there a way to FORCE using 10 cores/Threads?
Additional information:
private void GetFileSizeFSO(List<string> l)
{
foreach (var dir in l)
{
var ds = GetDirectorySize3(dir);
Interlocked.Add(ref _size, ds);
}
}
public static long GetDirectorySize3(string parentDirectory)
{
Scripting.FileSystemObject fso = new Scripting.FileSystemObject();
Scripting.Folder folder = fso.GetFolder(parentDirectory);
Int64 dirSize = (Int64)folder.Size;
Marshal.ReleaseComObject(fso);
return dirSize;
}
What does your function GetFileSizeFSO do? In case it accesses files on disc, that must be your main time-consumer. Processor is simply too fast and disc can't catch up with the processor. So processor have enough time to spare and wait while HDD completes it's job.
If you need to optimize your code, you better look into accessing files more efficiently than trying to load processor for 100%.
It's called MaxDegreeOfParallelism, not MinDegreeOfParallelism. Parallel is designed for CPU-bound work - there's no point whatsoever in using more threads than you have CPUs. It sounds like your work is I/O bound, rather than CPU-bound, so Parallel simply isn't the right tool for the job.
Ideally, find an asynchronous API to do what you're trying to do - this is the best way to use the resources you have. If there's no asynchronous API, you'll have to spawn those threads yourself - don't expect to see CPU usage, though. And most importantly, measure - it's very much possible that parallelizing the workload doesn't improve throughput at all (for example, the I/O might already be saturated).
The simple answer is - you can't.
But why should you? .NET is pretty good at choosing the optimal amount of threads used. The use of MaxDegreeOfParallelism is to limit parallelism, not to force it, for example if you don't want to give all system resources to the loop.
As a side note, judging from your function name GetFileSizeFSO, I would guess that it reads file sizes from your persistent storage, which would explain why your CPU is not being fully used.
ManInMoon, your CPU usage is probably slow because the meat of the work you're doing is probably bound by your storage mechanism. 10 cores hitting the same hard drive for getting file sizes may not be any faster than 2 cores, because going and making a hit against a hard drive is a relatively (ridiculously) more expensive operation than the surrounding C# logic you have there.
So, you don't have a parallelism problem, you have an I/O problem.
Side note, perhaps don't use FSO, use .NET's FileInfo instead.

TPL vs Multithreading

I am new to threading and I need a clarification for the below scenario.
I am working on apple push notification services. My application demands to send notifications to 30k users when a new deal is added to the website.
can I split the 30k users into lists, each list containing 1000 users and start multiple threads or can use task?
Is the following way efficient?
if (lstDevice.Count > 0)
{
for (int i = 0; i < lstDevice.Count; i += 2)
{
splitList.Add(lstDevice.Skip(i).Take(2).ToList<DeviceHelper>());
}
var tasks = new Task[splitList.Count];
int count=0;
foreach (List<DeviceHelper> lst in splitList)
{
tasks[count] = Task.Factory.StartNew(() =>
{
QueueNotifications(lst, pMessage, pSubject, pNotificationType, push);
},
TaskCreationOptions.None);
count++;
}
QueueNotification method will just loop through each list item and creates a payload like
foreach (DeviceHelper device in splitList)
{
if (device.PlatformType.ToLower() == "ios")
{
push.QueueNotification(new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(pMessage)
.WithBadge(device.Badge)
);
Console.Write("Waiting for Queue to Finish...");
}
}
push.StopAllServices(true);
Technically it is sure possible to split a list and then start threads that runs your List in parallel. You can also implement everything yourself, as you already have done, but this isn't a good approach. At first splitting a List into chunks that gets processed in parallel is already what Parallel.For or Parallel.ForEach does. There is no need to re-implement everything yourself.
Now, you constantly ask if something can run 300 or 500 notifications in parallel. But actually this is not a good question because you completly miss the point of running something in parallel.
So, let me explain you why that question is not good. At first, you should ask yourself why do you want to run something in parallel? The answer to that is, you want that something runs faster by using multiple CPU-cores.
Now your simple idea is probably that spawning 300 or 500 threads is faster, because you have more threads and it runs more things "in parallel". But that is not exactly the case.
At first, creating a thread is not "free". Every thread you create has some overhead, it takes some CPU-time to create a thread, and also it needs some memory. On top of that, if you create 300 threads it doesn't mean 300 threads run in parallel. If you have for example an 8 core CPU only 8 threads really can run in parallel. Creating more threads can even hurt your performance. Because now your program needs to switch constanlty between threads, that also cost CPU-performance.
The result of all that is. If you have something lightweight some small code that don't do a lot of computation it ends that creating a lot of threads will slow down your application instead of running faster, because the managing of your threads creates more overhead than running it on (for example) 8 cpu-cores.
That means, if you have a list of 30,000 of somewhat. It usally end that it is faster to just split your list in 8 chunks and work through your list in 8 threads as creating 300 Threads.
Your goal should never be: Can it run xxx things in parallel?
The question should be like: How many threads do i need, and how much items should every thread process to get my work as fastest done.
That is an important difference because just spawning more threads doesn't mean something ends up beeing fast.
So how many threads do you need, and how many items should every thread process? Well, you can write a lot of code to test it. But the amount changes from hardware to hardware. A PC with just 4 cores have another optimum than a system with 8 cores. If what you are doing is IO bound (for example read/write to disk/network) you also don't get more speed by increasing your threads.
So what you now can do is test everything, try to get the correct thread number and do a lot of benchmarking to find the best numbers.
But actually, that is the whole purpose of the TPL library with the Task<T> class. The Task<T> class already looks at your computer how many cpu-cores it have. And when you are running your Task it automatically tries to create as much threads needed to get the maximum out of your system.
So my suggestion is that you should use the TPL library with the Task<T> class. In my opinion you should never create Threads directly yourself or doing partition yourself, because all of that is already done in TPL.
I think the Task-Class is a good choise for your aim, becuase you have an easy handling over the async process and don't have to deal with Threads directly.
Maybe this help: Task vs Thread differences
But to give you a better answer, you should improve your question an give us more details.
You should be careful with creating to much parallel threads, because this can slow down your application. Read this nice article from SO: How many threads is too many?. The best thing is you make it configurable and than test some values.
I agree Task is a good choice however creating too many tasks also bring risks to your system and for failures, your decision is also a factor to come up a solution. For me I prefer MSQueue combining with thread pool.
If you want parallelize the creation of the push notifications and maximize the performance by using all CPU's on the computer you should use Parallel.ForEach:
Parallel.ForEach(
devices,
device => {
if (device.PlatformType.ToUpperInvariant() == "IOS") {
push.QueueNotification(
new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(message)
.WithBadge(device.Badge)
);
}
}
);
push.StopAllServices(true);
This assumes that calling push.QueueNotification is thread-safe. Also, if this call locks a shared resource you may see lower than expected performance because of lock contention.
To avoid this lock contention you may be able to create a separate queue for each partition that Parallel.ForEach creates. I am improvising a bit here because some details are missing from the question. I assume that the variable push is an instance of the type Push:
Parallel.ForEach(
devices,
() => new Push(),
(device, _, push) => {
if (device.PlatformType.ToUpperInvariant() == "IOS") {
push.QueueNotification(
new AppleNotification()
.ForDeviceToken(device.DeviceToken)
.WithAlert(message)
.WithBadge(device.Badge)
);
}
return push;
},
push.StopAllServices(true);
);
This will create a separate Push instance for each partition that Parallel.ForEach creates and when the partition is complete it will call StopAllServices on the instance.
This approach should perform no worse than splitting the devices into N lists where N is the number of CPU's and and starting either N threads or N tasks to process each list. If one thread or task "gets behind" the total execution time will be the execution time of this "slow" thread or task. With Parallel.ForEach all CPU's are used until all devices have been processed.

The relationship between cores and the number of threads I can spawn

I have a Intel Quad Core CPU.
If I was to develop a Winforms application which only be used on my machine (I use C# btw), how many threads can I spawn?
Is there some sort of correlation between cores and the max number of threads I can have running at any one time? Would I need to find out how many threads are running at any one time and if so, is this possible? (I know there's properties like min and max threads)? Would this depend on the threadpool (does the max no of threads in this pool change?). This is the C# part of this post/thread.
It all depends, if your threads are active (and not waiting for IO) 100% of the time then there is little point in having more that 1 thread per CPU. However, this is rarely the case unless you are performing complex numeric calculations.
.Nets threadpool has: http://msdn.microsoft.com/en-us/library/system.threading.threadpool.aspx
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads.
So, I would say, there is very little recommendations anyone can give you, besides:
Measure measure measure.
At some point when you add more threads stuff will get slower, due to context switching and synchronization.
You have to measure. That said, with N cores I usually get the best results by spawning between N+1 and 2N threads. But you have to measure.
While there is a loose correlation between threads and cores (the only way for threads to execute truly concurrently would be for them to run on separate cores, but that knowledge is of less value than you might think), the real work is done by the operating system scheduler, in this case the thread scheduler in Windows.
As to how many threads you can create, that will vary from system to system. The ThreadPool class does not place any restrictions on spawning your own threads; it has a, well, pool of threads that it manages itself internally. Those are the values you can see when inspecting the properties of the ThreadPool class. That is not to say, however, that you should spawn limitless threads; eventually the OS will be spending more time switching between your threads than it will spend actually allowing your threads to run. Figure out how many threads are appropriate for your application through benchmarking.
What exactly are you trying to do?
how many threads can I spawn?
Waaay, waaay more (hundreds or thousands of times) than you would want to spawn for optimal throughput.
The per-thread limitations (on Windows) I'm aware of are:
16-bit thread ID
4-8KB allocation for user-space stack (typically, much more)
Non-pageable kernel-space context and stack, something like 16KB
Dotnet probably adds a bunch of per-thread overhead for its own stuff. GC and the like.
One formula I like use for a WAG is:
threads = 2 * (cpu cores + active disk spindles)
The optimal number is usually within a factor of two of that. of that. The theory is that needed threads are proprotional to cpu cores (for obvious reasons), but also that some threads will block on disk I/O. Multiplying by two gives the CPU something to do while other threads are blocked.
Anyway, start with that and measure it. The number of worker threads is the easiest part of the whole problem to adjust later, so don't worry about it too much now.
Just wondered what limit would be hit first by starting many threads. I wrote the following simple test program and tried it. Now I assume that memory is the limiting factor. I was able to run 1000 threads, but without the Thread.Sleep() the system became "a bit inresponsive". With 2000 threads I got an out of memory exception after starting around 1800 threads. (Notebook with a Intel Core 2 Duo T5800 2.0 GHz, 3.0 GiB RAM and a "few" applications running on WIndows XP SP3 with .NET Framework 3.5 SP1)
UPDATE
The out of memory exception is caused by the stack of the threads. After specifying the stack size on the thread constructor (I used 64 kB but got probably the minimum size I don't know at the moment) I was able to start 3500 threads (with Thread.Sleep()).
using System;
using System.Linq;
using System.Threading;
namespace GeneralTestApplication
{
class Program
{
private static void Main()
{
Console.WriteLine("Enter the number of threads to start.");
while (!Int32.TryParse(Console.ReadLine(), out Program.numberThreads)) { }
Program.counters = new Int64[Program.numberThreads];
Console.WriteLine("Starting {0} threads.", Program.numberThreads);
for (Int32 threadNumber = 0; threadNumber < Program.numberThreads; threadNumber++)
{
new Thread(Program.ThreadMethod).Start(threadNumber);
}
Console.WriteLine("Press enter to perform work on all threads.");
Console.ReadLine();
Program.manualResetEvent.Set();
Console.WriteLine("Press enter to stop all threads.");
Console.ReadLine();
Program.stop = true;
Console.WriteLine("At least {0} threads ran.", Program.counters.Count(c => c > 0));
Console.ReadLine();
}
private static Int32 numberThreads = 0;
private static Int64[] counters = null;
private static readonly ManualResetEvent manualResetEvent = new ManualResetEvent(false);
private static volatile Boolean stop = false;
public static void ThreadMethod(Object argument)
{
Int32 threadNumber = (Int32)argument;
Program.manualResetEvent.WaitOne();
while (!Program.stop)
{
Program.counters[threadNumber]++;
// Uncomment to simulate heavy work.
Thread.Sleep(10);
}
}
}
}

How do I spawn threads on different CPU cores?

Let's say I had a program in C# that did something computationally expensive, like encoding a list of WAV files into MP3s. Ordinarily I would encode the files one at a time, but let's say I wanted the program to figure out how many CPU cores I had and spin up an encoding thread on each core. So, when I run the program on a quad core CPU, the program figures out it's a quad core CPU, figures out there are four cores to work with, then spawns four threads for the encoding, each of which is running on its own separate CPU. How would I do this?
And would this be any different if the cores were spread out across multiple physical CPUs? As in, if I had a machine with two quad core CPUs on it, are there any special considerations or are the eight cores across the two dies considered equal in Windows?
Don't bother doing that.
Instead use the Thread Pool. The thread pool is a mechanism (actually a class) of the framework that you can query for a new thread.
When you ask for a new thread it will either give you a new one or enqueue the work until a thread get freed. In that way the framework is in charge on deciding wether it should create more threads or not depending on the number of present CPUs.
Edit: In addition, as it has been already mentioned, the OS is in charge of distributing the threads among the different CPUs.
It is not necessarily as simple as using the thread pool.
By default, the thread pool allocates multiple threads for each CPU. Since every thread which gets involved in the work you are doing has a cost (task switching overhead, use of the CPU's very limited L1, L2 and maybe L3 cache, etc...), the optimal number of threads to use is <= the number of available CPU's - unless each thread is requesting services from other machines - such as a highly scalable web service. In some cases, particularly those which involve more hard disk reading and writing than CPU activity, you can actually be better off with 1 thread than multiple threads.
For most applications, and certainly for WAV and MP3 encoding, you should limit the number of worker threads to the number of available CPU's. Here is some C# code to find the number of CPU's:
int processors = 1;
string processorsStr = System.Environment.GetEnvironmentVariable("NUMBER_OF_PROCESSORS");
if (processorsStr != null)
processors = int.Parse(processorsStr);
Unfortunately, it is not as simple as limiting yourself to the number of CPU's. You also have to take into account the performance of the hard disk controller(s) and disk(s).
The only way you can really find the optimal number of threads is trial an error. This is particularly true when you are using hard disks, web services and such. With hard disks, you might be better off not using all four processers on you quad processor CPU. On the other hand, with some web services, you might be better off making 10 or even 100 requests per CPU.
Although I agree with most of the answers here, I think it's worth it to add a new consideration: Speedstep technology.
When running a CPU intensive, single threaded job on a multi-core system, in my case a Xeon E5-2430 with 6 real cores (12 with HT) under windows server 2012, the job got spread out among all the 12 cores, using around 8.33% of each core and never triggering a speed increase. The CPU remained at 1.2 GHz.
When I set the thread affinity to a specific core, it used ~100% of that core, causing the CPU to max out at 2.5 GHz, more than doubling the performance.
This is the program I used, which just loops increasing a variable. When called with -a, it will set the affinity to core 1. The affinity part was based on this post.
using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.InteropServices;
using System.Threading;
namespace Esquenta
{
class Program
{
private static int numThreads = 1;
static bool affinity = false;
static void Main(string[] args)
{
if (args.Contains("-a"))
{
affinity = true;
}
if (args.Length < 1 || !int.TryParse(args[0], out numThreads))
{
numThreads = 1;
}
Console.WriteLine("numThreads:" + numThreads);
for (int j = 0; j < numThreads; j++)
{
var param = new ParameterizedThreadStart(EsquentaP);
var thread = new Thread(param);
thread.Start(j);
}
}
static void EsquentaP(object numero_obj)
{
int i = 0;
DateTime ultimo = DateTime.Now;
if(affinity)
{
Thread.BeginThreadAffinity();
CurrentThread.ProcessorAffinity = new IntPtr(1);
}
try
{
while (true)
{
i++;
if (i == int.MaxValue)
{
i = 0;
var lps = int.MaxValue / (DateTime.Now - ultimo).TotalSeconds / 1000000;
Console.WriteLine("Thread " + numero_obj + " " + lps.ToString("0.000") + " M loops/s");
ultimo = DateTime.Now;
}
}
}
finally
{
Thread.EndThreadAffinity();
}
}
[DllImport("kernel32.dll")]
public static extern int GetCurrentThreadId();
[DllImport("kernel32.dll")]
public static extern int GetCurrentProcessorNumber();
private static ProcessThread CurrentThread
{
get
{
int id = GetCurrentThreadId();
return Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Single(x => x.Id == id);
}
}
}
}
And the results:
Processor speed, as shown by Task manager, similar to what CPU-Z reports:
In the case of managed threads, the complexity of doing this is a degree greater than that of native threads. This is because CLR threads are not directly tied to a native OS thread. In other words, the CLR can switch a managed thread from native thread to native thread as it sees fit. The function Thread.BeginThreadAffinity is provided to place a managed thread in lock-step with a native OS thread. At that point, you could experiment with using native API's to give the underlying native thread processor affinity. As everyone suggests here, this isn't a very good idea. In fact there is documentation suggesting that threads can receive less processing time if they are restricted to a single processor or core.
You can also explore the System.Diagnostics.Process class. There you can find a function to enumerate a process' threads as a collection of ProcessThread objects. This class has methods to set ProcessorAffinity or even set a preferred processor -- not sure what that is.
Disclaimer: I've experienced a similar problem where I thought the CPU(s) were under utilized and researched a lot of this stuff; however, based on all that I read, it appeared that is wasn't a very good idea, as evidenced by the comments posted here as well. However, it's still interesting and a learning experience to experiment.
You can definitely do this by writing the routine inside your program.
However you should not try to do it, since the Operating System is the best candidate to manage these stuff. I mean user mode program should not do try to do it.
However, sometimes, it can be done (for really advanced user) to achieve the load balancing and even to find out true multi thread multi core problem (data racing/cache coherence...) as different threads would be truly executing on different processor.
Having said that, if you still want to achieve we can do it in the following way. I am providing you the pseudo code for(Windows OS), however they could easily be done on Linux as well.
#define MAX_CORE 256
processor_mask[MAX_CORE] = {0};
core_number = 0;
Call GetLogicalProcessorInformation();
// From Here we calculate the core_number and also we populate the process_mask[] array
// which would be used later on to set to run different threads on different CORES.
for(j = 0; j < THREAD_POOL_SIZE; j++)
Call SetThreadAffinityMask(hThread[j],processor_mask[j]);
//hThread is the array of handles of thread.
//Now if your number of threads are higher than the actual number of cores,
// you can use reset the counters(j) once you reach to the "core_number".
After the above routine is called, the threads would always be executing in the following manner:
Thread1-> Core1
Thread2-> Core2
Thread3-> Core3
Thread4-> Core4
Thread5-> Core5
Thread6-> Core6
Thread7-> Core7
Thread8-> Core8
Thread9-> Core1
Thread10-> Core2
...............
For more information, please refer to manual/MSDN to know more about these concepts.
You shouldn't have to worry about doing this yourself. I have multithreaded .NET apps running on dual-quad machines, and no matter how the threads are started, whether via the ThreadPool or manually, I see a nice even distribution of work across all cores.
Where each thread goes is generally handled by the OS itself...so generate 4 threads on a 4 core system and the OS will decide which cores to run each on, which will usually be 1 thread on each core.
It is the operating system's job to split threads across different cores, and it will do so when automatically when your threads are using a lot of CPU time. Don't worry about that. As for finding out how many cores your user has, try Environment.ProcessorCount in C#.
you cannot do this, as only operating system has the privileges to do it. If you will decide it.....then it will be difficult to code applications. Because then you also need to take care for inter-processor communication. critical sections. for each application you have to create you own semaphores or mutex......to which operating system gives a common solution by doing it itself.......
One of the reasons you should not (as has been said) try to allocated this sort of stuff yourself, is that you just don't have enough information to do it properly, particularly into the future with NUMA, etc.
If you have a thread read-to-run, and there's a core idle, the kernel will run your thread, don't worry.

Categories