I'm going to start by describing my use case:
I have built an app which processes LARGE datasets, runs various transformations on them and them spits them out. This process is very time sensitive so a lot of time has gone into optimising.
The idea is to read a bunch of records at a time, process each one on different threads and write the results to file. But instead of writing them to one file, the results are written to one of many temp files which get combined into the desired output file at the end. This is so that we avoid memory write protection exceptions or bottlenecks (as much as possible).
To achieve that, we have an array of 10 fileUtils, 1 of which get passed to a thread as it is initiated. There is a threadCountIterator which increments at each localInit, and is reset back to zero when that count reaches 10. That value is what determines which of the fileUtils objects get passed to the record processing object per thread. The idea is that each util class is responsible for collecting and writing to just one of the temp output files.
It's worth nothing that each FileUtils object gathers about 100 records in a member outputBuildString variable before writing it out, hence having them exist separately and outside of the threading process, where objects lifespan is limited.
The is to more or less evenly disperse the responsability for collecting, storing and then writing the output data across multiple fileUtil objects which means we can write more per second than if we were just writing to one file.
my problem is that this approach results in a Array Out Of Bounds exception as my threadedOutputIterator jumps above the upper limit value, despite there being code that is supposed to reduce it when this happens:
//by default threadCount = 10
private void ProcessRecords()
{
try
{
Parallel.ForEach(clientInputRecordList, new ParallelOptions { MaxDegreeOfParallelism = threadCount }, LocalInit, ThreadMain, LocalFinally);
}
catch (Exception e)
{
Console.WriteLine("The following error occured: " + e);
}
}
private SplitLineParseObject LocalInit()
{
if (threadedOutputIterator >= threadCount)
{
threadedOutputIterator = 0;
}
//still somehow goes above 10, and this is where the excepetion hits since there are only 10 objects in the threadedFileUtils array
SplitLineParseObject splitLineParseUtil = new SplitLineParseObject(parmUtils, ref recCount, ref threadedFileUtils[threadedOutputIterator], ref recordsPassedToFileUtils);
if (threadedOutputIterator<threadCount)
{
threadedOutputIterator++;
}
return splitLineParseUtil;
}
private SplitLineParseObject ThreadMain(ClientInputRecord record, ParallelLoopState state, SplitLineParseObject threadLocalObject)
{
threadLocalObject.clientInputRecord = record;
threadLocalObject.ProcessRecord();
recordsPassedToObject++;
return threadLocalObject;
}
private void LocalFinally(SplitLineParseObject obj)
{
obj = null;
}
As explained in the above comment,it still manages to jump above 10, and this is where the excepetion hits since there are only 10 objects in the threadedFileUtils array. I understand that this is because multiple threads would be incrementing that number at the same time before either of the code in those if statements could be called, meaning theres still the chance it will fail in its current state.
How could I better approach this such that I avoid that exception, while still being able to take advantage of the read, store and write efficiency that having multiple fileUtils gives me?
Thanks!
But instead of writing them to one file, the results are written to one of many temp files which get combined into the desired output file at the end
That is probably not a great idea. If you can fit the data in memory it is most likely better to keep it in memory, or do the merging of data concurrently with the production of data.
To achieve that, we have an array of 10 fileUtils, 1 of which get passed to a thread as it is initiated. There is a threadCountIterator which increments at each localInit, and is reset back to zero when that count reaches 10
This does not sound safe to me. The parallel loop should guarantee that no more than 10 threads should run concurrently (if that is your limit), and that local init will run once for each thread that is used. As far as I know it makes no guarantee that no more than 10 threads will be used in total, so it seem possible that thread #0 and thread #10 could run concurrently.
The correct usage would be to create a new fileUtils-object in the localInit.
This more or less works and ends up being more efficient than if we are writing to just one file
Are you sure? typically IO does not scale very well with concurrency. While SSDs are absolutely better than HDDs, both tend to work best with sequential IO.
How could I better approach this?
My approach would be to use a single writing thread, and a blockingCollection as a thread-safe buffer between the producers and the writer. This assumes that the order of items is not significant:
public async Task ProcessAndWriteItems(List<int> myItems)
{
// BlockingCollection uses a concurrentQueue by default
// Can also set a max size , in case the writer cannot keep up with the producers
var writeQueue = new BlockingCollection<string>();
var writeTask = Task.Run(() => Writer(writeQueue));
Parallel.ForEach(
myItems,
item =>
{
writeQueue.Add(item.ToString());
});
writeQueue.CompleteAdding(); // signal the writer to stop once all items has been processed
await writeTask;
}
private void Writer(BlockingCollection<string> queue)
{
using var stream = new StreamWriter(myFilePath);
foreach (var line in queue.GetConsumingEnumerable())
{
stream.WriteLine(line);
}
}
There is also dataflow that should be suitable for tasks like this. But I have not used it, so I cannot provide specific recommendations.
Note that multi threaded programming is difficult. While it can be made easier by proper use of modern programming techniques, you still need need to know a fair bit about thread safety to understand the problems, and what options and tools exist to solve them. You will not always be so lucky to get actual exceptions, a more typical result of multi threading bugs would be that your program just produces the wrong result. If you are unlucky this only occur in production, on a full moon, and only when processing important data.
LocalInit obviously is not thread safe, so when invoked multiple times in parallel it will have all the multithreading problems caused by not-atomic operations. As a quick fix you can lock the whole method:
private object locker = new object();
private SplitLineParseObject LocalInit()
{
lock (locker)
{
if (threadedOutputIterator >= threadCount)
{
threadedOutputIterator = 0;
}
SplitLineParseObject splitLineParseUtil = new SplitLineParseObject(parmUtils, ref recCount,
ref threadedFileUtils[threadedOutputIterator], ref recordsPassedToFileUtils);
if (threadedOutputIterator < threadCount)
{
threadedOutputIterator++;
}
return splitLineParseUtil;
}
}
Or maybe try to workaround with Interlocked for more fine-grained control and better performance (but it would not be very easy, if even possible).
Note that even if you will implement this in current code - there is still no guarantee that all previous writes are actually finished i.e. for 10 files there is a possibility that the one with 0 index is not yet finished while next 9 are and the 10th will try writing to the same file as 0th is writing too. Possibly you should consider another approach (if you still want to write to multiple files, though IO does not usually scale that well, so possibly just blocking write with queue in one file is a way to go) - you can consider splitting your data in chunks and process them in parallel (i.e. "thread" per chunk) while every chunk writes to it's own file, so there is no need for sync.
Some potentially useful reading:
Overview of synchronization primitives
System.Threading.Channels
TPL Dataflow
Threading in C# by Joseph Albahari
Related
I'm not completely new to C#, but I'm not familiar enough with the language to know how to do what I need to do.
I have a file, call it File1.txt. File1.txt has 100,000 lines or so.
I will duplicate File1.txt and call it File1_untested.txt.
I will also create an empty file "Successes.txt"
For each line in the file:
Remove this line from File1_untested.txt
If this line passes the test, write it to Successes.txt
So, my question is, how can I multithread this?
My approach so far has been to create an object (LineChecker), give the object its line to check, and pass the object into a ThreadPool. I understand how to use ThreadPools for a few tasks with a CountdownEvent. However, it seems unreasonable to queue up 100,000 tasks all at once. How can I gradually feed the pool? Maybe 1000 lines at a time or something like that.
Also, I need to ensure that no two threads are adding to Successes.txt or removing from File1_untested.txt at the same time. I can handle this with lock(), right? What should I be passing into lock()? Can I use a static member of LineChecker?
I'm just trying to get a broad understanding of how something like this can be designed.
Since the tests takes a relatively significant amount of time then it makes sense to utilize multiple CPU cores. However, such utilization should be done only for the relatively expensive test, not for reading/updating the file. This is because reading/updating the file is relatively cheap.
Here is some example code that you can use:
Assuming the you have a relatively expensive Test method:
private bool Test(string line)
{
//This test is expensive
}
Here is a code sample that can utilize multiple CPU for testing:
Here we limit the number of items in the collection to 10, so that the thread that is reading from the file will wait for the other threads to catch up before reading more lines from the file.
This input thread will read much faster than the other threads can test, so we at the worst case we will have read 10 more lines than the testing threads have done testing. This makes sure we have good memory consumption.
CancellationTokenSource cancellation_token_source = new CancellationTokenSource();
CancellationToken cancellation_token = cancellation_token_source.Token;
BlockingCollection<string> blocking_collection = new BlockingCollection<string>(10);
using (StreamReader reader = new StreamReader(new FileStream(filename, FileMode.Open, FileAccess.Read)))
{
using (
StreamWriter writer =
new StreamWriter(new FileStream(success_filename, FileMode.OpenOrCreate, FileAccess.Write)))
{
var input_task = Task.Factory.StartNew(() =>
{
try
{
while (!reader.EndOfStream)
{
if (cancellation_token.IsCancellationRequested)
return;
blocking_collection.Add(reader.ReadLine());
}
}
finally //In all cases, even in the case of an exception, we need to make sure that we mark that we have done adding to the collection so that the Parallel.ForEach loop will exit. Note that Parallel.ForEach will not exit until we call CompleteAdding
{
blocking_collection.CompleteAdding();
}
});
try
{
Parallel.ForEach(blocking_collection.GetConsumingEnumerable(), (line) =>
{
bool test_reault = Test(line);
if (test_reault)
{
lock (writer)
{
writer.WriteLine(line);
}
}
});
}
catch
{
cancellation_token_source.Cancel(); //If Paralle.ForEach throws an exception, we inform the input thread to stop
throw;
}
input_task.Wait(); //This will make sure that exceptions thrown in the input thread will be propagated here
}
}
If your "test" was fast, then multithreading would not have given you any advantage whatsoever, because your code would be 100% disk-bound, and presumably you have all of your files on the same disk: you cannot improve the throughput of a single disk with multithreading.
But since your "test" will be waiting for a response from a webserver, this means that the test is going to be slow, so there is plenty of room for improvement by multithreading. Basically, the number of threads you need depends on how many requests the webserver can be servicing simultaneously without degrading the performance of the webserver. This number might still be low, so you might end up not gaining anything, but at least you can try.
If your file is not really huge, then you can read it all at once, and write it all at once. If each line is only 80 characters long, then this means that your file is only 8 megabytes, which is peanuts, so you can read all the lines into a list, work on the list, produce another list, and in the end write out the entire list.
This will allow you to create a structure, say, MyLine which contains the index of each line and the text of each line, so that you can sort all lines before writing them, so that you do not have to worry about out-of-order responses from the server.
Then, what you need to do is use a bounding blocking queue like BlockingCollection as #Paul suggested.
BlockingCollection accepts as a constructor parameter its maximum capacity. This means that once its maximum capacity has been reached, any further attempts to add to it are blocked (the caller sits there waiting) until some items are removed. So, if you want to have up to 10 simultaneously pending requests, you would construct it as follows:
var sourceCollection = new BlockingCollection<MyLine>(10);
Your main thread will be stuffing sourceCollection with MyLine objects, and you will have 10 threads which block waiting to read MyLines from the collection. Each thread sends a request to the server, waits for a response, saves the result into a thread-safe resultCollection, and attempts to fetch the next item from sourceCollection.
Instead of using multiple threads you could instead use the async features of C#, but I am not terribly familiar with them, so I cannot advice you on precisely how you would do that.
In the end, copy the contents of resultCollection into a List, sort the list, and write it to the output file. (The copy into a separate List is probably a good idea because sorting the thread-safe resultCollection will probably be much slower than sorting a non-thread-safe List. I said probably.)
I am learning to use RX and tried this sample. But could not fix the exception that happens in the highlighted while statement - while(!f.EndofStream)
I want to read a huge file - line by line - and for every line of data - I want to do some processing in a different thread (so I used ObserverOn)
I want the whole thing async. I want to use ReadLineAsync since it returns TASK and so I can convert that to Observables and subscribe to it.
I guess the task thread which I create first, gets in between the Rx threads. But even if I use Observe and Subscribe using the currentThread, I still cannot stop the exception. Wonder how I do accomplish this neatly Aysnc with Rx.
Wondering if the whole thing could be done even simpler ?
static void Main(string[] args)
{
RxWrapper.ReadFileWithRxAsync();
Console.WriteLine("this should be called even before the file read begins");
Console.ReadLine();
}
public static async Task ReadFileWithRxAsync()
{
Task t = Task.Run(() => ReadFileWithRx());
await t;
}
public static void ReadFileWithRx()
{
string file = #"C:\FileWithLongListOfNames.txt";
using (StreamReader f = File.OpenText(file))
{
string line = string.Empty;
bool continueRead = true;
***while (!f.EndOfStream)***
{
f.ReadLineAsync()
.ToObservable()
.ObserveOn(Scheduler.Default)
.Subscribe(t =>
{
Console.WriteLine("custom code to manipulate every line data");
});
}
}
}
The exception is an InvalidOperationException - I'm not intimately familiar with the internals of FileStream, but according to the exception message this is being thrown because there is an in-flight asynchronous operation on the stream. The implication is that you must wait for any ReadLineAsync() calls to finish before checking EndOfStream.
Matthew Finlay has provided a neat re-working of your code to solve this immediate problem. However, I think it has problems of its own - and that there is a bigger issue that needs to be examined. Let's look at the fundamental elements of the problem:
You have a very large file.
You want to process it asynchronously.
This suggests that you don't want the whole file in memory, you want to be informed when the processing is done, and presumably you want to process the file as fast as possible.
Both solutions are using a thread to process each line (the ObserveOn is passing each line to a thread from the thread pool). This is actually not an efficient approach.
Looking at both solutions, there are two possibilities:
A. It takes more time on average to read a file line than it does to process it.
B. It takes less time on average to read a file line than it does to process it.
A. File read of a line slower than processing a line
In the case of A, the system will basically spend most of it's time idle while it waits for file IO to complete. In this scenario, Matthew's solution won't result in memory filling up - but it's worth seeing if using ReadLines directly in a tight loop produces better results due to less thread contention. (ObserveOn pushing the line to another thread will only buy you something if ReadLines isn't getting lines in advance of calling MoveNext - which I suspect it does - but test and see!)
B. File read of a line faster than processing a line
In the case of B (which I assume is more likely given what you have tried), all those lines will start to queue up in memory and, for a big enough file, you will end up with most of it in memory.
You should note that unless your handler is firing off asynchronous code to process a line, then all lines will be processed serially because Rx guarantees OnNext() handler invocations won't overlap.
The ReadLines() method is great because it returns an IEnumerable<string> and it's your enumeration of this that drives reading the file. However, when you call ToObservable() on this, it will enumerate as fast as possible to generate the observable events - there is no feedback (known as "backpressure" in reactive programs) in Rx to slow down this process.
The problem is not the ToObservable itself - it's the ObserveOn. ObserveOn doesn't block the OnNext() handler it is invoked on waiting until it's subscribers are done with the event - it queues up events as fast as possible against the target scheduler.
If you remove the ObserveOn, then - as long as your OnNext handler is synchronous - you'll see each line is read and processed one at a time because the ToObservable() is processing the enumeration on the same thread as the handler.
If this isn't want you want, and you attempt to mitigate this in pursuit of parallel processing by firing an async job in the subscriber - e.g. Task.Run(() => /* process line */ or similar - then things won't go as well as you hope.
Because it takes longer to process a line than read a line, you will create more and more tasks that aren't keeping pace with the incoming lines. The thread count will gradually increase and you will be starving the thread pool.
In this case, Rx isn't a great fit really.
What you probably want is a small number of worker threads (probably 1 per processor core) that fetch a line of code at a time to work on, and limit the number of lines of the file in memory.
A simple approach could be this, which limits the number of lines in memory to a fixed number of workers. It's a pull-based solution, which is a much better design in this scenario:
private Task ProcessFile(string filePath, int numberOfWorkers)
{
var lines = File.ReadLines(filePath);
var parallelOptions = new ParallelOptions {
MaxDegreeOfParallelism = numberOfWorkers
};
return Task.Run(() =>
Parallel.ForEach(lines, parallelOptions, ProcessFileLine));
}
private void ProcessFileLine(string line)
{
/* Your processing logic here */
Console.WriteLine(line);
}
And use it like this:
static void Main()
{
var processFile = ProcessFile(
#"C:\Users\james.world\Downloads\example.txt", 8);
Console.WriteLine("Processing file...");
processFile.Wait();
Console.WriteLine("Done");
}
Final Notes
There are ways of dealing with back pressure in Rx (search around SO for some discussions) - but it's not something that Rx handles well, and I think the resulting solutions are less readable than the alternative above. There are also many other approaches that you can look at (actor based approaches such as TPL Dataflows, or LMAX Disruptor style ring-buffers for high-performance lock free approaches) but the core idea of pulling work from queues will be prevalent.
Even in this analysis, I am conveniently glossing over what you are doing to process the file, and tacitly assuming that the processing of each line is compute bound and truly independent. If there is work to merge the results and/or IO activity to store the output then all bets are off - you will need to examine the efficiency of this side of things carefully too.
In most cases where performing work in parallel as an optimization is under consideration, there are usually so many variables in play that it is best to measure the results of each approach to determine what is best. And measuring is a fine art - be sure to measure realistic scenarios, take averages of many runs of each test and properly reset the environment between runs (e.g. to eliminate caching effects) in order to reduce measurement error.
I haven't looked into what is causing your exception, but I think the neatest way to write this is:
File.ReadLines(file)
.ToObservable()
.ObserveOn(Scheduler.Default)
.Subscribe(Console.Writeline);
Note: ReadLines differs from ReadAllLines in that it will start yielding without having read the entire file, which is the behavior that you want.
I would like to ask help on my code. I am a newbie and wanted to implement safe multi threading in writing to a text file.
StreamWriter sw = new StreamWriter(#"C:\DailyLog.txt");
private void Update(){
var collection = Database.GetCollection<Entity>("products");
StreamReader sr = new StreamReader(#"C:\LUSTK.txt");
string[] line = sr.ReadLine().Split(new char[] { ';' });
while (!sr.EndOfStream)
{
line = sr.ReadLine().Split(new char[] { ';' });
t = delegate {
UpdateEach(Convert.ToInt32(line[5]));
};
new Thread(t).Start();
}
sr.Close();
}
private void UpdateEach(int stock)
{
sw.WriteLine(ean);
}
I got no error on my code but it seems not all written to my text file. I did not make sw.Close() because i know some threads were not finish yet. In addition, how can i implement sw.Close knowing that no thread left unfinish. I have 5 milion records in my LUSTK.text that is to be read by StreamReader and each created a thread and each thread access same text file.
You aren't going to be able to concurrently write to the same writer from different threads. The object wasn't designed to support concurrent access.
Beyond that, the general idea of writing to the same file from multiple threads is flawed. You still only have one physical disk, and it can only spin so fast. Telling it to do things faster won't make it spin any faster.
Beyond that, you're not closing the writer, as you said, and as a result, the buffer isn't being flushed.
You also have a bug in that your anonymous method is closing over line, and all of the methods are closing over the same variable, which is changing. It's important that they each close over their own identifier that won't change. (This can be accomplished simply by declaring line inside of the while loop.) But since you shouldn't be using multiple threads to begin with, there's no real need to focus on this.
You can also use File.ReadLines and File.WriteAllLines to do your file IO; it results in much cleaner code:
var values = File.ReadLines(inputFile)
.Select(line => line.Split(';')[5]);
File.WriteAllLines(outputFile, values);
If you were to want to parallelize this process it would be because you're doing some CPU bound work on each item after you read the line and before you write the line. Parallelizing the actual file IO, as said before, is likely to be harmful, not helpful. In this case the CPU bound work is just splitting the line and grabbing one value, and that's likely to be amazingly fast compared to the file IO. If you needed to, for example, hit the database or do some expensive processing on each line, then you would consider parallelizing just that part of the work, while synchronizing the file IO through a single thread.
A StreamWriter is simply not thread-safe; you would need to synchronize access to this via lock or similar. However, I would advise rethinking your strategy generally:
starting lots of threads is a really bad idea - threads are actually pretty expensive, and should not be used for small items of work (a Task or the ThreadPool might be fine, though) - a low number of threads perhaps separately dequeuing from a thread-safe queue would be preferable
you will have no guarantee of order in terms of the output
frankly, I would expect IO to be your biggest performance issue here, and that isn't impacted by the number of threads (or worse: can be adversely impacted)
My trading software is pretty slow, I want to boost it. There are two bottle-necks.
First bottleneck:
When new bunch of data is received (new quotes, trades etc.) all strategies need to be updated asap. They need to recalculate their state/orders etc. When new bunch of data is ready to be read AllTablesUpdated method is called. This method calls then AllTablesUpdated method for each particular strategy.
public void AllTablesUpdated()
{
Console.WriteLine("Start updating all tables.");
foreach (Strategy strategy in strategies)
{
strategy.AllTablesUpdated();
}
Console.WriteLine("All tables updated in " + sw.ElapsedMilliseconds);
}
The result defers. Sometimes it takes 0 or 1 milliseconds (that's very good), sometimes it takes 8-20 milliseconds, but sometimes it takes 800 milliseconds
There are two problems with the current implementation:
it uses one thread and so doesn't use multi-core processors
strategy.AllTablesUpdated() uses shared resources and may be blocked for a while. If some particular strategy is waiting for resources to be released, all others strategies waiting too (instead can we somehow postpone blocked strategy and start processing other strategies?)
Second bottleneck is pretty similar:
private OrdersOrchestrator()
{
// A simple blocking consumer with no cancellation.
Task.Factory.StartNew(() =>
{
while (!dataItems.IsCompleted)
{
OrdersExecutor ordersExecutor = null;
// Blocks if number.Count == 0
// IOE means that Take() was called on a completed collection.
// Some other thread can call CompleteAdding after we pass the
// IsCompleted check but before we call Take.
// In this example, we can simply catch the exception since the
// loop will break on the next iteration.
try
{
ordersExecutor = dataItems.Take();
}
catch (InvalidOperationException) { }
if (ordersExecutor != null)
{
ordersExecutor.IssueOrders();
}
}
});
}
ordersExecutor may wait until some resources are released. If so all other ordersExecutors are blocked.
In details: each strategy contains one ordersExecutor, they are using shared resources. strategy.AllTablesUpdated() may wait for resources to be released by it's ordersExecutor and vice versa. If this condition occurs all others stretegies/ordersExecutors are blocked too. There are more than 100 strategies.
How to modify the code to achieve?:
if one strategy or ordersExecutor is blocked others shouldn't be blocked?
use power of multi-core processors and probably multi-processors platform?
Your questions is rather broad, what you are basically asking is how to make use of parallelism in your application? You already have code which is broken up into discrete tasks, so using parallelism shouldn't be a big problem. I would recommend reading about PLinq and TPL, both provide easy-to-use APIs for this sort of thing:
http://www.codeproject.com/KB/dotnet/parallelism-in-net-4-0.aspx
There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.
I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.
How do I approach such a multithreaded application?
Is it better to use a distributed hash table or a queue?
Which data structure do I use that would avoid locks?
Is there a better approach to this scheme?
Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.
Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):
var files =
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();
var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration
sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");
Slightly changing your loop to parallelize the query is all that's needed in
most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.
This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().
.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.
The results (Computer 1 - Physical Hard Disk):
Serial: 1288 - 1333ms
Parallel: 461 - 503ms
Computer specs - for comparison:
Quad Core i7 920 # 2.66 GHz
12 GB RAM (DDR 1333)
300 GB 10k rpm WD VelociRaptor
The results (Computer 2 - Solid State Drive):
Serial: 545 - 601 ms
Parallel: 248 - 278 ms
Computer specifications - for comparison:
Quad Core 2 Quad Q9100 # 2.26 GHz
8 GB RAM (DDR 1333)
120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)
I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).
These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.
Another disclaimer: Quad core + .NET 4 parallel isn't going to give you four times the performance, it doesn't scale linearly... There are other considerations and bottlenecks in play.
I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)
Design
The Producer/Consumer pattern will probably be the most useful for this situation. You should create enough threads to maximize the throughput.
Here are some questions about the Producer/Consumer pattern to give you an idea of how it works:
C# Producer/Consumer pattern
C# producer/consumer
You should use a blocking queue and the producer should add files to the queue while the consumers process the files from the queue. The blocking queue requires no locking, so it's about the most efficient way to solve your problem.
If you're using .NET 4.0 there are several concurrent collections that you can use out of the box:
ConcurrentQueue: http://msdn.microsoft.com/en-us/library/dd267265%28v=VS.100%29.aspx
BlockingCollection: http://msdn.microsoft.com/en-us/library/dd267312%28VS.100%29.aspx
Threading
A single producer thread will probably be the most efficient way to load the files from disk and push them onto the queue; subsequently multiple consumers will be popping items off the queue and they'll process them. I would suggest that you try 2-4 consumer threads per core and take some performance measurements to determine which is most optimal (i.e. the number of threads that provide you with the maximum throughput). I would not recommend the use a ThreadPool for this specific example.
P.S. I don't understand what's the concern with a single point of failure and the use of distributed hash tables? I know DHTs sound like a really cool thing to use, but I would try the conventional methods first unless you have a specific problem in mind that you're trying to solve.
I recommend that you queue a thread for each file and keep track of the running threads in a dictionary, launching a new thread when a thread completes, up to a maximum limit. I prefer to create my own threads when they can be long-running, and use callbacks to signal when they're done or encountered an exception. In the sample below I use a dictionary to keep track of the running worker instances. This way I can call into an instance if I want to stop work early. Callbacks can also be used to update a UI with progress and throughput. You can also dynamically throttle the running thread limit for added points.
The example code is an abbreviated demonstrator, but it does run.
class Program
{
static void Main(string[] args)
{
Supervisor super = new Supervisor();
super.LaunchWaitingThreads();
while (!super.Done) { Thread.Sleep(200); }
Console.WriteLine("\nDone");
Console.ReadKey();
}
}
public delegate void StartCallbackDelegate(int idArg, Worker workerArg);
public delegate void DoneCallbackDelegate(int idArg);
public class Supervisor
{
Queue<Thread> waitingThreads = new Queue<Thread>();
Dictionary<int, Worker> runningThreads = new Dictionary<int, Worker>();
int maxThreads = 20;
object locker = new object();
public bool Done {
get {
lock (locker) {
return ((waitingThreads.Count == 0) && (runningThreads.Count == 0));
}
}
}
public Supervisor()
{
// queue up a thread for each file
Directory.GetFiles("C:\\folder").ToList().ForEach(n => waitingThreads.Enqueue(CreateThread(n)));
}
Thread CreateThread(string fileNameArg)
{
Thread thread = new Thread(new Worker(fileNameArg, WorkerStart, WorkerDone).ProcessFile);
thread.IsBackground = true;
return thread;
}
// called when a worker starts
public void WorkerStart(int threadIdArg, Worker workerArg)
{
lock (locker)
{
// update with worker instance
runningThreads[threadIdArg] = workerArg;
}
}
// called when a worker finishes
public void WorkerDone(int threadIdArg)
{
lock (locker)
{
runningThreads.Remove(threadIdArg);
}
Console.WriteLine(string.Format(" Thread {0} done", threadIdArg.ToString()));
LaunchWaitingThreads();
}
// launches workers until max is reached
public void LaunchWaitingThreads()
{
lock (locker)
{
while ((runningThreads.Count < maxThreads) && (waitingThreads.Count > 0))
{
Thread thread = waitingThreads.Dequeue();
runningThreads.Add(thread.ManagedThreadId, null); // place holder so count is accurate
thread.Start();
}
}
}
}
public class Worker
{
string fileName;
StartCallbackDelegate startCallback;
DoneCallbackDelegate doneCallback;
public Worker(string fileNameArg, StartCallbackDelegate startCallbackArg, DoneCallbackDelegate doneCallbackArg)
{
fileName = fileNameArg;
startCallback = startCallbackArg;
doneCallback = doneCallbackArg;
}
public void ProcessFile()
{
startCallback(Thread.CurrentThread.ManagedThreadId, this);
Console.WriteLine(string.Format("Reading file {0} on thread {1}", fileName, Thread.CurrentThread.ManagedThreadId.ToString()));
File.ReadAllBytes(fileName);
doneCallback(Thread.CurrentThread.ManagedThreadId);
}
}
Generally speaking, 1000 small files (how small, btw?) should not take six minutes to process. As a quick test, do a find "foobar" * in the directory containing the files (the first argument in quotes doesn't matter; it can be anything) and see how long it takes to process every file. If it takes more than one second, I'll be disappointed.
Assuming this test confirms my suspicion, then the process is CPU-bound, and you'll get no improvement from separating the reading into its own thread. You should:
Figure out why it takes more than 350 ms, on average, to process a small input, and hopefully improve the algorithm.
If there's no way to speed up the algorithm and you have a multicore machine (almost everyone does, these days), use a thread pool to assign 1000 tasks each the job of reading one file.
You could have a central queue, the reader threads would need write access during the push of the in-memory contents to the queue. The processing threads would need read access to this central queue to pop off the next memory stream to-be-processed. This way you minimize the time spent in locks and don't have to deal with the complexities of lock free code.
EDIT: Ideally, you'd handle all exceptions/error conditions (if any) gracefully, so you don't have points of failure.
As an alternative, you can have multiple threads, each one "claims" a file by renaming it before processing, thus the filesystem becomes the implementation for locked access. No clue if this is any more performant than my original answer, only testing would tell.
You might consider a queue of files to process. Populate the queue once by scanning the directory when you start and have the queue updated with a FileSystemWatcher to efficiently add new files to the queue without constantly re-scanning the directory.
If at all possible, read and write to different physical disks. That will give you maximum IO performance.
If you have an initial burst of many files to process and then an uneven pace of new files being added and this all happens on the same disk (read/write), you could consider buffering the processed files to memory until one of two conditions applies:
There are (temporarily) no new files
You have buffered so many files that
you don't want to use more memory for
buffering (ideally a configurable
threshold)
If your actual processing of the files is CPU intensive, you could consider having one processing thread per CPU core. However, for "normal" processing CPU time will be trivial compared to IO time and the complexity would not be worth any minor gains.