Task Parallel Library for directory traversal

Task Parallel Library for directory traversal - c#

I'd like to traverse a directory on my hard drive and search through all the files for a specific search string. This sounds like the perfect candidate for something that could (or should) be done in parallel since the IO is rather slow.
Traditionally, I would write a recursive function to finds and processes all files in the current directory and then recurse into all the directories in that directory. I'm wondering how I can modify this to be more parallel. At first I simply modified:
foreach (string directory in directories) { ... }
to
Parallel.ForEach(directories, (directory) => { ... })
but I feel that this might create too many tasks and get itself into knots, especially when trying to dispatch back onto a UI thread. I also feel that the number of tasks is unpredictable and that this might not be an efficient way to parallize (is that a word?) this task.
Has anyone successfully done something like this before? What advice do you have in doing so?

No, this doesn't sound like a good candidate for parallelism precisely because the IO is slow. You're going to be diskbound. Assuming you've only got one disk, you don't really want to be making it seek to multiple different places at the same time.
It's a bit like trying to attach several hoses to the same tap in order to get water out faster - or trying to run 16 CPU-bound threads on a single core :)

Related

use Task parallel library for I/O bound processing

Wondering if you could clarify.
I am writing a tool that all has todo is retrieve data from a database (sql server) and create txt files.
I am talking 500.000 txt files.
It's working and all is good.
However I was wondering if using Task Parallel library could improve and speed up the time it takes to create these files.
I know (read) that "TPL" is not meant to be used for I/0 bound processing and that most likely it will perform the same as
sequential .
Is this true?
Also in an initial attempt using a simple "foreach parallel" I was getting an error cannot access file because is in use.
Any advice?

You do not parallel I/O bound processes.
The reason is simple: because CPU is not the bottleneck. No matter you start how many threads, You only have ONE disk to write to, and that is the slowest thing.
So what you need to is to simply iterate every file and write them. You can start a seperate working thread doing this work, or using async I/O to get a better UI response.

If you read and/or write from multiple disks, then parallizing could improve speed. E.g if you want to read all your files and run a hash on them and store the hash, then you could create one thread per disk and you would see a significant speed up. However, if your case it seems like tasks are unlikely to improve performance.

Inefficient Parallel.For?

I'm using a parallel for loop in my code to run a long running process on a large number of entities (12,000).
The process parses a string, goes through a number of input files (I've read that given the number of IO based things the benefits of threading could be questionable, but it seems to have sped things up elsewhere) and outputs a matched result.
Initially, the process goes quite quickly - however it ends up slowing to a crawl. It's possible that it's just hit a number of particularly tricky input data, but this seems unlikely looking closer at things.
Within the loop, I added some debug code that prints "Started Processing: " and "Finished Processing: " when it begins/ends an iteration and then wrote a program that pairs a start and a finish, initially in order to find which ID was causing a crash.
However, looking at the number of unmatched ID's, it looks like the program is processing in excess of 400 different entities at once. This seems like, with the large number of IO, it could be the source of the issue.
So my question(s) is(are) this(these):
Am I interpreting the unmatched ID's properly, or is there some clever stuff going behind the scenes I'm missing, or even something obvious?
If you'd agree what I've spotted is correct, how can I limit the number it spins off and does at once?
I realise this is perhaps a somewhat unorthodox question and may be tricky to answer given there is no code, but any help is appreciated and if there's any more info you'd like, let me know in the comments.

Without seeing some code, I can guess at the answers to your questions:
Unmatched IDs indicate to me that the thread that is processing that data is being de-prioritized. This could be due to IO or the thread pool trying to optimize, however it seems like if you are strongly IO bound then that is most likely your issue.
I would take a look at Parallel.For, specifically using ParallelOptions.MaxDegreesOfParallelism to limit the maximum number of tasks to a reasonable number. I would suggest trial and error to determine the optimum number of degrees, starting around the number of processor cores you have.
Good luck!

Let me start by confirming that is indeed a very bad idea to read 2 files at the same time from a hard drive (at least until the majority of HDs out there are SSDs), let alone whichever number your whole thing is using.
The use of parallelism serves to optimize processing using an actually paralellizable resource, which is the CPU power. If you paralellized process reads from a hard drive then you're losing most of the benefit.
And even then, even the CPU power is not prone to infinite paralellization. A normal desktop CPU has the capacity to run up to 10 threads at the same time (depends of the model obviously, but that's the order of magnitude).
So two things
first, I am going to make the assumption that your entities use all your files, but your files are not too big to be loaded into memory. If it's the case, you should read your files into objects (i.e. into memory), then paralellize the processing of your entities using those objects. If not, you're basically relying on your hard drive's cache to not reread your files every time you need them, and your hard drive's cache is far smaller than your memory (1000-fold).
second, you shouldn't be running Parallel.For on 12.000 items. Parallel.For will actually (try to) create 12.000 threads, and that is actually worse than 10 threads, because of the big overhead that paralellizing will create, and the fact your CPU will not benefit from it at all since it cannot run more than 10 threads at a time.
You should probably use a more efficient method, which is the IEnumerable<T>.AsParallel() extension (comes with .net 4.0). This one will, at runtime, determine what is the optimal thread number to run, then divide your enumerable into as many batches. Basically, it does the job for you - but it creates a big overhead too, so it's only useful if the processing of one element is actually costly for the CPU.
From my experience, using anything parallel should always be evaluated against not using it in real-life, i.e. by actually profiling your application. Don't assume it's going to work better.

Calculating directory sizes

I'm trying to calculate directory sizes in a way that divides the load so that the user can see counting progress. I thought a logical way to do this would be to first create the directory tree then do an operation counting the length of all the files.
The thing that comes to me as unexpected is that the bulk of time (disk I/O) comes from creating the directory tree, then going over the FileInfo[] comes nearly instantly with virtually no disk I/O.
I've tried with both Directory.GetDirectories(), simply creating a tree of strings of the directory names, and using a DirectoryInfo object, and both methods still take the bulk of the I/O time (reading the MFT of course) compared to going over all the FileInfo.Length for the files in each directory.
I guess there's no way to reduce the I/O to make the tree significantly, I guess I'm just wondering why this operation takes significantly more time compared to going over the more numerous files?
Also, if anyone could recommend a non-recursive way to tally things up (since it seems I need to just split up the enumeration and balance it in order to make the size tallying more responsive). Making a thread for each subdirectory off the base and letting scheduler competition balance things out would probably not be very good, would it?
EDIT: Repository for this code

You can utilize Parallel.ForEach to run the directory size calculation in parallel fashion. You can get the GetDirectories and run the Parallel.ForEach on each node. You can use a variable to keep track of size and display that to the user. Each parallel calculation would be incrementing on the same variable. If needed use lock() to synchronize between parallel executions.

What's the most efficient way to periodically read files from network shared folder?

The goal is to periodically read files from a folder to which another program outputs them, and then feed the files into another part of my code.
How can I accomplish this with the best trade off between performance and readable and easy code?
(I need to accomplish this in both C# and Java. hence the double tagging, (and no, this is not homework :))

I/O is a bottle neck for most programs but if you're going for performance there are a couple things you can do to help. One, is only read when you need to. This can be accomplished by using FileSystemWatcher to tell you when the file is changed. The second is, if possible, spawn a new thread to do the I/O if you can continue without the operation being completed.

In Java you can use a watch service. I believe it uses the same underlying system calls as C# would.
http://docs.oracle.com/javase/tutorial/essential/io/notification.html

Multicore Text File Parsing

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.
Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.
My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.
My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.
I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.

I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.
Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.

Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.
The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.

This will eliminate bottlenecks of having a single thread do the reading:
open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file

My experience is with Java, not C#, so apologies if these solutions don't apply.
The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.
Probably better solutions exist, but hopefully that will do the job. :-)
ETA: Sounds a lot like Wolfbyte's second solution. :-)
ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!

Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.

#lomaxx
#Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.
No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)

If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.
The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.
Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.
Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.
Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.