Multiple iterations through a file structure (C#) - c#

I am writing a program which iterates through the file system multiple times using simple loops and recursion.
The problem is that, because I am iterating through multiple times, it is taking a long time because (I guess) the hard drive can only work at a certain pace.
Is there any way to optimize this process? Maybe by iterating though once, saving all the relevant information in a collection and then referring to the collection when I need to?
I know I can cache my results like this but I have absolutely no idea how to go about it.
Edit:
There are three main pieces of information I am trying to obtain from a given directory:
The size of the directory (the sum of the size of each file within that directory)
The number of files within the directory
The number of folders within the directory
All of the above includes sub-directories too. Currently, I am performing an iteration of a given directory to obtain each piece of information, i.e. three iterations per directory.
My output is basically a spreadsheet which looks like this:

To improve performance, you could access the Master File Table (MFT) of the NTFS file system directly. There is a excellent code sample on MSDN social forum.
It seems that accessing the MFT is about 10x faster than enumerating the file system using FindFirst/FindNext file.
Hope, this helps.

Yes anything you can do to minimize hard drive I/O will improve the performance. I would also suggest putting in a Stopwatch and measure the time it takes so you can get a sense of how your improvements are affecting the speed.

Related

Is FIle.Exists() suitable for big directories?

I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?
The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.
In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.

C# Obtain Random Folder In Directory Over Network

I'm writing a little app to pull down a few valid samples of each particular type, from a much larger pile of samples.
The structure looks like:
ROOT->STATE->TYPE->SAMPLE
My program cruises through the states, and grabs each unique type, and the path to that type. Once all those are obtained, it goes through each type, and selects X random samples, with X supplied by the user.
The program works great locally, but over the network it's obiviously much slower. I've taken measures to help this, but the last part I'm hung up on is getting the random sample from the TYPE directory fast.
Locally, I use
List<String> directories = Directory.GetDirectories(kvp.Value).ToList();
Which is the bottleneck when running this over the network. I have a feeling this may not be possible, but is there a way to grab, say, 5 random samples from the TYPE directory without first identifying all the samples?
Hopefully I have been clear enough, thankyou.
Perhaps try using DirectoryInfo, when making lots of calls to a specific directory it's faster as security not checked on every access.
You may find speed increases from using a DirectoryInfo object for the root and the sub-folders you want and listing directories that way. That will get you minor speed increases as .NET's lazy initialisation strategy means it will take more network roundtrips using the static Directory methods that you employ in your sample.
The next question I suppose is why is speed important? Have you considered doing something like maintaining an uptodate index in a cache of your own design for speedy access? Either using a FileSystemWatcher, a regular poll, or both?
I think you may also be interested in this link: Checking if folder has files
... it contains some information about limiting your network calls to the bare minimum by retrieving information about the entire directory structure from one call. This will no doubt increase your memory requirements however.
Is the name of each kind of file predictable? Would you have better luck randomly predicting some sample names and reading them directly?

optimizing streaming in of many small files

I have hundreds of thousands of small text files between 0 and 8kb each on a LAN network share. I can use some interop calls with kernel32.dll and FindFileEx to recursively pull a list of the fully qualified UNC path of each file and store the paths in memory in a collection class such as List<string>. Using this approach I was able to populate the List<string> fairly quickly (about 30seconds per 50k file names as compared to 3minutes of Directory.GetFiles).
Though, once I've crawled the directories and stored the file paths in the List<string> I now want to make a pass on every path stored in my list and read the contents of the small text file and perform some action based on the values read in.
As a test bed I iterated over each file path in a List<string> that stored 42,945 file paths to this LAN network share and performed the following lines on each FileFullPath:
StreamReader file = new StreamReader(FileFullPath);
file.ReadToEnd();
file.Close();
So with just these lines, it takes 13-15minutes runtime for all 42,945 files paths stored in my list.
Is there a more optimal way to load in many small text files via C#? Is there some interop I should consider? Or is this pretty much the best I can expect? It just seems like an awfully long time.
I would consider using Directory.EnumerateFiles, and then processing your files as you read them.
This would prevent the need to actually store the list of 42,945 files at once, as well as open up the potential of doing some of the processing in parallel using PLINQ (depending on the processing requirements of the files).
If the processing has a reasonably large CPU portion of the total time (and it's not purely I/O bound), this could potentially provide a large benefit in terms of complete time required.

Calculating directory sizes

I'm trying to calculate directory sizes in a way that divides the load so that the user can see counting progress. I thought a logical way to do this would be to first create the directory tree then do an operation counting the length of all the files.
The thing that comes to me as unexpected is that the bulk of time (disk I/O) comes from creating the directory tree, then going over the FileInfo[] comes nearly instantly with virtually no disk I/O.
I've tried with both Directory.GetDirectories(), simply creating a tree of strings of the directory names, and using a DirectoryInfo object, and both methods still take the bulk of the I/O time (reading the MFT of course) compared to going over all the FileInfo.Length for the files in each directory.
I guess there's no way to reduce the I/O to make the tree significantly, I guess I'm just wondering why this operation takes significantly more time compared to going over the more numerous files?
Also, if anyone could recommend a non-recursive way to tally things up (since it seems I need to just split up the enumeration and balance it in order to make the size tallying more responsive). Making a thread for each subdirectory off the base and letting scheduler competition balance things out would probably not be very good, would it?
EDIT: Repository for this code
You can utilize Parallel.ForEach to run the directory size calculation in parallel fashion. You can get the GetDirectories and run the Parallel.ForEach on each node. You can use a variable to keep track of size and display that to the user. Each parallel calculation would be incrementing on the same variable. If needed use lock() to synchronize between parallel executions.

Windows File system API to query large files

I have HDD (say 1TB) with FAT32 and NTFS partitions and I dont have information on which all files are stored on it, but when needed I want to quickly access large files say over 500 MB. I dont want to scan my whole HDD since it is very time consuming. I need quick results. I was wondering if there are any NTFS/FAT32 APIs that I can directly call - i mean if they have some metadata about the files that are stored then it will be quicker. I want to write my program in C++ and C#.
EDIT
If scanning the HDD is the only option then what all can I do to ensure best performance. Like - I could skip scanning system folders, since I am only interested in user data.
If you're willing to do a lot of extra work yourself to speed things up, you might be able to accomplish something. A lot is going to depend on what you need.
Let's start with FAT32. FAT (in general, not just the 32-bit variant) is named for the File Allocation Table. This is a block of data toward the beginning of the partition that tells which clusters in the partition belong to which files. The FAT is basically organized as linked lists of clusters. If you just want to find the data areas for the large files, you can read the FAT in as a number of raw sectors, and scan through that data to find linked lists of more than X clusters (where X defines the lower limit for what you consider a large file). You can then access those clusters and see the actual data associated with each file. Oddly, what you won't know is the name of that file. The file names are contained in directories, which are basically like files, except that what they contain are a fixed-size records of a specified format. You have to start from the root directory, and read through the directory tree to find file names.
NTFS is both simpler and more complex. NTFS has a Master File Table (MFT) to contains records for all the files in a partition. The good point is that you can read the MFT and get information about every file on the disk without chasing through the directory tree to get it. The bad point is that decoding the contents of an NTFS partition is definitely non-trivial. Reading data (meaningfully) is quite difficult -- and writing data much more difficult. Also, recent versions of Windows have added more restrictions on raw reading from disk partitions, so depending on what partition you're after, you may not be able to access the data you need at all.
None of this, however, is anything that's more than minimally supported. To do it, you open a file named "\.\D:" (where D=letter of the disk you care about). You can then read raw sectors from that disk drive (assuming that opening it worked). This will let you see the raw data for the entire disk (or partition, as the case may be) starting from the boot sector, and going through everything else that's there (FAT, root directory, subdirectories, etc. -- all as sectors of raw data). The system will let you read the raw data, but all the work to make any sense of that data is 100% your responsibility. If the speed you've asked about is an absolute necessity, this may be a possibility -- but it'll take a fair amount of work for FAT volumes, and considerably more than that for NTFS. Unless you really need extra speed like you've said, it's probably not even worth considering trying to do this.
If you're willing to target Vista and beyond, you can use the search indexer APIs.
If you look here you can find information about the search indexer. The search indexer does index the file size so it may do what you want.
Not possible. Neither filesystem keeps a list of big files that you could query directly. You'd have to recursively look at every folder and check the size of every file to find whatever you consider big.
Your only prayer is to latch onto a file indexer, otherwise you will have to iterate through all files. Depending on your computer you might be able to latch onto the native Microsoft indexer (searchindexer.exe) or if you have Google Desktop search you may be able to latch onto that.
Possible way to latch onto Microsoft's indexer

Categories