optimizing streaming in of many small files - c#

I have hundreds of thousands of small text files between 0 and 8kb each on a LAN network share. I can use some interop calls with kernel32.dll and FindFileEx to recursively pull a list of the fully qualified UNC path of each file and store the paths in memory in a collection class such as List<string>. Using this approach I was able to populate the List<string> fairly quickly (about 30seconds per 50k file names as compared to 3minutes of Directory.GetFiles).
Though, once I've crawled the directories and stored the file paths in the List<string> I now want to make a pass on every path stored in my list and read the contents of the small text file and perform some action based on the values read in.
As a test bed I iterated over each file path in a List<string> that stored 42,945 file paths to this LAN network share and performed the following lines on each FileFullPath:
StreamReader file = new StreamReader(FileFullPath);
file.ReadToEnd();
file.Close();
So with just these lines, it takes 13-15minutes runtime for all 42,945 files paths stored in my list.
Is there a more optimal way to load in many small text files via C#? Is there some interop I should consider? Or is this pretty much the best I can expect? It just seems like an awfully long time.

I would consider using Directory.EnumerateFiles, and then processing your files as you read them.
This would prevent the need to actually store the list of 42,945 files at once, as well as open up the potential of doing some of the processing in parallel using PLINQ (depending on the processing requirements of the files).
If the processing has a reasonably large CPU portion of the total time (and it's not purely I/O bound), this could potentially provide a large benefit in terms of complete time required.

Related

Is FIle.Exists() suitable for big directories?

I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?
The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.
In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.

Concatenate large files using Win NT kernel API

I've been looking around for a way to concatenate large files (a few gigabytes) together without having to rewrite one of the files. I am sure the OS does this internally when manipulating the master file table. This is purely for an internal application where speed is critical even at the cost of data integrity (in case of risking undocumented APIs). The app processes a large amount of high-bandwidth, multi-channel ethernet data where a corrupt unit of work (file in this case) will not have a large impact on overall processing results.
At the moment when combining files A and B, the effort involved is equal to: A[Read] + B[Read] +C[Write]`. Would any of you NT gurus shed some light on how to work around this to get to the MFT directly?
I have not been able to gain any clues as to which API to explore and would appreciate some pointers. Although the app in managed, I would gladly explore native APIs and even setup light-weight VMs for testing.
Thanks in advance.
If you are appending File B to File A, all you have to do is open File A for write+append , seek to end of file, then read from B and write to A.
If you want to create File C as the concatenation of File A and File B, then you are going to have to create File C and copy A to C, then B to C.
There aren't any shortcuts.
That's not really something a file system would do. File systems allocate space for files in terms of clusters and blocks of data, not in terms of bytes. Concatenating two files like this would only work if they were both multiples of the cluster size, and the FS might have other assumptions about how blocks are allocated to files under the covers. You might be able to do this yourself to the file system if you dismounted it and wrote a tool to directly manipulate all the file system structures. But you're risking corrupting the whole disk if you do that, not just a single file.
I don't know your exact situation but would it be possible to not append the files together at all? Just keep throwing files into some directory as you receive data, and keep an index
Then as the data is needed use the index to piece together the data to create one new file?
So you only ever do the expensive file merging on demand?

When does it become worthwhile to spend the execution time to zip files?

We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.
My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.
Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.
I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.
UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.
To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.
If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.
You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).
You probably have plenty of CPU time, so the only issue is: does it shrink?
If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.
Alas, photos (jpeg) are already compressed so you probably won't see much gain.
You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.
Basically interface:
enum FileContentType
{
PlainText,
OfficeDoc,
OffixeXlsx
}
// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}
Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)

Multiple iterations through a file structure (C#)

I am writing a program which iterates through the file system multiple times using simple loops and recursion.
The problem is that, because I am iterating through multiple times, it is taking a long time because (I guess) the hard drive can only work at a certain pace.
Is there any way to optimize this process? Maybe by iterating though once, saving all the relevant information in a collection and then referring to the collection when I need to?
I know I can cache my results like this but I have absolutely no idea how to go about it.
Edit:
There are three main pieces of information I am trying to obtain from a given directory:
The size of the directory (the sum of the size of each file within that directory)
The number of files within the directory
The number of folders within the directory
All of the above includes sub-directories too. Currently, I am performing an iteration of a given directory to obtain each piece of information, i.e. three iterations per directory.
My output is basically a spreadsheet which looks like this:
To improve performance, you could access the Master File Table (MFT) of the NTFS file system directly. There is a excellent code sample on MSDN social forum.
It seems that accessing the MFT is about 10x faster than enumerating the file system using FindFirst/FindNext file.
Hope, this helps.
Yes anything you can do to minimize hard drive I/O will improve the performance. I would also suggest putting in a Stopwatch and measure the time it takes so you can get a sense of how your improvements are affecting the speed.

Windows File system API to query large files

I have HDD (say 1TB) with FAT32 and NTFS partitions and I dont have information on which all files are stored on it, but when needed I want to quickly access large files say over 500 MB. I dont want to scan my whole HDD since it is very time consuming. I need quick results. I was wondering if there are any NTFS/FAT32 APIs that I can directly call - i mean if they have some metadata about the files that are stored then it will be quicker. I want to write my program in C++ and C#.
EDIT
If scanning the HDD is the only option then what all can I do to ensure best performance. Like - I could skip scanning system folders, since I am only interested in user data.
If you're willing to do a lot of extra work yourself to speed things up, you might be able to accomplish something. A lot is going to depend on what you need.
Let's start with FAT32. FAT (in general, not just the 32-bit variant) is named for the File Allocation Table. This is a block of data toward the beginning of the partition that tells which clusters in the partition belong to which files. The FAT is basically organized as linked lists of clusters. If you just want to find the data areas for the large files, you can read the FAT in as a number of raw sectors, and scan through that data to find linked lists of more than X clusters (where X defines the lower limit for what you consider a large file). You can then access those clusters and see the actual data associated with each file. Oddly, what you won't know is the name of that file. The file names are contained in directories, which are basically like files, except that what they contain are a fixed-size records of a specified format. You have to start from the root directory, and read through the directory tree to find file names.
NTFS is both simpler and more complex. NTFS has a Master File Table (MFT) to contains records for all the files in a partition. The good point is that you can read the MFT and get information about every file on the disk without chasing through the directory tree to get it. The bad point is that decoding the contents of an NTFS partition is definitely non-trivial. Reading data (meaningfully) is quite difficult -- and writing data much more difficult. Also, recent versions of Windows have added more restrictions on raw reading from disk partitions, so depending on what partition you're after, you may not be able to access the data you need at all.
None of this, however, is anything that's more than minimally supported. To do it, you open a file named "\.\D:" (where D=letter of the disk you care about). You can then read raw sectors from that disk drive (assuming that opening it worked). This will let you see the raw data for the entire disk (or partition, as the case may be) starting from the boot sector, and going through everything else that's there (FAT, root directory, subdirectories, etc. -- all as sectors of raw data). The system will let you read the raw data, but all the work to make any sense of that data is 100% your responsibility. If the speed you've asked about is an absolute necessity, this may be a possibility -- but it'll take a fair amount of work for FAT volumes, and considerably more than that for NTFS. Unless you really need extra speed like you've said, it's probably not even worth considering trying to do this.
If you're willing to target Vista and beyond, you can use the search indexer APIs.
If you look here you can find information about the search indexer. The search indexer does index the file size so it may do what you want.
Not possible. Neither filesystem keeps a list of big files that you could query directly. You'd have to recursively look at every folder and check the size of every file to find whatever you consider big.
Your only prayer is to latch onto a file indexer, otherwise you will have to iterate through all files. Depending on your computer you might be able to latch onto the native Microsoft indexer (searchindexer.exe) or if you have Google Desktop search you may be able to latch onto that.
Possible way to latch onto Microsoft's indexer

Categories