I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?
The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.
In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.
Related
We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.
My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.
Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.
I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.
UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.
To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.
If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.
You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).
You probably have plenty of CPU time, so the only issue is: does it shrink?
If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.
Alas, photos (jpeg) are already compressed so you probably won't see much gain.
You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.
Basically interface:
enum FileContentType
{
PlainText,
OfficeDoc,
OffixeXlsx
}
// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}
Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)
I am writing a program which iterates through the file system multiple times using simple loops and recursion.
The problem is that, because I am iterating through multiple times, it is taking a long time because (I guess) the hard drive can only work at a certain pace.
Is there any way to optimize this process? Maybe by iterating though once, saving all the relevant information in a collection and then referring to the collection when I need to?
I know I can cache my results like this but I have absolutely no idea how to go about it.
Edit:
There are three main pieces of information I am trying to obtain from a given directory:
The size of the directory (the sum of the size of each file within that directory)
The number of files within the directory
The number of folders within the directory
All of the above includes sub-directories too. Currently, I am performing an iteration of a given directory to obtain each piece of information, i.e. three iterations per directory.
My output is basically a spreadsheet which looks like this:
To improve performance, you could access the Master File Table (MFT) of the NTFS file system directly. There is a excellent code sample on MSDN social forum.
It seems that accessing the MFT is about 10x faster than enumerating the file system using FindFirst/FindNext file.
Hope, this helps.
Yes anything you can do to minimize hard drive I/O will improve the performance. I would also suggest putting in a Stopwatch and measure the time it takes so you can get a sense of how your improvements are affecting the speed.
I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.
You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.
One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.
Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.
Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).
I have HDD (say 1TB) with FAT32 and NTFS partitions and I dont have information on which all files are stored on it, but when needed I want to quickly access large files say over 500 MB. I dont want to scan my whole HDD since it is very time consuming. I need quick results. I was wondering if there are any NTFS/FAT32 APIs that I can directly call - i mean if they have some metadata about the files that are stored then it will be quicker. I want to write my program in C++ and C#.
EDIT
If scanning the HDD is the only option then what all can I do to ensure best performance. Like - I could skip scanning system folders, since I am only interested in user data.
If you're willing to do a lot of extra work yourself to speed things up, you might be able to accomplish something. A lot is going to depend on what you need.
Let's start with FAT32. FAT (in general, not just the 32-bit variant) is named for the File Allocation Table. This is a block of data toward the beginning of the partition that tells which clusters in the partition belong to which files. The FAT is basically organized as linked lists of clusters. If you just want to find the data areas for the large files, you can read the FAT in as a number of raw sectors, and scan through that data to find linked lists of more than X clusters (where X defines the lower limit for what you consider a large file). You can then access those clusters and see the actual data associated with each file. Oddly, what you won't know is the name of that file. The file names are contained in directories, which are basically like files, except that what they contain are a fixed-size records of a specified format. You have to start from the root directory, and read through the directory tree to find file names.
NTFS is both simpler and more complex. NTFS has a Master File Table (MFT) to contains records for all the files in a partition. The good point is that you can read the MFT and get information about every file on the disk without chasing through the directory tree to get it. The bad point is that decoding the contents of an NTFS partition is definitely non-trivial. Reading data (meaningfully) is quite difficult -- and writing data much more difficult. Also, recent versions of Windows have added more restrictions on raw reading from disk partitions, so depending on what partition you're after, you may not be able to access the data you need at all.
None of this, however, is anything that's more than minimally supported. To do it, you open a file named "\.\D:" (where D=letter of the disk you care about). You can then read raw sectors from that disk drive (assuming that opening it worked). This will let you see the raw data for the entire disk (or partition, as the case may be) starting from the boot sector, and going through everything else that's there (FAT, root directory, subdirectories, etc. -- all as sectors of raw data). The system will let you read the raw data, but all the work to make any sense of that data is 100% your responsibility. If the speed you've asked about is an absolute necessity, this may be a possibility -- but it'll take a fair amount of work for FAT volumes, and considerably more than that for NTFS. Unless you really need extra speed like you've said, it's probably not even worth considering trying to do this.
If you're willing to target Vista and beyond, you can use the search indexer APIs.
If you look here you can find information about the search indexer. The search indexer does index the file size so it may do what you want.
Not possible. Neither filesystem keeps a list of big files that you could query directly. You'd have to recursively look at every folder and check the size of every file to find whatever you consider big.
Your only prayer is to latch onto a file indexer, otherwise you will have to iterate through all files. Depending on your computer you might be able to latch onto the native Microsoft indexer (searchindexer.exe) or if you have Google Desktop search you may be able to latch onto that.
Possible way to latch onto Microsoft's indexer
I need a test app that will create a big number of small files on disk as faster as possible.
Will asynch ops help creating files or just writing them? Is there a way to speed up the whole process (writing on a single file is not possible)
Wouldn't physical drive IO be the bottleneck here? You'll probably get different results if you write to a 4200rpm drive versus a 10,000rpm drive versus an ultrafast SSD.
It's hard for me to say without writing a test app myself, but disc access will be synchronized anyway, so it's not like you will have multiple threads writing to disk at the same time. You could speed up your performance by using threads if there was a fair amount of processing done before writing out each file.
If it's possible to test your app using a ramdisk it would probably speed up things considerably.
If possible, don't write them all in the same directory. Many filesystems slow down when dealing with directories containing large numbers of files. (I once brought our fileserver at work, which normally happily serves the whole office, to its knees by writing thousands of files to the same directory).
Instead, make a new directory for each 1000 files or so.