I'm not really sure what causes this so please forgive me if I couldn't find the information I needed in a search. Here is an example:
Let's say that we have a folder with 1,000,000 files. Running Directory.GetFiles() on that will take a few minutes. However, running it again right after will take only a few seconds. Why does this happen? Are the objects being cached somewhere? How can I run it with the original time?
Hard drives have internal caches that will help speed up subsequent reads. Try reading a bunch of other directory information in a completely different sector to clear the cache.
Related
At work I have come across an error where a service I am running times out when waiting for a file to be created. What I am wondering is if there is a way for the to measure using an external application how long files take to be completely created in a particular directory.
The accuracy does not need to be perfect, but rather a general indication on how quickly files are being generated. I was thinking that I could poll a directory for new files, record their size on first discovery and wait until the file size stops growing, but this seems cumbersome and far from an ideal and accurate solution.
Suggestions would be greatly appreciated.
I'm writing a little app to pull down a few valid samples of each particular type, from a much larger pile of samples.
The structure looks like:
ROOT->STATE->TYPE->SAMPLE
My program cruises through the states, and grabs each unique type, and the path to that type. Once all those are obtained, it goes through each type, and selects X random samples, with X supplied by the user.
The program works great locally, but over the network it's obiviously much slower. I've taken measures to help this, but the last part I'm hung up on is getting the random sample from the TYPE directory fast.
Locally, I use
List<String> directories = Directory.GetDirectories(kvp.Value).ToList();
Which is the bottleneck when running this over the network. I have a feeling this may not be possible, but is there a way to grab, say, 5 random samples from the TYPE directory without first identifying all the samples?
Hopefully I have been clear enough, thankyou.
Perhaps try using DirectoryInfo, when making lots of calls to a specific directory it's faster as security not checked on every access.
You may find speed increases from using a DirectoryInfo object for the root and the sub-folders you want and listing directories that way. That will get you minor speed increases as .NET's lazy initialisation strategy means it will take more network roundtrips using the static Directory methods that you employ in your sample.
The next question I suppose is why is speed important? Have you considered doing something like maintaining an uptodate index in a cache of your own design for speedy access? Either using a FileSystemWatcher, a regular poll, or both?
I think you may also be interested in this link: Checking if folder has files
... it contains some information about limiting your network calls to the bare minimum by retrieving information about the entire directory structure from one call. This will no doubt increase your memory requirements however.
Is the name of each kind of file predictable? Would you have better luck randomly predicting some sample names and reading them directly?
I have problem with processing of more binary files. I have many many folders, in each there is about 200 bin files.
I choose 2 of these directories, then all bin files (their paths) from these 2 directories save to List, and make some filtering with this list. At the end of this, in list is about 200 bin files.
Then I'm iterating over all filtered files, and from each read first 4x8 Bytes (I tried FileStream or BinaryReader). All this operations take about 2-6 seconds, but only for the first time. Next time it's fast enough. If nothing happens with files for a long time (about 30 minutes), the problem appears again.
So probably it's something about caching or what?
Can someone help me please? Thanks
It is very possible that the handles to the files are disposed and that's why after a while the GC removes them and it takes longer or simply that the files are loaded in RAM by the OS and then it serves them to you from there and that's why it is faster, but that is not the issue, the process runs slow because it is slow, it isn't relevant that it is faster the 2nd time because you mustn't rely on that.
What i suggest is to parallellise as much as possible the processing of those files to be able to harness the full power of the hardware at hand.
Start by isolating the code that handles a file and then run the code within a Parallel.ForEach and see if that helps.
One possibility is that your drive is going to sleep (typically a drive will be configured to power down after 15-30 minutes). This can add a significant delay (5 seconds would be a typical figure) as the hard-drive is span back up to speed.
Luckily, this is an easy thing to test. Just set the power-down time to, say, 6 hours, and then test if the behaviour has changed.
I am writing a program which iterates through the file system multiple times using simple loops and recursion.
The problem is that, because I am iterating through multiple times, it is taking a long time because (I guess) the hard drive can only work at a certain pace.
Is there any way to optimize this process? Maybe by iterating though once, saving all the relevant information in a collection and then referring to the collection when I need to?
I know I can cache my results like this but I have absolutely no idea how to go about it.
Edit:
There are three main pieces of information I am trying to obtain from a given directory:
The size of the directory (the sum of the size of each file within that directory)
The number of files within the directory
The number of folders within the directory
All of the above includes sub-directories too. Currently, I am performing an iteration of a given directory to obtain each piece of information, i.e. three iterations per directory.
My output is basically a spreadsheet which looks like this:
To improve performance, you could access the Master File Table (MFT) of the NTFS file system directly. There is a excellent code sample on MSDN social forum.
It seems that accessing the MFT is about 10x faster than enumerating the file system using FindFirst/FindNext file.
Hope, this helps.
Yes anything you can do to minimize hard drive I/O will improve the performance. I would also suggest putting in a Stopwatch and measure the time it takes so you can get a sense of how your improvements are affecting the speed.
A very similar question has also been asked here on SO in case you are interested, but as we will see the accepted answer of that question is not always the case (and it's never the case for my application use-pattern).
The performance determining code consists of FileStream constructor (to open a file) and a SHA1 hash (the .Net framework implementation). The code is pretty much C# version of what was asked in the question I've linked to above.
Case 1: The Application is started either for the first time or Nth time, but with different target file set. The application is now told to compute the hash values on the files that were never accessed before.
~50ms
80% FileStream constructor
18% hash computation
Case 2: Application is now fully terminated, and started again, asked to compute hash on the same files:
~8ms
90% hash computation
8% FileStream constructor
Problem
My application is always in use Case 1. It will never be asked to re-compute a hash on a file that was already visited once.
So my rate-determining step is FileStream Constructor! Is there anything I can do to speed up this use case?
Thank you.
P.S. Stats were gathered using JetBrains profiler.
... but with different target file set.
Key phrase, your app will not be able to take advantage of the file system cache. Like it did in the second measurement. The directory info can't come from RAM because it wasn't read yet, the OS always has to fall back to the disk drive and that is slow.
Only better hardware can speed it up. 50 msec is about the standard amount of time needed for a spindle drive, 20 msec is about as low as such drives can go. Reader head seek time is the hard mechanical limit. That's easy to beat today, SSD is widely available and reasonably affordable. The only problem with it is that when you got used to it then you never move back :)
The file system and or disk controller will cache recently accessed files / sectors.
The rate-determining step is reading the file, not constructing a FileStream object, and it's completely normal that it will be significantly faster on the second run when data is in the cache.
Off track suggestion, but this is something that I have done a lot and got our analyses 30% - 70% faster:
Caching
Write another piece of code that will:
iterate over all the files;
compute the hash; and,
store it in another index file.
Now, don't call a FileStream constructor to compute the hash when your application starts. Instead, open the (expectedly much) smaller index file and read the precomputed hash off it.
Further, if these files are log etc. files which are freshly created every time before your application starts, add code in the file creator to also update the index file with the hash of the newly created file.
This way your application can always read the hash from the index file only.
I concur with #HansPassant's suggestion of using SSDs to make your disk reads faster. This answer and his answer are complimentary. You can implement both to maximize the performance.
As stated earlier, the file system has its own caching mechanism which perturbates your measurement.
However, the FileStream constructor performs several tasks which, the first time are expensive and require accessing the file system (therefore something which might not be in the data cache). For explanatory reasons, you can take a look at the code, and see that the CompatibilitySwitches classes is used to detect sub feature usage. Together with this class, Reflection is heavily used both directly (to access the current assembly) and indirectly (for CAS protected sections, security link demands). The Reflection engine has its own cache, and requires accessing the file system when its own cache is empty.
It feels a little bit odd that the two measurements are so different. We currently have something similar on our machines equipped with an antivirus software configured with realtime protection. In this case, the antivirus software is in the middle and the cache is hit or missed the first time depending the implementation of such software.
The antivirus software might decide to aggressively check certain image files, like PNGs, due to known decode vulnerabilities. Such checks introduce additional slowdown and accounts the time in the outermost .NET class, i.e. the FileStream class.
Profiling using native symbols and/or with kernel debugging, should give you more insights.
Based on my experience, what you describe cannot be mitigated as there are multiple hidden layers out of our control. Depending on your usage, which is not perfectly clear to me right now, you might turn the application in a service, therefore you could serve all the subsequent requests faster. Alternative, you could batch multiple requests into one single call to achieve an amortized reduced cost.
You should try to use the native FILE_FLAG_SEQUENTIAL_SCAN, you will have to pinvoke CreateFile in order to get an handle and pass it to FileStream