Directory file size calculation - how to make it faster?

Directory file size calculation - how to make it faster? - c#

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.
I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.
Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.
Is there any way I can bring down the time taken when I run the program first time??

If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...
.NET4.0 Code (or use 3.5 with TaskParallelLibrary)
private static long DirSize(string sourceDir, bool recurse)
{
long size = 0;
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName in fileEntries)
{
Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
}
if (recurse)
{
string[] subdirEntries = Directory.GetDirectories(sourceDir);
Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
{
subtotal += DirSize(subdirEntries[i], true);
return subtotal;
}
return 0;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}

Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)
The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.
some good reading:
NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx
Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1
FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.

The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.
I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx

Peformance will suffer using any method when scanning a folder with tens of thousands of files.
Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.
Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.
How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.
You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.
I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).
Yes, the operating-system caches file access.
Suggestions
I wonder if BackupRead would help in your case?
What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)
What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.

Based on the answer by spookycoder, I found this variation (using DirectoryInfo) at least 2 times faster (and up to 10 times faster on complex folder structures!) :
public static long CalcDirSize(string sourceDir, bool recurse = true)
{
return _CalcDirSize(new DirectoryInfo(sourceDir), recurse);
}
private static long _CalcDirSize(DirectoryInfo di, bool recurse = true)
{
long size = 0;
FileInfo[] fiEntries = di.GetFiles();
foreach (var fiEntry in fiEntries)
{
Interlocked.Add(ref size, fiEntry.Length);
}
if (recurse)
{
DirectoryInfo[] diEntries = di.GetDirectories("*.*", SearchOption.TopDirectoryOnly);
System.Threading.Tasks.Parallel.For<long>(0, diEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((diEntries[i].Attributes & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint) return 0;
subtotal += __CalcDirSize(diEntries[i], true);
return subtotal;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}

I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.
I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.
PInvoke has a sample of how to use the API.

With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.
So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.
It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)
Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.

I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)
Try this:
[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
public uint fileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
public uint fileSizeHigh;
public uint fileSizeLow;
}
public enum GET_FILEEX_INFO_LEVELS
{
GetFileExInfoStandard,
GetFileExMaxInfoLevel
}
public class NativeMethods {
[DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS level, out WIN32_FILE_ATTRIBUTE_DATA data);
}
Now simply do the following:
WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {
long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}

Related

Passthru reading of all files in folder

I've got pretty unusual request:
I would like to load all files from specific folder (so far easy). I need something with very small memory footprint.
Now it gets complicated (at least for me). I DON'T need to store or use the content of the files - I just need to force block-level caching mechanism to cache all the blocks that are used by that specific folder.
I know there are many different methods (BinaryReader, StreamReader etc.), but my case is quite special, since I don't care about the content...
Any idea what would be the best way how to achieve this?
Should I use small buffer? But since it would filled quickly, wouldn't flushing of the buffer actually slow down the operation?
Thanks,
Martin

I would perhaps memory map the files and then loop around accessing an element of each file at regular (block-spaced) intervals.
Assuming of course that you are able to use .Net 4.0.
In psuedo code you'd do something like:
using ( var mmf = MemoryMappedFile.CreateFromFile( path ) )
{
for ( long offset = 0 ; offset < file.Size ; offset += block_size )
{
using ( var acc = accessor = mmf.CreateViewAccessor(offset, 1) )
{
acc.ReadByte(offset);
}
}
}
But at the end of the day, each method will have different performance characteristics so you might have to use a bit of trial and error to find out which is the most performant.

I would simply read those files. When you do that, CacheManager in NTFS caches these files automatically, and you don't have to care about anything else - that's exactly the role of CacheManager, and by reading these files, you give it a hint that these files should be cached.

Show progress when searching all files in a directory

I previously asked the question Get all files and directories in specific path fast in order to find files as fastest as possible. I am using that solution in order to find the file names that match a regular expression.
I was hoping to show a progress bar because with some really large and slow hard drives it still takes about 1 minute to execute. That solution I posted on the other link does not enable me to know how many more files are missing to be traversed in order for me to show a progress bar.
One solution that I was thinking about doing was trying to obtain the size of the directory that I was planing traversing. For example when I right click on the folder C:\Users I am able to get an estimate of how big that directory is. If I am able to know the size then I will be able to show the progress by adding the size of every file that I find. In other words the progress = (current sum of file sizes) / directory size
For some reason I have not been able to efficiently get the size of that directory.
Some of the questions on stack overflow use the following approach:
But note that I get an exception and are not able to enumerate the files. I am curios in trying that method on my c drive.
On that picture I was trying to count the number of files in order to show a progress. I will probably not going to be able to get the number of files efficiently using that approach. I where just trying some of the answers on stack overflow when people asked how to get the number of files on a directory and also people asked how the get the size f a directory.

Solving this is going to leave you with one of a few possibilities...
Not displaying a progress
Using an up-front cost to compute (like Windows)
Performing the operation while computing the cost
If the speed is that important and you expect large directory trees I would lean to the last of these options. I've added an answer on the linked question Get all files and directories in specific path fast that demonstrates a faster means of counting files and sizes than you are currently using. To combine this into a multi-threaded piece of code for option #3, the following can be performed...
static void Main()
{
const string directory = #"C:\Program Files";
// Create an enumeration of the files we will want to process that simply accumulates these values...
long total = 0;
var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
fcounter.RaiseOnAccessDenied = false;
fcounter.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
Interlocked.Increment(ref total);
}
};
// Start a high-priority thread to perform the accumulation
Thread t = new Thread(fcounter.Find)
{
IsBackground = true,
Priority = ThreadPriority.AboveNormal,
Name = "file enum"
};
t.Start();
// Allow the accumulator thread to get a head-start on us
do { Thread.Sleep(100); }
while (total < 100 && t.IsAlive);
// Now we can process the files normally and update a percentage
long count = 0, percentage = 0;
var task = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
task.RaiseOnAccessDenied = false;
task.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
ProcessFile(e.FullPath);
// Update the percentage complete...
long progress = ++count * 100 / Interlocked.Read(ref total);
if (progress > percentage && progress <= 100)
{
percentage = progress;
Console.WriteLine("{0}% complete.", percentage);
}
}
};
task.Find();
}
The FindFile class implementation can be found at FindFile.cs.
Depending on how expensive your file-processing task is (the ProcessFile function above) you should see a very clean progression of the progress on large volumes of files. If your file-processing is extremely fast, you may want to increase the lag between the start of enumeration and start of processing.
The event argument is of type FindFile.FileFoundEventArgs and is a mutable class so be sure you don't keep a reference to the event argument as it's values will change.
Ideally you will want to add error handling and probably the ability to abort both enumerations. Aborting the enumeration can be done by setting "CancelEnumeration" on the event argument.

What you are asking may not be possible because of how the file-system store it's data.
It is a file system limitation
There is no way to know the total size of a folder, nor the total files count inside a folder without enumerating files one by one. Neither of these informations are stored in the file system.
This is why Windows shows a message like "Calculating space" before copying folders with a lot of files... it is actually counting how many files are there inside the folder, and summing their sizes so that it can show the progress bar while doing the real copy operation. (it also uses the informations to know if the destination has enough space to hold all the data being copied).
Also when you right-click a folder, and go to properties, note that it takes some time to count all files and to sum all the file sizes. That is caused by the same limitation.
To know how large a folder is, or how many files are there inside a folder, you must enumerate the files one-by-one.
Fast files enumeration
Of course, as you already know, there are a lot of ways of doing the enumeration itself... but none will be instantaneous. You could try using the USN Journal of the file system to do the scan. Take a look at this project in CodePlex: MFT Scanner in VB.NET (the code is actually in C#... don't know why the author says it is VB.NET) ... it found all the files in my IDE SATA (not SSD) drive in less than 15 seconds, and found 311000 files.
You will have to filter the files by path, so that only the files inside the path you are looking are returned. But that is the easy part of the job!
Hope this helps in your project... good luck!

Quicky estimate a number of subfolders

My C# 3.0 application should traverse through folders and do some stuff within. To show a meaningful progress, I need to know total folder count.
If I use Directory.GetDirectories with AllDirectories option, this takes a very long time on my 2Tb hard drive with around 100K folders, and I should present a progress even for that operation! The only meaningful thing I can do is to use recursive Directory.GetDirectories and present a user with a number of already found directories. However, this takes even longer, than the first approach.
I believe, both approaches are too slow. Is there any way to get this number quicker? E.g. take from some file tables using PInvoke? Any other ideas?

My suggestion would be to simply show the user an infinitely scrolling progress bar while you are getting all of the directories and only when show the user the actual progress while your application does the work.
This way the user will know the application is working in the background while everything happens.

This sort of thing is hard to do. If you're just trying to make a rough estimate for a progress bar, you don't need much granularity, right? I would suggest manually traversing the directory tree only one or two levels deep to figure out how many 1st- and 2nd-level subdirectories there are. Then you can update your progress bar whenever you hit one of those subdirs. That ought to give you a meaningful progress bar without taking too much time to compute.

If you implement this you'll find that your first pre-scan was the slowest but it will speed up the next (full) scan because the folder-structure is getting cached.
It may be an option to only count the folders in the first N (2..4) levels. That could still be slow but it will allow for a estimated progress. Just assume all lower levels contain equal numbers of files.
Part 2, concerning the P/Invoke question
Your main cost is here is true lowlevel I/O, the overhead of the (any) API is negligible.
You probably will benefit from replacing GetFiles() with EnumerateFiles() (Fx4). More so for your main loop than for the pre-scan.

Explore FindFirstFile and FindNextFile APIs. I think they will work faster in your case

I wrote a pretty simple enumeration of files. The progress is mathematically continuous, i.e. it will not turn to a lower value later on no matter what. The estimation is based on the idea that all folders hold the same number of files and subfolders, which is obviously almost never the case, but it suffices to get a reasonable idea.
There is almost no caching, especially not of deep structures, so this should work almost as quickly as enumerating directly.
public static IEnumerable<Tuple<string, float>> EnumerateFiles (string root)
{
var files = Directory.GetFiles (root);
var dirs = Directory.GetDirectories (root);
var fact = 1f / (float) (dirs.Length + 1); // this makes for a rough estimate
for (int i = 0; i < files.Length; i++) {
var file = files[i];
var f = (float) i / (float) files.Length;
f *= fact;
yield return new Tuple<string, float> (file, f);
}
for (int i = 0; i < dirs.Length; i++) {
var dir = dirs[i];
foreach (var tuple in EnumerateFiles (dir)) {
var f = tuple.Item2;
f *= fact;
f += (i + 1) * fact;
yield return new Tuple<string, float> (tuple.Item1, f);
}
}
}

C# byte[] substring? (design)

I'm downloading some files asynchronously into a large byte array, and I have a callback that fires off periodically whenever some data is added to that array. If I want to give developers the ability to use the last chunk of data that was added to array, then... well how would I do that? In C++ I could give them a pointer to somewhere in the middle, and then perhaps tell them the number of bytes that were added in the last operation so they at least know the chunk they should be looking at... I don't really want to give them a 2nd copy of that data, that's just wasteful.
I'm just thinking if people want to process this data before the file has completed downloading. Would anyone actually want to do that? Or is it a useless feature anyway? I already have a callback for when the buffer (entire byte array) is full, and then they can dump the whole thing without worrying about start and end points...

.NET has a struct that does exactly what you want:
System.ArraySegment.
In any case, it's easy to implement it yourself too - just make a constructor that takes a base array, an offset, and a length. Then implement an indexer that offsets indexes behind the scenes, so your ArraySegment can be seamlessly used in the place of an array.

You can't give them a pointer into the array, but you could give them the array and start index and length of the new data.
But I have to wonder what someone would use this for. Is this a known need? or are you just guessing that someone might want this someday. And If so, is there any reason why you couldn't wait to add the capability once somone actually needs it?

Whether this is needed or not depends on whether you can afford to accumulate all the data from a file before processing it, or whether you need to provide a streaming mode where you process each chunk as it arrives. This depends on two things: how much data there is (you probably would not want to accumulate a multi-gigabyte file), and how long it takes the file to completely arrive (if you are getting the data over a slow link you might not want your client to wait till it had all arrived). So it is a reasonable feature to add, depending on how the library is to be used. Streaming mode is usually a desirable attribute, so I would vote for implementing the feature. However, the idea of putting the data into an array seems wrong, because it fundamentally implies a non-streaming design, and because it requires an additional copy. What you could do instead is to keep each chunk of arriving data as a discrete piece. These could be stored in a container for which adding at the end and removing from the front is efficient.

Copying a chunk of a byte array may seem "wasteful," but then again, object-oriented languages like C# tend to be a little more wasteful than procedural languages anyway. A few extra CPU cycles and a little extra memory consumption can greatly reduce complexity and increase flexibility in the development process. In fact, copying bytes to a new location in memory to me sounds like good design, as opposed to the pointer approach which will give other classes access to private data.
But if you do want to use pointers, C# does support them. Here is a decent-looking tutorial. The author is correct when he states, "...pointers are only really needed in C# where execution speed is highly important."

I agree with the OP: sometimes you just plain need to pay some attention to efficiency. I don't think the example of providing an API is the best, because that certainly calls for leaning toward safety and simplicity over efficiency.
However, a simple example is when processing large numbers of huge binary files that have zillions of records in them, such as when writing a parser. Without using a mechanism such as System.ArraySegment, the parser becomes a big memory hog, and is greatly slowed down by creating a zillion new data elements, copying all the memory over, and fragmenting the heck out of the heap. It's a very real performance issue. I write these kinds of parsers all the time for telecommunications stuff which generate millions of records per day in each of several categories from each of many switches with variable length binary structures that need to be parsed into databases.
Using the System.ArraySegment mechanism versus creating new structure copies for each record tremendously speeds up the parsing, and greatly reduces the peak memory consumption of the parser. These are very real advantages because the servers run multiple parsers, run them frequently, and speed and memory conservation = very real cost savings in not having to have so many processors dedicated to the parsing.
System.Array segment is very easy to use. Here's a simple example of providing a base way to track the individual records in a typical big binary file full of records with a fixed length header and a variable length record size (obvious exception control deleted):
public struct MyRecord
{
ArraySegment<byte> header;
ArraySegment<byte> data;
}
public class Parser
{
const int HEADER_SIZE = 10;
const int HDR_OFS_REC_TYPE = 0;
const int HDR_OFS_REC_LEN = 4;
byte[] m_fileData;
List<MyRecord> records = new List<MyRecord>();
bool Parse(FileStream fs)
{
int fileLen = (int)fs.FileLength;
m_fileData = new byte[fileLen];
fs.Read(m_fileData, 0, fileLen);
fs.Close();
fs.Dispose();
int offset = 0;
while (offset + HEADER_SIZE < fileLen)
{
int recType = (int)m_fileData[offset];
switch (recType) { /*puke if not a recognized type*/ }
int varDataLen = ((int)m_fileData[offset + HDR_OFS_REC_LEN]) * 256
+ (int)m_fileData[offset + HDR_OFS_REC_LEN + 1];
if (offset + varDataLen > fileLen) { /*puke as file has odd bytes at end*/}
MyRecord rec = new MyRecord();
rec.header = new ArraySegment(m_fileData, offset, HEADER_SIZE);
rec.data = new ArraySegment(m_fileData, offset + HEADER_SIZE,
varDataLen);
records.Add(rec);
offset += HEADER_SIZE + varDataLen;
}
}
}
The above example gives you a list with ArraySegments for each record in the file while leaving all the actual data in place in one big array per file. The only overhead are the two array segments in the MyRecord struct per record. When processing the records, you have the MyRecord.header.Array and MyRecord.data.Array properties which allow you to operate on the elements in each record as if they were their own byte[] copies.

I think you shouldn't bother.
Why on earth would anyone want to use it?

That sounds like you want an event.
public class ArrayChangedEventArgs : EventArgs {
public (byte[] array, int start, int length) {
Array = array;
Start = start;
Length = length;
}
public byte[] Array { get; private set; }
public int Start { get; private set; }
public int Length { get; private set; }
}
// ...
// and in your class:
public event EventHandler<ArrayChangedEventArgs> ArrayChanged;
protected virtual void OnArrayChanged(ArrayChangedEventArgs e)
{
// using a temporary variable avoids a common potential multithreading issue
// where the multicast delegate changes midstream.
// Best practice is to grab a copy first, then test for null
EventHandler<ArrayChangedEventArgs> handler = ArrayChanged;
if (handler != null)
{
handler(this, e);
}
}
// finally, your code that downloads a chunk just needs to call OnArrayChanged()
// with the appropriate args
Clients hook into the event and get called when things change. This is what most client code in .NET expects to have in an API ("call me when something happens"). They can hook into the code with something as simple as:
yourDownloader.ArrayChanged += (sender, e) =>
Console.WriteLine(String.Format("Just downloaded {0} byte{1} at position {2}.",
e.Length, e.Length == 1 ? "" : "s", e.Start));

How to get max allowed filesize in .Net?

Does anyone know how to (natively) get the max allowed file size for a given drive/folder/directory? As in for Fat16 it is ~2gb, Fat32 it was 4gb as far as I remember and for the newer NTFS versions it is something way beyond that.. let alone Mono and the underlying OSes.
Is there anything I can read out / retrieve that might give me a hint on that? Basically I -know- may app will produce bigger, single files than 2gb and I want to check for that when the user sets the corresponding output path(s)...
Cheers & thanks,
-J

This may not be the ideal solution, but I will suggest the following anyway:
// Returns the maximum file size in bytes on the filesystem type of the specified drive.
long GetMaximumFileSize(string drive)
{
var driveInfo = new System.IO.DriveInfo(drive)
switch(driveInfo.DriveFormat)
{
case "FAT16":
return 1000; // replace with actual limit
case "FAT32":
return 1000; // replace with actual limit
case "NTFS":
return 1000; // replace with actual limit
}
}
// Examples:
var maxFileSize1 = GetMaximumFileSize("C"); // for the C drive
var maxFileSize2 = GetMaximumFileSize(absolutePath.Substring(0, 1)); // for whichever drive the given absolute path refers to
This page on Wikipedia contains a pretty comprehensive list of the maximum file sizes for various filesystems. Depending on the number of filesystems for which you want to check in the GetMaximumFileSize function, you may want to use a Dictionary object or even a simple data file rather than a switch statement.
Now, you may be retrieve the maximum file size directly using WMI or perhaps even the Windows API, but these solutions will of course only be compatible with Windows (i.e. no luck with Mono/Linux). However, I would consider this a reasonably nice purely managed solution, despite the use of a lookup table, and has the bonus of working reliably on all OSs.
Hope that helps.

How about using System.Info.DriveInfo.DriveFormat to retrieve the drive's file system (NTFS, FAT, ect.)? That ought to give you at least some idea of the supported file sizes.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.