Quicky estimate a number of subfolders - c#

My C# 3.0 application should traverse through folders and do some stuff within. To show a meaningful progress, I need to know total folder count.
If I use Directory.GetDirectories with AllDirectories option, this takes a very long time on my 2Tb hard drive with around 100K folders, and I should present a progress even for that operation! The only meaningful thing I can do is to use recursive Directory.GetDirectories and present a user with a number of already found directories. However, this takes even longer, than the first approach.
I believe, both approaches are too slow. Is there any way to get this number quicker? E.g. take from some file tables using PInvoke? Any other ideas?

My suggestion would be to simply show the user an infinitely scrolling progress bar while you are getting all of the directories and only when show the user the actual progress while your application does the work.
This way the user will know the application is working in the background while everything happens.

This sort of thing is hard to do. If you're just trying to make a rough estimate for a progress bar, you don't need much granularity, right? I would suggest manually traversing the directory tree only one or two levels deep to figure out how many 1st- and 2nd-level subdirectories there are. Then you can update your progress bar whenever you hit one of those subdirs. That ought to give you a meaningful progress bar without taking too much time to compute.

If you implement this you'll find that your first pre-scan was the slowest but it will speed up the next (full) scan because the folder-structure is getting cached.
It may be an option to only count the folders in the first N (2..4) levels. That could still be slow but it will allow for a estimated progress. Just assume all lower levels contain equal numbers of files.
Part 2, concerning the P/Invoke question
Your main cost is here is true lowlevel I/O, the overhead of the (any) API is negligible.
You probably will benefit from replacing GetFiles() with EnumerateFiles() (Fx4). More so for your main loop than for the pre-scan.

Explore FindFirstFile and FindNextFile APIs. I think they will work faster in your case

I wrote a pretty simple enumeration of files. The progress is mathematically continuous, i.e. it will not turn to a lower value later on no matter what. The estimation is based on the idea that all folders hold the same number of files and subfolders, which is obviously almost never the case, but it suffices to get a reasonable idea.
There is almost no caching, especially not of deep structures, so this should work almost as quickly as enumerating directly.
public static IEnumerable<Tuple<string, float>> EnumerateFiles (string root)
{
var files = Directory.GetFiles (root);
var dirs = Directory.GetDirectories (root);
var fact = 1f / (float) (dirs.Length + 1); // this makes for a rough estimate
for (int i = 0; i < files.Length; i++) {
var file = files[i];
var f = (float) i / (float) files.Length;
f *= fact;
yield return new Tuple<string, float> (file, f);
}
for (int i = 0; i < dirs.Length; i++) {
var dir = dirs[i];
foreach (var tuple in EnumerateFiles (dir)) {
var f = tuple.Item2;
f *= fact;
f += (i + 1) * fact;
yield return new Tuple<string, float> (tuple.Item1, f);
}
}
}

Related

Faster way to search through a big string?

I am currently trying to make a program to find blocks of a specific color in a game save and move their position, however with some of the bigger saves my method of searching for the blocks in the save can start to take a bit. My current fastest method takes about 42 seconds to search for and move every block in a string about the size of 1MB. There are a lot of blocks in the save (Roughly one every 50-300 characters in the string, with a total of around 7k) so I'm not sure if string search algorithms would speed up or slow down this process.
So, I was wondering if I could get any tips of if anyone had any ideas on how to further speed up my code I would be very greatfull.
progressBar2.Maximum = blueprint.Length;
int i = 0;
while (i < blueprint.Length - 15)
{
progressBar2.Value = i;
try
{
if (!blueprint.Substring(i, 110).ToLower()
.Contains("\"color\""))
{
i += 100;
}
}
catch
{
return;
}
checkcolor(i, color, colortf, posset, axis);
i++;
}
I am currently optimizing the method checkcolor and it's the cause for most of the delay, but my current method runs it way more than needed.
I've tried adding a second if to skip at an interval of 10 as well as 100 but that caused it to take over 2 min, I've also tried different values to skip other then 100 but 100 seems to be the fastest.
Edit: I was making 2 new temporary strings just to check for a small bit of text millions of times, it's a lot faster to use .IndexOf which I did not know existed. Thanks for the help and sorry if this was off topic.
I would try to compare efficiency without creation substring and using ToLower():
if (!blueprint.IndexOf("\"color\"", StringComparison.OrdinalIgnoreCase) >= 0)

Very slow performance with DotSpatial shapefile

I'm trying to read all of the feature data from particular shapefile. In this case, I'm using DotSpatial to open the file, and I'm iterating through the features. This particular shapefile is only 9mb in size, and the dbf file is 14mb. There is roughly 75k features to loop through.
Note, this is all programmatically through a console app, so there is no rendering or anything involved.
When loading the shape file, I reproject, then I'm iterating. The loading an reprojecting is super quick. However, as soon as the code reaches my foreach block, it takes nearly 2 full minutes to load the data, and uses roughly 2GB of memory when debugging in VisualStudio. This seems very, very excessive for what's a reasonably small data file.
I've ran the same code outside of Visual Studio, from the command line, however the time is still roughly 2 full minutes, and about 1.3GB of memory for the process.
Is there anyway to speed this up at all?
Below is my code:
// Load the shape file and project to GDA94
Shapefile indexMapFile = Shapefile.OpenFile(shapeFilePath);
indexMapFile.Reproject(KnownCoordinateSystems.Geographic.Australia.GeocentricDatumofAustralia1994);
// Get's slow here and takes forever to get to the first item
foreach(IFeature feature in indexMapFile.Features)
{
// Once inside the loop, it's blazingly quick.
}
Interestingly, when I use the VS immediate window, it's super super fast, no delay at all...
I've managed to figure this out...
For some reason, calling foreach on the features is painfully slow.
However, as these files have a 1-1 mapping with features - data rows (each feature has a relevant data row), I've modified it slightly to the following. It's now very quick.. less than a second to start the iterations.
// Load the shape file and project to GDA94
Shapefile indexMapFile = Shapefile.OpenFile(shapeFilePath);
indexMapFile.Reproject(KnownCoordinateSystems.Geographic.Australia.GeocentricDatumofAustralia1994);
// Get the map index from the Feature data
for(int i = 0; i < indexMapFile.DataTable.Rows.Count; i++)
{
// Get the feature
IFeature feature = indexMapFile.Features.ElementAt(i);
// Now it's very quick to iterate through and work with the feature.
}
I wonder why this would be. I think I need to look at the iterator on the IFeatureList implementation.
Cheers,
Justin
This has the same problem for very large files (1.2 millions of features), populating .Features collections never ends.
But if you ask for the feature you do not have memory or delay overheads.
int lRows = fs.NumRows();
for (int i = 0; i < lRows; i++)
{
// Get the feature
IFeature pFeat = fs.GetFeature(i);
StringBuilder sb = new StringBuilder();
{
sb.Append(Guid.NewGuid().ToString());
sb.Append("|");
sb.Append(pFeat.DataRow["MAPA"]);
sb.Append("|");
sb.Append(pFeat.BasicGeometry.ToString());
}
pLinesList.Add(sb.ToString());
lCnt++;
if (lCnt % 10 == 0)
{
pOld = Console.ForegroundColor;
Console.ForegroundColor = ConsoleColor.DarkGreen;
Console.Write("\r{0} de {1} ({2}%)", lCnt.ToString(), lRows.ToString(), (100.0 * ((float)lCnt / (float)lRows)).ToString());
Console.ForegroundColor = pOld;
}
}
Look for the GetFeature method.

Show progress when searching all files in a directory

I previously asked the question Get all files and directories in specific path fast in order to find files as fastest as possible. I am using that solution in order to find the file names that match a regular expression.
I was hoping to show a progress bar because with some really large and slow hard drives it still takes about 1 minute to execute. That solution I posted on the other link does not enable me to know how many more files are missing to be traversed in order for me to show a progress bar.
One solution that I was thinking about doing was trying to obtain the size of the directory that I was planing traversing. For example when I right click on the folder C:\Users I am able to get an estimate of how big that directory is. If I am able to know the size then I will be able to show the progress by adding the size of every file that I find. In other words the progress = (current sum of file sizes) / directory size
For some reason I have not been able to efficiently get the size of that directory.
Some of the questions on stack overflow use the following approach:
But note that I get an exception and are not able to enumerate the files. I am curios in trying that method on my c drive.
On that picture I was trying to count the number of files in order to show a progress. I will probably not going to be able to get the number of files efficiently using that approach. I where just trying some of the answers on stack overflow when people asked how to get the number of files on a directory and also people asked how the get the size f a directory.
Solving this is going to leave you with one of a few possibilities...
Not displaying a progress
Using an up-front cost to compute (like Windows)
Performing the operation while computing the cost
If the speed is that important and you expect large directory trees I would lean to the last of these options. I've added an answer on the linked question Get all files and directories in specific path fast that demonstrates a faster means of counting files and sizes than you are currently using. To combine this into a multi-threaded piece of code for option #3, the following can be performed...
static void Main()
{
const string directory = #"C:\Program Files";
// Create an enumeration of the files we will want to process that simply accumulates these values...
long total = 0;
var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
fcounter.RaiseOnAccessDenied = false;
fcounter.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
Interlocked.Increment(ref total);
}
};
// Start a high-priority thread to perform the accumulation
Thread t = new Thread(fcounter.Find)
{
IsBackground = true,
Priority = ThreadPriority.AboveNormal,
Name = "file enum"
};
t.Start();
// Allow the accumulator thread to get a head-start on us
do { Thread.Sleep(100); }
while (total < 100 && t.IsAlive);
// Now we can process the files normally and update a percentage
long count = 0, percentage = 0;
var task = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
task.RaiseOnAccessDenied = false;
task.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
ProcessFile(e.FullPath);
// Update the percentage complete...
long progress = ++count * 100 / Interlocked.Read(ref total);
if (progress > percentage && progress <= 100)
{
percentage = progress;
Console.WriteLine("{0}% complete.", percentage);
}
}
};
task.Find();
}
The FindFile class implementation can be found at FindFile.cs.
Depending on how expensive your file-processing task is (the ProcessFile function above) you should see a very clean progression of the progress on large volumes of files. If your file-processing is extremely fast, you may want to increase the lag between the start of enumeration and start of processing.
The event argument is of type FindFile.FileFoundEventArgs and is a mutable class so be sure you don't keep a reference to the event argument as it's values will change.
Ideally you will want to add error handling and probably the ability to abort both enumerations. Aborting the enumeration can be done by setting "CancelEnumeration" on the event argument.
What you are asking may not be possible because of how the file-system store it's data.
It is a file system limitation
There is no way to know the total size of a folder, nor the total files count inside a folder without enumerating files one by one. Neither of these informations are stored in the file system.
This is why Windows shows a message like "Calculating space" before copying folders with a lot of files... it is actually counting how many files are there inside the folder, and summing their sizes so that it can show the progress bar while doing the real copy operation. (it also uses the informations to know if the destination has enough space to hold all the data being copied).
Also when you right-click a folder, and go to properties, note that it takes some time to count all files and to sum all the file sizes. That is caused by the same limitation.
To know how large a folder is, or how many files are there inside a folder, you must enumerate the files one-by-one.
Fast files enumeration
Of course, as you already know, there are a lot of ways of doing the enumeration itself... but none will be instantaneous. You could try using the USN Journal of the file system to do the scan. Take a look at this project in CodePlex: MFT Scanner in VB.NET (the code is actually in C#... don't know why the author says it is VB.NET) ... it found all the files in my IDE SATA (not SSD) drive in less than 15 seconds, and found 311000 files.
You will have to filter the files by path, so that only the files inside the path you are looking are returned. But that is the easy part of the job!
Hope this helps in your project... good luck!

Directory file size calculation - how to make it faster?

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.
I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.
Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.
Is there any way I can bring down the time taken when I run the program first time??
If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...
.NET4.0 Code (or use 3.5 with TaskParallelLibrary)
private static long DirSize(string sourceDir, bool recurse)
{
long size = 0;
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName in fileEntries)
{
Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
}
if (recurse)
{
string[] subdirEntries = Directory.GetDirectories(sourceDir);
Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
{
subtotal += DirSize(subdirEntries[i], true);
return subtotal;
}
return 0;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}
Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)
The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.
some good reading:
NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx
Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1
FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.
The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.
I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx
Peformance will suffer using any method when scanning a folder with tens of thousands of files.
Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.
Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.
How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.
You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.
I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).
Yes, the operating-system caches file access.
Suggestions
I wonder if BackupRead would help in your case?
What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)
What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.
Based on the answer by spookycoder, I found this variation (using DirectoryInfo) at least 2 times faster (and up to 10 times faster on complex folder structures!) :
public static long CalcDirSize(string sourceDir, bool recurse = true)
{
return _CalcDirSize(new DirectoryInfo(sourceDir), recurse);
}
private static long _CalcDirSize(DirectoryInfo di, bool recurse = true)
{
long size = 0;
FileInfo[] fiEntries = di.GetFiles();
foreach (var fiEntry in fiEntries)
{
Interlocked.Add(ref size, fiEntry.Length);
}
if (recurse)
{
DirectoryInfo[] diEntries = di.GetDirectories("*.*", SearchOption.TopDirectoryOnly);
System.Threading.Tasks.Parallel.For<long>(0, diEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((diEntries[i].Attributes & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint) return 0;
subtotal += __CalcDirSize(diEntries[i], true);
return subtotal;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}
I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.
I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.
PInvoke has a sample of how to use the API.
With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.
So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.
It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)
Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.
I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)
Try this:
[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
public uint fileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
public uint fileSizeHigh;
public uint fileSizeLow;
}
public enum GET_FILEEX_INFO_LEVELS
{
GetFileExInfoStandard,
GetFileExMaxInfoLevel
}
public class NativeMethods {
[DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS level, out WIN32_FILE_ATTRIBUTE_DATA data);
}
Now simply do the following:
WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {
long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}

how to increase speed of my execution

i am creating project in c#.net. my execution process is very slow. i also found the reason for that.in one method i copied the values from one list to another.that list consists more 3000values for every row . how can i speed up this process.any body help me
for (int i = 0; i < rectTristrip.NofStrips; i++)
{
VertexList verList = new VertexList();
verList = rectTristrip.Strip[i];
GraphicsPath rectPath4 = verList.TristripToGraphicsPath();
for (int j = 0; j < rectPath4.PointCount; j++)
{
pointList.Add(rectPath4.PathPoints[j]);
}
}
This is the code slow up my procees.Rect tristirp consists lot of vertices each vertices has more 3000 values..
A profiler will tell you exactly how much time is spent on which lines and which are most important to optimize. Red-gate makes a very good one.
http://www.red-gate.com/products/ants_performance_profiler/index.htm
Like musicfreak already mentioned you should profile your code to get reliable result on what's going on. But some processes are just taking some time.
In some way you can't get rid of them, they must be done. The question is just: When they are neccessary? So maybe you can put them into some initialization phase or into another thread which will compute the results for you, while your GUI is accessible to your users.
In one of my applications i make a big query against a SQL Server. This task takes a while (built up connection, send query, wait for result, putting result into a data table, making some calculations on my own, presenting the results to the user). All of these steps are necessary and can't be make any faster. But they will be done in another thread while the user sees in the result window a 'Please wait' with a progress bar. In the meantime the user can already make some other settings in the UI (if he likes). So the UI is responsive and the user has no big problem to wait a few seconds.
So this is not a real answer, but maybe it gives you some ideas on how to solve your problem.
You can split the load into a couple of worker threads, say 3 threads each dealing with 1000 elements.
You can synchronize it with AutoResetEvent
Some suggestions, even though I think the bulk of the work is in TristripToGraphicsPath():
// Use rectTristrip.Strip.Length instead of NoOfStrips
// to let the JIT eliminate bounds checking
// .Count if it is a list instead of array
for (int i = 0; i < rectTristrip.Strip.Length; i++)
{
VertexList verList = rectTristrip.Strip[i]; // Removed 'new'
GraphicsPath rectPath4 = verList.TristripToGraphicsPath();
// Assuming pointList is infact a list, do this:
pointList.AddRange(rectPath4.PathPoints);
// Else do this:
// Use PathPoints.Length instead of PointCount
// to let the JIT eliminate bounds checking
for (int j = 0; j < rectPath4.PathPoints.Length; j++)
{
pointList.Add(rectPath4.PathPoints[j]);
}
}
And maybe verList = rectTristrip.Strip[i]; // Removed 'VertexList' to save some memory
Define variable VertexList verList above loop.

Categories