I previously asked the question Get all files and directories in specific path fast in order to find files as fastest as possible. I am using that solution in order to find the file names that match a regular expression.
I was hoping to show a progress bar because with some really large and slow hard drives it still takes about 1 minute to execute. That solution I posted on the other link does not enable me to know how many more files are missing to be traversed in order for me to show a progress bar.
One solution that I was thinking about doing was trying to obtain the size of the directory that I was planing traversing. For example when I right click on the folder C:\Users I am able to get an estimate of how big that directory is. If I am able to know the size then I will be able to show the progress by adding the size of every file that I find. In other words the progress = (current sum of file sizes) / directory size
For some reason I have not been able to efficiently get the size of that directory.
Some of the questions on stack overflow use the following approach:
But note that I get an exception and are not able to enumerate the files. I am curios in trying that method on my c drive.
On that picture I was trying to count the number of files in order to show a progress. I will probably not going to be able to get the number of files efficiently using that approach. I where just trying some of the answers on stack overflow when people asked how to get the number of files on a directory and also people asked how the get the size f a directory.
Solving this is going to leave you with one of a few possibilities...
Not displaying a progress
Using an up-front cost to compute (like Windows)
Performing the operation while computing the cost
If the speed is that important and you expect large directory trees I would lean to the last of these options. I've added an answer on the linked question Get all files and directories in specific path fast that demonstrates a faster means of counting files and sizes than you are currently using. To combine this into a multi-threaded piece of code for option #3, the following can be performed...
static void Main()
{
const string directory = #"C:\Program Files";
// Create an enumeration of the files we will want to process that simply accumulates these values...
long total = 0;
var fcounter = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
fcounter.RaiseOnAccessDenied = false;
fcounter.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
Interlocked.Increment(ref total);
}
};
// Start a high-priority thread to perform the accumulation
Thread t = new Thread(fcounter.Find)
{
IsBackground = true,
Priority = ThreadPriority.AboveNormal,
Name = "file enum"
};
t.Start();
// Allow the accumulator thread to get a head-start on us
do { Thread.Sleep(100); }
while (total < 100 && t.IsAlive);
// Now we can process the files normally and update a percentage
long count = 0, percentage = 0;
var task = new CSharpTest.Net.IO.FindFile(directory, "*", true, true, true);
task.RaiseOnAccessDenied = false;
task.FileFound +=
(o, e) =>
{
if (!e.IsDirectory)
{
ProcessFile(e.FullPath);
// Update the percentage complete...
long progress = ++count * 100 / Interlocked.Read(ref total);
if (progress > percentage && progress <= 100)
{
percentage = progress;
Console.WriteLine("{0}% complete.", percentage);
}
}
};
task.Find();
}
The FindFile class implementation can be found at FindFile.cs.
Depending on how expensive your file-processing task is (the ProcessFile function above) you should see a very clean progression of the progress on large volumes of files. If your file-processing is extremely fast, you may want to increase the lag between the start of enumeration and start of processing.
The event argument is of type FindFile.FileFoundEventArgs and is a mutable class so be sure you don't keep a reference to the event argument as it's values will change.
Ideally you will want to add error handling and probably the ability to abort both enumerations. Aborting the enumeration can be done by setting "CancelEnumeration" on the event argument.
What you are asking may not be possible because of how the file-system store it's data.
It is a file system limitation
There is no way to know the total size of a folder, nor the total files count inside a folder without enumerating files one by one. Neither of these informations are stored in the file system.
This is why Windows shows a message like "Calculating space" before copying folders with a lot of files... it is actually counting how many files are there inside the folder, and summing their sizes so that it can show the progress bar while doing the real copy operation. (it also uses the informations to know if the destination has enough space to hold all the data being copied).
Also when you right-click a folder, and go to properties, note that it takes some time to count all files and to sum all the file sizes. That is caused by the same limitation.
To know how large a folder is, or how many files are there inside a folder, you must enumerate the files one-by-one.
Fast files enumeration
Of course, as you already know, there are a lot of ways of doing the enumeration itself... but none will be instantaneous. You could try using the USN Journal of the file system to do the scan. Take a look at this project in CodePlex: MFT Scanner in VB.NET (the code is actually in C#... don't know why the author says it is VB.NET) ... it found all the files in my IDE SATA (not SSD) drive in less than 15 seconds, and found 311000 files.
You will have to filter the files by path, so that only the files inside the path you are looking are returned. But that is the easy part of the job!
Hope this helps in your project... good luck!
Related
In order to update a progress bar with the number of files to extract. My program is going over a list of Zip files and collects the number of files in them. The combined number is approximately 22000 files.
The code I am using:
foreach (string filepath in zipFiles)
{
ZipArchive zip = ZipFile.OpenRead(filepath);
archives.Add(zip);
filesCounter += zip.Entries.Count;
}
However it looks like the zip.Entries.Count is doing some kind of a traversal and it takes ages for this count to complete (Several Minutes and much, much more, if the internet connection is not great).
To have a sort of notion how much this can improve, I compared the above to the performance of 7-Zip.
I took one of the zip files that contain ~11000 files and folders:
2 Seconds to Open 7-Zip Archive.
1 Second to get the file properties
In the properties I can see 10016 files + 882 folder - meaning it takes 7-Zip ~3 seconds to know there are 10898 entries in the Zip file.
Any Idea, suggestion or any alternative method, that quickly counts the number of files, will be appreciated.
Using DotNetZip to count is actually much faster, but due to some internal bureaucratic issues, I can't use it.
I need to have a solution not involving third party libraries, I can still use Microsoft Standard Libraries.
My progress bar issue is solved, by taking a new approach to the matter.
I simply accumulate all ZIP files sizes, which serves as the max size. Now for each individual file that is extracted I add its compressed size to the progress. This way the progress bar does not show me the number of files, it shows me the uncompressed progress (E.g. If, in total, I have 4GB to Extract, when the progress bar is 1/4 green, I know I Extracted 1GB). Looks like a better representation of reality.
foreach (string filepath in zipFiles)
{
ZipArchive zip = ZipFile.OpenRead(filepath);
archives.Add(zip);
// Accumulating the Zip files sizes.
filesCounter += new FileInfo(filepath).Length;
}
// To utilize multiple processors it is possible to activate this loop
// in a thread for each ZipArchive -> currentZip!
// :
// :
foreach (ZipArchiveEntry entry in currentZip.Entries) {
// Doing my extract code here.
// :
// :
// Accumulate the compressed size of each file.
compressedFileSize += entry.CompressedLength
// Doing other stuff
// :
// :
}
So the issue with improving the performance of the zip.Entries.Count is still on, and I am still interested in knowing how to solve this specific issue (What does 7Zip do to be so quick - may be they use the DotNetZip or other C++ libraries)
I would like to handle too much text data, and then save it to hard drive in zip archives. The task is complicated by the fact that the treatment should occur multithreaded.
...
ZipSaver saver = new ZipSaver(10000); // 10000 - is the number of items when necessary to save the file to hard drive
Parallel.ForEach(source, item => {
string workResult = ModifyItem(item);
saver.AddItem(workResult);
});
Part of a class ZipSaver (uses the library Ionic ZipFile)
private ConcurrentQueue<ZipFile> _pool;
public void AddItem(string src){
ZipFile currentZipFile;
if(_pool.TryDequeue(out currentZipFile) == false){
currentZipFile = InitNewZipFile(); //
}
currentZipFile.AddEntry(path, src); // f the pool is not available archives, create a new one
// if after an item is added to the archive, you have reached the maximum number of elements,
// specified in the constructor, save this file to your hard drive,
// else return the archive into a common pool
if(currentZipFile.Enties.Count > _maxEntries){
SaveZip(currentZipFile);
}else{
_pool.Enqueue(currentZipFile);
}
}
Of course, I can play with the number of the maximum number of items in the archive, but this depends on the size of output file, that ideally, should be configured. Now many items of collection, which is processed in the cycle, creating many threads, practical, each of which has its "own" instance ZipFile that leads to the overflow of RAM.
How to improve the mechanism of conservation? And sorry for my English =)
What about limiting the number of concurrent threads, which will limit the number of ZipFile instances you have in the queue. For example:
Parallel.ForEach(source,
new ParallelOptions { MaxDegreeOfParallelism = 3 },
item =>
{
string workResult = ModifyItem(item);
saver.AddItem(workResult);
});
It might also be that 10,000 items is too many. If the files you're adding are each 1 megabyte in size, then 10,000 of them is going to create a 10 gigabyte file. That's likely to make you run out of memory.
You need to limit the zip file by size rather than by number of files. I don't know if DotNetZip will let you see how many bytes are currently in the output buffer. If nothing else, you can estimate your compression ratio and use that to limit the size by counting up the uncompressed bytes. That is, if you expect a 50% compression ratio and you want to limit your output file sizes to 1 gigabyte, then you need to limit your total input to 2 gigabytes (i.e. 1 gb/0.5 = 2 gb).
Would be best if you could see the current output size. I'm not familiar with DotNetZip, so I can't say if it has that capability.
I have a physical Directory structure as :
Root directory (X) -> many subdirectory in side root (1,2,3,4..) -> In each sub dir many files present.
Photos(Root)
----
123456789(Child One)
----
1234567891_w.jpg (Child two)
1234567891_w1.jpg(Child two)
1234567891_w2.jpg(Child two)
1234567892_w.jpg (Child two)
1234567892_w1.jpg(Child two)
1234567892_w2.jpg(Child two)
1234567893_w.jpg(Child two)
1234567893_w1.jpg(Child two)
1234567893_w2.jpg(Child two)
-----Cont
232344343(Child One)
323233434(Child One)
232323242(Child One)
232324242(Child One)
----Cont..
In database I have one table having huge number of names of type "1234567891_w.jpg".
NOTE : Both number of data in database and number of photos are in lacs.
I need an effective and faster way to check the presence of each name from database table to the physical directory structure.
Ex : Whether any file with "1234567891_w.jpg" name is present in physical folder inside Photos (Root).*
Please let me know if I miss any information to be given here.
Update :
I know how to find a file name existance in a directory. But I am looking for an efficient way, as it will be too much resource consuming to check each filename (from lacs of record) existance in more than 40 GB data.
You can try to group data from the database based on the directory in which they are. Sort them somehow (based on the filename for instance) and then get the array of files within that directory
string[] filePaths = Directory.GetFiles(#"c:\MyDir\");. Now you only have to compare strings.
It might sound funny or Might be I was unclear or did not provide much information..
But from the directory pattern I got one nice way to handle it is :
AS the probability of existance of the file name is only in one location and that is :
Root/SubDir/filename
I should be using :
File.Exists(Root/SubDir/filename);
i.e - Photos/123456789/1234567891_w.jpg
And I think this will be O(1)
it would seem the files are uniquely named if that's the case you can do something like this
var fileNames = GetAllFileNamesFromDb();
var physicalFiles = Directory.GetFiles(rootDir,
string.Join(",",fileNames),
SearchOptions.AllDirectories)
.Select(f=>Path.GetFileName(f));
var setOfFiles = new Hashset<string>(physicalFiles);
var notPresent = from name in fileNames
where setOfFiles.Contains(name)
select name;
First get all the names of the files from the datatbase
Then search for all the files at once searching from the root and including all subdirectories to get all the physical files
Create a Hashset for fast lookup
Then match the fileNames to the set those not in the set are selected.
the Hashset is basically just a set. That is a collection that can only incude an item once (Ie there's no duplicates) equality in the Hashset is based on HashCode and the lookup to determine if an item is in the set is O(1).
This approach requires you to store a potentially hugh Hashset in memory and depending on the size of that set it might affect the system to an extend where it's no longer optimizing the speed of the application but passes an optimum instead.
As is the case with most optimizations they are all trade offs and the key is finding the balance between all the trade offs in the context of the value the application is producing for the end user
Unfortunately their is no magic bullet which you could use to improve your performance. As always it will be a trade off between speed and memory. Also their are two sides which could lack on performance: The database site and the hdd drive i/o speed.
So to gain speed i would in a first step improve the performance of the database query to ensure that it can return the names for searching fast enough. So ensure that your query is fast and also maybe uses (im MS SQL case) keywords like READ SEQUENTIAL in this case you will already retrieve the first results while the query is still running and you don't have to wait till the query finished and gave you the names as a big block.
On the other hdd side you can either call Directory.GetFiles(), but this call would block till it iterated over all files and will give you back a big array containing all filenames. This would be the memory consuming path and take a while for the first search, but if you afterwards only work on that array you get speed improvements for all consecutive searches. Another approach would be to call Directory.EnumerateFiles() which would search the drive on the fly by every call and so maybe gain speed for the first search, but their won't happen any memory storage for the next search which improves memory footprint but costs speed, due to the fact that their is no array in your memory which could be searched. On the other hand the OS will also do some caching if detects that you iterate over the same files over and over again and some caching occurs on a lower level.
So for the check on hdd site use Directory.GetFiles() if the returned array won't blow your memory and do all your searches on this (maybe put it into a HashSet to further improve performance if filename only or full path depends on what you get from your database) and in the other case use Directory.EnumerateFiles() and hope the best for some caching done be the OS.
Update
After re-reading your question and comments, as far as i understand you have a name like 1234567891_w.jpg and you don't know which part of the name represents the directory part. So in this case you need to make an explicit search, cause iteration through all directories simply takes to much time. Here is some sample code, which should give you an idea on how to solve this in a first shot:
string rootDir = #"D:\RootDir";
// Iterate over all files reported from the database
foreach (var filename in databaseResults)
{
var fullPath = Path.Combine(rootDir, filename);
// Check if the file exists within the root directory
if (File.Exists(Path.Combine(rootDir, filename)))
{
// Report that the file exists.
DoFileFound(fullPath);
// Fast exit to continue with next file.
continue;
}
var directoryFound = false;
// Use the filename as a directory
var directoryCandidate = Path.GetFileNameWithoutExtension(filename);
fullPath = Path.Combine(rootDir, directoryCandidate);
do
{
// Check if a directory with the given name exists
if (Directory.Exists(fullPath))
{
// Check if the filename within this directory exists
if (File.Exists(Path.Combine(fullPath, filename)))
{
// Report that the file exists.
DoFileFound(fullPath);
directoryFound = true;
}
// Fast exit, cause we looked into the directory.
break;
}
// Is it possible that a shorter directory name
// exists where this file exists??
// If yes, we have to continue the search ...
// (Alternative code to the above one)
////// Check if a directory with the given name exists
////if (Directory.Exists(fullPath))
////{
//// // Check if the filename within this directory exists
//// if (File.Exists(Path.Combine(fullPath, filename)))
//// {
//// // Report that the file exists.
//// DoFileFound(fullPath);
//// // Fast exit, cause we found the file.
//// directoryFound = true;
//// break;
//// }
////}
// Shorten the directory name for the next candidate
directoryCandidate = directoryCandidate.Substring(0, directoryCandidate.Length - 1);
} while (!directoryFound
&& !String.IsNullOrEmpty(directoryCandidate));
// We did our best but we found nothing.
if (!directoryFound)
DoFileNotAvailable(filename);
}
The only furhter performance improvement i could think of, would be putting the directories found into a HashSet and before checking with Directory.Exists() use this to check for an existing directory, but maybe this wouldn't gain anything cause the OS already makes some caching in directory lookups and would then nearly as fast as your local cache. But for these things you simply have to measure your concrete problem.
I have to remove duplicate strings from extremely big text file (100 Gb+)
Since in memory duplicate removing is hopeless due to size of data, I have tried bloomfilter but of no use beyond something like 50 millions strings ..
total strings are like 1 trillion+
I want to know what are the ways to solve this problem..
My initial attempt is, dividing the file in to number of sub files , sort each file and then merge all files together...
If you have better solution than this please let me know,
Thanks..
The key concept you are looking for here is external sorting. You should be able to merge sort the whole file using the techniques described in that article and then run through it sequentially to remove duplicates.
If the article is not clear enough have a look at the referenced implementations such as this one.
You can make second file, which contains records, each record is 64-bit CRC plus offset of the string and file should be indexed for fast search.
Something like this:
ReadFromSourceAndSort()
{
offset=0;
while(!EOF)
{
string = ReadFromFile();
crc64 = crc64(string);
if(lookUpInCache(crc64))
{
skip;
} else {
WriteToCacheFile(crc64, offset);
WriteToOutput(string);
}
}
}
How to make good cachefile? It should be sorted by CRC64 to search fast. So you shuold to make structure of this file like binary searching tree, but with fast adding of new items without moving existing in the file. To improve speed you need to use Memory Mapped Files.
Possible answer:
memory = ReserveMemory(100 Mb);
mapfile= MapMemoryToFile(memory, "\\temp\\map.tmp"); (File can be bigger, Mapping is just window)
currentWindowNumber = 0;
while(!EndOfFile)
{
ReadFromSourceAndSort(); But only for first 100 Mb in memory
currentWindowNumber++;
MoveMapping(currentWindowNumber)
}
And Function To lookup; Shuld not use mapping (because each window switching saves 100 Mb to HDD and loads 100 Mb of the next window). Just seeks in 100Mb Trees of CRC64 and if CRC64 found -> string is already stored
Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.
I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.
Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.
Is there any way I can bring down the time taken when I run the program first time??
If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...
.NET4.0 Code (or use 3.5 with TaskParallelLibrary)
private static long DirSize(string sourceDir, bool recurse)
{
long size = 0;
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName in fileEntries)
{
Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
}
if (recurse)
{
string[] subdirEntries = Directory.GetDirectories(sourceDir);
Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
{
subtotal += DirSize(subdirEntries[i], true);
return subtotal;
}
return 0;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}
Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)
The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.
some good reading:
NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx
Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1
FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.
The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.
I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx
Peformance will suffer using any method when scanning a folder with tens of thousands of files.
Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.
Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.
How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.
You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.
I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).
Yes, the operating-system caches file access.
Suggestions
I wonder if BackupRead would help in your case?
What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)
What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.
Based on the answer by spookycoder, I found this variation (using DirectoryInfo) at least 2 times faster (and up to 10 times faster on complex folder structures!) :
public static long CalcDirSize(string sourceDir, bool recurse = true)
{
return _CalcDirSize(new DirectoryInfo(sourceDir), recurse);
}
private static long _CalcDirSize(DirectoryInfo di, bool recurse = true)
{
long size = 0;
FileInfo[] fiEntries = di.GetFiles();
foreach (var fiEntry in fiEntries)
{
Interlocked.Add(ref size, fiEntry.Length);
}
if (recurse)
{
DirectoryInfo[] diEntries = di.GetDirectories("*.*", SearchOption.TopDirectoryOnly);
System.Threading.Tasks.Parallel.For<long>(0, diEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((diEntries[i].Attributes & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint) return 0;
subtotal += __CalcDirSize(diEntries[i], true);
return subtotal;
},
(x) => Interlocked.Add(ref size, x)
);
}
return size;
}
I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.
I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.
PInvoke has a sample of how to use the API.
With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.
So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.
It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)
Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.
I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)
Try this:
[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
public uint fileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
public uint fileSizeHigh;
public uint fileSizeLow;
}
public enum GET_FILEEX_INFO_LEVELS
{
GetFileExInfoStandard,
GetFileExMaxInfoLevel
}
public class NativeMethods {
[DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS level, out WIN32_FILE_ATTRIBUTE_DATA data);
}
Now simply do the following:
WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {
long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}