I am iterating through a particular directory in my code using Directory.EnumerateFiles(). I only need the file name for each file present in the directory but number of files can be huge. As per MSDN and various other sources, we see it mentioned that "When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned. When you use GetFiles, you must wait for the whole array of names to be returned before you can access the array.", here.
So, I did a bit of experiment on my machine.
var files = Directory.EnumerateFiles(path);
foreach(var file in files)
{
}
I had a debugger before starting the iteration, i.e. at foreach. I see the files (returned enumerable of strings) contains all the entries in my directory. Why this? Is this expected?
On some other sources I see that we need to have a separate task which does the iteration and simultaneously in another processing thread, the enumerable can be iterated, example. I was looking for something like: an enumerator by default fetching let's say first 200 files. Successive calls to MoveNext() i.e. when 201'th item is accessed, the next batch of files are fetched.
The variable files here is an IEnumerable, lazily evaluated. If you hover over files in the debugger and click 'Results view' then the full evaluation will take place (just as if you'd called, say, ToArray()). Otherwise the files will only be fetched as you need them (i.e. one at a time by the foreach loop).
So when you say:
"I see the files (returned enumerable of strings) contains all the entries in my directory"
I think you are mistaken.
Related
I have a folder of files and a spreadsheet with a list of file names. I need to go through each file in the folder and see if that file name exists on the spreadsheet. It seems I can either load all the information (file names and spreadsheet list) into lists and then search from there or I can just loop through the files, get the name as I go, then look through the spreadsheet itself.
As far as I can tell, the benefit to loading them first is that it may make the search code a bit cleaner, but if there are too many files it would be redundant and slower. Working directly with the files and spreadsheet would negate that intermediate step, but the search code would be a little bit messier.
Is there any other clear trade off that I am missing? Is there a best practice for this?
Thanks.
Be careful as comparing two lists results in a O(n2) problem. This means that if you have 20 files, you will have to make 20 * 20 = 400 comparisons.
Therefore I suggest putting the filenames from the spreadsheet into a HashSet<string>. It has a constant access time of O(1). This reduces your problem to a O(n) problem.
// Gather the file names from the spreadsheet and insert them in a HashSet.
// (This is just simulated here.)
var fileNamesOnSpreadsheet = new HashSet<string>(StringComparer.OrdinalIgnoreCase) {
"filename 1", "filename 2", "filename 3", "another filename"
};
string folder = #"C:\Data";
foreach (string file in Directory.EnumerateFiles(folder)) {
if (fileNamesOnSpreadsheet.Contains(file)) {
// file found in spreadsheet
} else {
// file missing from spreadsheet
}
}
Note that Directory.EnumerateFiles get the filenames including their paths an extensions. If you have the bare filenames in the spreadsheet, you can remove the path with
string fileNameOnly = Path.GetFileName(file);
You can also remove the extension with
string fileNameOnly = Path.GetFileNameWithoutExtension(file);
Note that this solution reads the files from the folder only once and gets the file names from the spreadsheet only once. Reading information from file system is time-consuming and extracting information form the spreadsheet as well.
Directory.EnumerateFiles does not even store the filenames in a collection but instead delivers them continuously as the foreach-loop progresses.
So, this solution is very efficient.
See also:
Directory.EnumerateFiles Method
Big O notation - Wikipedia
For small numbers of names and a single search it takes so little time that optimizing the code is probably not worth it and you can do whatever is easiest for you.
For the sheet, it could make sense to load the names into a list* because you will search the list N times (once per file) and searching the list will be faster than searching the sheet. It might also make sense to have the list sorted so searches take log N time instead of N
as #JonathanWillcock1 notes, other in-memory data structures such as a dictionary might work even better, hiding the details of sorting and searching from you and making your code cleaner.
For the file names you only look at each name one once so iterating through a directory listing function is all that you need and copying it to a list would double the work.
I'm trying to get a list of files in a specific directory that contains over 20 million files ranging from 2 to 20 KB each.
The problem is that my program throws the Out Of Memory Exception everytime, while tools like robocopy are doing a good job copying the folder to another directory with no problem at all. Here's the code I'm using to enumerate files:
List<string> files = new List<string>(Directory.EnumerateFiles(searchDir));
What should I do to solve this problem?
Any help would be appreciated.
You are creating a list of 20 million object in memory. I don't think you will ever use that, even if it become possible.
Instead use to Directory.EnumerateFiles(searchDir) and iterate each item one by one.
like:
foreach(var file in Directory.EnumerateFiles(searchDir))
{
//Copy to other location, or other stuff
}
With your current code, your program will have 20 million objects first loaded up in memory and then you have to iterate, or perform operations on them.
See: Directory.EnumerateFiles Method (String)
The EnumerateFiles and GetFiles methods differ as follows: When you
use EnumerateFiles, you can start enumerating the collection of
names before the whole collection is returned; when you use
GetFiles, you must wait for the whole array of names to be returned
before you can access the array. Therefore, when you are working with
many files and directories, EnumerateFiles can be more efficient.
The answer above covers one directory level. To be able to enumerate through multiple levels of directories, each having a large number of directories with a large number of files, one can do the following:
public IEnumerable<string> EnumerateFiles(string startingDirectoryPath) {
var directoryEnumerables = new Queue<IEnumerable<string>>();
directoryEnumerables.Enqueue(new string[] { startingDirectoryPath });
while (directoryEnumerables.Any()) {
var currentDirectoryEnumerable = directoryEnumerables.Dequeue();
foreach (var directory in currentDirectoryEnumerable) {
foreach (var filePath in EnumerateFiles(directory)) {
yield return filePath;
}
directoryEnumerables.Enqueue(Directory.EnumerateDirectories(directory));
}
}
}
The function will traverse a collection of directories through enumerators, so it will load the directory contents one by one. The only thing left to solve is the depth of the hierarchy...
I've got an issue with some directory manipulation.
The problem is I have an archive of data that needs to be add or purge backup data based on a series of constraints. The constraint that is an issue is the archive only needs to keep the backup from the previous week.
So when you chart out the steps you would assume:
Check if directory exist.
Grab the files.
Then purge them.
Then move the following week into the directory.
The problem though is when you try keep the code simple and implementation, you create some code that doesn't feel like it is proper practice.
string[] archiveFiles = Directory.GetFiles(
Archive, #"*.*", SearchOption.TopDirectoryOnly);
foreach(string archive in archiveFiles)
File.Delete(archive);
So if you attempt to grab the files with Directory.GetFiles() and it doesn't return a value, according to the documentation:
Return Value Type: System.String[] An array of the full names
(including paths) for the files in the specified directory that match
the specified search pattern and option, or an empty array if no files
are found.
If it returns a null in the array then that would actually have the loop iterate once, an error. If it returns an array with no elements then it will ignore the loop. The second is what I believe it does, which makes this approach feel incorrect.
The only thing I could do would be to use File.Copy() as it can overwrite files, which would avoid this approach but even that could become suceiptable to the same dilemma of that empty array.
Is that right usage and approach for Directory.GetFiles() or is there a better way?
If it returns a null in the array then that would actually have the
loop iterate once, an error. If it returns an array with no elements
then it will ignore the loop. The second is what I believe it does,
which makes this approach feel incorrect.
If there are no files matching the list will be empty, there won't be nulls (how many nulls should return an empty directory?).
So your delete code will not be executed. Makes sense to me.
If you need to delete old files, then copy the new ones you may want to first move old files somewhere safe, then copy new ones, then to delete old files.
Maybe I didnt understand the problem here, but I don't see any. I hope actual code has some try catches though.
I have a physical Directory structure as :
Root directory (X) -> many subdirectory in side root (1,2,3,4..) -> In each sub dir many files present.
Photos(Root)
----
123456789(Child One)
----
1234567891_w.jpg (Child two)
1234567891_w1.jpg(Child two)
1234567891_w2.jpg(Child two)
1234567892_w.jpg (Child two)
1234567892_w1.jpg(Child two)
1234567892_w2.jpg(Child two)
1234567893_w.jpg(Child two)
1234567893_w1.jpg(Child two)
1234567893_w2.jpg(Child two)
-----Cont
232344343(Child One)
323233434(Child One)
232323242(Child One)
232324242(Child One)
----Cont..
In database I have one table having huge number of names of type "1234567891_w.jpg".
NOTE : Both number of data in database and number of photos are in lacs.
I need an effective and faster way to check the presence of each name from database table to the physical directory structure.
Ex : Whether any file with "1234567891_w.jpg" name is present in physical folder inside Photos (Root).*
Please let me know if I miss any information to be given here.
Update :
I know how to find a file name existance in a directory. But I am looking for an efficient way, as it will be too much resource consuming to check each filename (from lacs of record) existance in more than 40 GB data.
You can try to group data from the database based on the directory in which they are. Sort them somehow (based on the filename for instance) and then get the array of files within that directory
string[] filePaths = Directory.GetFiles(#"c:\MyDir\");. Now you only have to compare strings.
It might sound funny or Might be I was unclear or did not provide much information..
But from the directory pattern I got one nice way to handle it is :
AS the probability of existance of the file name is only in one location and that is :
Root/SubDir/filename
I should be using :
File.Exists(Root/SubDir/filename);
i.e - Photos/123456789/1234567891_w.jpg
And I think this will be O(1)
it would seem the files are uniquely named if that's the case you can do something like this
var fileNames = GetAllFileNamesFromDb();
var physicalFiles = Directory.GetFiles(rootDir,
string.Join(",",fileNames),
SearchOptions.AllDirectories)
.Select(f=>Path.GetFileName(f));
var setOfFiles = new Hashset<string>(physicalFiles);
var notPresent = from name in fileNames
where setOfFiles.Contains(name)
select name;
First get all the names of the files from the datatbase
Then search for all the files at once searching from the root and including all subdirectories to get all the physical files
Create a Hashset for fast lookup
Then match the fileNames to the set those not in the set are selected.
the Hashset is basically just a set. That is a collection that can only incude an item once (Ie there's no duplicates) equality in the Hashset is based on HashCode and the lookup to determine if an item is in the set is O(1).
This approach requires you to store a potentially hugh Hashset in memory and depending on the size of that set it might affect the system to an extend where it's no longer optimizing the speed of the application but passes an optimum instead.
As is the case with most optimizations they are all trade offs and the key is finding the balance between all the trade offs in the context of the value the application is producing for the end user
Unfortunately their is no magic bullet which you could use to improve your performance. As always it will be a trade off between speed and memory. Also their are two sides which could lack on performance: The database site and the hdd drive i/o speed.
So to gain speed i would in a first step improve the performance of the database query to ensure that it can return the names for searching fast enough. So ensure that your query is fast and also maybe uses (im MS SQL case) keywords like READ SEQUENTIAL in this case you will already retrieve the first results while the query is still running and you don't have to wait till the query finished and gave you the names as a big block.
On the other hdd side you can either call Directory.GetFiles(), but this call would block till it iterated over all files and will give you back a big array containing all filenames. This would be the memory consuming path and take a while for the first search, but if you afterwards only work on that array you get speed improvements for all consecutive searches. Another approach would be to call Directory.EnumerateFiles() which would search the drive on the fly by every call and so maybe gain speed for the first search, but their won't happen any memory storage for the next search which improves memory footprint but costs speed, due to the fact that their is no array in your memory which could be searched. On the other hand the OS will also do some caching if detects that you iterate over the same files over and over again and some caching occurs on a lower level.
So for the check on hdd site use Directory.GetFiles() if the returned array won't blow your memory and do all your searches on this (maybe put it into a HashSet to further improve performance if filename only or full path depends on what you get from your database) and in the other case use Directory.EnumerateFiles() and hope the best for some caching done be the OS.
Update
After re-reading your question and comments, as far as i understand you have a name like 1234567891_w.jpg and you don't know which part of the name represents the directory part. So in this case you need to make an explicit search, cause iteration through all directories simply takes to much time. Here is some sample code, which should give you an idea on how to solve this in a first shot:
string rootDir = #"D:\RootDir";
// Iterate over all files reported from the database
foreach (var filename in databaseResults)
{
var fullPath = Path.Combine(rootDir, filename);
// Check if the file exists within the root directory
if (File.Exists(Path.Combine(rootDir, filename)))
{
// Report that the file exists.
DoFileFound(fullPath);
// Fast exit to continue with next file.
continue;
}
var directoryFound = false;
// Use the filename as a directory
var directoryCandidate = Path.GetFileNameWithoutExtension(filename);
fullPath = Path.Combine(rootDir, directoryCandidate);
do
{
// Check if a directory with the given name exists
if (Directory.Exists(fullPath))
{
// Check if the filename within this directory exists
if (File.Exists(Path.Combine(fullPath, filename)))
{
// Report that the file exists.
DoFileFound(fullPath);
directoryFound = true;
}
// Fast exit, cause we looked into the directory.
break;
}
// Is it possible that a shorter directory name
// exists where this file exists??
// If yes, we have to continue the search ...
// (Alternative code to the above one)
////// Check if a directory with the given name exists
////if (Directory.Exists(fullPath))
////{
//// // Check if the filename within this directory exists
//// if (File.Exists(Path.Combine(fullPath, filename)))
//// {
//// // Report that the file exists.
//// DoFileFound(fullPath);
//// // Fast exit, cause we found the file.
//// directoryFound = true;
//// break;
//// }
////}
// Shorten the directory name for the next candidate
directoryCandidate = directoryCandidate.Substring(0, directoryCandidate.Length - 1);
} while (!directoryFound
&& !String.IsNullOrEmpty(directoryCandidate));
// We did our best but we found nothing.
if (!directoryFound)
DoFileNotAvailable(filename);
}
The only furhter performance improvement i could think of, would be putting the directories found into a HashSet and before checking with Directory.Exists() use this to check for an existing directory, but maybe this wouldn't gain anything cause the OS already makes some caching in directory lookups and would then nearly as fast as your local cache. But for these things you simply have to measure your concrete problem.
I've recently written a small program to rename a bunch of files located in 6 directories. The program loops through each directory from a list and then renames each file in that directory using the File.Move method. The files are renamed to cart_buttons_1.png with the 1 incrementing by 1 each time.
public static int RenameFiles(DirectoryInfo d, StreamWriter sqlStreamWriter,
int incrementer, int category, int size)
{
FileInfo[] files = d.GetFiles("*.png");
foreach (FileInfo fileInfo in files)
{
File.Move(fileInfo.FullName, d.FullName + "cart_button_" + incrementer + ".png" );
incrementer++;
}
return incrementer;
}
The problem I'm encountering is when I run the program more than once it runs fine up until it hits the folder containing the 100th record. The d.Getfiles method retrieves all the files with the 100s first, causing and IOException, because the file which it is trying to rename already exists in the folder. The workaround I've found for this is just to select all the records with 100 in the filename and renaming them all to 'z' or something so that it just batches them all together. Any thoughts or ideas on how to fix this. Possibly some way to sort the GetFiles to look at the others first.
Using LINQ:
var sorted = files.OrderBy(fi => fi.FullName).ToArray();
Note that the above will sort by the textual values, so you may want to change that to order by the numeric value:
files.OrderBy(fi => int.Parse(fi.Name.Split(new []{'_','.'})[2]))
The above assumes that splitting by _ and . of a file name will result in an array with the third value being the numeric value.
The easiest workaround would be to check if the destination name exists prior to attempting the copy. Since you have the files array already, you can construct your destination name and if File.Exists() returns true, skip that numerical value.
I would also handle the exception that is thrown by the File.Move (you want to test for Existance first to avoid unnecessary exception throwing) because the file-system is not frozen while your code works... so even testing for existance won't ensure it isn't created in the meantime.
Finally, I think that running this code again against the same directory is going to duplicate all of the files again... probably not what is intended. I would filter the source filenames and avoid copying those already matching your pattern.