I've got an issue with some directory manipulation.
The problem is I have an archive of data that needs to be add or purge backup data based on a series of constraints. The constraint that is an issue is the archive only needs to keep the backup from the previous week.
So when you chart out the steps you would assume:
Check if directory exist.
Grab the files.
Then purge them.
Then move the following week into the directory.
The problem though is when you try keep the code simple and implementation, you create some code that doesn't feel like it is proper practice.
string[] archiveFiles = Directory.GetFiles(
Archive, #"*.*", SearchOption.TopDirectoryOnly);
foreach(string archive in archiveFiles)
File.Delete(archive);
So if you attempt to grab the files with Directory.GetFiles() and it doesn't return a value, according to the documentation:
Return Value Type: System.String[] An array of the full names
(including paths) for the files in the specified directory that match
the specified search pattern and option, or an empty array if no files
are found.
If it returns a null in the array then that would actually have the loop iterate once, an error. If it returns an array with no elements then it will ignore the loop. The second is what I believe it does, which makes this approach feel incorrect.
The only thing I could do would be to use File.Copy() as it can overwrite files, which would avoid this approach but even that could become suceiptable to the same dilemma of that empty array.
Is that right usage and approach for Directory.GetFiles() or is there a better way?
If it returns a null in the array then that would actually have the
loop iterate once, an error. If it returns an array with no elements
then it will ignore the loop. The second is what I believe it does,
which makes this approach feel incorrect.
If there are no files matching the list will be empty, there won't be nulls (how many nulls should return an empty directory?).
So your delete code will not be executed. Makes sense to me.
If you need to delete old files, then copy the new ones you may want to first move old files somewhere safe, then copy new ones, then to delete old files.
Maybe I didnt understand the problem here, but I don't see any. I hope actual code has some try catches though.
Related
I am iterating through a particular directory in my code using Directory.EnumerateFiles(). I only need the file name for each file present in the directory but number of files can be huge. As per MSDN and various other sources, we see it mentioned that "When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned. When you use GetFiles, you must wait for the whole array of names to be returned before you can access the array.", here.
So, I did a bit of experiment on my machine.
var files = Directory.EnumerateFiles(path);
foreach(var file in files)
{
}
I had a debugger before starting the iteration, i.e. at foreach. I see the files (returned enumerable of strings) contains all the entries in my directory. Why this? Is this expected?
On some other sources I see that we need to have a separate task which does the iteration and simultaneously in another processing thread, the enumerable can be iterated, example. I was looking for something like: an enumerator by default fetching let's say first 200 files. Successive calls to MoveNext() i.e. when 201'th item is accessed, the next batch of files are fetched.
The variable files here is an IEnumerable, lazily evaluated. If you hover over files in the debugger and click 'Results view' then the full evaluation will take place (just as if you'd called, say, ToArray()). Otherwise the files will only be fetched as you need them (i.e. one at a time by the foreach loop).
So when you say:
"I see the files (returned enumerable of strings) contains all the entries in my directory"
I think you are mistaken.
I have a folder of files and a spreadsheet with a list of file names. I need to go through each file in the folder and see if that file name exists on the spreadsheet. It seems I can either load all the information (file names and spreadsheet list) into lists and then search from there or I can just loop through the files, get the name as I go, then look through the spreadsheet itself.
As far as I can tell, the benefit to loading them first is that it may make the search code a bit cleaner, but if there are too many files it would be redundant and slower. Working directly with the files and spreadsheet would negate that intermediate step, but the search code would be a little bit messier.
Is there any other clear trade off that I am missing? Is there a best practice for this?
Thanks.
Be careful as comparing two lists results in a O(n2) problem. This means that if you have 20 files, you will have to make 20 * 20 = 400 comparisons.
Therefore I suggest putting the filenames from the spreadsheet into a HashSet<string>. It has a constant access time of O(1). This reduces your problem to a O(n) problem.
// Gather the file names from the spreadsheet and insert them in a HashSet.
// (This is just simulated here.)
var fileNamesOnSpreadsheet = new HashSet<string>(StringComparer.OrdinalIgnoreCase) {
"filename 1", "filename 2", "filename 3", "another filename"
};
string folder = #"C:\Data";
foreach (string file in Directory.EnumerateFiles(folder)) {
if (fileNamesOnSpreadsheet.Contains(file)) {
// file found in spreadsheet
} else {
// file missing from spreadsheet
}
}
Note that Directory.EnumerateFiles get the filenames including their paths an extensions. If you have the bare filenames in the spreadsheet, you can remove the path with
string fileNameOnly = Path.GetFileName(file);
You can also remove the extension with
string fileNameOnly = Path.GetFileNameWithoutExtension(file);
Note that this solution reads the files from the folder only once and gets the file names from the spreadsheet only once. Reading information from file system is time-consuming and extracting information form the spreadsheet as well.
Directory.EnumerateFiles does not even store the filenames in a collection but instead delivers them continuously as the foreach-loop progresses.
So, this solution is very efficient.
See also:
Directory.EnumerateFiles Method
Big O notation - Wikipedia
For small numbers of names and a single search it takes so little time that optimizing the code is probably not worth it and you can do whatever is easiest for you.
For the sheet, it could make sense to load the names into a list* because you will search the list N times (once per file) and searching the list will be faster than searching the sheet. It might also make sense to have the list sorted so searches take log N time instead of N
as #JonathanWillcock1 notes, other in-memory data structures such as a dictionary might work even better, hiding the details of sorting and searching from you and making your code cleaner.
For the file names you only look at each name one once so iterating through a directory listing function is all that you need and copying it to a list would double the work.
I've recently written a small program to rename a bunch of files located in 6 directories. The program loops through each directory from a list and then renames each file in that directory using the File.Move method. The files are renamed to cart_buttons_1.png with the 1 incrementing by 1 each time.
public static int RenameFiles(DirectoryInfo d, StreamWriter sqlStreamWriter,
int incrementer, int category, int size)
{
FileInfo[] files = d.GetFiles("*.png");
foreach (FileInfo fileInfo in files)
{
File.Move(fileInfo.FullName, d.FullName + "cart_button_" + incrementer + ".png" );
incrementer++;
}
return incrementer;
}
The problem I'm encountering is when I run the program more than once it runs fine up until it hits the folder containing the 100th record. The d.Getfiles method retrieves all the files with the 100s first, causing and IOException, because the file which it is trying to rename already exists in the folder. The workaround I've found for this is just to select all the records with 100 in the filename and renaming them all to 'z' or something so that it just batches them all together. Any thoughts or ideas on how to fix this. Possibly some way to sort the GetFiles to look at the others first.
Using LINQ:
var sorted = files.OrderBy(fi => fi.FullName).ToArray();
Note that the above will sort by the textual values, so you may want to change that to order by the numeric value:
files.OrderBy(fi => int.Parse(fi.Name.Split(new []{'_','.'})[2]))
The above assumes that splitting by _ and . of a file name will result in an array with the third value being the numeric value.
The easiest workaround would be to check if the destination name exists prior to attempting the copy. Since you have the files array already, you can construct your destination name and if File.Exists() returns true, skip that numerical value.
I would also handle the exception that is thrown by the File.Move (you want to test for Existance first to avoid unnecessary exception throwing) because the file-system is not frozen while your code works... so even testing for existance won't ensure it isn't created in the meantime.
Finally, I think that running this code again against the same directory is going to duplicate all of the files again... probably not what is intended. I would filter the source filenames and avoid copying those already matching your pattern.
I can retrieve the names of individual files and also I can use count in the looping code to get the required value, but is it possible to know this without the iteration. Perhaps, a Property is defined for that.
openFileDialog1.FileNames.Length
I have a large list of emails that I need to check test to see if they contain a string. I only need to do this once. I originally only need to check to see if they email matched any of the emails from a list of emails.
I was using if(ListOfEmailsToRemoveHashSet.Contains(email)) { Discard(email); }
This worked great, but now I need to check for partial matches, so I am trying to invert it, but if I used the same method, I would be testing it like...
if (ListOfEmailsHashSet.Contains(badstring). Obviously that tells me which string is being found, but not which index in the hashset contains the bad string.
I can't see any way of making this work while still being fast.
Does anyone know of a function I can use that will return the HashSet of matches, the index of a matched item, or any way around this?
I only need to do this once.
If this is the case, performance shouldn't really be a consideration. Something like this should work:
if(StringsToDisallow.Any(be => email.Contains(be))) {...}
On a side note, you may want to consider using Regular Expressions rather than a straight black-list of contained strings. They'll give you a much more powerful, flexible way to find matches.
If performance does turn out to be an issue after all, you'll have to find a data structure that works better for full-text searching. It might be best to leverage an existing tool like Lucene.NET.
Just a note here, We had a program that was tasked with uploading excess of 100,000 pdf/excel/doc etc, everytime the file was uploaded an entry was made in a text file. Every Night when the program ran it would read this file, load the records and add it to the static HashSet<string> FilesVisited = new HashSet<string>(); FilesVisited.Add(reader.ReadLine());.
When the program attempted to upload a file, we had to first scan through the HashSet to see if we already worked on the file. What we found was that
if (!FilesVisited.Contains(newFilePath))... would take a lot of time and would not give us the correct results (even if the file path was in there) alternately, FilesVisited.Any(m => m.Contains(newFilePath)) was also a slow operation.
The best way we found to be fast was the traditional way of
foreach (var item in FilesVisited)
{
if (item.Contains(fileName)) {
alreadyUploded = true;
break;
}
}
Just thought I would share this....