How do I compare one collection of files to another in c#?

How do I compare one collection of files to another in c#? - c#

I am just learning C# (have been fiddling with it for about 2 days now) and I've decided that, for leaning purposes, I will rebuild an old app I made in VB6 for syncing files (generally across a network).
When I wrote the code in VB 6, it worked approximately like this:
Create a Scripting.FileSystemObject
Create directory objects for the source and destination
Create file listing objects for the source and destination
Iterate through the source object, and check to see if it exists in the destination
if not, create it
if so, check to see if the source version is newer/larger, and if so, overwrite the other
So far, this is what I have:
private bool syncFiles(string sourcePath, string destPath) {
DirectoryInfo source = new DirectoryInfo(sourcePath);
DirectoryInfo dest = new DirectoryInfo(destPath);
if (!source.Exists) {
LogLine("Source Folder Not Found!");
return false;
}
if (!dest.Exists) {
LogLine("Destination Folder Not Found!");
return false;
}
FileInfo[] sourceFiles = source.GetFiles();
FileInfo[] destFiles = dest.GetFiles();
foreach (FileInfo file in sourceFiles) {
// check exists on file
}
if (optRecursive.Checked) {
foreach (DirectoryInfo subDir in source.GetDirectories()) {
// create-if-not-exists destination subdirectory
syncFiles(sourcePath + subDir.Name, destPath + subDir.Name);
}
}
return true;
}
I have read examples that seem to advocate using the FileInfo or DirectoryInfo objects to do checks with the "Exists" property, but I am specifically looking for a way to search an existing collection/list of files, and not live checks to the file system for each file, since I will be doing so across the network and constantly going back to a multi-thousand-file directory is slow slow slow.
Thanks in Advance.

The GetFiles() method will only get you files that does exist. It doesn't make up random files that doesn't exist. So all you have to do is to check if it exists in the other list.
Something in the lines of this could work:
var sourceFiles = source.GetFiles();
var destFiles = dest.GetFiles();
foreach (var file in sourceFiles)
{
if(!destFiles.Any(x => x.Name == file.Name))
{
// Do whatever
}
}
Note: You have of course no guarantee that something hasn't changed after you have done the calls to GetFiles(). For example, a file could have been deleted or renamed if you try to copy it later.
Could perhaps be done nicer somehow by using the Except method or something similar. For example something like this:
var sourceFiles = source.GetFiles();
var destFiles = dest.GetFiles();
var sourceFilesMissingInDestination = sourceFiles.Except(destFiles, new FileNameComparer());
foreach (var file in sourceFilesMissingInDestination)
{
// Do whatever
}
Where the FileNameComparer is implemented like so:
public class FileNameComparer : IEqualityComparer<FileInfo>
{
public bool Equals(FileInfo x, FileInfo y)
{
return Equals(x.Name, y.Name);
}
public int GetHashCode(FileInfo obj)
{
return obj.Name.GetHashCode();
}
}
Untested though :p

One little detail, instead of
sourcePath + subDir.Name
I would use
System.IO.Path.Combine(sourcePath, subDir.Name)
Path does reliable, OS independent operations on file- and foldernames.
Also I notice optRecursive.Checked popping out of nowhere. As a matter of good design, make that a parameter:
bool syncFiles(string sourcePath, string destPath, bool checkRecursive)
And since you mention it may be used for large numbers of files, keep an eye out for .NET 4, it has an IEnumerable replacement for GetFiles() that will let you process this in a streaming fashion.

Related

Copy only new or modified files/directories in C#

I am trying to create a simple “directory/file copy" console application in C#. What I need is to copy all folders and files (keeping the original hierarchy) from one drive to another, like from drive C:\Data to drive E:\Data.
However, I only want it to copy any NEW or MODIFIED files from the source to the destination.
If the file on the destination drive is newer than the one on the source drive, then it does not copy.
(the problem)
In the code I have, it's comparing file "abc.pdf" in the source with file "xyz.pdf" in the destination and thus is overwriting the destination file with whatever is in the source even though the destination file is newer. I am trying to figure out how to make it compare "abc.pdf" in the source to "abc.pdf" in the destination.
This works if I drill the source and destination down to a specific file, but when I back out to the folder level, it overwrites the destination file with the source file, even though the destination file is newer.
(my solutions – that didn’t work)
I thought by putting the “if (file.LastWriteTime > destination.LastWriteTime)” after the “foreach” command, that it would compare the files in the two folders, File1 source to File1 destination, but it’s not.
It seems I’m missing something in either the “FileInfo[]”, “foreach” or “if” statements to make this a one-to-one comparison. I think maybe some reference to the “Path.Combine” statement or a “SearchOption.AllDirectories”, but I’m not sure.
Any suggestions?
As you can see from my basic code sample, I'm new to C# so please put your answer in simple terms.
Thank you.
Here is the code I have tried, but it’s not working.
class Copy
{
public static void CopyDirectory(DirectoryInfo source, DirectoryInfo destination)
{
if (!destination.Exists)
{
destination.Create();
}
// Copy files.
FileInfo[] files = source.GetFiles();
FileInfo[] destFiles = destination.GetFiles();
foreach (FileInfo file in files)
foreach (FileInfo fileD in destFiles)
// Copy only modified files
if (file.LastWriteTime > fileD.LastWriteTime)
{
file.CopyTo(Path.Combine(destination.FullName,
file.Name), true);
}
// Copy all new files
else
if (!fileD.Exists)
{
file.CopyTo(Path.Combine(destination.FullName, file.Name), true);
}
// Process subdirectories.
DirectoryInfo[] dirs = source.GetDirectories();
foreach (DirectoryInfo dir in dirs)
{
// Get destination directory.
string destinationDir = Path.Combine(destination.FullName, dir.Name);
// Call CopyDirectory() recursively.
CopyDirectory(dir, new DirectoryInfo(destinationDir));
}
}
}

You can just take the array of files in "source" and check for a matching name in "destination"
/// <summary>
/// checks whether the target file needs an update (if it doesn't exist: it needs one)
/// </summary>
public static bool NeedsUpdate(FileInfo localFile, DirectoryInfo localDir, DirectoryInfo backUpDir)
{
bool needsUpdate = false;
if (!File.Exists(Path.Combine(backUpDir.FullName, localFile.Name)))
{
needsUpdate = true;
}
else
{
FileInfo backUpFile = new FileInfo(Path.Combine(backUpDir.FullName, localFile.Name));
DateTime lastBackUp = backUpFile.LastWriteTimeUtc;
DateTime lastChange = localFile.LastWriteTimeUtc;
if (lastChange != lastBackUp)
{
needsUpdate = true;
}
else
{/*no change*/}
}
return needsUpdate;
}

Update:
I modified my code with the suggestions above and all went well. It did exactly as I expected.
However, the problem I ran into was the amount of time it took run the application on a large folder. (containing 6,000 files and 5 sub-folders)
On a small folder, (28 files in 5 sub-folders) it only took a few seconds to run. But, on the larger folder it took 35 minutes to process only 1,300 files.
Solution:
The code below will do the same thing but much faster. This new version processed 6,000 files in about 10 seconds. It processed 40,000 files in about 1 minute and 50 seconds.
What this new code does (and doesn’t do)
If the destination folder is empty, copy all from the source to the destination.
If the destination has some or all of the same files / folders as the source, compare and copy any new or modified files from the source to the destination.
If the destination file is newer than the source, don’t copy.
So, here’s the code to make it happen. Enjoy and share.
Thanks to everyone who helped me get a better understanding of this.
using System;
using System.IO;
namespace VSU1vFileCopy
{
class Program
{
static void Main(string[] args)
{
const string Src_FOLDER = #"C:\Data";
const string Dest_FOLDER = #"E:\Data";
string[] originalFiles = Directory.GetFiles(Src_FOLDER, "*", SearchOption.AllDirectories);
Array.ForEach(originalFiles, (originalFileLocation) =>
{
FileInfo originalFile = new FileInfo(originalFileLocation);
FileInfo destFile = new FileInfo(originalFileLocation.Replace(Src_FOLDER, Dest_FOLDER));
if (destFile.Exists)
{
if (originalFile.Length > destFile.Length)
{
originalFile.CopyTo(destFile.FullName, true);
}
}
else
{
Directory.CreateDirectory(destFile.DirectoryName);
originalFile.CopyTo(destFile.FullName, false);
}
});
}
}
}

Fastest way to delete million of files

I have a list of string which are relative paths. I also have a string which contains root path for those files. Now I am deleting them like this:
foreach (var rawDocumentPath in documents.Select(x => x.RawDocumentPath))
{
if (string.IsNullOrEmpty(rawDocumentPath))
{
continue;
}
string fileName = Path.Combine(storagePath, rawDocumentPath);
File.Delete(fileName);
}
the problem is that I call Path.Combine for every file, and it's slow enough.
How can I speed up this code? I can't delete whole folders, I cannot change current directory (because it affects a whole program)...
I need something like a class which can delete fast several files in specified directory.

If your disk can handle it, parallizing should help a lot:
documents.AsParallel().ForAll(
document =>
{
if (!string.IsNullOrEmpty(document.RawDocumentPath))
{
string fileName = Path.Combine(storagePath, document.RawDocumentPath);
File.Delete(fileName);
}
});

Need elegant way to move orphan files from folder after paired files are moved with FileInfo class

I have a folder from which I'm moving pairs of related files (xml paired with pdf). Additional files could be deposited into this folder at any time, but the utility runs every 10 minutes or so. We could use the FileSystemWatcher class but for internal reasons we don't for this utility.
I'm using the System.IO.FileInfo class to read all the files in the folder (will only be xml and pdf) during each run. Once I have the files in the FileInfo object, I iterate through the files, moving matches to a working folder. Once that is done, I want to move any files that were not paired, but are in the FileInfo object, to a failure folder.
Since I can't seem to remove items from the FileInfo object (or I am missing something), would it be easier to (1) use a string array from Directory class .GetFiles, (2) create a Dictionary from the FileInfo object and remove values from that during iteration, or (3) is there a more elegant approach using LINQ or something else?
Here is the code so far:
internal static bool CompareXMLandPDFFileNames(FileInfo[] xmlFiles, FileInfo[] pdfFiles, string xmlFilePath)
{
string workingFilePath = xmlFilePath + #"\WORKING";
if (xmlFiles.Length > 0)
{
foreach (var xmlFile in xmlFiles)
{
string xfn = xmlFile.Name; //xml file name
string pdfName = xfn.Substring(0,xfn.IndexOf('_')) + ".pdf"; //parsed pdf file name contained in xml file name
foreach (var pdfFile in pdfFiles)
{
string pfn = pdfFile.Name; //pdf file name
if (pfn == pdfName)
{
//move xml and pdf files to working folder...
FileInfo xmlInfo = new FileInfo(xmlFilePath + xfn);
FileInfo pdfInfo = new FileInfo(xmlFilePath + pfn);
if (!File.Exists(workingFilePath + xfn))
{
xmlInfo.MoveTo(workingFilePath + xfn);
}
if (!File.Exists(workingFilePath + pfn))
{
pdfInfo.MoveTo(workingFilePath + pfn);
}
}
}
}
//all files in the file objects should now be moved to working folder, if not, fix orphans...
}
return true;
}

To be honest I think the question is a bit poor. The problem is stated in a very complicated fashion. I think the workflow be designed to be more robust and deterministic. (e.g. why not upload file pairs in zipped sets in the first place?)
(And no "Someone" most likely "must not have been here before")
Here are some random improvements:
using System;
using System.Linq;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections.Generic;
namespace O
{
static class X
{
private static readonly Regex _xml2pdf = new Regex("(_.*).xml$", RegexOptions.Compiled | RegexOptions.IgnoreCase);
internal static void MoveFileGroups(string uploadFolder)
{
string workingFilePath = Path.Combine(uploadFolder, "PROGRESS");
var groups = new DirectoryInfo(uploadFolder)
.GetFiles()
.GroupBy(fi => _xml2pdf.Replace(fi.Name, ".pdf"), StringComparer.InvariantCultureIgnoreCase)
.Where(group => group.Count() >1);
foreach (var group in groups)
{
if (!group.Any(fi => File.Exists(Path.Combine(workingFilePath, fi.Name))))
foreach (var file in group)
file.MoveTo(Path.Combine(workingFilePath, file.Name));
}
}
public static void Main(string[]args)
{
}
}
}
use readable names (say what you mean)
IndexOf returns -1 if filename contains no "_"; random upload filenames could make procedure fail
Handle filenames case insensitive on Windows
Don't manually do the path concats (you could accidentally manufacture UNC paths, and your code is less portable)
don't assume one xml will map to one pdf: the naming scheme implies that many xmls map to the same pdf name. This implementation allows that (or you could detect the situation by rejecting groups.Where(g => g.Count()>2)
Move groups atomically only (!): if any one of the files in a group exist in the target dir, don't move any (or you will have a race condition, where part of a group get's moved before the last file was (completely) uploaded and it will never get moved because the group is no longer detected
Other items (todo)
Don't pass redundant parameters. You might pass a FI[] instead of the raw GetFiles() call if you want filtering.
Do error handling, notably:
handle IO exceptions
locking errors are expectable while uploads in progress (test it or end up with corrupted files); you need to atomically handle these (i.e. not move any files in a group unless all could be moved; this will be somewhat tricky)
test your code (none of my sample was tested; it just compiled on linux with mono)

C# How to loop a large set of folders and files recursively without using a huge amount of memory

I want to index all my music files and store them in a database.
I have this function that i call recusively, starting from the root of my music drive.
i.e.
start > ReadFiles(C:\music\);
ReadFiles(path){
foreach(file)
save to index;
foreach(directory)
ReadFiles(directory);
}
This works fine, but while running the program the amount of memory that is used grows and grows and.. finally my system runs out of memory.
Does anyone have a better approach that doesnt need 4GB of RAM to complete this task?
Best Regards, Tys

Alxandr's queue based solution should work fine.
If you're using .NET 4.0, you could also take advantage of the new Directory.EnumerateFiles method, which enumerates files lazily, without loading them all in memory:
void ReadFiles(string path)
{
IEnumerable<string> files =
Directory.EnumerateFiles(
path,
"*",
SearchOption.AllDirectories); // search recursively
foreach(string file in files)
SaveToIndex(file);
}

Did you check for the . and .. entries that show up in every directory except the root?
If you don't skip those, you'll have an infinite loop.

You can implement this as a queue. I think (but I'm not sure) that this will save memory. At least it will free up your stack. Whenever you find a folder you add it to the queue, and whenever you find a file you just read it. This prevents recursion.
Something like this:
Queue<string> dirs = new Queue<string>();
dirs.Enqueue("basedir");
while(dirs.Count > 0) {
foreach(directory)
dirs.Enqueue(directory);
ReadFiles();
}

Beware, though, that EnumerateFiles() will stop running if you don't have access to a file or if a path is too long or if some other exception occurs. This is what I use for the moment to solve those problems:
public static List<string> getFiles(string path, List<string> files)
{
IEnumerable<string> fileInfo = null;
IEnumerable<string> folderInfo = null;
try
{
fileInfo = Directory.EnumerateFiles(str);
}
catch
{
}
if (fileInfo != null)
{
files.AddRange(fileInfo);
//recurse through the subfolders
fileInfo = Directory.EnumerateDirectories(str);
foreach (string s in folderInfo)
{
try
{
getFiles(s, files);
}
catch
{
}
}
}
return files;
}
Example use:
List<string> files = new List<string>();
files = folder.getFiles(path, files);
My solution is based on the code at this page: http://msdn.microsoft.com/en-us/library/vstudio/bb513869.aspx.
Update: A MUCH faster method to get files recursively can be found at http://social.msdn.microsoft.com/Forums/vstudio/en-US/ae61e5a6-97f9-4eaa-9f1a-856541c6dcce/directorygetfiles-gives-me-access-denied?forum=csharpgeneral. Using Stack is new to me (I didn't even know it existed), but the method seems to work. At least it listed all files on my C and D partition with no errors.

It could be junction folders wich leads to infinite loop when doing recursion but i am not sure , check this out and see by yourself . Link: https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/mklink

Quickest way in C# to find a file in a directory with over 20,000 files

I have a job that runs every night to pull xml files from a directory that has over 20,000 subfolders under the root. Here is what the structure looks like:
rootFolder/someFolder/someSubFolder/xml/myFile.xml
rootFolder/someFolder/someSubFolder1/xml/myFile1.xml
rootFolder/someFolder/someSubFolderN/xml/myFile2.xml
rootFolder/someFolder1
rootFolder/someFolderN
So looking at the above, the structure is always the same - a root folder, then two subfolders, then an xml directory, and then the xml file.
Only the name of the rootFolder and the xml directory are known to me.
The code below traverses through all the directories and is extremely slow. Any recommendations on how I can optimize the search especially if the directory structure is known?
string[] files = Directory.GetFiles(#"\\somenetworkpath\rootFolder", "*.xml", SearchOption.AllDirectories);

Rather than doing GetFiles and doing a brute force search you could most likely use GetDirectories, first to get a list of the "First sub folder", loop through those directories, then repeat the process for the sub folder, looping through them, lastly look for the xml folder, and finally searching for .xml files.
Now, as for performance the speed of this will vary, but searching for directories first, THEN getting to files should help a lot!
Update
Ok, I did a quick bit of testing and you can actually optimize it much further than I thought.
The following code snippet will search a directory structure and find ALL "xml" folders inside the entire directory tree.
string startPath = #"C:\Testing\Testing\bin\Debug";
string[] oDirectories = Directory.GetDirectories(startPath, "xml", SearchOption.AllDirectories);
Console.WriteLine(oDirectories.Length.ToString());
foreach (string oCurrent in oDirectories)
Console.WriteLine(oCurrent);
Console.ReadLine();
If you drop that into a test console app you will see it output the results.
Now, once you have this, just look in each of the found directories for you .xml files.

I created a recursive method GetFolders using a Parallel.ForEach to find all the folders named as the variable yourKeyword
List<string> returnFolders = new List<string>();
object locker = new object();
Parallel.ForEach(subFolders, subFolder =>
{
if (subFolder.ToUpper().EndsWith(yourKeyword))
{
lock (locker)
{
returnFolders.Add(subFolder);
}
}
else
{
lock (locker)
{
returnFolders.AddRange(GetFolders(Directory.GetDirectories(subFolder)));
}
}
});
return returnFolders;

Are there additional directories at the same level as the xml folder? If so, you could probably speed up the search if you do it yourself and eliminate that level from searching.
System.IO.DirectoryInfo root = new System.IO.DirectoryInfo(rootPath);
List<System.IO.FileInfo> xmlFiles=new List<System.IO.FileInfo>();
foreach (System.IO.DirectoryInfo subDir1 in root.GetDirectories())
{
foreach (System.IO.DirectoryInfo subDir2 in subDir1.GetDirectories())
{
System.IO.DirectoryInfo xmlDir = new System.IO.DirectoryInfo(System.IO.Path.Combine(subDir2.FullName, "xml"));
if (xmlDir.Exists)
{
xmlFiles.AddRange(xmlDir.GetFiles("*.xml"));
}
}
}

I can't think of anything faster in C#, but do you have indexing turned on for that file system?

Only way I can see that would make much difference is to change from a brute strength hunt and use some third party or OS indexing routine to speed the return. that way the search is done off line from your app.
But I would also suggest you should look at better ways to structure that data if at all possible.

Use P/Invoke on FindFirstFile/FindNextFile/FindClose and avoid overhead of creating lots of FileInfo instances.
But this will be hard work to get right (you will have to do all the handling of file vs. directory and recursion yourself). So try something simple (Directory.GetFiles(), Directory.GetDirectories()) to start with and get things working. If it is too slow look at alternatives (but always measure, too easy to make it slower).

Depending on your needs and configuration, you could utilize the Windows Search Index: https://msdn.microsoft.com/en-us/library/windows/desktop/bb266517(v=vs.85).aspx
Depending on your configuration this could increase performance greatly.

For file and directory search purpose I would want to offer use multithreading .NET library that possess a wide search opportunities.
All information about library you can find on GitHub: https://github.com/VladPVS/FastSearchLibrary
If you want to download it you can do it here: https://github.com/VladPVS/FastSearchLibrary/releases
If you have any questions please ask them.
Works really fast. Check it yourself!
It is one demonstrative example how you can use it:
class Searcher
{
private static object locker = new object();
private FileSearcher searcher;
List<FileInfo> files;
public Searcher()
{
files = new List<FileInfo>();
}
public void Startsearch()
{
CancellationTokenSource tokenSource = new CancellationTokenSource();
searcher = new FileSearcher(#"C:\", (f) =>
{
return Regex.IsMatch(f.Name, #".*[Dd]ragon.*.jpg$");
}, tokenSource);
searcher.FilesFound += (sender, arg) =>
{
lock (locker) // using a lock is obligatorily
{
arg.Files.ForEach((f) =>
{
files.Add(f);
Console.WriteLine($"File location: {f.FullName}, \nCreation.Time: {f.CreationTime}");
});
if (files.Count >= 10)
searcher.StopSearch();
}
};
searcher.SearchCompleted += (sender, arg) =>
{
if (arg.IsCanceled)
Console.WriteLine("Search stopped.");
else
Console.WriteLine("Search completed.");
Console.WriteLine($"Quantity of files: {files.Count}");
};
searcher.StartSearchAsync();
}
}
It's part of other example:
***
List<string> folders = new List<string>
{
#"C:\Users\Public",
#"C:\Windows\System32",
#"D:\Program Files",
#"D:\Program Files (x86)"
}; // list of search directories
List<string> keywords = new List<string> { "word1", "word2", "word3" }; // list of search keywords
FileSearcherMultiple multipleSearcher = new FileSearcherMultiple(folders, (f) =>
{
if (f.CreationTime >= new DateTime(2015, 3, 15) &&
(f.Extension == ".cs" || f.Extension == ".sln"))
foreach (var keyword in keywords)
if (f.Name.Contains(keyword))
return true;
return false;
}, tokenSource, ExecuteHandlers.InCurrentTask, true);
***
Moreover one can use simple static method:
List<FileInfo> files = FileSearcher.GetFilesFast(#"C:\Users", "*.xml");
Note that all methods of this library DO NOT throw UnauthorizedAccessException instead standard .NET search methods.
Furthermore fast methods of this library are performed at least in 2 times faster than simple one-thread recursive algorithm if you use multicore processor.

For those of you who want to search for a single file and you know your root directory then I suggest you keep it simple as possible. This approach worked for me.
private void btnSearch_Click(object sender, EventArgs e)
{
string userinput = txtInput.Text;
string sourceFolder = #"C:\mytestDir\";
string searchWord = txtInput.Text + ".pdf";
string filePresentCK = sourceFolder + searchWord;
if (File.Exists(filePresentCK))
{
pdfViewer1.LoadFromFile(sourceFolder+searchWord);
}
else if(! File.Exists(filePresentCK))
{
MessageBox.Show("Unable to Find file :" + searchWord);
}
txtInput.Clear();
}// end of btnSearch method

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How do I compare one collection of files to another in c#? - c#

Related

Copy only new or modified files/directories in C#

Fastest way to delete million of files

Need elegant way to move orphan files from folder after paired files are moved with FileInfo class

C# How to loop a large set of folders and files recursively without using a huge amount of memory

Quickest way in C# to find a file in a directory with over 20,000 files

Categories

Resources