How to determine if a Directory/Folder has been changed

How to determine if a Directory/Folder has been changed - c#

I have a folder with the following structure
Parent/
Child1/
GrandChild1/
File1.txt
I need to query Parent folder and find out if Child1 has changed.
Changed = A new file was add/update/deleted.
The Child1 folder DateModified is not updated. Only the GrandChild1 date modified was updated when changes occurs. I am trying to avoid going to the file level to determine if the rootparent has changed. since there will be many folders and sub folder. I just need to know if Child1 has changed.
I do not want to use FileSystemWatcher, since I am running this as a scheduled job and not watching it LIVE.

User FileSystemWatcher. Remember to enable raising events since it is a common mistake (watchfolder.EnableRaisingEvents = true;).
The FileSystemWatcher may prove not to be optimal from a performance perspective. If that is an issue for you, you might implement a CRC check with a Timer to check for changes of the files and folders you are interested in.
Essentially, what I would do is to generate a CRC32 hash for the entire folder I am watching (and save it away into variable A) and when I decide it is time to check for changes, you simply calculate a new CRC32 hash for the same folder (into variable B). You then compare A with B and if they don´t match, something has changed. Really not that difficult.
Reference:
http://www.codeproject.com/Articles/26528/C-Application-to-Watch-a-File-or-Directory-using-F
http://social.msdn.microsoft.com/Forums/zh/netfxbcl/thread/b7612249-eb32-4005-9d6b-7f291c218326
http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net
http://marknelson.us/1992/05/01/file-verification-using-crc-2/

Have you tried the file system watcher?
You can monitor local drives for changes from a given path, and then if necessary, ignore or process the fact they changed.

You can use the FileSystemWatcher Class for this.
Alternatively, if you would rather schedule a Task to run, Weekly, for example, you might want to have a look at: http://taskscheduler.codeplex.com/ and http://www.emoreau.com/Entries/Articles/2004/08/Interfacing-the-Windows-Task-Scheduler.aspx
And here's a link to the Windows Task Schedular API

Related

How can System.IO.FileSystemInfo.Refresh be used

There is an impressive lack of examples of the usage of Refresh.
I'm using the following method, which gets an inaccurate time
ViewBag.t1 = System.IO.File.GetLastAccessTime(#"C:\BillingExport\BILLING_TABLE_FILE01_1.txt");
I read that it's inaccurate because the OS hasn't performed a check and updated the files read/write times.
I've tried
System.IO.FileSystemInfo.Refresh(#"C:\BillingExport\BILLING_TABLE_FILE01_1.txt");
But this does not work and I can't locate a resource giving similar examples of its usage.

FileSystemInfo.Refresh is not a static method. What you have shown for your example does not compile. You should create a FileInfo object initialized with the file name and then you can call Refresh on that. You should then be able to use the properties of the FileInfo object to get the last access time and other pertinent file details.
var info = new FileInfo(#"C:\Temp\a.txt");
info.Refresh(#"C:\BillingExport\BILLING_TABLE_FILE01_1.txt");
var lastAccess = info.LastAccessTime;
One last edit based on an answer at the above linked possible duplicate and CodeCaster's answer:
http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-time-in-windows-vista-to-improve-ntfs-performance.aspx
Indicates that in Vista this was disabled by default. I just checked the registry in my Win 8.1 box and sure enough, the registry key is there and Last Access update is disabled by default. So, if you are on Vista or above the above code won't really work. If you are on XP than you should be golden!

FileSystemInfo is the abstract base class for FileInfo and DirectoryInfo. Which cache the properties of a file/directory. If you keep, say, a FileInfo object around and keep testing its Exists property then it gets to be important that you call Refresh().
Which has nothing to do with File.GetLastAccessTime(). The classes are entirely unrelated, the File class does no caching and always retrieves the last access time from the file system.
Which is unreliable if the file is opened by any program. The file system is just not in a hurry to update these attributes when a program is actively accessing the file. That's way too expensive, that can easily cost many dozens of milliseconds to send the disk drive write head to the MFT sector that stores these values. A program can access a file much faster than that. Documented in this MSDN article:
Not all file systems can record creation and last access times, and not all file systems record them in the same manner. For example, the resolution of create time on FAT is 10 milliseconds, while write time has a resolution of 2 seconds and access time has a resolution of 1 day, so it is really the access date. The NTFS file system delays updates to the last access time for a file by up to 1 hour after the last access.
Most relevant phrase bolded, what you see is pretty much expected. You'll need to look for a different approach.

Uniquely identify file on Windows

I need to uniquely identify a file on Windows so I can always have a reference for that file even if it's moved or renamed. I did some research and found the question Unique file identifier in windows with a way that uses the method GetFileInformationByHandle with C++, but apparently that only works for NTFS partitions, but not for the FAT ones.
I need to program a behavior like the one on DropBox: if you close it on your computer, rename a file and open it again it detects that change and syncs correctly. I wonder whats the technique and maybe how DropBox does if you guys know.
FileSystemWatcher for example would work, but If the program using it is closed, no changes can be detected.
I will be using C#.
Thanks,

The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.
I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.
In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.

What the user was looking for was most likely Windows Change Journals. Those track changes like renames of files persistently, no need to have a watcher observing file system events running all the time. Instead, one simply needs to maintain when last looked at the log and continue looking again beginning at that point. At some point a file with an already known ID would have an event of type RENAME and whoever is interested in that event could do the same for its own version of that file. The important thing is to keep track of the used IDs for files of course.
An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journals

Detect changes in directory

I need to monitor a folder and its subdirectories for any file manipulations (add/remove/rename). I've read about FileSystemWatcher but I'd like to monitor changes between each time the program is run or when the user presses the "check for changes" button (FSW seems more orientated to runtime detection). My first thought was to iterate through all the (sub)directories and hash each file. Then, concatenate all the hashes (which have been ordered) and hash that. When I want to check for changes, I repeat the process and check if the hashes are the same.
Is this an efficient way of doing it?
Also, once I've detected a change, how do I find out what file has been added, removed or renamed as quickly as possible?
As a side note, I don't mind using scripts to do this if they're faster as long as those scripts don't require end users to install anything and the scripts can notify my C# app of the changes.

We handle this by storing all found files in a database along with their last modification time.
On each pass through the files, we check the database for each file: if it doesn't exist in the DB, it is new and if it does exist, but the timestamp is different, it has changed.
There is also an option to handle deleted files by marking all of the files in the database as ToBeDeleteed prior to the pass and clearing this if the file was found. Then, at the end of the process, we can just delete all of the records that are marked as ToBeDeleted.

Obviously you need to make "snapshots" of the directory tree and compare them as required. What exactly goes into the snapshots would depend on your requirements. Keep in mind that:
You need to store filenames in order to detect "new" and "deleted" files
File sizes and last-modified times are a good and cheap indicator that a file has or has not changed, but do not provide a guarantee
Hashing the contents of files can be prohibitively expensive if the files can be large, but it's the only way to know they have changed with a near-perfect degree of accuracy (remember that hashes can collide as well, so if you want mathematical 100% certainty that's not going to be good enough either)

directory monitoring

What is the best way for me to check for new files added to a directory, I dont think the filesystemwatcher would be suitable as this is not an always on service but a method that runs when my program starts up.
there are over 20,000 files in the folder structure I am monitoring, at present I am checking each file individually to see if the filepath is in my database table, however this is taking around ten minutes and I would like to speed it up is possible,
I can store the date the folder was last checked - is it easy to get all files with createddate > last checked date.
anyone got any Ideas?
Thanks
Mark

Your approach is the only feasible (i.e. file system watcher allows you to see changes, not check on start).
Find out what takes so long. 20.000 checks should not take 10 minutes - maybe 1 maximum. Your program is written slowly. How do you test it?
Hint: do not ask the database, get a list of all files into memory, a list of all filesi n the database, check in memory. 20.000 SQL statements to the database are too slow, this way you need ONE to get the list.

10 minutes seems awfully long for 20,000 files. How are you going about doing the comparison? Your suggestion doesn't account for deleted files either. If you want to remove those from the database, you will have to do a full comparison.
Perhaps the problem is the database round trips. You can retrieve a known file list from the database in large chunks (or all at once), sorted alphabetically. Sort the local file list as well and walk the two lists, processing missing or new entries as you go along.

FileSystemWatcher is not reliable, so even if you could use a service, it would not necessarily work for you.
The two options I can see are:
Keep a list of files you know about and keep comparing to this list. This will allow you to see if files were added, deleted etc. Keep this list in memory, instead of querying the database for each file.
As you suggest, store a timestamp and compare to that.

You can write in somewhere the last timestamp that onfile was created, it is simple and can work for you.

Can you write a service that runs on that machine? The service can then use FileSystemWtcher

Having a FileSystemWatcher service like Kevin Jones suggests is probably the most pragmatic answer, but there are some other options.
You can watch the directory with inotify if you mount it with Samba on a linux box. That of course assumes you don't mind fragmenting your platform, but that's what inotify is there for.
And then more correctly but with correspondingly less chance of you getting a go-ahead, if you're sitting monitoring a directory with 20K files in it it is probably time to evolve your system architecture. Not knowing all that much more about your application, it sounds like a message queue might be worth looking at.

Detecting moved files using FileSystemWatcher

I realise that FileSystemWatcher does not provide a Move event, instead it will generate a separate Delete and Create events for the same file. (The FilesystemWatcher is watching both the source and destination folders).
However how do we differentiate between a true file move and some random creation of a file that happens to have the same name as a file that was recently deleted?
Some sort of property of the FileSystemEventArgs class such as "AssociatedDeleteFile" that is assigned the deleted file path if it is the result of a move, or NULL otherwise, would be great. But of course this doesn't exist.
I also understand that the FileSystemWatcher is operating at the basic Filesystem level and so the concept of a "Move" may be only meaningful to higher level applications. But if this is the case, what sort of algorithm would people recommend to handle this situation in my application?
Update based on feedback:
The FileSystemWatcher class seems to see moving a file as simply 2 distinct events, a Delete of the original file, followed by a Create at the new location.
Unfortunately there is no "link" provided between these events, so it is not obvious how to differentiate between a file move and a normal Delete or Create. At the OS level, a move is treated specially, you can move say a 1GB file almost instantaneously.
A couple of answers suggested using a hash on files to identify them reliably between events, and I will proably take this approach. But if anyone knows how to detect a move more simply, please leave an answer.

According to the docs:
Common file system operations might
raise more than one event. For
example, when a file is moved from one
directory to another, several
OnChanged and some OnCreated and
OnDeleted events might be raised.
Moving a file is a complex operation
that consists of multiple simple
operations, therefore raising multiple
events.
So if you're trying to be very careful about detecting moves, and having the same path is not good enough, you will have to use some sort of heuristic. For example, create a "fingerprint" using file name, size, last modified time, etc for files in the source folder. When you see any event that may signal a move, check the "fingerprint" against the new file.

As far as I understand it, the Renamed event is for files being moved...?
My mistake - the docs specifically say that only files inside a moved folder are considered "renamed" in a cut-and-paste operation:
The operating system and FileSystemWatcher object interpret a cut-and-paste action or a move action as a rename action for a folder and its contents. If you cut and paste a folder with files into a folder being watched, the FileSystemWatcher object reports only the folder as new, but not its contents because they are essentially only renamed.
It also says about moving files:
Common file system operations might raise more than one event. For example, when a file is moved from one directory to another, several OnChanged and some OnCreated and OnDeleted events might be raised. Moving a file is a complex operation that consists of multiple simple operations, therefore raising multiple events.

As you already mentioned, there is no reliable way to do this with the default FileSystemWatcher class provided by C#. You can apply certain heuristics like filename, hashes, or unique file ids to map created and deleted events together, but none of these approaches will work reliably. In addition, you cannot easily get the hash or file id for the file associated with the deleted event, meaning that you have to maintain these values in some sort of database.
I think the only reliable approach for detecting file movements is to create an own file system watcher. Therefore, you can use different approaches. If you are only going to watch changes on NTFS file systems, one solution might be to read out the NTFS change journal as described here. What's nice about this is that it even allows you to track changes that occurred while your app wasn't running.
Another approach is to create a minifilter driver that tracks file system operations and forwards them to your application. Using this you basically get all information about what is happening to your files and you'll be able to get information about moved files. A drawback of this approach is that you have to create a separate driver that needs to be installed on the target system. The good thing however is that you wouldn't need to start from scratch, because I already started to create something like this: https://github.com/CenterDevice/MiniFSWatcher
This allows you to simply track moved files like this:
var eventWatcher = new EventWatcher();
eventWatcher.OnRenameOrMove += (filename, oldFilename, process) =>
{
Console.WriteLine("File " + oldFilename + " has been moved to " + filename + " by process " + process );
};
eventWatcher.Connect();
eventWatcher.WatchPath("C:\\Users\\MyUser\\*");
However, please be aware that this requires kernel code that needs to be signed in order run on 64bit version of Windows (if you don't disable signature checking for testing). At time of writing, this code is also still in an early stage of development, so I would not use it on production systems yet. But even if you're not going to use this, it should still give you some information about how file system events might be tracked on Windows.

I'll hazard a guess 'move' indeed does not exist, so you're really just going to have to look for a 'delete' and then mark that file as one that could be 'possibly moved', and then if you see a 'create' for it shortly after, I suppose you can assume you're correct.
Do you have a case of random file creations affecting your detection of moves?

Might want to try the OnChanged and/or OnRenamed events mentioned in the documentation.

StorageLibrary class can track moves. The example from Microsoft:
StorageLibrary videosLib = await StorageLibrary.GetLibraryAsync(KnownLibraryId.Videos);
StorageLibraryChangeTracker videoTracker = videosLib.ChangeTracker;
videoTracker.Enable();
A complete example could be found here.
However, it looks like you can only track changes inside Windows "known libraries".
You can also try to get StorageLibraryChangeTracker using StorageFolder.TryGetChangeTracker(). But your folder must be under sync root, you can not use this method to get an arbitrary folder in file system.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.