Detect changes in directory

Detect changes in directory - c#

I need to monitor a folder and its subdirectories for any file manipulations (add/remove/rename). I've read about FileSystemWatcher but I'd like to monitor changes between each time the program is run or when the user presses the "check for changes" button (FSW seems more orientated to runtime detection). My first thought was to iterate through all the (sub)directories and hash each file. Then, concatenate all the hashes (which have been ordered) and hash that. When I want to check for changes, I repeat the process and check if the hashes are the same.
Is this an efficient way of doing it?
Also, once I've detected a change, how do I find out what file has been added, removed or renamed as quickly as possible?
As a side note, I don't mind using scripts to do this if they're faster as long as those scripts don't require end users to install anything and the scripts can notify my C# app of the changes.

We handle this by storing all found files in a database along with their last modification time.
On each pass through the files, we check the database for each file: if it doesn't exist in the DB, it is new and if it does exist, but the timestamp is different, it has changed.
There is also an option to handle deleted files by marking all of the files in the database as ToBeDeleteed prior to the pass and clearing this if the file was found. Then, at the end of the process, we can just delete all of the records that are marked as ToBeDeleted.

Obviously you need to make "snapshots" of the directory tree and compare them as required. What exactly goes into the snapshots would depend on your requirements. Keep in mind that:
You need to store filenames in order to detect "new" and "deleted" files
File sizes and last-modified times are a good and cheap indicator that a file has or has not changed, but do not provide a guarantee
Hashing the contents of files can be prohibitively expensive if the files can be large, but it's the only way to know they have changed with a near-perfect degree of accuracy (remember that hashes can collide as well, so if you want mathematical 100% certainty that's not going to be good enough either)

Related

FIFO log file in C#

I am making an evolution for a log file system I have had in place for a few builds of a service I develop on. I had previously been opening the file, appending data, and prior to writing checking to see if the log file had grown over a predetermined size, if so starting a new log.
So say the log size was 100mb, at that size I delete, and start a new file, but I loose history, functional, but not the best model.
What I want to do is a FIFO model that would chop off the top and add to the end while keeping it consistently no larger than 100mb, and at least as far back as that represents.
The data is high speed in a failure prone industrial environment, so keeping it all in memory and writing the whole file at interval has proven unreliable. (SSD, fast enough to do it reasonably most of the time, spinners fail too often to tolerate)
Likewise the records are of greatly variable length (formatted as XML nodes, so parsing them back out accounts for this easily)
So the only workable model I have come up with thus far is to keep smaller slices (say 10mb) chunks, create new ones then delete the oldest 10mb slice on count >= 10.
What I would prefer to do is be able to keep the file on disk and work with the tag ends.
Open to suggestions on how this might be best achieved in a reasonable manner, or is there no reasonable manner and the layered multi log approach will be the best option?

The biggest issue with expiring old log entries in a single file is that you have to rewrite the file's content in order to expire older entries. This isn't too bad for small files (up to a few MB in size), but once you get to the point where rewriting takes a significant period of time it becomes problematic.
One of the more common ways to retire logs is to rename the existing log file and/or start a new file. Lots of programs do it that way, with either dated log file names or by using a sequential numbering system - logfile, logfile.1, logfile.2, etc. with higher-numbered files being older. You can add compression to the process to further reduce the storage requirements for expired files, etc.
Another option is to use a more database-like format, or an out-and-out database like SQLite to store your log entries. The primary downside of this of course is that your log files become more difficult to read, since they're not just in plain text form. It's simple enough to write a dump-to-text program whose output can be piped to a log parser... but even this will probably require a change in the way your consumers are interfacing with the log file.
The problem as stated is unlikely to be realistically solvable, I suspect. On the one hand you have the limitations of file manipulation, and on the other the fact that your log consumers are many and varied and therefore changes to the logging structure will be an involved process.
About all I can suggest is that you trial a log aging process similar to this:
Rename current log file
Walk renamed file and copy desired contents to new log file
Discard or archive renamed log
Beware duplication or data loss.

i dunno why u need this feature "chop off the top and add to the end while keeping it consistently no larger than 100mb".
general design approach is archiving. simply rename the oversized file to another file, or move it to somwhere else, then using back the same filename as new file.
simple as this is.

Uniquely identify file on Windows

I need to uniquely identify a file on Windows so I can always have a reference for that file even if it's moved or renamed. I did some research and found the question Unique file identifier in windows with a way that uses the method GetFileInformationByHandle with C++, but apparently that only works for NTFS partitions, but not for the FAT ones.
I need to program a behavior like the one on DropBox: if you close it on your computer, rename a file and open it again it detects that change and syncs correctly. I wonder whats the technique and maybe how DropBox does if you guys know.
FileSystemWatcher for example would work, but If the program using it is closed, no changes can be detected.
I will be using C#.
Thanks,

The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.
I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.
In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.

What the user was looking for was most likely Windows Change Journals. Those track changes like renames of files persistently, no need to have a watcher observing file system events running all the time. Instead, one simply needs to maintain when last looked at the log and continue looking again beginning at that point. At some point a file with an already known ID would have an event of type RENAME and whoever is interested in that event could do the same for its own version of that file. The important thing is to keep track of the used IDs for files of course.
An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journals

Difference between in doing file copy/delete and Move

What is difference between
Copying a file and deleting it using File.Copy() and File.Delete()
Moving the file using File.Move()
In terms of permission required to do these operations is there any difference? Any help much appreciated.

File.Move method can be used to move the file from one path to another. This method works across disk volumes, and it does not throw an exception if the source and destination are the same.
You cannot use the Move method to overwrite an existing file. If you attempt to replace a file by moving a file of the same name into that directory, you get an IOException. To overcome this you can use the combination of Copy and Delete methods

Performance wise, if on one and the same file system, moving a file is (in simplified terms) just adjusting some internal registers of the file system itself (possibly adjusting some nodes in a red/black-tree), without actually moving something.
Imagine you have 180MiB to move, and you can write onto your disk at roughly 30MiB/s. Then with copy/delete, it takes approximately 6 seconds to finish. With a simple move [same file system], it goes so fast you might not even realise it.
(I once wrote some transactional file system helpers that would move or copy multiple files, all or none; in order to make the commit as fast as possible, I moved/copied all stuff into a temporary sub-folder first, and then the final commit would move existent data into another folder (to enable rollback), and the new data up to the target).

I don't think there is any difference permission-wise, but I would personally prefer to use File.Move() since then you have both actions happening in the same "transaction". In other words if something on the move fails the whole operation fails. However, if you break it up in two steps (copy + delete) if copy worked and delete failed, you would have to reverse the "transaction" (delete the copy) manually.

Permission in file transfer is checked at two points: source, and destination. So, if you don't have read permission in source folder, or you don't have write permission in destination, then these methods both throw AccessDeniedException exception. In other words, permission checking is agnostic to method in use.

directory monitoring

What is the best way for me to check for new files added to a directory, I dont think the filesystemwatcher would be suitable as this is not an always on service but a method that runs when my program starts up.
there are over 20,000 files in the folder structure I am monitoring, at present I am checking each file individually to see if the filepath is in my database table, however this is taking around ten minutes and I would like to speed it up is possible,
I can store the date the folder was last checked - is it easy to get all files with createddate > last checked date.
anyone got any Ideas?
Thanks
Mark

Your approach is the only feasible (i.e. file system watcher allows you to see changes, not check on start).
Find out what takes so long. 20.000 checks should not take 10 minutes - maybe 1 maximum. Your program is written slowly. How do you test it?
Hint: do not ask the database, get a list of all files into memory, a list of all filesi n the database, check in memory. 20.000 SQL statements to the database are too slow, this way you need ONE to get the list.

10 minutes seems awfully long for 20,000 files. How are you going about doing the comparison? Your suggestion doesn't account for deleted files either. If you want to remove those from the database, you will have to do a full comparison.
Perhaps the problem is the database round trips. You can retrieve a known file list from the database in large chunks (or all at once), sorted alphabetically. Sort the local file list as well and walk the two lists, processing missing or new entries as you go along.

FileSystemWatcher is not reliable, so even if you could use a service, it would not necessarily work for you.
The two options I can see are:
Keep a list of files you know about and keep comparing to this list. This will allow you to see if files were added, deleted etc. Keep this list in memory, instead of querying the database for each file.
As you suggest, store a timestamp and compare to that.

You can write in somewhere the last timestamp that onfile was created, it is simple and can work for you.

Can you write a service that runs on that machine? The service can then use FileSystemWtcher

Having a FileSystemWatcher service like Kevin Jones suggests is probably the most pragmatic answer, but there are some other options.
You can watch the directory with inotify if you mount it with Samba on a linux box. That of course assumes you don't mind fragmenting your platform, but that's what inotify is there for.
And then more correctly but with correspondingly less chance of you getting a go-ahead, if you're sitting monitoring a directory with 20K files in it it is probably time to evolve your system architecture. Not knowing all that much more about your application, it sounds like a message queue might be worth looking at.

Detecting moved files using FileSystemWatcher

I realise that FileSystemWatcher does not provide a Move event, instead it will generate a separate Delete and Create events for the same file. (The FilesystemWatcher is watching both the source and destination folders).
However how do we differentiate between a true file move and some random creation of a file that happens to have the same name as a file that was recently deleted?
Some sort of property of the FileSystemEventArgs class such as "AssociatedDeleteFile" that is assigned the deleted file path if it is the result of a move, or NULL otherwise, would be great. But of course this doesn't exist.
I also understand that the FileSystemWatcher is operating at the basic Filesystem level and so the concept of a "Move" may be only meaningful to higher level applications. But if this is the case, what sort of algorithm would people recommend to handle this situation in my application?
Update based on feedback:
The FileSystemWatcher class seems to see moving a file as simply 2 distinct events, a Delete of the original file, followed by a Create at the new location.
Unfortunately there is no "link" provided between these events, so it is not obvious how to differentiate between a file move and a normal Delete or Create. At the OS level, a move is treated specially, you can move say a 1GB file almost instantaneously.
A couple of answers suggested using a hash on files to identify them reliably between events, and I will proably take this approach. But if anyone knows how to detect a move more simply, please leave an answer.

According to the docs:
Common file system operations might
raise more than one event. For
example, when a file is moved from one
directory to another, several
OnChanged and some OnCreated and
OnDeleted events might be raised.
Moving a file is a complex operation
that consists of multiple simple
operations, therefore raising multiple
events.
So if you're trying to be very careful about detecting moves, and having the same path is not good enough, you will have to use some sort of heuristic. For example, create a "fingerprint" using file name, size, last modified time, etc for files in the source folder. When you see any event that may signal a move, check the "fingerprint" against the new file.

As far as I understand it, the Renamed event is for files being moved...?
My mistake - the docs specifically say that only files inside a moved folder are considered "renamed" in a cut-and-paste operation:
The operating system and FileSystemWatcher object interpret a cut-and-paste action or a move action as a rename action for a folder and its contents. If you cut and paste a folder with files into a folder being watched, the FileSystemWatcher object reports only the folder as new, but not its contents because they are essentially only renamed.
It also says about moving files:
Common file system operations might raise more than one event. For example, when a file is moved from one directory to another, several OnChanged and some OnCreated and OnDeleted events might be raised. Moving a file is a complex operation that consists of multiple simple operations, therefore raising multiple events.

As you already mentioned, there is no reliable way to do this with the default FileSystemWatcher class provided by C#. You can apply certain heuristics like filename, hashes, or unique file ids to map created and deleted events together, but none of these approaches will work reliably. In addition, you cannot easily get the hash or file id for the file associated with the deleted event, meaning that you have to maintain these values in some sort of database.
I think the only reliable approach for detecting file movements is to create an own file system watcher. Therefore, you can use different approaches. If you are only going to watch changes on NTFS file systems, one solution might be to read out the NTFS change journal as described here. What's nice about this is that it even allows you to track changes that occurred while your app wasn't running.
Another approach is to create a minifilter driver that tracks file system operations and forwards them to your application. Using this you basically get all information about what is happening to your files and you'll be able to get information about moved files. A drawback of this approach is that you have to create a separate driver that needs to be installed on the target system. The good thing however is that you wouldn't need to start from scratch, because I already started to create something like this: https://github.com/CenterDevice/MiniFSWatcher
This allows you to simply track moved files like this:
var eventWatcher = new EventWatcher();
eventWatcher.OnRenameOrMove += (filename, oldFilename, process) =>
{
Console.WriteLine("File " + oldFilename + " has been moved to " + filename + " by process " + process );
};
eventWatcher.Connect();
eventWatcher.WatchPath("C:\\Users\\MyUser\\*");
However, please be aware that this requires kernel code that needs to be signed in order run on 64bit version of Windows (if you don't disable signature checking for testing). At time of writing, this code is also still in an early stage of development, so I would not use it on production systems yet. But even if you're not going to use this, it should still give you some information about how file system events might be tracked on Windows.

I'll hazard a guess 'move' indeed does not exist, so you're really just going to have to look for a 'delete' and then mark that file as one that could be 'possibly moved', and then if you see a 'create' for it shortly after, I suppose you can assume you're correct.
Do you have a case of random file creations affecting your detection of moves?

Might want to try the OnChanged and/or OnRenamed events mentioned in the documentation.

StorageLibrary class can track moves. The example from Microsoft:
StorageLibrary videosLib = await StorageLibrary.GetLibraryAsync(KnownLibraryId.Videos);
StorageLibraryChangeTracker videoTracker = videosLib.ChangeTracker;
videoTracker.Enable();
A complete example could be found here.
However, it looks like you can only track changes inside Windows "known libraries".
You can also try to get StorageLibraryChangeTracker using StorageFolder.TryGetChangeTracker(). But your folder must be under sync root, you can not use this method to get an arbitrary folder in file system.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.