Uniquely identify file on Windows - c#

I need to uniquely identify a file on Windows so I can always have a reference for that file even if it's moved or renamed. I did some research and found the question Unique file identifier in windows with a way that uses the method GetFileInformationByHandle with C++, but apparently that only works for NTFS partitions, but not for the FAT ones.
I need to program a behavior like the one on DropBox: if you close it on your computer, rename a file and open it again it detects that change and syncs correctly. I wonder whats the technique and maybe how DropBox does if you guys know.
FileSystemWatcher for example would work, but If the program using it is closed, no changes can be detected.
I will be using C#.
Thanks,

The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.
I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.
In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.

What the user was looking for was most likely Windows Change Journals. Those track changes like renames of files persistently, no need to have a watcher observing file system events running all the time. Instead, one simply needs to maintain when last looked at the log and continue looking again beginning at that point. At some point a file with an already known ID would have an event of type RENAME and whoever is interested in that event could do the same for its own version of that file. The important thing is to keep track of the used IDs for files of course.
An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journals

Related

Does File.Move() deletes original on IoException?

I have a webservice that is writing files that are being read by a different program.
To keep the reader program from reading them before they're done writing, I'm writing them with a .tmp extension, then using File.Move to rename them to a .xml extension.
My problem is when we are running at volume - thousands of files in just a couple of minutes.
I've successfully written file "12345.tmp", but when I try to rename it, File.Move() throws an IOException:
File.Move("12345.tmp", "12345.xml")
Exception: The process cannot access the file because it is being used
by another process.
For my situation, I don't really care what the filenames are, so I retry:
File.Move("12345.tmp", "12346.xml")
Exception: Exception: Could not find file '12345.tmp'.
Is File.Move() deleting the source file, if it encounters an error in renaming the file?
Why?
Is there someway to ensure that the file either renames successfully or is left unchanged?
The answer is that it depends much on how the file system itself is implemented. Also, if the Move() is between two file systems (possibly even between two machines, if the paths are network shares) - then it also depends much on the O/S implementation of Move(). Therefore, the guarantees depend less on what System.IO.File does, and more about the underlying mechanisms: the O/S code, file-system drivers, file system structure etc.
Generally, in the vast majority of cases Move() will behave the way you expect it to: either the file is moved or it remains as it was. This is because a Move within a single file system is an act of removing the file reference from one directory (an on-disk data structure), and adding it to another. If something bad happens, then the operation is rolled back: the removal from the source directory is undone by an opposite insert operation. Most modern file systems have a built-in journaling mechanism, which ensures that the move operation is either carried out completely or rolled back completely, even in case the machine loses power in the midst of the operation.
All that being said, it still depends, and not all file systems provide these guarantees. See this study
If you are running over Windows, and the file system is local (not a network share), then you can use the Transacitonal File System (TxF) feature of Windows to ensure the atomicity of your move operation.

File move - How does the OS know whether to update a master file table or copy and delete?

After having read questions dealing with how to tell whether two files are on the same physical volume or not, and seeing that it's (almost) impossible (e.g. here), I'm wondering how the OS knows whether a file move operation should update a master file table (or its equivalent) or whether to copy and delete.
Does Windows delegate that to the drives somehow? (Or perhaps the OS does have information about every file, and it's just not accessible by programs? Unlikely.)
Or - Does Windows know only about certain types of drives (and copies and deletes in other cases)? In which case we could also assume the same. Which means allowing a file move without using a background thread, for example. (Because it will be near instantaneous.)
I'm trying to better understand this subject. If I'm making some basic incorrect assumption - please, correcting that in itself would be an answer.
If needed to limit the scope, let's concentrate on Windows 7 and up, and NTFS and FAT drives.
Of course the operating system knows which drive (and which partition on that drive) contains any particular local file; otherwise, how could it read the data? (For remote files, the operating system doesn't know about the drives, but it does know which server to contact. Moves between different servers are implemented as copy-and-delete; moves on the same server are either copy-and-delete or are delegated to that server, depending on the protocol in use.)
This information is also available to applications. You can use the GetFileInformationByHandle() function to obtain the serial number of the volume containing a particular file.
The OS does have information about every file, and it's just not as easily accessible to your program. Not in any portable way, that is.
See it this way: Those files are owned by the system. The system allocates the space, manages the volume and indexes. It's not going to copy and delete the file if it ends up in the same physical volume, as it is more efficient to move the file. It will only copy and delete if it needs to.
In C or C++ for Windows I first try to MoveFileEx without MOVEFILE_COPY_ALLOWED set. It will fail if the file can not be moved by renaming. If rename fails I know that it may take some time and show some progress bar or the like.
There are no such rename AFAIK in .NET and that System::IO::File::Move of .NET does not fail if you move between different volumes.
First, regarding Does Windows delegate that to the drives somehow. No. The OS is more like a central nervous system. It keeps track of whats going on centrally, and for its distributed assets (or devices) such as a drive. (internal or external)
It follows that the OS, has information about every file residing on a drive for which it has successfully enumerated. The most relevant part of the OS with respect to file access is the File System. There are several types. Knowledge of the following topics will help to understand issues surrounding file access:
1) File attribute settings
2) User Access Controls
3) File location (pdf) (related to User Access Controls)
4) Current state of file (i.e. is the file in use currently)
5) Access Control Lists
Regarding will be near instantaneous. This obviously is only a perception. No matter how fast, or seemingly simultaneous, file handling via standard programming libraries can be done in such a way as to be aware of file related errors, such as:
ENOMEM - insufficient memory.
EMFILE - FOPEN_MAX files open already.
EINVAL - filename is NULL or contains only whitespace.
EINVAL - invalid mode.
(these in relation to fopen) can be used to mitigate OS/file run-time issues. This being said, applications should always be written to comply with good programming methods to avoid bumping into OS related file access issues, thread safety included.

Detect changes in directory

I need to monitor a folder and its subdirectories for any file manipulations (add/remove/rename). I've read about FileSystemWatcher but I'd like to monitor changes between each time the program is run or when the user presses the "check for changes" button (FSW seems more orientated to runtime detection). My first thought was to iterate through all the (sub)directories and hash each file. Then, concatenate all the hashes (which have been ordered) and hash that. When I want to check for changes, I repeat the process and check if the hashes are the same.
Is this an efficient way of doing it?
Also, once I've detected a change, how do I find out what file has been added, removed or renamed as quickly as possible?
As a side note, I don't mind using scripts to do this if they're faster as long as those scripts don't require end users to install anything and the scripts can notify my C# app of the changes.
We handle this by storing all found files in a database along with their last modification time.
On each pass through the files, we check the database for each file: if it doesn't exist in the DB, it is new and if it does exist, but the timestamp is different, it has changed.
There is also an option to handle deleted files by marking all of the files in the database as ToBeDeleteed prior to the pass and clearing this if the file was found. Then, at the end of the process, we can just delete all of the records that are marked as ToBeDeleted.
Obviously you need to make "snapshots" of the directory tree and compare them as required. What exactly goes into the snapshots would depend on your requirements. Keep in mind that:
You need to store filenames in order to detect "new" and "deleted" files
File sizes and last-modified times are a good and cheap indicator that a file has or has not changed, but do not provide a guarantee
Hashing the contents of files can be prohibitively expensive if the files can be large, but it's the only way to know they have changed with a near-perfect degree of accuracy (remember that hashes can collide as well, so if you want mathematical 100% certainty that's not going to be good enough either)

Implement autosave feature of a XML

In order to give my application an autosave functionality, I'm looking at the best implementation that would optimise the 3 followings requirements:
safety: in order to reduce the risk of data corruption
user friendly: the user is not computer expert so the solution must be intuitive and friendly
quick to develop: I don't want to spend weeks over this implementation never
I have three solutions witch doesn't fit the 3 criteria and I'm looking for an alternative:
creating a simple shadow file so when the application crashes or the PC shutdown unexpectedly the application try to restore it
working the same way than above but storing several version of the file at different time in a temp folder
implement a true roll back system allowing the extend the undo/redo functionnality even the the application is restarted by keeping trace of the modification in a temp folder.
Does someone have anything to suggest?
For autosave, I'd simply have a background running thread that would run your Save() method silently (no popups) to a temp location (AppData system folder). You should probably keep a separate file for each session, so that you can always offer to return to a previous crashed session. On normal exit, you should delete the file to indicate the session has completed successfully.
I'd even keep 2 files for every session an alternate saving to each, so that if a crash happens during an autosave, it won't corrupt the previous autosave.

Detecting moved files using FileSystemWatcher

I realise that FileSystemWatcher does not provide a Move event, instead it will generate a separate Delete and Create events for the same file. (The FilesystemWatcher is watching both the source and destination folders).
However how do we differentiate between a true file move and some random creation of a file that happens to have the same name as a file that was recently deleted?
Some sort of property of the FileSystemEventArgs class such as "AssociatedDeleteFile" that is assigned the deleted file path if it is the result of a move, or NULL otherwise, would be great. But of course this doesn't exist.
I also understand that the FileSystemWatcher is operating at the basic Filesystem level and so the concept of a "Move" may be only meaningful to higher level applications. But if this is the case, what sort of algorithm would people recommend to handle this situation in my application?
Update based on feedback:
The FileSystemWatcher class seems to see moving a file as simply 2 distinct events, a Delete of the original file, followed by a Create at the new location.
Unfortunately there is no "link" provided between these events, so it is not obvious how to differentiate between a file move and a normal Delete or Create. At the OS level, a move is treated specially, you can move say a 1GB file almost instantaneously.
A couple of answers suggested using a hash on files to identify them reliably between events, and I will proably take this approach. But if anyone knows how to detect a move more simply, please leave an answer.
According to the docs:
Common file system operations might
raise more than one event. For
example, when a file is moved from one
directory to another, several
OnChanged and some OnCreated and
OnDeleted events might be raised.
Moving a file is a complex operation
that consists of multiple simple
operations, therefore raising multiple
events.
So if you're trying to be very careful about detecting moves, and having the same path is not good enough, you will have to use some sort of heuristic. For example, create a "fingerprint" using file name, size, last modified time, etc for files in the source folder. When you see any event that may signal a move, check the "fingerprint" against the new file.
As far as I understand it, the Renamed event is for files being moved...?
My mistake - the docs specifically say that only files inside a moved folder are considered "renamed" in a cut-and-paste operation:
The operating system and FileSystemWatcher object interpret a cut-and-paste action or a move action as a rename action for a folder and its contents. If you cut and paste a folder with files into a folder being watched, the FileSystemWatcher object reports only the folder as new, but not its contents because they are essentially only renamed.
It also says about moving files:
Common file system operations might raise more than one event. For example, when a file is moved from one directory to another, several OnChanged and some OnCreated and OnDeleted events might be raised. Moving a file is a complex operation that consists of multiple simple operations, therefore raising multiple events.
As you already mentioned, there is no reliable way to do this with the default FileSystemWatcher class provided by C#. You can apply certain heuristics like filename, hashes, or unique file ids to map created and deleted events together, but none of these approaches will work reliably. In addition, you cannot easily get the hash or file id for the file associated with the deleted event, meaning that you have to maintain these values in some sort of database.
I think the only reliable approach for detecting file movements is to create an own file system watcher. Therefore, you can use different approaches. If you are only going to watch changes on NTFS file systems, one solution might be to read out the NTFS change journal as described here. What's nice about this is that it even allows you to track changes that occurred while your app wasn't running.
Another approach is to create a minifilter driver that tracks file system operations and forwards them to your application. Using this you basically get all information about what is happening to your files and you'll be able to get information about moved files. A drawback of this approach is that you have to create a separate driver that needs to be installed on the target system. The good thing however is that you wouldn't need to start from scratch, because I already started to create something like this: https://github.com/CenterDevice/MiniFSWatcher
This allows you to simply track moved files like this:
var eventWatcher = new EventWatcher();
eventWatcher.OnRenameOrMove += (filename, oldFilename, process) =>
{
Console.WriteLine("File " + oldFilename + " has been moved to " + filename + " by process " + process );
};
eventWatcher.Connect();
eventWatcher.WatchPath("C:\\Users\\MyUser\\*");
However, please be aware that this requires kernel code that needs to be signed in order run on 64bit version of Windows (if you don't disable signature checking for testing). At time of writing, this code is also still in an early stage of development, so I would not use it on production systems yet. But even if you're not going to use this, it should still give you some information about how file system events might be tracked on Windows.
I'll hazard a guess 'move' indeed does not exist, so you're really just going to have to look for a 'delete' and then mark that file as one that could be 'possibly moved', and then if you see a 'create' for it shortly after, I suppose you can assume you're correct.
Do you have a case of random file creations affecting your detection of moves?
Might want to try the OnChanged and/or OnRenamed events mentioned in the documentation.
StorageLibrary class can track moves. The example from Microsoft:
StorageLibrary videosLib = await StorageLibrary.GetLibraryAsync(KnownLibraryId.Videos);
StorageLibraryChangeTracker videoTracker = videosLib.ChangeTracker;
videoTracker.Enable();
A complete example could be found here.
However, it looks like you can only track changes inside Windows "known libraries".
You can also try to get StorageLibraryChangeTracker using StorageFolder.TryGetChangeTracker(). But your folder must be under sync root, you can not use this method to get an arbitrary folder in file system.

Categories