directory monitoring - c#

What is the best way for me to check for new files added to a directory, I dont think the filesystemwatcher would be suitable as this is not an always on service but a method that runs when my program starts up.
there are over 20,000 files in the folder structure I am monitoring, at present I am checking each file individually to see if the filepath is in my database table, however this is taking around ten minutes and I would like to speed it up is possible,
I can store the date the folder was last checked - is it easy to get all files with createddate > last checked date.
anyone got any Ideas?
Thanks
Mark

Your approach is the only feasible (i.e. file system watcher allows you to see changes, not check on start).
Find out what takes so long. 20.000 checks should not take 10 minutes - maybe 1 maximum. Your program is written slowly. How do you test it?
Hint: do not ask the database, get a list of all files into memory, a list of all filesi n the database, check in memory. 20.000 SQL statements to the database are too slow, this way you need ONE to get the list.

10 minutes seems awfully long for 20,000 files. How are you going about doing the comparison? Your suggestion doesn't account for deleted files either. If you want to remove those from the database, you will have to do a full comparison.
Perhaps the problem is the database round trips. You can retrieve a known file list from the database in large chunks (or all at once), sorted alphabetically. Sort the local file list as well and walk the two lists, processing missing or new entries as you go along.

FileSystemWatcher is not reliable, so even if you could use a service, it would not necessarily work for you.
The two options I can see are:
Keep a list of files you know about and keep comparing to this list. This will allow you to see if files were added, deleted etc. Keep this list in memory, instead of querying the database for each file.
As you suggest, store a timestamp and compare to that.

You can write in somewhere the last timestamp that onfile was created, it is simple and can work for you.

Can you write a service that runs on that machine? The service can then use FileSystemWtcher

Having a FileSystemWatcher service like Kevin Jones suggests is probably the most pragmatic answer, but there are some other options.
You can watch the directory with inotify if you mount it with Samba on a linux box. That of course assumes you don't mind fragmenting your platform, but that's what inotify is there for.
And then more correctly but with correspondingly less chance of you getting a go-ahead, if you're sitting monitoring a directory with 20K files in it it is probably time to evolve your system architecture. Not knowing all that much more about your application, it sounds like a message queue might be worth looking at.

Related

Application architecture with data on a shared network, without a database on the server

I'm currently working on a C# project of an application we'd like to develop. We're brainstorming over the question of sharing the data between users. We'd like to be able to specify a folder where all the files of the application are going to be saved and we'd like to be able to save them on a shared folder (server, different PC or Mac, Nas, etc.).
The deployment would be like so :
Installation on the first PC, we choose a network drive, share, whatever and create all the files for the application in this location.
On the second PC we install the application and we choose the same location (on the network), the application doesn't create anything, it sees that it's already existing and it uses these files as the application's data
Same thing on the other clients
The application's files are going to be documents (most likely XML formatted documents) and when opening the application we want to show all the existing documents. The thing is, we don't only want to have the list of documents and be able to edit their content, we also would like to be able to edit the document's property, so in a way we'd like a file (Sqlite, XML, whatever) representing the list of all the documents and their attributes. Same thing for a list of addresses.
I know all that looks exactly like a client / server with database solution, but this solution is out of the question. I was first looking at SQLite for my data files, but I know concurrency can be a real problem and file lock doesn't work well. The thing is, I would have the same problem with simple XML files (refreshing the content when several users are working, accessing locked files).
So I guess my final question is : Is it feasable? Is there an alternative I didn't see which would allow us to do that more easily?
EDIT :
OK I'm not responding to every post or comment, because I'm currently testing concurrency with SQLite. What I did, and please correct me if the way I test this is wrong, is launch X BackgroundWorker which are all going to insert record in a sample database (which is recreated everytime I start the application). I tried launching 100 iterations of INSERT in the database via these backgroundWorkers.
Of course concurrency is working with one application running, it's simply waiting for the last BackgroundWorker to do it's job and then writing the next record. I also tried inserting at (almost) the same time, meaning I put a loop in every BackgroundWorker waiting for a modulo 5 timestamp (every 5 seconds, every BackgroundWorker runs). Again, it's waiting for the previous insert query to end before doing the next and everything's working fine. I even tried it with 500 BackgroundWorkers and it worked fine.
I then tried launching my app several times and running them simultaneously. When doing this I did have some issue. With two instances of my app it was still working fine, but when trying this with 4-5 instances, it got really buggy and I got two types of error : 1. database is locked 2. disk I/O failure. But mostyle locked databases.
What I did was pretty intensive, in the scenario of my application, it will never ever come to 5 processes trying to simultaneously insert 500 hunded rows at the same time (maybe I'll get a concurrency of two or three connections). But what really bugged me and what makes me think my testing method is not really a good one, is that I got these errors trying to work on a database on a shared network, on a NAS AND on my own HDD. Everytime it worked for maybe 30-40 queries then throwing me "database is locked" error.
Am I testing it wrong? Maybe I shouldn't be trying so hard to make this work, but I'm still not convinced that SQLite is not a good alternative to what I'm trying to do, since the concurrency is going to be really small.
With your optimistic/pessimistic locking, you are ultimately trying to build a database. Also, you WILL have issues with consistency while trying to keep multiple files in sync with each other. Think about if you update the "metadata" file, and the write fails half-way through because of a network blip. File corruption will ensue, and you will be left trying to reconstruct things from backups.
I would suggest a couple of likely solutions:
1) Host the content yourselves, and let them be pure clients (cloud based deployments are ideal for this). Most network/firewall issues can be circumvented by using HTTP as your transport (web services).
2) Have one of the workstations be the "server", which keeps it data files on the NFS. This will give you transactional integrity, incremental backups, etc. There are lots of good embedded database managements systems to help you manage this complexity. MS SQL Server even has some great options for this.
You right, Sqlite uses file locks on database file, so storing all data files in database would bring write-starvation problem for editing your documents.
May be it's better choice to implement simple optimistic/pessimistic locking by yourself on particular-file level? For example, in case of using pessimistic lock you just don't allow anyone to edit particular file, if somebody already in process of editing it. In this case you will hold lock just on one file, but not on the entire database. If possibility of conflict(editing particular file at the same time) is pretty low, it is better to go with optimistic locking.
Simple optimistic locking implementation:
When user get file for reading - it's OK, no problem here. If user get file for editing, you could calculate hash for this file(or get timestamp of last updated time of the file), and then, when user tries to save edited file, compare current(at the moment of saving) hash/timestamp to make sure that file has not been changed by somebody else. If file has not been changed then it's ok to save it. IF file has been changed, then current user is out of luck, you need to inform him about it. This optimistic scenario is nice when possibility of this "out of luck" is pretty low. Otherwise it's better to stick with pessimistic locking, when you do not allow user even to start file editing if somebody else is doing it.

How can System.IO.FileSystemInfo.Refresh be used

There is an impressive lack of examples of the usage of Refresh.
I'm using the following method, which gets an inaccurate time
ViewBag.t1 = System.IO.File.GetLastAccessTime(#"C:\BillingExport\BILLING_TABLE_FILE01_1.txt");
I read that it's inaccurate because the OS hasn't performed a check and updated the files read/write times.
I've tried
System.IO.FileSystemInfo.Refresh(#"C:\BillingExport\BILLING_TABLE_FILE01_1.txt");
But this does not work and I can't locate a resource giving similar examples of its usage.
FileSystemInfo.Refresh is not a static method. What you have shown for your example does not compile. You should create a FileInfo object initialized with the file name and then you can call Refresh on that. You should then be able to use the properties of the FileInfo object to get the last access time and other pertinent file details.
var info = new FileInfo(#"C:\Temp\a.txt");
info.Refresh(#"C:\BillingExport\BILLING_TABLE_FILE01_1.txt");
var lastAccess = info.LastAccessTime;
One last edit based on an answer at the above linked possible duplicate and CodeCaster's answer:
http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-time-in-windows-vista-to-improve-ntfs-performance.aspx
Indicates that in Vista this was disabled by default. I just checked the registry in my Win 8.1 box and sure enough, the registry key is there and Last Access update is disabled by default. So, if you are on Vista or above the above code won't really work. If you are on XP than you should be golden!
FileSystemInfo is the abstract base class for FileInfo and DirectoryInfo. Which cache the properties of a file/directory. If you keep, say, a FileInfo object around and keep testing its Exists property then it gets to be important that you call Refresh().
Which has nothing to do with File.GetLastAccessTime(). The classes are entirely unrelated, the File class does no caching and always retrieves the last access time from the file system.
Which is unreliable if the file is opened by any program. The file system is just not in a hurry to update these attributes when a program is actively accessing the file. That's way too expensive, that can easily cost many dozens of milliseconds to send the disk drive write head to the MFT sector that stores these values. A program can access a file much faster than that. Documented in this MSDN article:
Not all file systems can record creation and last access times, and not all file systems record them in the same manner. For example, the resolution of create time on FAT is 10 milliseconds, while write time has a resolution of 2 seconds and access time has a resolution of 1 day, so it is really the access date. The NTFS file system delays updates to the last access time for a file by up to 1 hour after the last access.
Most relevant phrase bolded, what you see is pretty much expected. You'll need to look for a different approach.

Uniquely identify file on Windows

I need to uniquely identify a file on Windows so I can always have a reference for that file even if it's moved or renamed. I did some research and found the question Unique file identifier in windows with a way that uses the method GetFileInformationByHandle with C++, but apparently that only works for NTFS partitions, but not for the FAT ones.
I need to program a behavior like the one on DropBox: if you close it on your computer, rename a file and open it again it detects that change and syncs correctly. I wonder whats the technique and maybe how DropBox does if you guys know.
FileSystemWatcher for example would work, but If the program using it is closed, no changes can be detected.
I will be using C#.
Thanks,
The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.
I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.
In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.
What the user was looking for was most likely Windows Change Journals. Those track changes like renames of files persistently, no need to have a watcher observing file system events running all the time. Instead, one simply needs to maintain when last looked at the log and continue looking again beginning at that point. At some point a file with an already known ID would have an event of type RENAME and whoever is interested in that event could do the same for its own version of that file. The important thing is to keep track of the used IDs for files of course.
An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journals

Rename file extensions in a sequential order

Windows Service - C# - VS2010
I have multiple instances of a FileWatcher Service. Each one looks for a different extension in the directory. I have a separate Router service that monitors the directory for zip files and renames the extensions to one of the values that the services look at.
Example:
Directory in question (all FileWatcher Services monitor this directory) contains the following files:
a.zip, b.zip, c.zip
FileWatcher1 looks for extensions of *.000, FileWatcher2 looks for extensions of *.001, FileWatcher3 looks for extensions of *.002
The Router will see the .zip files and change the file extensions on the zip files, but it should keep in sequence in order to delegate the same amount of work to each FileWatcher.
Also, if there are two zip files dropped, it would change a.zip -> a.000, and b.zip -> b.001, but if 5 minutes go by and another batch of zip files are dropped, it should know to rename the next file to *.002.
I have everything working fine, but now I need to implement the sequential part to the Router and am not sure the best way of implementation (currently router is changing every extension to *.000 thus only one FileWatcher is getting the work). I know this might be considered a cheap way of doing this but it's all we really need at the moment. Any help would be appreciated.
Maybe a different way of looking at it. Have you thought about having a single watcher and then using a thread pool? The reason why I am suggesting this is that you will have to start looking at the sizes and complexities of the fields to adequately distribute the work. You might start pushing more work to .000 because it's next in line when it is still busy processing a large amount of data from the first job whereas .001 could be free as it was processing a small file.
If you really want to get around the problem of the next extension in line, why not just keep a static variable with the next extension number. I am not 100% sure if the Router Filewatcher will run multiple threads when it sees new files one after the other but I don't think so. If that does happen then you will need to put some thread safety code when accessing the static variable.
Can the Router just keep a counter and do a mod 3 (or N, where N is the number of watchers) operation for every new file?

Detect changes in directory

I need to monitor a folder and its subdirectories for any file manipulations (add/remove/rename). I've read about FileSystemWatcher but I'd like to monitor changes between each time the program is run or when the user presses the "check for changes" button (FSW seems more orientated to runtime detection). My first thought was to iterate through all the (sub)directories and hash each file. Then, concatenate all the hashes (which have been ordered) and hash that. When I want to check for changes, I repeat the process and check if the hashes are the same.
Is this an efficient way of doing it?
Also, once I've detected a change, how do I find out what file has been added, removed or renamed as quickly as possible?
As a side note, I don't mind using scripts to do this if they're faster as long as those scripts don't require end users to install anything and the scripts can notify my C# app of the changes.
We handle this by storing all found files in a database along with their last modification time.
On each pass through the files, we check the database for each file: if it doesn't exist in the DB, it is new and if it does exist, but the timestamp is different, it has changed.
There is also an option to handle deleted files by marking all of the files in the database as ToBeDeleteed prior to the pass and clearing this if the file was found. Then, at the end of the process, we can just delete all of the records that are marked as ToBeDeleted.
Obviously you need to make "snapshots" of the directory tree and compare them as required. What exactly goes into the snapshots would depend on your requirements. Keep in mind that:
You need to store filenames in order to detect "new" and "deleted" files
File sizes and last-modified times are a good and cheap indicator that a file has or has not changed, but do not provide a guarantee
Hashing the contents of files can be prohibitively expensive if the files can be large, but it's the only way to know they have changed with a near-perfect degree of accuracy (remember that hashes can collide as well, so if you want mathematical 100% certainty that's not going to be good enough either)

Categories