Best way to use SFTP folder as concurrent work queue

Best way to use SFTP folder as concurrent work queue - c#

I am writing a c# windows service which will be polling an SFTP folder for new files (one file = one job) and processing them. Multiple instances of the service may be running at the same time, so it is important that they do not step on each other.
I realize that an SFTP folder does not make an ideal queue, but that's what I have to work with. What do I need to do to either use this SFTP folder as a concurrent message queue, or safely represent it in a way that can be used concurrently?

Seems like your biggest problem would be dealing with multiple instances of the program stepping on each other and processing the same files.
The way I've handled this in the past is to have the program grab the first file and immediately rename it from say 'filename.txt' to 'filename.txt.processing'. The processes would be set up to ignore any file ending in '.processing' so that they don't step on each other. I don't think a file rename is perfectly atomic, but I've never had any problems with it.

Multiple instances of the service may
be running at the same time
On the same machine, or different ones?

Not sure if moving a file in Windows is an atomic operation.
If it is, then when a service chooses to work on a file, it should attempt to move the file to another folder.
If the move operation is successful, then it is safe to work on the file.

You could also leveragea datasbase to keep track of which files are being processed, have been processed or are awaiting processing.
This adds the complicatio of updating the table with new files.

Related

Best file mutex in .NET 3.5

I want to use some mutex on files, so any process won't touch certain files before other stop using them. How can I do it in .NET 3.5? Here are some details:
I have some service, which checks every period of time if there are any files/directories in certain folder and if there are, service's doing something with it.
My other process is responsible for moving files (and directories) into certain folder and everything works just fine.
But I'm worrying because there can be situation, when my copying process will copy the files to certain folder and in the same time (in the same milisecond) my service will check if there are some files, and will do something with them (but not with all of them, because it checked during the copying).
So my idea is to put some mutex in there (maybe one extra file can be used as a mutex?), so service won't check anything until copying is done.
How can I achieve something like that in possibly easy way?
Thanks for any help.

The canonical way to achieve this is the filename:
Process A copies the files to e.g. "somefile.ext.noprocess" (this is non-atomic)
Process B ignores all files with the ".noprocess" suffix
After Process B has finished copying, it renames the file to "somefile.ext"
Next time Process B checks, it sees the file and starts processing.
If you have more than one file, that have to be processd together (or none), you need to adapt this scheme to an additional transaction file containing the file names for the transaction: Only if this file exists and has the correct name, must process B read it and process the files mentioned in it.

Your problem really is not of mutual exclusion, but of atomicity. Copying multiple files is not an atomic operation, and so it is possible to observe the files in a half-copied state which you'd like to prevent.
To solve your problem, you could hinge your entire operation on a single atomic file system operation, for example renaming (or moving) of a folder. That way no one can observe an intermediate state. You can do it as follows:
Copy the files to a folder outside the monitored folder, but on the same drive.
When the copying operation is complete, move the folder inside the monitored folder. To any outside process, all the files would appear at once, and it would have no chance to see only part of the files.

More appropriate for my task: background worker or thread pool?

I have a simple web application module which basically accepts requests to save a zip file on PageLoad from a mobile client app.
Now, What I want to do is to unzip the file and read the file inside it and process it further..including making entries into a database.
Update: the zip file and its contents will be fairly smaller in size so the server shouldn't be burdened with much load.
Update 2: I just read about when IIS queues requests (at global/app level). So does that mean that I don't need to implement complex request handling mechanism and the IIS can take care of the app by itself?
Update 3: I am looking for offloading the processing of the downloaded zip not only for the sake of minimizing the overhead (in terms of performance) but also in order to avoid the problem of table-locking when the file is processed and records updated into the same table. In the scenario of multiple devices requesting the page and the background task processing database updateing in parallel would cause an exception.
As of now I have zeroed on two solutions:
To implement a concurrent/message queue
To implement the file processing code into a separate tool and schedule a job on the server to check for non-processed file(s) and process them serially.
Inclined towards a Queuing Mechanism I will try to implement is as it seems less dependent on config. v/s manually configuring the job/schedule at the server side.
So, what do you guys recommend me for this purpose?
Moreover after the zip file is requested and saved on server side, the client & server side connection is released after doing so. Not looking to burden my IIS.
Imagine a couple of hundred clients simultaneously requesting the page..
I actually haven't used neither of them before so any samples or how-to's will be more appreciated.

I'd recommend TPL and Rx Extensions: you make your unzipped file list an observable collection and for each item start a new task asynchronously.

I'd suggest a queue system.
When you received a file you'll save the path into a thread-synchronized queue. Meanwhile a background worker (or preferably another machine) will check this queue for new files and dequeue the entry to handle it.
This way you won't launch an unknown amount of threads (every zip file) and can handle the zip files in one location. This way you can also easier move your zip-handling code to another machine when the load gets too heavy. You just need to access a common queue.
The easiest would probably be to use a static Queue with a lock-object. It is the easiest to implement and does not require external resources. But this will result in the queue being lost when your application recycles.
You mentioned losing zip files was not an option, then this approach is not the best if you don't want to rely on external resources. Depending on your load it may be worth to utilize external resources - meaning upload the zip file to a common storage on another machine and add a message to an queue on another machine.
Here's an example with a local queue:
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
void GotNewZip(string pathToZip)
{
queue.Enqueue(pathToZip); // Added a new work item to the queue
}
void MethodCalledByWorker()
{
while (true)
{
if (queue.IsEmpty)
{
// Supposedly no work to be done, wait a few seconds and check again (new iteration)
Thread.Sleep(TimeSpan.FromSeconds(5));
continue;
}
string pathToZip;
if (queue.TryDequeue(out pathToZip)) // If TryDeqeue returns false, another thread dequeue the last element already
{
HandleZipFile(pathToZip);
}
}
}
This is a very rough example. Whenever a zip arrives, you add the path to the queue. Meanwhile a background worker (or multiple, the example s threadsafe) will handle one zip after another, getting the paths from the queue. The zip files will be handled in the order they arrive.
You need to make sure that your application does not recycle meanwhile. But that's the case with all resources you have on the local machine, they'll be lost when your machine crashes.

I believe you are optimising prematurely.
You mentioned table-locking - what kind of db are you using? If you add new rows or update existing ones most modern databases in most configurations will:
use row-level locking; and
be fast enough without you needing to worry about
locking.
I suggest starting with a simple method
//Unzip
//Do work
//Save results to database
and get some proof it's too slow.

Rename file extensions in a sequential order

Windows Service - C# - VS2010
I have multiple instances of a FileWatcher Service. Each one looks for a different extension in the directory. I have a separate Router service that monitors the directory for zip files and renames the extensions to one of the values that the services look at.
Example:
Directory in question (all FileWatcher Services monitor this directory) contains the following files:
a.zip, b.zip, c.zip
FileWatcher1 looks for extensions of *.000, FileWatcher2 looks for extensions of *.001, FileWatcher3 looks for extensions of *.002
The Router will see the .zip files and change the file extensions on the zip files, but it should keep in sequence in order to delegate the same amount of work to each FileWatcher.
Also, if there are two zip files dropped, it would change a.zip -> a.000, and b.zip -> b.001, but if 5 minutes go by and another batch of zip files are dropped, it should know to rename the next file to *.002.
I have everything working fine, but now I need to implement the sequential part to the Router and am not sure the best way of implementation (currently router is changing every extension to *.000 thus only one FileWatcher is getting the work). I know this might be considered a cheap way of doing this but it's all we really need at the moment. Any help would be appreciated.

Maybe a different way of looking at it. Have you thought about having a single watcher and then using a thread pool? The reason why I am suggesting this is that you will have to start looking at the sizes and complexities of the fields to adequately distribute the work. You might start pushing more work to .000 because it's next in line when it is still busy processing a large amount of data from the first job whereas .001 could be free as it was processing a small file.
If you really want to get around the problem of the next extension in line, why not just keep a static variable with the next extension number. I am not 100% sure if the Router Filewatcher will run multiple threads when it sees new files one after the other but I don't think so. If that does happen then you will need to put some thread safety code when accessing the static variable.

Can the Router just keep a counter and do a mod 3 (or N, where N is the number of watchers) operation for every new file?

Queue file operations for later when file is locked

I am trying to implement file based autoincrement identity value (at int value stored in TXT file) and I am trying to come up with the best way to handle concurrency issues. This identity will be used for unique ID for my content. When saving new content this file gets opened, the value gets read, incremented, new content is saved and the incremented value is written back to the file (whether we store the next available ID or the last issued one doesn't really matter). While this is being done another process might come along and try to save new content. The previous process opens the file with FileShare.None so no other process will be able to read the file until it is released by the first process. While the odds of this happening are minimal it could still happen.
Now when this does happen we have two options:
wait for the file to become available -
Emulate waiting on File.Open in C# when file is locked
we are talking about miliseconds here, so I guess this wouldn't be an issue as long as something strange happens and file never becomes available, then this solution would result in an infinite loop, so not an ideal solution
implement some sort of a queue and run all operations on files within a queue. My user experience requirements are such that at the time of saving/modifying files user should never be informed about exceptions or that something went wrong - he would get informed about them through a very friendly user interface later when operations would fail on the queue too.
At the moment of writing this, the solution should work within ASP.NET MVC application (both synchronously and async thru AJAX) but, if possible, it should use the concepts that could also work in Silverlight or Windows Forms or WPF application.
With regards to those two options which one do you think is better and for the second option what are possible technologies to implement this?

The ReaderWriterLockSlim class seems like a good solution for synchronizing access to the shared resource.

Synchronize writing to a file at file-system level

I have a text file and multiple threads/processes will write to it (it's a log file).
The file gets corrupted sometimes because of concurrent writings.
I want to use a file writing mode from all of threads which is sequential at file-system level itself.
I know it's possible to use locks (mutex for multiple processes) and synchronize writing to this file but I prefer to open the file in the correct mode and leave the task to System.IO.
Is it possible ? what's the best practice for this scenario ?

Your best bet is just to use locks/mutexex. It's a simple approach, it works and you can easily understand it and reason about it.
When it comes to synchronization it often pays to start with the simplest solution that could work and only try to refine if you hit problems.

To my knowledge, Windows doesn't have what you're looking for. There is no file handle object that does automatic synchronization by blocking all other users while one is writing to the file.
If your logging involves the three steps, open file, write, close file, then you can have your threads try to open the file in exclusive mode (FileShare.None), catch the exception if unable to open, and then try again until success. I've found that tedious at best.
In my programs that log from multiple threads, I created a TextWriter descendant that is essentially a queue. Threads call the Write or WriteLine methods on that object, which formats the output and places it into a queue (using a BlockingCollection). A separate logging thread services that queue--pulling things from it and writing them to the log file. This has a few benefits:
Threads don't have to wait on each other in order to log
Only one thread is writing to the file
It's trivial to rotate logs (i.e. start a new log file every hour, etc.)
There's zero chance of an error because I forgot to do the locking on some thread
Doing this across processes would be a lot more difficult. I've never even considered trying to share a log file across processes. Were I to need that, I would create a separate application (a logging service). That application would do the actual writes, with the other applications passing the strings to be written. Again, that ensures that I can't screw things up, and my code remains simple (i.e. no explicit locking code in the clients).

you might be able to use File.Open() with a FileShare value set to None, and make each thread wait if it can't get access to the file.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.