I wanted to implement a windows service that captures dropped flat delimited files to a folder for import to the database. What I originally envision is to have a FileSystemWatcher looking over new files imported and creating a new thread for importing.
I wanted to know how I should properly implement an algorithm for this and what technique should I use? Am I going to the right direction?
I developed an product like this for a customer. The service were monitoring a number of folders for new files and when the files were discovered, the files were read, processed (printed on barcode printers), archived and deleted.
We used a "discoverer" layer that discovered files using FileSystemWatcher or polling depending on environment (since FileSystemWatcher is not reliable when monitoring e.g. samba shares), a "file reader" layer and a "processor" layer.
The "discoverer" layer discovered files and put the filenames in a list that the "file reader" layer processed. The "discoverer" layer signaled that there were new files to process by settings an event that the "file reader" layer were waiting on.
The "file reader" layer then read the files (using retry functionality since you may get notifications for new files before the files has been completely written by the process that create the file).
After the "file reader" layer has read the file, a new "processor" thread were created using the ThreadPool.QueueWorkItem to process the file contents.
When the file has been processed, the original file were copied to an archive and deleted from the original location. The archive were also cleaned up regularly to keep from flooding the server. The archive were great for troubleshooting.
This has now been used in production in a number of different environments in over two years now and has proved to be very reliable.
I've fielded a service that does this as well. I poll via a timer whose elapsed event handler acts as a supervisor, adding new files to a queue and launching a configurable number of threads that consume the queue. Once the files are processed, it restarts the timer.
Each thread including the event handler traps and reports all exceptions. The service is always running, and I use a separate UI app to tell the service to start and stop the timer. This approach has been rock solid and the service has never crashed in several years of processing.
The traditional approach is to create a finite set of threads (could be as few as 1) and have them watch a blocking queue. The code in the FileSystemWatcher1 event handlers will enqueue work items while the worker thread(s) dequeue and process them. It might look like the following which uses the BlockingCollection class which is available in .NET 4.0 or as part of the Reactive Extensions download.
Note: The code is left short and concise for brevity. You will have to expand and harden it yourself.
public class Example
{
private BlockingCollection<string> m_Queue = new BlockingCollection<string>();
public Example()
{
var thread = new Thread(Process);
thread.IsBackground = true;
thread.Start();
}
private void FileSystemWatcher_Event(object sender, EventArgs args)
{
string file = GetFilePathFromEventArgs(args);
m_Queue.Add(file);
}
private void Process()
{
while (true)
{
string file = m_Queue.Take();
// Process the file here.
}
}
}
You could take advantage of the Task class in the TPL for a more modern and ThreadPool-like approach. You would start a new task for each file (or perhaps batch them) that needs to be processed. The only gotcha I see with this approach is that it would be harder to control the number of database connections being opened simultaneously. Its definitely not a showstopper and it might be of no concern.
1The FileSystemWatcher has been known to be a little flaky so it is often advised to use a secondary method of discovering file changes in case they get missed by the FileSystemWatcher. Your mileage may vary on this issue.
Creating a thread per message will most likely be too expensive. If you can use .NET 4, you could start a Task for each message. That would run the code on a thread pool thread and thus reduce the overhead of creating threads.
You could also do something similar with asynchronous delegates if .NET 4 is not an option. However, the code gets a bit more complicated in that case. That would utilize the thread pool as well and save you the overhead of creating a new thread for each message.
Related
I have multiple backgroundworkers that all wants to write to log.txt, which results in the exception The process cannot access the file 'C:\...\log.txt' because it is being used by another process.. I know it's a long shot but would it help if I used WriteAsync() instead or would it have no effect at all?
(If that's not a simple solution, I guess I have to implement the mutex object I've seen before.)
public static void WriteToLog(string text, bool append = true)
{
try
{
using (var writer = new StreamWriter("log.txt", append))
{
writer.Write(text);
// writer.WriteAsync(text); // Would this 'queue up' instead of trying
to access the same process at the same time?
}
}
catch (Exception ex)
{
Console.WriteLine($"ERROR! Fejl i loggen! {ex.Message}. {ex.StackTrace}");
}
}
To actually answer your question. No, it wont save you from the locking issue. async is not a magic keyword that will synchronize all thread. On the opposite it might even start its own thread depending on the synchronizer.
Unless you are on a single thread model then yes this will queue up since the synchronizer only has one thread to work with. It will then have to queue up all async calls with context switch. However if you are on single thread model you wouldnt have this problem in the first place.
You can solve the problem with multiple ways.
Use a locking mechanism to synchronize access to a shared resource. One good option for this scenario is
ReaderWriterLockSlim
Use a logging framework(there are a lot of good libraries and very reliable).
Personally i would prefer going with a logging framework, as there are many features that you will use useful (rolling file appender, db logger, etc) that will offer you a clean solution for logging with zero hacks and maintenance.
While using a logging framework is the best solution, to specifically address the issue...
The append mode requires the file to be locked, and when a lock can't be obtained you get the error you're receiving. You could synchronize all threads but then you'd be blocking them for a time. Using WriteAsync does not alleviate the problem.
A better solution is to enqueue your messages and then have a dedicated thread dequeue them and write to the log. Thus, you need no synchronization because all writes are done by a single thread.
I will warn again: use a logging framework.
I have .net windows service which gets list of image files from a folder and do some conversion and sent the converted files to another directory. I want achive more throughput by adding another instance of serVice watching same folder. I want 2 instances process files independently without any duplicate processing.
What patterns can be used?
Is file locking would work for this ?
Don't want to use database or any other messaging platform.
I Can use text files etc to create synchronization if needed.
If using .net I would consider creating multiple threads (using TPL in .net) that would be used to process the files in parallel. This way you have a single process that has control over the entire process. Hence no need to track what process (exe) is processing a file, no databases, no locking, etc..
However if you wish to have multiple processes processing the files, then one option of synchronizing the processing is to make use of a Mutex.
I would use this option along with Solution 1.
I.e. use TPL (multiple threads) in one service. And also use Mutexes. This way you have the benefit of multiple threads and multiple services. Hopefully this is what you are after.
https://msdn.microsoft.com/en-us/library/bwe34f1k(v=vs.110).aspx
Before processing any file, create a Mutex with a particular name and if ownership has been granted, then continue processing the file. If ownership hasn't been granted you can safely assume that another process or another thread (within the same application) has acquired a lock on this Mutex, meaning another process/thread is already processing the file.
Sample code:
var fileMutex = new Mutex(true, "File Name", out mutexWasCreated);
if (mutexWasCreated){
//Some other process/thread is processing this file, so nothing to do
}
else {
//Start processing the file
}
If one service (exe) goes down, then the threads would die, meaning the mutexes would be released and those files will be available for processing by another process.
Just switched from .Net 1.1 to 3.5 on Win service active for 10 years with over 2 million files processed. I have an asynchronous class that prints graphics to a PDFPrinter with a FileSystemWatcher event handler, now on its own STA thread, archiving the PDF files. The PDF creation is asynchronous because of an existing client application method permitting creation of all missing PDFs in a DateTime interval.
(1) Without the event handler spun off on an STA thread, the service hangs.
(2) With only a few PDFs arriving within a few-second interval, it works fine. Increase that to 5 PDFs and inevitably one file doesn't get archived. Increase that to 15 PDFs and several don't get archived (all this in a test bed). Before moving a file, I check that it exists, requiring 2 successful detections (PDFPrinters tend to produce phantom file-creation events). I also check for exclusive access to the file. Update: I tried another STA thread-creation approach (via a parameterized class and method) in a different section of COM-interacting code, and had the same problem with unreliability (only about 50% of threads complete).
For PDFs, I was tempted to setup a Timer to archive abandoned files, but am unclear when to start the Timer so as to avoid having multiple Timers trying to do the same archiving task (with additional danger of Oracle concurrency problems); that design feels a bit like belt and suspenders (negative ugh factor).
This is such a common paradigm, it shouldn't be this difficult to make robust! Looking for enlightenment on (1) and help with making new STA threads complete reliably (2).
PSEUDOCODE
Test bed user interface:
// Process 20 instrument raw data files in a loop
// For each file:
// 1-2 s to setup processing and retrieve metadata from database on each file
// (A) spin off STA worker thread
// call instrument vendor COM API to read data file
// setup FileSystemWatcher for PDF files
// create graphical image PDF
// handle PDF_Created in a shell that ...
// (B) spins off STA worker thread to
// archive the PDF
Answering (2): I had to add code to linearize or resynch the new STA thread with the old MTA thread (e.g. block the parent thread until the worker thread completes).
thread.Join();
That worked well at point (A) in the pseudocode, but not at point (B) where I had some shared field variables that still need to be moved into thread parameters (potential cause of not all PDFs being created)
I confess to still not understanding why a FileSystemWatcher that archives files across the network needs to be handled on a STA thread (question (1)).
I have a MT app that downloads content form the internet (ex - lots of images - 10K to 5MB). One download session can represent gigabytes of data. I have wrapped the download in a Parallel.ForEach loop and that works, but doesn't seem to use any more then one thread on the device for downloading (I would like at least two to reduce the download time).
Note: Parallel.ForEach does create multiple threads in the simulator. Should I just throw all the downloads as tasks into the thread pool? Should I spin up my own queue and threads and bypass the threadpool? I know the threadpool scales to match the device, so that might not be the best option.
When it comes to IO, only the application developer knows how much parallelism he wants. Don't rely on the TPL for that - it knows nothing about IO.
Create the right amount of IO parallelism yourself by starting the correct number of tasks manually, using PLINQ with an exact degree of parallelism or using async IO (which is thread-less).
Are you downloading via HTTP? I've found the WebClient class to work well for the type of thing you're describing.
Something like:
WebClient client = new WebClient();
client.DownloadFileCompleted += new AsyncCompletedEventHandler(client_DownloadFileCompleted);
client.DownloadFileAsync("http://stackoverflow.com", "test.txt");
void client_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
//file finished downloading
}
This way there's no need to manage the threads yourself.
Also if you want to read the data from the file right away you might just want to use
DownloadDataAsync
And save the file yourself.
So I've been told what I'm doing here is wrong, but I'm not sure why.
I have a webpage that imports a CSV file with document numbers to perform an expensive operation on. I've put the expensive operation into a background thread to prevent it from blocking the application. Here's what I have in a nutshell.
protected void ButtonUpload_Click(object sender, EventArgs e)
{
if (FileUploadCSV.HasFile)
{
string fileText;
using (var sr = new StreamReader(FileUploadCSV.FileContent))
{
fileText = sr.ReadToEnd();
}
var documentNumbers = fileText.Split(new[] {',', '\n', '\r'}, StringSplitOptions.RemoveEmptyEntries);
ThreadStart threadStart = () => AnotherClass.ExpensiveOperation(documentNumbers);
var thread = new Thread(threadStart) {IsBackground = true};
thread.Start();
}
}
(obviously with some error checking & messages for users thrown in)
So my three-fold question is:
a) Is this a bad idea?
b) Why is this a bad idea?
c) What would you do instead?
A possible problem is that your background thread is running in your web sites application pool. IIS may decide to recycle your application pool causing the expensive operation to be killed before it is done.
I would rather go for an option where I had a separate process, possibly a windows service, that would get the expensive operation requests and perform them outside the asp.net process. Not only would this mean that your expensive operation would survive an application pool restart, but it would also simplify your web application since it didn't have to handle the processing.
Telling the service to perform the expensive process could be done using some sort of inter-process communication, the service could poll a database table or a file, or you could use a management queue that the service would listen to.
There are many ways to do this, but my main point is that you should separate the expensive process from your web application if possible.
I recommend you use the BackgroundWorker class instead of using threads directly. This is because BackgroundWorker is designed specifically to perform background operations for a graphical application, and (among other things) provides mechanisms to communicate updates to the user interface.
a: yes.
Use the ThreadPool;) Queue a WorkItem - avoids the overhead of generating tons of threads.