I have .net windows service which gets list of image files from a folder and do some conversion and sent the converted files to another directory. I want achive more throughput by adding another instance of serVice watching same folder. I want 2 instances process files independently without any duplicate processing.
What patterns can be used?
Is file locking would work for this ?
Don't want to use database or any other messaging platform.
I Can use text files etc to create synchronization if needed.
If using .net I would consider creating multiple threads (using TPL in .net) that would be used to process the files in parallel. This way you have a single process that has control over the entire process. Hence no need to track what process (exe) is processing a file, no databases, no locking, etc..
However if you wish to have multiple processes processing the files, then one option of synchronizing the processing is to make use of a Mutex.
I would use this option along with Solution 1.
I.e. use TPL (multiple threads) in one service. And also use Mutexes. This way you have the benefit of multiple threads and multiple services. Hopefully this is what you are after.
https://msdn.microsoft.com/en-us/library/bwe34f1k(v=vs.110).aspx
Before processing any file, create a Mutex with a particular name and if ownership has been granted, then continue processing the file. If ownership hasn't been granted you can safely assume that another process or another thread (within the same application) has acquired a lock on this Mutex, meaning another process/thread is already processing the file.
Sample code:
var fileMutex = new Mutex(true, "File Name", out mutexWasCreated);
if (mutexWasCreated){
//Some other process/thread is processing this file, so nothing to do
}
else {
//Start processing the file
}
If one service (exe) goes down, then the threads would die, meaning the mutexes would be released and those files will be available for processing by another process.
Related
Just switched from .Net 1.1 to 3.5 on Win service active for 10 years with over 2 million files processed. I have an asynchronous class that prints graphics to a PDFPrinter with a FileSystemWatcher event handler, now on its own STA thread, archiving the PDF files. The PDF creation is asynchronous because of an existing client application method permitting creation of all missing PDFs in a DateTime interval.
(1) Without the event handler spun off on an STA thread, the service hangs.
(2) With only a few PDFs arriving within a few-second interval, it works fine. Increase that to 5 PDFs and inevitably one file doesn't get archived. Increase that to 15 PDFs and several don't get archived (all this in a test bed). Before moving a file, I check that it exists, requiring 2 successful detections (PDFPrinters tend to produce phantom file-creation events). I also check for exclusive access to the file. Update: I tried another STA thread-creation approach (via a parameterized class and method) in a different section of COM-interacting code, and had the same problem with unreliability (only about 50% of threads complete).
For PDFs, I was tempted to setup a Timer to archive abandoned files, but am unclear when to start the Timer so as to avoid having multiple Timers trying to do the same archiving task (with additional danger of Oracle concurrency problems); that design feels a bit like belt and suspenders (negative ugh factor).
This is such a common paradigm, it shouldn't be this difficult to make robust! Looking for enlightenment on (1) and help with making new STA threads complete reliably (2).
PSEUDOCODE
Test bed user interface:
// Process 20 instrument raw data files in a loop
// For each file:
// 1-2 s to setup processing and retrieve metadata from database on each file
// (A) spin off STA worker thread
// call instrument vendor COM API to read data file
// setup FileSystemWatcher for PDF files
// create graphical image PDF
// handle PDF_Created in a shell that ...
// (B) spins off STA worker thread to
// archive the PDF
Answering (2): I had to add code to linearize or resynch the new STA thread with the old MTA thread (e.g. block the parent thread until the worker thread completes).
thread.Join();
That worked well at point (A) in the pseudocode, but not at point (B) where I had some shared field variables that still need to be moved into thread parameters (potential cause of not all PDFs being created)
I confess to still not understanding why a FileSystemWatcher that archives files across the network needs to be handled on a STA thread (question (1)).
I am using TvdbLib in a program. This library can use a cache for loading TV series quicker. To further improve the speed of the program, I do all my loading of TV series on separate threads. When two threads run simultaneously and try to read/write from the cache simultaneously, I will get the following error:
The process cannot access the file
'C:\BinaryCache\79349\series_79349.ser' because it is being used by
another process.
Does anyone know how to avoid this and still have the program running smoothly?
CacheProvider is not built for being used in multi-threaded scenarios... either use it in one thread only or lock on every access via a shared object or supply every thread with its own CacheProvider and its own distinct _root directory (in the constructor).
You can use the lock statement to ensure only one thread is accessing the cache at the same time:
http://msdn.microsoft.com/en-us/library/c5kehkcz(v=vs.71).aspx
From the error I assume that TvdbLib does not support multiple concurrent threads accessing the same cache. As it is an open source project, you could get the source code and implement your own protection around the cache access, e.g., using the lock statement. Of course, you could lock within your own code before it calls TvdbLib, but because this will be a higher level, the lock will be maintained for longer and you may not get the fine-grained concurrency that you want.
I am making use of the C# code located at the following links to implement a Ram-disk project.
Link to description of source code
Link to source code
As a summary, the code indicated above makes use of a simple tree structure to store the directories, sub-directories and files. At the root is a MemoryFolder object which stores zero or more 'MemoryFolder' objects and/or MemoryFile objects. Each MemoryFolder object in turn stores zero or more MemoryFolder objects and/or MemoryFile objects and so forth up to an unlimited depth.
However, the code is not thread safe. What is the most elegant way of implementing thread safety? In addition, how should the following non-exhaustive list of multithreading requirements for a typical file system be enforced by using the appropriate locking strategy?
The creation of two different folder (each by a different thread) simultaneously under the same
parent folder can occur concurrently if the thread safe
implementation allows it. Otherwise, some locking strategy should be
implemented to only allow sequential creation.
None of the direct or indirect parent folders of the folder
containing a specific file (that is currently read by another
thread) propagating all the way up to the root folder can be moved
or deleted by another thread until the ReadFile thread completes its
execution.
With regards to each unique file, allows concurrent access for multiple ReadFile threads but restricting access to a single WriteFile thread.
If two separate ReadFile threads (fired almost simultaneously),
each from a different application attempts to create a folder with
the same name (assuming that the folder does not already exist
before both threads are fired), the first thread that enters the
Ram-Disk always succeeds while the second one always fails. In other
words, the order of thread execution is deterministic.
The total disk space calculation method GetDiskFreeSpace running
under a separate thread should not complete its execution until all
WriteFile threads that are already in progress complete its execution. All subsequent WriteFile threads that have not begun executing are blocked until the GetDiskFreeSpace thread completes its execution.
The easiest way to do this would be to protect the entire tree with a ReaderWriterLockSlim. That allows concurrent access by multiple readers or exclusive access by a single writer. Any method that will modify the structure in any way will have to acquire the write lock, and no other threads will be allowed to read or write to the structure until that thread releases the write lock.
Any thread that wants to read the structure has to acquire the read lock. Multiple readers can acquire the read lock concurrently, but if a thread wants to acquire the write lock--which means waiting until all existing read locks are released.
There might be a way to make that data structure lock-free. Doing so, however, could be quite difficult. The reader/writer lock will give you the functionality you want, and I suspect it would be fast enough.
If you want to share this across processes, that's another story. The ReaderWriterLockSlim doesn't work across processes. You could, however, implement something similar using a combination of the synchronization primitives, or create a device driver (or service) that serves the requests, thereby keeping it all in the same process.
Hopefully this is a better question than my previous. I have a .exe which I will be passing different parameters (file paths) to which it will then take in and parse. So I will have a loop going, looping through the file paths in a list and passing them to this .exe file.
For this to be more efficient, I want to spread the execution across multiple cores which I think you do through threading.
My question is, should I use the threadpool, or multiple threads to run this .exe asynchronously?
Also, depending on which one of those you guys think is the best, if you can point me to a tutorial that will have some info on what I want to do. Thank you!
EDIT:
I need to limit the number of executions of the .exe to ONE execution PER CORE. This is the most efficient because if I am parsing 100,000 files I can't just fire up 100000 processes. So I am using threads to limit the number of executions at one time to one execution per core. If there is another way (other than threads) to find out if a processor isn't tied up in execution, or if the .exe has finished please explain.
But if there isn't another way, my FINAL question is how would I use a thread to call a parse method and then call back when that thread is no longer in use?
SECOND UPDATE (VERY IMPORTANT):
I went through what everyone told me, and found out a key element that I left out that I thought didn't matter. So I am using a GUI and I don't want it to be locked up. THAT is why I wanted to use threads. My main question now is, how do I send back information from a thread so I know when the execution is over?
As I said in my answer to your previous question, I think you don't understand the difference between processes and threads. Processes are incredibly "heavy" (*); each process can contain many threads. If you are spawning new processes from a parent process, that parent process doesn't need to create new threads; each process will have its own collection of threads.
Only create threads in the parent process if all the work is being done in the same process.
Think of a thread as a worker, and a process as a building containing one or more workers.
One strategy is "build a single building and populate it with ten workers who do each do some amount of work". You get the expense of building one process and ten threads.
If your strategy is "build a building. Then have the one worker in that building order the construction of a thousand more buildings, each of which contains a worker that does their bidding", then you get the expense of building 1001 buildings and hiring 1001 workers.
The strategy you do not want to pursue is "build a building. Hire 1000 workers in that building. Then instruct each worker to build a building, which then has one worker to go do the real work." There is no point in making a thread whose sole job is creating a process that then creates a thread! You have 1001 buildings and 2001 workers, half of whom are immediately idle but still have to be paid.
Looking at your specific problem: the key question is "where is the bottleneck?" Spawning off new processes or new threads only helps when the performance problem is that the perf is gated on the processor. If the performance of your parser is gated not on how fast you can parse the file but rather on how fast you can get it off disk, then parallelizing it is going to make things far, far worse. You'll have a huge amount of system resources devoted to all hammering on the same disk controller at the same time, and the disk controller will get slower as more load piles up on it.
UPDATE:
I need to limit the number of executions of the .exe to ONE execution PER CORE. This is the most efficient because if I am parsing 100,000 files I can't just fire up 100000 processes. So I am using threads to limit the number of executions at one time to one execution per core. If there is another way (other than threads) to find out if a processor isn't tied up in execution, or if the .exe has finished please explain
This seems like an awfully complicated way to go about it. Suppose you have n processors. Your proposed strategy, as I understand it, is to fire up n threads, then have each thread fire up one process, and you know that since the operating system will probably schedule one thread per CPU that somehow the processor will magically also schedule the new thread in each new process on a different CPU?
That seems like a tortuous chain of reasoning that depends on implementation details of the operating system. This is craziness. If you want to set the processor affinity of a particular process, just set the processor affinity on the process! Don't be doing this crazy thing with threads and hope that it works out.
I say that if you want to have no more than n instances of an executable running, one per processor, don't mess around with threads at all. Rather, just have one thread sit in a loop, constantly monitoring what processes are running. If there are fewer than n copies of the executable running, spawn another and set its processor affinity to be the CPU you like best. If there are n or more copies of the executable running, go to sleep for a second (or a minute, or whatever makes sense), and when you wake up, check again. Keep doing that until you're done. That seems like a much easier approach.
(*) Threads are also heavy, but they are lighter than processes.
Spontaneously I would push your file paths into a thread safe queue and then fire up a number of threads (say one per core). Each thread would repeatedly pop one item from the queue and process the it accordingly. The work is done when the queue is empty.
Implementation suggestions (to answer some of the questions in comments):
Queue:
In C# you could have a look at the Queue Class and the Queue.Synchronized Method for the implementation of the queue:
"Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
To guarantee the thread safety of the Queue, all operations must be done through the wrapper returned by the Synchronized method.
Enumerating through a collection is intrinsically not a thread-safe procedure. Even when a collection is synchronized, other threads can still modify the collection, which causes the enumerator to throw an exception. To guarantee thread safety during enumeration, you can either lock the collection during the entire enumeration or catch the exceptions resulting from changes made by other threads."
Threading:
For the threading part I suppose that any of the examples in the msdn threading tutorial would do (the tutorial is a bit old, but should be valid). Should not need to worry about synchronizing the threads as they can work independently from each other. The queue above is the only common resource they should need to access (hence the importance of thread safety of the queue).
Start the external process (.exe):
The following code is borrowed (and tweaked) from How to wait for a shelled application to finish by using Visual C#. You need to edit for your own needs, but as a starter:
//How to Wait for a Shelled Process to Finish
//Create a new process info structure.
ProcessStartInfo pInfo = new ProcessStartInfo();
//Set the file name member of the process info structure.
pInfo.FileName = "mypath\myfile.exe";
//Start the process.
Process p = Process.Start(pInfo);
//Wait for the process to end.
p.WaitForExit();
Pseudo code:
Main thread;
Create thread safe queue
Populate the queue with all the file paths
Create child threads and wait for them to finish
Child threads:
While queue is not empty << this section is critical, not more then one
pop file from queue << thread can check and pop at the time
start external exe
wait for it....
end external exe
end while
Child thread exits
Main thread waits for all child threads to finish
Program finishes.
See this question for how to find out the number of cores.
Then use Parallel.ForEach with ParallelOptions with MaxDegreeOfParallelism set to the number of cores.
Parallel.ForEach(args, new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount }, (element) => Console.WriteLine(element));
If you're targeting the .Net 4 framework the Parallel.For or Parallel.Foreach are extremely helpful. If those don't meet your requirements I've found the Task.Factory to be useful and straightforward to use as well.
To answer your revised question, you want processes. You just need to create the correct number of processes running the exe. Don't worry about forcing them onto specific cores. Windows will do that automatically.
How to do this:
You want to determine the number of cores on the machine. You may simply know it, and hardcode it, or you might want to use something like System.Environment.ProcessorCount.
Create a List<Process> object.
Then you want to start that many processes using System.Diagnostics.Process.Start. The return value will be a process object, which you will want to add to the List.
Now repeat the following until you are finished:
Call Thread.Sleep to wait for a while. Perhaps a minute or so.
Loop through each Process in the list but be sure to use a for loop rather than a foreach loop. For each process, call Refresh() then check the 'HasExited' property of each process, and if it is true, create a new process using Process.Start, and replace the exited process in the list with the newly created one.
If you're launching a .exe, then you have no choice. You will be running this asynchronously in a separate process. For the program which does the launching, I would recommend that you use a single thread and keep a list of the processes you launched.
Each exe launched will occur in its own process. You don't need to use a threadpool or multiple threads; the OS manages the processes (and since they're processes and not threads, they're very independent; completely separate memory space, etc.).
I wanted to implement a windows service that captures dropped flat delimited files to a folder for import to the database. What I originally envision is to have a FileSystemWatcher looking over new files imported and creating a new thread for importing.
I wanted to know how I should properly implement an algorithm for this and what technique should I use? Am I going to the right direction?
I developed an product like this for a customer. The service were monitoring a number of folders for new files and when the files were discovered, the files were read, processed (printed on barcode printers), archived and deleted.
We used a "discoverer" layer that discovered files using FileSystemWatcher or polling depending on environment (since FileSystemWatcher is not reliable when monitoring e.g. samba shares), a "file reader" layer and a "processor" layer.
The "discoverer" layer discovered files and put the filenames in a list that the "file reader" layer processed. The "discoverer" layer signaled that there were new files to process by settings an event that the "file reader" layer were waiting on.
The "file reader" layer then read the files (using retry functionality since you may get notifications for new files before the files has been completely written by the process that create the file).
After the "file reader" layer has read the file, a new "processor" thread were created using the ThreadPool.QueueWorkItem to process the file contents.
When the file has been processed, the original file were copied to an archive and deleted from the original location. The archive were also cleaned up regularly to keep from flooding the server. The archive were great for troubleshooting.
This has now been used in production in a number of different environments in over two years now and has proved to be very reliable.
I've fielded a service that does this as well. I poll via a timer whose elapsed event handler acts as a supervisor, adding new files to a queue and launching a configurable number of threads that consume the queue. Once the files are processed, it restarts the timer.
Each thread including the event handler traps and reports all exceptions. The service is always running, and I use a separate UI app to tell the service to start and stop the timer. This approach has been rock solid and the service has never crashed in several years of processing.
The traditional approach is to create a finite set of threads (could be as few as 1) and have them watch a blocking queue. The code in the FileSystemWatcher1 event handlers will enqueue work items while the worker thread(s) dequeue and process them. It might look like the following which uses the BlockingCollection class which is available in .NET 4.0 or as part of the Reactive Extensions download.
Note: The code is left short and concise for brevity. You will have to expand and harden it yourself.
public class Example
{
private BlockingCollection<string> m_Queue = new BlockingCollection<string>();
public Example()
{
var thread = new Thread(Process);
thread.IsBackground = true;
thread.Start();
}
private void FileSystemWatcher_Event(object sender, EventArgs args)
{
string file = GetFilePathFromEventArgs(args);
m_Queue.Add(file);
}
private void Process()
{
while (true)
{
string file = m_Queue.Take();
// Process the file here.
}
}
}
You could take advantage of the Task class in the TPL for a more modern and ThreadPool-like approach. You would start a new task for each file (or perhaps batch them) that needs to be processed. The only gotcha I see with this approach is that it would be harder to control the number of database connections being opened simultaneously. Its definitely not a showstopper and it might be of no concern.
1The FileSystemWatcher has been known to be a little flaky so it is often advised to use a secondary method of discovering file changes in case they get missed by the FileSystemWatcher. Your mileage may vary on this issue.
Creating a thread per message will most likely be too expensive. If you can use .NET 4, you could start a Task for each message. That would run the code on a thread pool thread and thus reduce the overhead of creating threads.
You could also do something similar with asynchronous delegates if .NET 4 is not an option. However, the code gets a bit more complicated in that case. That would utilize the thread pool as well and save you the overhead of creating a new thread for each message.