Multithreading with Filesystem watcher and/or MSMQ wcf service

Multithreading with Filesystem watcher and/or MSMQ wcf service - c#

I need to create a service which is basically responsible for the following:
Watch a specific folder for any new files created.
If yes , read that file , process it and save data in DB.
For the above task, I am thinking of creating a multi threaded service with either of the following approach:
In the main thread, create an instance of filesystem watcher and as soon as a new file is created, add that file in the threadQueue. There will be N no. of consumer threads running which should take a file from the queue and process it (i.e step 2).
Again in the main thread, create an instance of filesystem watcher and as soon as a new file is created, read that file and add the data to MSMQ using wcf MSMQ service. When the message is read by the wcf msmq service, it will be responsible for processing further
I am a newbie when it comes to creating a multi threaded service. So not sure which will tbe the best option. Please guide me.
Thanks,

First off, let me say that you have taken a wise approach to do a single producer - multiple consumer model. This is the best approach in this case.
I would go for option 1, using a ConcurrentQueue data structure, which provides you an easy way to queue tasks in a thread-safe manner. Alternatively, you can simply use the ThreadPool.QueueUserWorkItem method to send work directly to the built-in thread pool, without worrying about managing the workers or the queue explicitly.
Edit: Regarding the reliability of FileSystemWatcher, MSDN says:
The Windows operating system notifies your component of file changes
in a buffer created by the FileSystemWatcher. If there are many
changes in a short time, the buffer can overflow. This causes the
component to lose track of changes in the directory, and it will only
provide blanket notification. Increasing the size of the buffer with
the InternalBufferSize property is expensive, as it comes from
non-paged memory that cannot be swapped out to disk, so keep the
buffer as small yet large enough to not miss any file change events.
To avoid a buffer overflow, use the NotifyFilter and
IncludeSubdirectories properties so you can filter out unwanted change
notifications.
So it depends on how often changes will occur and how much buffer you are allocating.

I would also consider your demands for failure handling and sizes of the files you are sending.
Whether you decide for option 1 or 2 will be dependent on specifications.
Option 2 has the avantage that by using MSMQ you have your data persisted in a recoverable way, even if you may need to restart your machine. Option 1 only has your data in memory which might get lost.
On the other hand, option 2 has a disadvantage that the message size of MSMQ is limited to 4 MB per message (explanation in a Microsoft blog here) and therefore only half of it when working with unicode characters, while the in-memory queues are capaple of much bigger sizes.
[Edit]
Thinking a bit longer, I would prefer option 2.
In your comment, you mention that you want to move files around in the filesystem. This can be very expensive in regards to performance, even worse if you move the files between different partions.
I have used the MSQM in multiple projects at work and am convinced that it would work well for what you want to do. A big advantage here would be that the MSMQ works with transactional communications. That means, that if for some reason a network or electricity or whatever failure occurs, neither your message nor your files get lost.
If any of those happen while you move a file around it could easily get corrupted.
Only thing I have grumbles in my stomach is the file sizes. To work around the message size limitations of 4 MB (see added link above), I would not put the file content into a message. Instead. I would only send an ID or a filepath with it so that the consuming service can find it and read it when needed.
This keeps the message and queue sizes small and avoids using too much bandwith or memory in network and on your serve(s).

Related

Fast Distributed Memory Access in C#

I have a C# WCF service that hosts 120 GB of memory in a Dictionary<File,byte[]> for very fast access of file contents, which really worked well with me. Upon access, the file contents were wrapped within a MemoryStream and read
This service needs to be restarted everyday to load some static data from the database that could change on daily basis. The restart took so much time because of the huge data that need to be loaded again into memory
So I decided to host this memory in a different process on the same machine, and access it through sockets. The Data process will be always up and running. TcpListener/Client and NetworkStream were used in a similar fashion to the following
memoryStream.Read(position.PositionData, 0, position.SizeOfData);
position.NetworkStream.Write(position.PositionData, 0, position.SizeOfData);
Problem is: this was 10 times slower than hosting the memory in the same process. Slowdown is expected, but a factor of 10 is too much.
I thought of MemoryMappedFiles, but those are more useful for random access to a specific view of the file. My file access is sequential from the beginning all the way to the end.
Is there a different technology or library that could be used in my case? or is this just so expected?

I assume you are using SQLServer. If so, Service Broker & SQLNotificaiton Or Query notification may be of your friends here. I presume, you need more of a push messaging model, which automatically propagate changes back to service (if something change in db). Therefore, avoid restarting memory/resource intensive process hence no need to remap your heavy weight dictionary.

More appropriate for my task: background worker or thread pool?

I have a simple web application module which basically accepts requests to save a zip file on PageLoad from a mobile client app.
Now, What I want to do is to unzip the file and read the file inside it and process it further..including making entries into a database.
Update: the zip file and its contents will be fairly smaller in size so the server shouldn't be burdened with much load.
Update 2: I just read about when IIS queues requests (at global/app level). So does that mean that I don't need to implement complex request handling mechanism and the IIS can take care of the app by itself?
Update 3: I am looking for offloading the processing of the downloaded zip not only for the sake of minimizing the overhead (in terms of performance) but also in order to avoid the problem of table-locking when the file is processed and records updated into the same table. In the scenario of multiple devices requesting the page and the background task processing database updateing in parallel would cause an exception.
As of now I have zeroed on two solutions:
To implement a concurrent/message queue
To implement the file processing code into a separate tool and schedule a job on the server to check for non-processed file(s) and process them serially.
Inclined towards a Queuing Mechanism I will try to implement is as it seems less dependent on config. v/s manually configuring the job/schedule at the server side.
So, what do you guys recommend me for this purpose?
Moreover after the zip file is requested and saved on server side, the client & server side connection is released after doing so. Not looking to burden my IIS.
Imagine a couple of hundred clients simultaneously requesting the page..
I actually haven't used neither of them before so any samples or how-to's will be more appreciated.

I'd recommend TPL and Rx Extensions: you make your unzipped file list an observable collection and for each item start a new task asynchronously.

I'd suggest a queue system.
When you received a file you'll save the path into a thread-synchronized queue. Meanwhile a background worker (or preferably another machine) will check this queue for new files and dequeue the entry to handle it.
This way you won't launch an unknown amount of threads (every zip file) and can handle the zip files in one location. This way you can also easier move your zip-handling code to another machine when the load gets too heavy. You just need to access a common queue.
The easiest would probably be to use a static Queue with a lock-object. It is the easiest to implement and does not require external resources. But this will result in the queue being lost when your application recycles.
You mentioned losing zip files was not an option, then this approach is not the best if you don't want to rely on external resources. Depending on your load it may be worth to utilize external resources - meaning upload the zip file to a common storage on another machine and add a message to an queue on another machine.
Here's an example with a local queue:
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
void GotNewZip(string pathToZip)
{
queue.Enqueue(pathToZip); // Added a new work item to the queue
}
void MethodCalledByWorker()
{
while (true)
{
if (queue.IsEmpty)
{
// Supposedly no work to be done, wait a few seconds and check again (new iteration)
Thread.Sleep(TimeSpan.FromSeconds(5));
continue;
}
string pathToZip;
if (queue.TryDequeue(out pathToZip)) // If TryDeqeue returns false, another thread dequeue the last element already
{
HandleZipFile(pathToZip);
}
}
}
This is a very rough example. Whenever a zip arrives, you add the path to the queue. Meanwhile a background worker (or multiple, the example s threadsafe) will handle one zip after another, getting the paths from the queue. The zip files will be handled in the order they arrive.
You need to make sure that your application does not recycle meanwhile. But that's the case with all resources you have on the local machine, they'll be lost when your machine crashes.

I believe you are optimising prematurely.
You mentioned table-locking - what kind of db are you using? If you add new rows or update existing ones most modern databases in most configurations will:
use row-level locking; and
be fast enough without you needing to worry about
locking.
I suggest starting with a simple method
//Unzip
//Do work
//Save results to database
and get some proof it's too slow.

Synchronize writing to a file at file-system level

I have a text file and multiple threads/processes will write to it (it's a log file).
The file gets corrupted sometimes because of concurrent writings.
I want to use a file writing mode from all of threads which is sequential at file-system level itself.
I know it's possible to use locks (mutex for multiple processes) and synchronize writing to this file but I prefer to open the file in the correct mode and leave the task to System.IO.
Is it possible ? what's the best practice for this scenario ?

Your best bet is just to use locks/mutexex. It's a simple approach, it works and you can easily understand it and reason about it.
When it comes to synchronization it often pays to start with the simplest solution that could work and only try to refine if you hit problems.

To my knowledge, Windows doesn't have what you're looking for. There is no file handle object that does automatic synchronization by blocking all other users while one is writing to the file.
If your logging involves the three steps, open file, write, close file, then you can have your threads try to open the file in exclusive mode (FileShare.None), catch the exception if unable to open, and then try again until success. I've found that tedious at best.
In my programs that log from multiple threads, I created a TextWriter descendant that is essentially a queue. Threads call the Write or WriteLine methods on that object, which formats the output and places it into a queue (using a BlockingCollection). A separate logging thread services that queue--pulling things from it and writing them to the log file. This has a few benefits:
Threads don't have to wait on each other in order to log
Only one thread is writing to the file
It's trivial to rotate logs (i.e. start a new log file every hour, etc.)
There's zero chance of an error because I forgot to do the locking on some thread
Doing this across processes would be a lot more difficult. I've never even considered trying to share a log file across processes. Were I to need that, I would create a separate application (a logging service). That application would do the actual writes, with the other applications passing the strings to be written. Again, that ensures that I can't screw things up, and my code remains simple (i.e. no explicit locking code in the clients).

you might be able to use File.Open() with a FileShare value set to None, and make each thread wait if it can't get access to the file.

Finding or building an inter-process broadcast communication channel

So we have this somewhat unusual need in our product. We have numerous processes running on the local host and need to construct a means of communication between them. The difficulty is that ...
There is no 'server' or master process
Messages will be broadcast to all listening nodes
Nodes are all Windows processes, but may be C++ or C#
Nodes will be running in both 32-bit and 64-bit simultaneously
Any node can jump in/out of the conversation at any time
A process abnormally terminating should not adversely affect other nodes
A process responding slowly should also not adversely affect other nodes
A node does not need to be 'listening' to broadcast a message
A few more important details...
The 'messages' we need to send are trivial in nature. A name of the type of message and a single string argument would suffice.
The communications are not necessarily secure and do not need to provide any means of authentication or access control; however, we want to group communications by a Windows Log-on session. Perhaps of interest here is that a non-elevated process should be able to interact with an elevated process and vise-versa.
My first question: is there an existing open-source library?, or something that can be used to fulfill this with little effort. As of now I haven't been able to find anything :(
If a library doesn't exist for this then... What technologies would you use to solve this problem? Sockets, named-pipes, memory mapped files, event handles? It seems like connection based transports (sockets/pipes) would be a bad idea in a fully connected graph since n nodes requires n(n-1) number of connections. Using event handles and some form of shared storage seems the most plausible solution right now...
Updates
Does it have to be reliable and guaranteed? Yes, and no... Let's say that if I'm listening, and I'm responding in a reasonable time, then I should always get the message.
What are the typical message sizes? less than 100 bytes including the message identifier and argument(s). These are small.
What message rate are we talking about? Low throughput is acceptable, 10 per second would be a lot, average usage would be around 1 per minute.
What are the number of processes involved? I'd like it to handle between 0 and 50, with the average being between 5 and 10.

I don't know of anything that already exists, but you should be able to build something with a combination of:
Memory mapped files
Events
Mutex
Semaphore
This can be built in such a way that no "master" process is required, since all of those can be created as named objects that are then managed by the OS and not destroyed until the last client uses them. The basic idea is that the first process to start up creates the objects you need, and then all other processes connect to those. If the first process shuts down, the objects remain as long as at least one other process is maintaining a handle to them.
The memory mapped file is used to share memory among the processes. The mutex provides synchronization to prevent simultaneous updates. If you want to allow multiple readers or one writer, you can build something like a reader/writer lock using a couple of mutexes and a semaphore (see Is there a global named reader/writer lock?). And events are used to notify everybody when new messages are posted.
I've waved my hand over some significant technical detail. For example, knowing when to reset the event is kind of tough. You could instead have each app poll for updates.
But going this route will provide a connectionless way of sharing information. It doesn't require that a "server" process is always running.
For implementation, I would suggest implementing it in C++ and let the C# programs call it through P/Invoke. Or perhaps in C# and let the C++ apps call it through COM interop. That's assuming, of course, that your C++ apps are native rather than C++/CLI.

I've never tried this, but in theory it should work. As I mentioned in my comment, use a UDP port on the loopback device. Then all the processes can read and write from/to this socket. As you say, the messages are small, so should fit into each packet - may be you can look at something like google's protocol buffers to generate the structures, or simply mem copy the structure into the packet to send and at the other end, cast. Given it's all on the local host, you don't have any alignment, network order type issues to worry about. To support different types of messages, ensure a common header which can be checked for type so that you can be backward compatible.
2cents...

I think one more important consideration is performance, what message rate are we talking about and no. of processes?
Either way you are relying on a "master" that allows the communication needs, be it a custom service or a system provided(Pipes, Message Queue and such).
If you don't need to keep track and query for past messages, I do think you should consider a dead simple service that opens a named Pipe - allowing all other processes to either read or write to it as PipeClients. If I am not mistaken it checks on all items in your list.

What your looking for is Mailslots!
See CreateMailslot:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365147(v=vs.85).aspx

WCF service with XML based storage. Concurrency issues?

I programmed a simple WCF service that stores messages sent by users and sends these messages to the intended user when asked for. For now, the persistence is implemented by creating username.xml files with the following structure:
<messages recipient="username">
<message sender="otheruser">
...
</message
</messages>
It is possible for more than one user to send a message to the same recipient at the same time, possibly causing the xml file to be updated concurrently. The WCF service is currently implemented with basicHttp binding, without any provisions for concurrent access.
What concurrency risks are there? How should I deal with them? A ReadWrite lock on the xml file being accessed?
Currently the service runs with 5 users at the most, this may grow up to 50, but no more.
EDIT:
As stated above the client will instantiate a new service class with every call it makes. (InstanceContext is PerCall, ConcurrencyMode irrelevant) This is inherent to the use of basicHttpBinding with default settings on the service.
The code below:
public class SomeWCFService:ISomeServiceContract
{
ClassThatTriesToHoldSomeInfo useless;
public SomeWCFService()
{
useless=new ClassThatTriesToHoldSomeInfo();
}
#region Implementation of ISomeServiceContract
public void IncrementUseless()
{
useless.Counter++;
}
#endregion
}
behaves is if it were written:
public class SomeWCFService:ISomeServiceContract
{
ClassThatTriesToHoldSomeInfo useless;
public SomeWCFService()
{}
#region Implementation of ISomeServiceContract
public void IncrementUseless()
{
useless=new ClassThatTriesToHoldSomeInfo();
useless.Counter++;
}
#endregion
}
So concurrency is never an issue until you try to access some externally stored data as in a database or in a file.
The downside is that you cannot store any data between method calls of the service unless you store it externally.

If your WCF service is a singleton service and guaranteed to be that way, then you don't need to do anything. Since WCF will allow only one request at a time to be processed, concurrent access to the username files is not an issue unless the operation that serves that request spawns multiple threads that access the same file. However, as you can imagine, a singleton service is not very scalable and not something you want in your case I assume.
If your WCF service is not a singleton, then concurrent access to the same user file is a very realistic scenario and you must definitely address it. Multiple instances of your service may concurrently attempt to access the same file to update it and you will get a 'can not access file because it is being used by another process' exception or something like that. So this means that you need to synchronize access to user files. You can use a monitor (lock), ReaderWriterLockSlim, etc. However, you want this lock to operate on per file basis. You don't want to lock the updates on other files out when an update on a different file is going on. So you will need to maintain a lock object per file and lock on that object e.g.
//when a new userfile is added, create a new sync object
fileLockDictionary.Add("user1file.xml",new object());
//when updating a file
lock(fileLockDictionary["user1file.xml"])
{
//update file.
}
Note that that dictionary is also a shared resource that will require synchronized access.
Now, dealing with concurrency and ensuring synchronized access to shared resources at the appropriate granularity is very hard not only in terms of coming up with the right solution but also in terms of debugging and maintaining that solution. Debugging a multi-threaded application is not fun and hard to reproduce problems. Sometimes you don't have an option but sometimes you do. So, Is there any particular reason why you're not using or considering a database based solution? Database will handle concurrency for you. You don't need to do anything. If you are worried about the cost of purchasing a database, there are very good proven open source databases out there such as MySQL and PostgreSQL that won't cost you anything.
Another problem with the xml file based approach is that updating them will be costly. You will be loading the xml from a user file in memory, create a message element, and save it back to file. As that xml grows, that process will take longer, require more memory, etc. It will also hurt your scalibility because the update process will hold onto that lock longer. Plus, I/O is expensive. There are also benefits that come with a database based solution: transactions, backups, being able to easily query your data, replication, mirroring, etc.
I don't know your requirements and constraints but I do think that file-based solution will be problematic going forward.

You need to read the file before adding to it and writing to disk, so you do have a (fairly small) risk of attempting two overlapping operations - the second operation reads from disk before the first operation has written to disk, and the first message will be overwritten when the second message is committed.
A simple answer might be to queue your messages to ensure that they are processed serially. When the messages are received by your service, just dump the contents into an MSMQ queue. Have another single-threaded process which reads from the queue and writes the appropriate changes to the xml file. That way you can ensure you only write one file at a time and resolve any concurrency issues.

The basic problem is when you access a global resource (like a static variable, or a file on the filesystem) you need to make sure you lock that resource or serialize access to it somehow.
My suggestion here (if you want to just get it done quick without using a database or anything, which would be better) would be to insert your messages into a Queue structure in memory from your service code.
public MyService : IMyService
{
public static Queue queue = new Queue();
public void SendMessage(string from, string to, string message)
{
Queue syncQueue = Queue.Synchronized(queue);
syncQueue.Enqueue(new Message(from, to, message));
}
}
Then somewhere else in your app you can create a background thread that reads from that queue and writes to the filesystem one update at a time.
void Main()
{
Timer timer = new Timer();
timer.Tick += (o, e)
{
Queue syncQueue = Queue.Synchronized(MyService.queue);
while(syncQueue.Count > 0)
{
Message message = syncQueue.Dequeue() as Message;
WriteMessageToXMLFile(message);
}
timer.Start();
};
timer.Start();
//Or whatever you do here
StartupService();
}
It's not pretty (and I'm not 100% sure it compiles) but it should work. It sort of follows the "get it done with the tools I have, not the tools I want" kind of approach I think you are looking for.
The clients are also off the line as soon as possible, rather than waiting for the file to be written to the filesystem before they disconnect. This can also be bad... clients might not know their message didn't get delivered should your app go down after they disconnect and the background thread hasn't written their message yet.
Other approaches on here are just as valid... I wanted to post the serialization approach, rather than the locking approach others have suggested.
HTH,
Anderson

Well, it just so happens that I've done something almost exactly the same, except that it wasn't actually messages...
Here's how I'd handle it.
Your service itself talks to a central object (or objects), which can dispatch message requests based on the sender.
The object relating to each sender maintains an internal lock while updating anything. When it gets a new request for a modification, it then can read from disk (if necessary), update the data, and write to disk (if necessary).
Because different updates will be happening on different threads, the internal lock will be serialized. Just be sure to release the lock if you call any 'external' objects to avoid deadlock scenarios.
If I/O becomes a bottleneck, you can look at different strategies involving putting messages in one file, separate files, not immediately writing them to disk, etc. In fact, I'd think about storing the messages for each user in a separate folder for exactly that reason.
The biggest point is, that each service instance acts as, essentially, an adapter to the central class, and that only one instance of one class will ever be responsible for reading/writing messages for a given recipient. Other classes may request a read/write, but they do not actually perform it (or even know how it's performed). This also means that their code is going to look like 'AddMessage(message)', not 'SaveMessages(GetMessages.Add(message))'.
That said, using a database is a very good suggestion, and will likely save you a lot of headaches.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.