WCF service with XML based storage. Concurrency issues?

WCF service with XML based storage. Concurrency issues? - c#

I programmed a simple WCF service that stores messages sent by users and sends these messages to the intended user when asked for. For now, the persistence is implemented by creating username.xml files with the following structure:
<messages recipient="username">
<message sender="otheruser">
...
</message
</messages>
It is possible for more than one user to send a message to the same recipient at the same time, possibly causing the xml file to be updated concurrently. The WCF service is currently implemented with basicHttp binding, without any provisions for concurrent access.
What concurrency risks are there? How should I deal with them? A ReadWrite lock on the xml file being accessed?
Currently the service runs with 5 users at the most, this may grow up to 50, but no more.
EDIT:
As stated above the client will instantiate a new service class with every call it makes. (InstanceContext is PerCall, ConcurrencyMode irrelevant) This is inherent to the use of basicHttpBinding with default settings on the service.
The code below:
public class SomeWCFService:ISomeServiceContract
{
ClassThatTriesToHoldSomeInfo useless;
public SomeWCFService()
{
useless=new ClassThatTriesToHoldSomeInfo();
}
#region Implementation of ISomeServiceContract
public void IncrementUseless()
{
useless.Counter++;
}
#endregion
}
behaves is if it were written:
public class SomeWCFService:ISomeServiceContract
{
ClassThatTriesToHoldSomeInfo useless;
public SomeWCFService()
{}
#region Implementation of ISomeServiceContract
public void IncrementUseless()
{
useless=new ClassThatTriesToHoldSomeInfo();
useless.Counter++;
}
#endregion
}
So concurrency is never an issue until you try to access some externally stored data as in a database or in a file.
The downside is that you cannot store any data between method calls of the service unless you store it externally.

If your WCF service is a singleton service and guaranteed to be that way, then you don't need to do anything. Since WCF will allow only one request at a time to be processed, concurrent access to the username files is not an issue unless the operation that serves that request spawns multiple threads that access the same file. However, as you can imagine, a singleton service is not very scalable and not something you want in your case I assume.
If your WCF service is not a singleton, then concurrent access to the same user file is a very realistic scenario and you must definitely address it. Multiple instances of your service may concurrently attempt to access the same file to update it and you will get a 'can not access file because it is being used by another process' exception or something like that. So this means that you need to synchronize access to user files. You can use a monitor (lock), ReaderWriterLockSlim, etc. However, you want this lock to operate on per file basis. You don't want to lock the updates on other files out when an update on a different file is going on. So you will need to maintain a lock object per file and lock on that object e.g.
//when a new userfile is added, create a new sync object
fileLockDictionary.Add("user1file.xml",new object());
//when updating a file
lock(fileLockDictionary["user1file.xml"])
{
//update file.
}
Note that that dictionary is also a shared resource that will require synchronized access.
Now, dealing with concurrency and ensuring synchronized access to shared resources at the appropriate granularity is very hard not only in terms of coming up with the right solution but also in terms of debugging and maintaining that solution. Debugging a multi-threaded application is not fun and hard to reproduce problems. Sometimes you don't have an option but sometimes you do. So, Is there any particular reason why you're not using or considering a database based solution? Database will handle concurrency for you. You don't need to do anything. If you are worried about the cost of purchasing a database, there are very good proven open source databases out there such as MySQL and PostgreSQL that won't cost you anything.
Another problem with the xml file based approach is that updating them will be costly. You will be loading the xml from a user file in memory, create a message element, and save it back to file. As that xml grows, that process will take longer, require more memory, etc. It will also hurt your scalibility because the update process will hold onto that lock longer. Plus, I/O is expensive. There are also benefits that come with a database based solution: transactions, backups, being able to easily query your data, replication, mirroring, etc.
I don't know your requirements and constraints but I do think that file-based solution will be problematic going forward.

You need to read the file before adding to it and writing to disk, so you do have a (fairly small) risk of attempting two overlapping operations - the second operation reads from disk before the first operation has written to disk, and the first message will be overwritten when the second message is committed.
A simple answer might be to queue your messages to ensure that they are processed serially. When the messages are received by your service, just dump the contents into an MSMQ queue. Have another single-threaded process which reads from the queue and writes the appropriate changes to the xml file. That way you can ensure you only write one file at a time and resolve any concurrency issues.

The basic problem is when you access a global resource (like a static variable, or a file on the filesystem) you need to make sure you lock that resource or serialize access to it somehow.
My suggestion here (if you want to just get it done quick without using a database or anything, which would be better) would be to insert your messages into a Queue structure in memory from your service code.
public MyService : IMyService
{
public static Queue queue = new Queue();
public void SendMessage(string from, string to, string message)
{
Queue syncQueue = Queue.Synchronized(queue);
syncQueue.Enqueue(new Message(from, to, message));
}
}
Then somewhere else in your app you can create a background thread that reads from that queue and writes to the filesystem one update at a time.
void Main()
{
Timer timer = new Timer();
timer.Tick += (o, e)
{
Queue syncQueue = Queue.Synchronized(MyService.queue);
while(syncQueue.Count > 0)
{
Message message = syncQueue.Dequeue() as Message;
WriteMessageToXMLFile(message);
}
timer.Start();
};
timer.Start();
//Or whatever you do here
StartupService();
}
It's not pretty (and I'm not 100% sure it compiles) but it should work. It sort of follows the "get it done with the tools I have, not the tools I want" kind of approach I think you are looking for.
The clients are also off the line as soon as possible, rather than waiting for the file to be written to the filesystem before they disconnect. This can also be bad... clients might not know their message didn't get delivered should your app go down after they disconnect and the background thread hasn't written their message yet.
Other approaches on here are just as valid... I wanted to post the serialization approach, rather than the locking approach others have suggested.
HTH,
Anderson

Well, it just so happens that I've done something almost exactly the same, except that it wasn't actually messages...
Here's how I'd handle it.
Your service itself talks to a central object (or objects), which can dispatch message requests based on the sender.
The object relating to each sender maintains an internal lock while updating anything. When it gets a new request for a modification, it then can read from disk (if necessary), update the data, and write to disk (if necessary).
Because different updates will be happening on different threads, the internal lock will be serialized. Just be sure to release the lock if you call any 'external' objects to avoid deadlock scenarios.
If I/O becomes a bottleneck, you can look at different strategies involving putting messages in one file, separate files, not immediately writing them to disk, etc. In fact, I'd think about storing the messages for each user in a separate folder for exactly that reason.
The biggest point is, that each service instance acts as, essentially, an adapter to the central class, and that only one instance of one class will ever be responsible for reading/writing messages for a given recipient. Other classes may request a read/write, but they do not actually perform it (or even know how it's performed). This also means that their code is going to look like 'AddMessage(message)', not 'SaveMessages(GetMessages.Add(message))'.
That said, using a database is a very good suggestion, and will likely save you a lot of headaches.

Related

Does calling the same HttpRequestMethod wait for the existing one to finish

I have a workagent agents that runs every 60 seconds checking processing tables for new work, each 60 seconds if new work is available. During processing the agent uses a static TraceHelper class for logging purpose. At the point of writing to the log file I also send a WebRequest to an external API to ship the log entry to a Logstash.
The Webrequest essentially sends off a json object for each Writeline. obviously for logging purposes, order is important so my question is, even though I am only calling one POST HttpWebRequest this is happening hundreds of times a minute. Should I be worried about syncing issues? Could there be a potential that the second Writeline requests gets called and processed by the HttpWebRequest before the first Writeline has a chance to send? Or am I looking into wrong?
Note: below is semi pseudo code
Say I have the below
Tracehelper.Writeline("foo")
Tracehelper.Writeline("baa");
static Tracehelper(){}
public static void Writeline(string msg)
{
File.WriteToFile(msg);
WebProcessHelper.SendLog(msg);
}
static WebProcessHelper() {}
public static void SendLog(string msg)
{
SendHttpRequest(msg);
}
Is there a potential that "baa" is sent ahead of "foo"?

I don't know what classes you are actually using for the web request, but I'll assume its the default async ones and answer it based on that.If you are using something else, the answer may vary.
Yes and no. The requests will all be fired by the client in order. However if you are doing this hundreds of times per second, you may have the messages accepted by the server on the other end not in the same order since they may stack up and exceed the handling capacity of the server. Totally depends on the number of connections, and the server software used.
All that said, making hundreds of HTTP calls a second is an awful idea. If you need to log, use a logging framework that lets you buffer and batch up the logging into a request every second or two. This will greatly save you bandwidth, add scalability, and relieve CPU load on the server you are logging to.
I can heartily recommend either nLog or the Semantic Application Logging Block from MS. Have used both, and they are both really flexible and handle load well.

It's safe to assume that there's the possibility, especially if you're dealing with multiple threads running at once. A safe option would be to set up a separate thread with a BlockingCollection or something that simply pulls strings out of the BlockingCollection and sends them off to logstash.
That way you could have it such that anyone can write something to the BlockingCollection, and they'll all get sent to the remote server in the right order.

Handling limitations in multithreaded server

In my client-server architecture I have few API functions which usage need to be limited.
Server is written in .net C# and it is running on IIS.
Until now I didn't need to perform any synchronization. Code was written in a way that even if client would send same request multiple times (e.g. create sth request) one call will end with success and all others with error (because of server code + db structure).
What is the best way to perform such limitations? For example I want no more that 1 call of API method: foo() per user per minute.
I thought about some SynchronizationTable which would have just one column unique_text and before computing foo() call I'll write something like foo{userId}{date}{HH:mm} to this table. If call end with success I know that there wasn't foo call from that user in current minute.
I think there is much better way, probably in server code, without using db for that. Of course, there could be thousands of users calling foo.
To clarify what I need: I think it could be some light DictionaryMutex.
For example:
private static DictionaryMutex FooLock = new DictionaryMutex();
FooLock.lock(User.GUID);
try
{
...
}
finally
{
FooLock.unlock(User.GUID);
}
EDIT:
Solution in which one user cannot call foo twice at the same time is also sufficient for me. By "at the same time" I mean that server started to handle second call before returning result for first call.

Note, that keeping this state in memory in an IIS worker process opens the possibility to lose all this data at any instant in time. Worker processes can restart for any number of reasons.
Also, you probably want to have two web servers for high availability. Keeping the state inside of worker processes makes the application no longer clustering-ready. This is often a no-go.
Web apps really should be stateless. Many reasons for that. If you can help it, don't manage your own data structures like suggested in the question and comments.
Depending on how big the call volume is, I'd consider these options:
SQL Server. Your queries are extremely simple and easy to optimize for. Expect 1000s of such queries per seconds per CPU core. This can bear a lot of load. You can use a SQL Express for free.
A specialized store like Redis. Stack Overflow is using Redis as a persistent, clustering-enabled cache. A good idea.
A distributed cache, like Microsoft Velocity. Or others.
This storage problem is rather easy because it fits a key/value store model well. And the data is near worthless so you don't even need to backup.
I think you're overestimating how costly this rate limitation will be. Your web-service is probably doing a lot more costly things than a single UPDATE by primary key to a simple table.

More appropriate for my task: background worker or thread pool?

I have a simple web application module which basically accepts requests to save a zip file on PageLoad from a mobile client app.
Now, What I want to do is to unzip the file and read the file inside it and process it further..including making entries into a database.
Update: the zip file and its contents will be fairly smaller in size so the server shouldn't be burdened with much load.
Update 2: I just read about when IIS queues requests (at global/app level). So does that mean that I don't need to implement complex request handling mechanism and the IIS can take care of the app by itself?
Update 3: I am looking for offloading the processing of the downloaded zip not only for the sake of minimizing the overhead (in terms of performance) but also in order to avoid the problem of table-locking when the file is processed and records updated into the same table. In the scenario of multiple devices requesting the page and the background task processing database updateing in parallel would cause an exception.
As of now I have zeroed on two solutions:
To implement a concurrent/message queue
To implement the file processing code into a separate tool and schedule a job on the server to check for non-processed file(s) and process them serially.
Inclined towards a Queuing Mechanism I will try to implement is as it seems less dependent on config. v/s manually configuring the job/schedule at the server side.
So, what do you guys recommend me for this purpose?
Moreover after the zip file is requested and saved on server side, the client & server side connection is released after doing so. Not looking to burden my IIS.
Imagine a couple of hundred clients simultaneously requesting the page..
I actually haven't used neither of them before so any samples or how-to's will be more appreciated.

I'd recommend TPL and Rx Extensions: you make your unzipped file list an observable collection and for each item start a new task asynchronously.

I'd suggest a queue system.
When you received a file you'll save the path into a thread-synchronized queue. Meanwhile a background worker (or preferably another machine) will check this queue for new files and dequeue the entry to handle it.
This way you won't launch an unknown amount of threads (every zip file) and can handle the zip files in one location. This way you can also easier move your zip-handling code to another machine when the load gets too heavy. You just need to access a common queue.
The easiest would probably be to use a static Queue with a lock-object. It is the easiest to implement and does not require external resources. But this will result in the queue being lost when your application recycles.
You mentioned losing zip files was not an option, then this approach is not the best if you don't want to rely on external resources. Depending on your load it may be worth to utilize external resources - meaning upload the zip file to a common storage on another machine and add a message to an queue on another machine.
Here's an example with a local queue:
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
void GotNewZip(string pathToZip)
{
queue.Enqueue(pathToZip); // Added a new work item to the queue
}
void MethodCalledByWorker()
{
while (true)
{
if (queue.IsEmpty)
{
// Supposedly no work to be done, wait a few seconds and check again (new iteration)
Thread.Sleep(TimeSpan.FromSeconds(5));
continue;
}
string pathToZip;
if (queue.TryDequeue(out pathToZip)) // If TryDeqeue returns false, another thread dequeue the last element already
{
HandleZipFile(pathToZip);
}
}
}
This is a very rough example. Whenever a zip arrives, you add the path to the queue. Meanwhile a background worker (or multiple, the example s threadsafe) will handle one zip after another, getting the paths from the queue. The zip files will be handled in the order they arrive.
You need to make sure that your application does not recycle meanwhile. But that's the case with all resources you have on the local machine, they'll be lost when your machine crashes.

I believe you are optimising prematurely.
You mentioned table-locking - what kind of db are you using? If you add new rows or update existing ones most modern databases in most configurations will:
use row-level locking; and
be fast enough without you needing to worry about
locking.
I suggest starting with a simple method
//Unzip
//Do work
//Save results to database
and get some proof it's too slow.

Multithreading with Filesystem watcher and/or MSMQ wcf service

I need to create a service which is basically responsible for the following:
Watch a specific folder for any new files created.
If yes , read that file , process it and save data in DB.
For the above task, I am thinking of creating a multi threaded service with either of the following approach:
In the main thread, create an instance of filesystem watcher and as soon as a new file is created, add that file in the threadQueue. There will be N no. of consumer threads running which should take a file from the queue and process it (i.e step 2).
Again in the main thread, create an instance of filesystem watcher and as soon as a new file is created, read that file and add the data to MSMQ using wcf MSMQ service. When the message is read by the wcf msmq service, it will be responsible for processing further
I am a newbie when it comes to creating a multi threaded service. So not sure which will tbe the best option. Please guide me.
Thanks,

First off, let me say that you have taken a wise approach to do a single producer - multiple consumer model. This is the best approach in this case.
I would go for option 1, using a ConcurrentQueue data structure, which provides you an easy way to queue tasks in a thread-safe manner. Alternatively, you can simply use the ThreadPool.QueueUserWorkItem method to send work directly to the built-in thread pool, without worrying about managing the workers or the queue explicitly.
Edit: Regarding the reliability of FileSystemWatcher, MSDN says:
The Windows operating system notifies your component of file changes
in a buffer created by the FileSystemWatcher. If there are many
changes in a short time, the buffer can overflow. This causes the
component to lose track of changes in the directory, and it will only
provide blanket notification. Increasing the size of the buffer with
the InternalBufferSize property is expensive, as it comes from
non-paged memory that cannot be swapped out to disk, so keep the
buffer as small yet large enough to not miss any file change events.
To avoid a buffer overflow, use the NotifyFilter and
IncludeSubdirectories properties so you can filter out unwanted change
notifications.
So it depends on how often changes will occur and how much buffer you are allocating.

I would also consider your demands for failure handling and sizes of the files you are sending.
Whether you decide for option 1 or 2 will be dependent on specifications.
Option 2 has the avantage that by using MSMQ you have your data persisted in a recoverable way, even if you may need to restart your machine. Option 1 only has your data in memory which might get lost.
On the other hand, option 2 has a disadvantage that the message size of MSMQ is limited to 4 MB per message (explanation in a Microsoft blog here) and therefore only half of it when working with unicode characters, while the in-memory queues are capaple of much bigger sizes.
[Edit]
Thinking a bit longer, I would prefer option 2.
In your comment, you mention that you want to move files around in the filesystem. This can be very expensive in regards to performance, even worse if you move the files between different partions.
I have used the MSQM in multiple projects at work and am convinced that it would work well for what you want to do. A big advantage here would be that the MSMQ works with transactional communications. That means, that if for some reason a network or electricity or whatever failure occurs, neither your message nor your files get lost.
If any of those happen while you move a file around it could easily get corrupted.
Only thing I have grumbles in my stomach is the file sizes. To work around the message size limitations of 4 MB (see added link above), I would not put the file content into a message. Instead. I would only send an ID or a filepath with it so that the consuming service can find it and read it when needed.
This keeps the message and queue sizes small and avoids using too much bandwith or memory in network and on your serve(s).

Chat logs management, performance-wise

I've got an application receiving messages from different sources (chat rooms and private chats). Multiple instances of the application can be opened, and the final result should be something similar to the following scenario:
Currently, each application saves logs in a directory which name is the account used to log on the chat server; while it's not a problem for private chat sources (unique for each application instance), it's useless to have the same logs saved multiple times, concerning common chat rooms. Logs are saved in plain text format, so they can be accessed and read without being routed through the application.
If I don't save the logs in separate folders, I might get I/O exceptions due to accessing the same file simultaneously from multiple processes, and I'd need to verify whether the line about to be saved hasn't already been written by other applications. I need to optimize the whole operation and try to maintain code readability.
Besides, my current approach for writing lines is the following:
public void Write(string message)
{
using (var writer = new StreamWriter(_fileName, File.Exists(_fileName)))
writer.WriteLine(message);
}
Which, considering that logs are constantly written, might not be the most efficient solution.
Summed up, my questions are:
How do I create an unique log folder/database, maintaining their format (plain text) but solving the aforementioned duplicates/access problem?
How do I improve, if possible, the writing method? Remember that logs need to be constantly written, but that closing the StreamWriter when the application exits would not be a proper solution, as the application is meant to run for a long time.
Thank you.

I would come up with a simple solution, which might be appropriate for your needs, not entirely sure though.
My approach would be to use a single file for each chat session/room. If such a session is started, the application tries to create/open that file and creates a write lock for that file. If it gets an IOException (because the file is locked) it can simply skip logging completely.

To be honest, if I were you I would be looking at already exising open source frameworks e.g. NLog. It's fast enough and supports asynchronous logging so it should do exactly what your looking for.

Not sure, if I should write this as an answer or a comment, but might need the room:
You mentioned your sketch showing the desired result, but as I said this will prevent you from deduping if you don't couple the instances. So here is what I would suggest:
You create two applications: LogWriter, which is a singleton and sits at the bottom of your sketch
LogProcessor, which is the application instances in your sketch.
Upon startup of a LogProcessor instance, it spawns a LogWriter or connects to it, if it is already running.
LogProcessor handles the incoming log requests, maybe preprocesses them if you need to do so, then sends them on to the LogWriter as a tuple of Timestamp, ChatroomID, UserID (need not be unique), text, and maybe a hash for easier deduping. Calculating the hash in the instances makes better use of multiple cores
The LogWriter keeps a hanging list sorted by timestamp, containing the hashes, so it is able to quickly discard the duplicate items
For the rest of the items, LogWriter determines the logfile path. If a stream is already open to that path, it writes the item out, updates the LastUsed Timestamp on that stream and is done
If no stream is open, LogWriter opens one, then writes.
If the max number of streams is reached, it closes the oldest stream (as by the above mentioned LastUsed Timestamp) and opens the needed new stream instead.

Perhaps change your application design in such a way that a logger is attached to a chat room not to a user.
When users enter a chat room, the chatroom will pass a logger object to the users.
This way all users will use the same logger. The problem then becomes: 1 consumer (the logger), and multiple producers (all those users who want to log).
See my reply to this post:
Write to FileOutputStream from multiple threads in Java
here
https://stackoverflow.com/a/8422621/1007845

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.