I'm creating a design to process a large number of jobs using MSMQ to scale out. Each job is processed and the database is updated for that job Id saying it is complete. If error, it should go back to the queue. So, I need a transactional MSMQ. Now, I could process the job and updating the db record might fail for whatever reason. In this case also, I would need the job back in the queue so it can be re-attempted and saved back to db with success. This means I will need to enable MS DTC to manage transaction across the database server.
I am reading Pro MSMQ book and it mentions "For most applications, transactions are not needed at all. There is a tendency to overuse transactions and affect the performance of the entire system unnecessarily. Before deciding to use transactions, analyze the ACID properties requirement of the entire system and the performance impact.".
I'm failing to understand which cases would not involve a database update. Wouldn't any record that is picked up from a queue needs to be processed and updated somewhere? I would think about 90% of the systems will need this type of functionality. I get it if it was a middle layer queue that passes it to another queue or something like that, but for processing the last record in the pipeline, isn't MS DTC always required?
Thoughts?
Edit:- the full text:
Transactional messaging offers a lot of benefits, such as message
integrity and message order, over nontransactional messaging, but the
performance price you pay for using transactional messaging is huge.
Use internal transactions only if it is absolutely essential to
maintain the order of messages in the queue. External transactions
offer the benefit of propagating the transaction context across
multiple resource managers. Such transactions are useful when there is
a large-scale distributed system with multiple databases and message
queues, and the transactional integrity between the messages
exchanged between these resource managers is critical. The overhead
incurred while using external transactions is significantly more
than the one incurred by internal Message Queuing transactions. For
most applications, transactions are not needed at all. There is a
tendency to overuse transactions and affect the performance of the
entire system unnecessarily. Before deciding to use transactions,
analyze the ACID properties requirement of the entire system and the
resulting performance impact.
As the book says, the main problem with enlisting in these transactions is that you are sacrificing performance at high volumes by locking multiple resources. There are alternative ways for you to achieve your goals without losing consistency. There is always a trade-off between consistency, availability and partition tolerance (read about the CAP theorem) and you need to decide which attributes of the system are needed to successfully meet the business requirements.
To tackle your problem without a transaction, instead of popping the message off the queue, you can Peek the message, and if your processing succeeds you pop the message and discard it. If the processing fails you move the message to an error queue. Error queues can be automatically retried (it could be a transitory issue). The error queue(s) should be monitored actively to ensure your system is processing correctly, ie. you need alerts to fire if the error queue is over a threshold or increasing at a rate.
Note that this approach won't work for a single queue with multiple processors as you have commented. This will work where you partition your data and messages and bind a processor to a queue.
E.g. I'm doing central sales processing for a chain of retailers. I might say that I process us east coast retailers on queue QA and west coast on queue QB. I have a processor PA bound queue QA (can be muliple exe's or threads in a single exe) and processor PB bound to QB. This way messages are processed in order for a distinct entity in the system.
The key is to pick the right data partitioning scheme so that work is spread evenly.
Related
I have researched a lot and I haven't found anything that meets my needs. I'm hoping someone from SO can throw some insight into this.
I have an application where the expected load is thousands of jobs per customer and I can have 100s of customers. Currently it is 50 customers and close to 1000 jobs per each. These jobs are time sensitive (scheduled by customer) and can run up to 15 minutes (each job).
In order to scale and match the schedules, I'm planning to run this as multi threaded on a single server. So far so good. But the business wants to scale more (as needed) by adding more servers into the mix. Currently the way I have it is when it becomes ready in the database, a console application picks up first 500 and uses Task Parallel library to spawn 10 threads and waits until they are complete. I can't scale this to another server because that one could pick up the same records. I can't update a status on the db record as being processed because if the application crashes on one server, the job will be in limbo.
I could do a message queue and have multiple machines pick from it. The problem with this is the queue has to be transactional to support handling for any crashes. MSMQ supports only MS DTC transaction since it involves database and I'm not really comfortable with DTC transactions, especially with multi threads and multiple machines. Too much maintenance and set up and possibly unknown issues.
Is SQL service broker a good approach instead? Has anyone done something like this in a production environment? I also want to keep the transactions short (A job could run for 15,20 minutes - mostly streaming data from a service). The only reason I'm doing a transaction is to keep the message integrity of queue. I need the job to be re-picked if it crashes (re-appear in the queue)
Any words of wisdom?
Why not having an application receive the jobs and insert them in a table that will contain the queue of jobs. Each work process can then pick up a set of jobs and set the status as processing, then complete the work and set the status as done. Other info such as server name that processed each job, start and end time-stamp could also be logged. Moreover, instead of using multiple threads, you could use independent work processes so as to make your programming easier.
[EDIT]
SQL Server supports record level locking and lock escalation can also be prevented. See Is it possible to force row level locking in SQL Server?. Using such mechanism, you can have your work processes take exclusive locks on jobs to be processed, until they are done or crash (thereby releasing the lock).
In my scenario, messages are coming in a predefined order. Once it enters our system, multiple receivers picks up the message from the incoming queue and processes it. Once the processing is done, the processed messages should be sent out in the same order as they had arrived. In a scaled up system, how do we ensure this?
The system requires high processing speed and throughput. In .net world, which queue would be ideal for this scenario?
It's a fundamental problem around ordered delivery - somewhere in your system you need to throttle everything down to a single thread. This is unavoidable.
Where you choose to do this can make a large difference to throughput. You could choose to make your entire message processor single-threaded. This would ensure order was maintained but at a cost of low throughput.
However, there is a way you can still process messages concurrently, but then you need to somehow assemble them in the correct order again. There is an integration design pattern called Resequencer - http://eaipatterns.com/Resequencer.html.
However the resequencer pattern this relies on you being able to stamp each message with a time-stamp or sequence number on the way into your system if there is nothing already in your messages to indicate ordering.
Additionally, is ordered delivery a requirement across the entire message set coming in from the queue? For example, it may be that only some of your messages have a need to be delivered in order.
Or it could be that you can group your messages into "sets" under a correlating identifier - within each set order needs to be maintained but you can still have concurrent processing on a "per-set" basis.
I'm doing a project with some timing constraints right now. Setup is: A web service accepts (tiny) xml files and I have to process these, fast.
First and most naive idea was to handle this processing in the request dispatcher itself, but that didn't scale and was doomed from the start.
So now I'm looking at a varying load of incoming requests that each produce ~ 50 jobs on my side. Technologies available for use are limited due to the customers' rules. If it's not Sql Server or MS MQ it probably won't fly.
I thought about going down the MS MQ route (Web service just submitting messages, multiple consumer processes lateron) and small proof of concept modules worked like a charm.
There's one problem though: The priority of these jobs might change a lot, in the queue. The system is fairly time critical, so if we - for whatever reasons - cannot process incoming jobs in a timely fashion, we need to prefer the latest ones.
Basically the usecase changes from reliable messaging in general to LIFO under (too) heavy load. Old entries still have to be processed, but just lost all of their priority.
Is there any manageable way to build something like this in MS MQ?
Expanding the business side, as requested:
The processing of the incoming job is bound to some tracks, where physical goods are moved around. If I cannot process the messages in time, the things are "gone".
I still want the results for statistical purpose, but really need to focus on the newer messages now.
Think of me being able to influence mechanical things and reroute things moving on a track - if they didn't move past point X yet..
So, if i understand this, you want to be able to switch between sorting the queue by priority OR by arrival time, depending on the situation. MSMQ can only sort the queue by priority AND by arrival time.
Although I understand what you are trying to do, I don't quite see the business justification for it. Can you expand on this?
I would propose using a service to move messages from the incoming queue to a number of work queues for processing. Under normal load, there would be a several queues, each with a monitoring thread.
Under heavy load, new traffic would all go to just one "panic" queue under the load dropped. The threads on the other work queues could be paused if necessary.
CheersJohn Breakwell
I need to design a real-time product stock management engine (C# & WCF) but i don't know how to proceed in order to handle concurrency access and data integrity.
Here is some of the features the engine should be handle :
Stock Incoming products
Order preparation
Move products from one place to another
...
May i use MSMQ in order to ensure correct stock count (Messages processed in order by message pooling) or may i use application thread locking.
Note that my application have to be in Real-Time, preparer have to know in real-time how many products there are in stock in time. If there is lack of products at picking he can send a "request" to an operator.
Use a SQL database. They are already designed with data integrity, concurrency and data storage in mind.
you should probably use an SQL database as Lee says. If you use a transaction to e.g. store an order and decrease available product counts (both in the same transaction) the database guarantees atomicity. You probably also want some kind of concurrency mechanism (like a row version) to prevent inconsistent values (1st process reads, 2nd process updates the same value, then 1st process updates too overwriting the previous update based on outdated values).
Well the scenario that you have mentioned is generally where one has to use a queue rather than a persistent storage to meet the throughput needs. On searching on the net you can find a lot of case studies for the same where people have employed queuing systems to enhance the throughput of the system. SQL server can just not scale to that levels.
In special cases when your need to make your queue persistent very special methods are used as to how to mitigate the performance effects because of this. For ex. Apache's ActiveMQ has its own special file storage system which performs much better compared to simply using a MySQL for the backend persistence. Probably MSMQ also provides a similar option but am not sure.
I'm working on an application that may generate thousands of messages in a fairly tight loop on a client, to be processed on a server. The chain of events is something like:
Client processes item, places in local queue.
Local queue processing picks up messages and calls web service.
Web service creates message in service bus on server.
Service bus processes message to database.
The idea being that all communications are asynchronous, as there will be many clients for the web service. I know that MSMQ can do this directly, but we don't always have that kind of admin capability on the clients to set things up like security etc.
My question is about the granularity of the messages at each stage. The simplest method would mean that each item processed on the client generates one client message/web service call/service bus message. That's fine, but I know it's better for the web service calls to be batched up if possible, except there's a tradeoff between large granularity web service DTOs, versus short-running transactions on the database. This particular scenario does not require a "business transaction", where all or none items are processed, I'm just looking to achieve the best balance of message size vs. number of web service calls vs. database transactions.
Any advice?
Chatty interfaces (i.e. lots and lots of messages) will tend to have a high overhead from dispatching the incoming message (and, on the client, the reply) to the correct code to process the message (this will be a fixed cost per message). While big messages tend to use the resources in processing the message.
Additionally a lot of web service calls in progress will mean a lot of TCP/IP connections to manage, and concurrency issues (including locking in a database) might become an issue.
But without some details of the processing of the message it is hard to be specific, other than the general advice against chatty interfaces because of the fixed overheads.
Measure first, optimize later. Unless you can make a back-of-the-envelope estimate that shows that the simplest solution yields unacceptably high loads, try it, establish good supervisory measurements, see how it performs and scales. Then start thinking about how much to batch and where.
This approach, of course, requires you to be able to change the web service interface after deployment, so you need a versioning approach to deal with clients which may not have been redesigned, supporting several WS versions in parallel. But not thinking about versioning almost always traps you in suboptimal interfaces, anyway.
Abstract the message queue
and have a swappable message queue backend. This way you can test many backends and give yourself an easy bail-out should you pick the wrong one or grow to like a new one that appears. The overhead of messaging is usually packing and handling the request. Different systems are designed for different levels traffic and different symmetries over time.
If you abstract out the basic features you can swap the mechanics in and out as your needs change, or are more accurately assessed.
You can also translate messages from differing queue types at various portions of the application or message route as the recipient's stresses change because they are handling, for example 1000:1/s vs 10:1/s on a higher level.
Good Luck