We use rabbit mq to send messages to a server for processing.
We require the server to ack a message. That way if the server happens to die whilst processing the message, we will retry the message when it restarts, or with a different server.
The problem is, on a very rare occasion, we will get a message that deterministically crashes the server. This is because we call into some open source native dlls, those dlls have bugs, and sometimes these dlls just cause the process to crash with no exception. Of course it would be ideal to fix those bugs, but we don't expect to fix all such issues in pdfium or opencv any time soon. We have to reckon with the fact that whatever we do, we will eventually get such a message.
The result of this is that the message is then retried, the server restarts, picks ups the message, crashes, and so on ad infinitum. Nothing gets processed till we manually stop the server, and purge the message. Not ideal.
What can we do to solve this problem?
What we don't want to do is create another service that monitors the rabbitmq service, looks for such messages and purges them, since that just leads to spiralling complexity. Instead we want to deal with this at the rabbitmq client level. We would be perfectly happy to say that if a message is not processed 3 times, we should just fail the message. We could do this by maintaining a database entry of which messages we've processed, but ideally I wouldn't want to involve anything external, and just contain the solution to this problem in our rabbitmq client library. I'm not sure how to do this though.
One method I have used in my event driven architecture is to use dead letter exchanges (DLXs) or poison queues, that way if we see the same message multiple times due to service failure then it'll be pushed into the DLX instead of being re-queued into the original exchange. These messages then trigger a different type of process within our system to alert us messages are stuck and failing to process, we can then diagnose and fix the consumer. After a fix has been made we trigger another process to move the poison messages back into the original exchange to be then processed as normal.
In your scenario because your process crashes there is two possible options to deal with these messages:
If the message is marked as redelivered then clone the message and add an attempt count to the body or as a header (x-attempt-count) to the message. The copy will then be added to the back of the queue with the attempt count. When the copy is then consumed you can check if it hits the threshold and then move the message into a DLX or store in a database. The major drawback here is that it breaks the order of which the messages are processed.
Use an external services to keep track of the number of delivery attempts, I would recommend using something like redis/memcache where you can increment a counter based on a unique message id. At the start of your process if the message has been marked as redelivered then lookup the counter. If the message has reached the threshold, trigger a different process again like moving it into a DLX.
Related
We are running multiple instances of a windows service that reads messages from a Topic, runs a report, then converts the results into a PDF and emails them to a user. In case of exceptions we simply log the exception and move on.
The use case we want to handle is when the service is shut down we want to preserve the jobs that are currently running so they can be reprocessed by another instance of the service or when the service is restarted.
Is there a way of requeueing a message? The hacky solution would be to just republish the message from the consuming service, but there must be another way.
When incoming messages are processed, their data is put in an internal queue structure (not a message queue) and processed in batches of parallel threads, so the IbmMq transaction stuff seems hard to implement. Is that what I should be using though?
Your requirement seems to be hard to implement if you don't get rid of the "internal queue structure (not a message queue)" if this is not based on a transaction oriented middleware. The MQ queue / topic works well for multi-threaded consumers, so it is not apparent what you gain from this intermediate step of moving the data to just another queue. If you start your transaction with consuming the message from MQ, you can have it being rolled back when something goes wrong.
If I understood your use case correctly, you can use Durable subscriptions:
Durable subscriptions continue to exist when a subscribing application's connection to the queue manager is closed.
The details are explained in DEFINE SUB (create a durable subscription). Example:
DEFINE QLOCAL(THE.REPORTING.QUEUE) REPLACE DEFPSIST(YES)
DEFINE TOPIC(THE.REPORTING.TOPIC) REPLACE +
TOPICSTR('/Path/To/My/Interesting/Thing') DEFPSIST(YES) DURSUB(YES)
DEFINE SUB(THE.REPORTING.SUB) REPLACE +
TOPICOBJ(THE.REPORTING.TOPIC) DEST(THE.REPORTING.QUEUE)
Your service instances can consume now from THE.REPORTING.QUEUE.
While I readily admit that my knowledge is shaky, from what I understood from IBM’s [sketchy, inadequate, obtuse] documentation there really is no good built in solution. With transactions the Queue Manager assumes all is well unless it receives a roll back request and when it does it rolls back to a syncpoint, so if you’re trying to roll back to one message but two other messages have completed in the meantime it will roll back all three.
We ended up coding our own solution updating the way we’re logging messages and marking them as completed in the DB. Then on both startup and shutdown we find the uncompleted messages and programmatically publish them back to the queue, limiting the DB search by machine name so if we have multiple instances of the service running they won’t duplicate message processing.
We have code sending 2 duplicate tell messages ( a integer differences) one after directly the other no processing in between. These normally processed in a few milli seconds, but we are hitting times when one is process instantly the next some 11 seconds later. Which is a life time.
The randomness and sporadic nature of this issue is making it difficult to diagnosis and the fact 99% these message are processed blistering quick makes it a head scratching issue.
Back ground: We have a very controlled /stable environment 64 bit windows 10 machine. A dedicated windows server running self hosted webapi using c#, akka services v.1.3 (.net Framework). No akka remoting or clustering. As messages are posted in, actors and child actors breaking them down process into to smaller and smaller actors, some are stateful basically caching db details about requests to save on DB road trips, as the prices behind requests are fluctuating all the time we look to only post to the DB. None of these parent actors are misbehaving.
Currently logging on entry and exit to processmessage methods provides the only real diagnostics to track behaviour.
It is the behaviour of the message queuing that we think is the issue.
Basically two tells to the same actor, these are very small messages (less 1k) to a very small actor whose sole job is just to send http message. The actor has nocaching or Db requests or IO ( other than logging). Once the message hits the handler's ProcessMessage it is processed in a milli second or 2.
If I understood your issue correctly, upon receiving a message, it is forwarded to a downstream actor whose job is to call an external server? If that is the case, then Akka.NET is not at fault, why?:
Actors process message sequentially and won't process the next message
unless the current message has been processed completely. The more time it
takes to handle the current message the more time it takes for the next
message to be handled.
Probably the external server is over loaded and not sending response quickly
or maybe there is rate limiting turned on at the external server's side.
Probably the httpclient used by you needs fine-tuning!
If you can post a sample of your code, it will help in understanding your issue better!
My application uses a local private MessageQueue(#".\private$\queuename") to sequence messages from its multiple threads, and has been doing so successfully for a long time. Recently, an error occurred that caused several of those messages to essentially disappear without a trace. From the app's internal logging and the eyewitness account, it seems that the Send(msg) method failed to place the message into the queue and raised no error. In a debugger, I simulated that scenario by having execution skip the Send() call, and the resulting log info matches what was logged in the actual error occurrence.
Most disturbing is that the error condition existed for 45 minutes, persisting through a computer reboot and application restart, and required a second restart after the reboot to finally clear it.
This unresolved post[^] hints that the message might end up some place other than the intended local queue. Unfortunately, all evidence of it that might have existed was gone by the time I was able to inspect the system where the error occurred. I considered using the TimeToReachQueue acknowledgement to detect failure of the Send() call, but MSDN asserts here[^] that "If the queue is local, the message always reaches the queue," although this event challenges that claim.
Recurrence of this error will be a serious problem, so if I can't prevent it, I need to be able to detect and report it. Not knowing what actually happened makes both options extremely difficult.
Say I have a connection to rabbit, and I've pulled 1000 messages, but have not yet ack'd them, as they are being processed by a single thread out of a Blocking collection.
Now suppose my connection dies and is auto recovered. At this point all of these msgs on the server will be re queued for delivery. But I still have copies of them locally, with the old Delivery tag.
This leads me to believe I should handle connection or channel down events by clearing my local queue out.
Can you confirm this is true?
Yes that is the case. Those messages will be redelivered.
So in addition to clearing our your locally queued messages, you might want to consider your prefetch so that you don't have so many messages queued locally.
Is your strategy is to pull 1000, process them all, then finally ack them all? I can see that due to performance reasons you might do this so you can send a single ack with multiple=true, but it does introduce extra redelivery and duplicate processing risks.
You are right.If you are processing one message at a time you can set prefetch count as 1 and you may not need to clear any messages locally,too.
I have inherited an application that pulls messages out of an MSMQ does some processing to them and then adds some data to database depending on what is in the message. The messages are getting pushed into the Queue by a third party application I do not control.
I do not know much about MSMQ, although I do have a basic understanding of how to use the APIs.
Anyway, I have noticed that the messages never get deleted, our client definately never explictly deletes them, and I can look in computer management and see the messages back to when the server was last rebooted.
Is this wrong? Will the messages start to automatically get deleted when the queue reaches some maximum size or will they just pile up there forever slowly taking up more memory?
Once a message has been processed, it is normal practice to remove it from a queue (transactionally or otherwise).
I'd suspect that while this isn't best practice, the queue is cleared on reboot, and as long as there's a sufficient amount of resources available, you'll never actually run into a problem.
That said, I'd opt for setting something up to periodically clean up the queue so you don't overwhelm the server. I'm not too familiar with MSMQ, but is there some way that you can tell if a message has been processed? Even if it's an additional service that runs, checks the messages in the queue and sees if they already appear in the database, and deletes them if they do? That way, you wouldn't need to modify the codebase you inherited, since it's working properly as-is.
Once you decide on a solution, please post an update here - I'm interested to know how you end up dealing with this problem. Thanks!
"Anyway, I have noticed that the messages never get deleted, our client definately never explictly deletes them, and I can look in computer management and see the messages back to when the server was last rebooted."
Sounds like the messages are Express if there are none around from before the last reboot. Express messages are only stored in RAM and not persisted to disk so restarting the MSMQ service will destroy them. This is probably why the volume of messages has never reached a critical level.
As MSMQ uses kernel memory and disk space for memory storage, eventually one of the two would give out and cause you server stablity issues so your plan to have a cleanup process is a good one.
Cheers,
John Breakwell (MSFT)