I'm having a similar situation described here, but cannot comment there because just registered on this site.
A workaround for "pausing" with SetNumberOfWorkers(0) works in most cases. However, if SetNumberOfWorkers(0) is called during a lengthy message handler, I receive the following error at the end of the message handler:
An error occurred when attempting to complete the transaction context
Rebus.Exceptions.RebusApplicationException: Could not complete message with ID <...> and lock token <...> ---> Microsoft.Azure.ServiceBus.MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue.
Note, that "Worker Rebus 1 worker 1 stopped" messages are received for all workers almost immediately after calling SetNumberOfWorkers(0) despite handler is still running.
After bringing number of workers back to normal all further messages throw a similar error at the end of the handler.
Any advice how to correctly deal with the pause of rebus?
(I need to pause because my microservice requires to periodically updating some resources and handlers can't run during those update)
Related
I'd like to write parallel execution module based on Solace. And I use request-reply schema for this.
I have:
Multiple message consumers, which publish messages into the same queue.
Multiple message producers, which read queue and create reply messages.
Message execution time is between 10 seconds to 10 minutes.
Queue access type is non-exclusive (e.g. it does round-robin between all consumers).
Each producer and consumer is asynchronous, e.g. Solace API blocks execution during the connection only.
What I'd like to have: if produces works on the message, it should not receive any other messages. This is extremely important, because some tasks blocks executor for several minutes, however other executors can be free after couple of seconds.
Scheme below can be workable (possible), however blocking code appears below. I'd like to avoid it.
while(true)
{
var inputMessage = flow.ReceiveMsg( /*timeout 1s*/1_000); // <--- blocking code, I'd like to avoid it
flow.Ack(inputMessage.ADMessageId);
var reply = await ProcessMessageAsync(inputMessage); // execute plus handle exceptions
session.SendReply(inputMessage, reply)
}
Messages are only pushed to the consuming applications.
That being said, your desired behavior can be obtained by setting the "max-delivered-unacked-msgs-per-flow" on your queue to 1.
This means that each consumer bound to the queue is only allowed to have 1 outstanding unacknowledged messages.
The next message will be only sent to the consumer after it has acknowledged the message.
Details about this feature can be found here.
Do note that your code snippet does not appear to be valid.
IFlow.ReceiveMsg is only used in transacted sessions, which makes use of ITransactedSession.Commit to acknowledge messages.
I am running some tests that use Azure CloudQueue, and as setup/teardown I am calling CreateIfNotExistsAsync() and DeleteIfExistsAsync(). However when I am running my tests back to back I got a Microsoft.WindowsAzure.Storage.StorageException,"The remote server returned an error: (409) Conflict."
await cloudQueue.CreateIfNotExistsAsync();
// do work 1
await cloudQueue.DeleteIfExistsAsync();
await cloudQueue.CreateIfNotExistsAsync(); // throws exception
// do work 2
After taking a closer look at the server's response, I found the StatusDescription says "The specified queue is being deleted."
Is there a method that I can call so that once it returns, I know for sure the queue is already deleted?
=========================================================================
UPDATE Now that I think of it. If Azure Queue server wants to reply with deletion result, it will have to keep track of unfinished incoming request, which is obviously bad desgin (vulnerable to DOS attack)...
Is there a method that I can call so that once it returns, I know for
sure the queue is already deleted?
Unfortunately no. Deleting a queue (or blob container/table/file share) is an asynchronous operation. When you send a request to delete a queue, Azure Storage marks that queue for deletion (so that no operations can be performed on it) and then actually deletes the queue through a background process. Based on the documentation, it can take up to 30 seconds to delete a queue. However it may be more depending on how much data is held in there.
From the documentation:
When a queue is successfully deleted, the queue is immediately marked
for deletion and is no longer accessible to clients. The queue is
later removed from the Queue service during garbage collection.
Possible Workaround:
Since there's no method that you can call which will tell you for sure that a queue is already deleted, what you would need to do is try to create the queue using CreateIfNotExistsAsync and catch any error. If the HTTP status code is Conflict (409) and error code is QueueBeingDeleted, you should wait for some time and retry the operation. If you want, you can put incremental delay between retries.
The template code when you create a worker role with a queue client provides a message pump implementation. The code has a comment in it saying:
// Initiates the message pump and callback is invoked for each message that is received, calling close on the client will stop the pump.
sourceClient.OnMessage(received =>
{
//blah blah implementation
});
What actually happens when you call close() on the sourceClient? Do messages that are currently being processed continue? I.e. is this a graceful shutdown of the message pump? Or will calling close affect messages that are currently being processed by the message pump?
The documentation would lead me to believe it is, but there is this outstanding feedback item which would imply that there is no graceful shutdown mechanism for a message pump: https://feedback.azure.com/forums/216926-service-bus/suggestions/4345733-provide-gracefull-shutdown-feature-to-message-pump
So what does souceClient.close() actually do?
In the full framework client (WindowsAzure.ServiceBus) QueueClient does not stop message pump gracefully. Messages in flight that were not completed will have their delivery count increased.
So what does souceClient.close() actually do?
That client is a closed source project. Best guess would be to raise an issue for it here.
I understand that RabbitMQ with ack, by default, will re-queue the message if it detects that the consumer/worker has died.
What about the situation where the consumer/worker is still alive but its process has stalled out for too long and didn't ack?
I would like to set an explicit time that says that if a message has been dispatched to a consumer but that consumer has held the message without ack for too long that the message gets re-queued.
I recognize that this might result in messages getting processed in duplicate but sometimes the consequence of that is not as bad as delayed message delivery.
It can also happen with errant exception handling if something get swallowed, the task terminates, and the message is never ack'd and never re-queued.
Timeout for RabbitMQ consumer could be explicitly set on the consumer side. I think this is clear but just to mention - there must not be any automatic ACKs in this case. The solution would be that the consumer is multithreaded with one thread doing message processing and ACKing the message only after it has been processed, and the other thread being a timeout thread that would:
terminate the connection to broker once the timeout has expired, and
as a consequence the message would be requeued
ACK the received message and re-publish it (explicitly)
NACK the received message, but based on the documentation (instructing the broker to either discard them or requeue them), it seems that some config should be set instructing the broker what should it do with NACKed messages
Now all this implies that at least some part of the process isn't stuck. If the whole process is stuck, perhaps the broker heartbeat towards the consumer is stopped and that is how the broker knows that the consumer died (honestly I didn't test this situation, so I'm assuming), but if this is not the case (or simply to be extra safe) you could add some kind of a watchdog process that would be pinging the consumer(s) and killing them if there's no reply, which again would lead to the messages not being ACKed and being requeued.
We have an issue in our Rebus/RabbitMQ setup where Rebus suddenly stops retrieving/handling messages from RabbitMQ. This has happened two times in the last month and we're not really sure how to proceed.
Our RabbitMQ setup has two nodes on different servers, and the Rebus side is a windows service.
We see no errors in Rebus or in the eventlog on the server where Rebus runs. We also do not see errors on the RabbitMQ servers.
Rebus (and the windows service) keeps running as we do see other log messages, like the DueTimeOutSchedular and timeoutreplies. However it seems the worker thread stops running, but without any errors being logged.
It results in a RabbitMQ input queue that keeps growing :(, we're adding logging to monitor this so we get notified if it happens again.
But I'm looking for advise on how to continue the "investigation" and ideas on how to prevent this. Maybe some of you have experienced this before?
UPDATE
It seems that we actually did have a node crashing, at least the last time it happened. The master RabbitMQ node crashed (the server crashed) and the slave was promoted to master. As far as I can see from the RabbitMQ logs on the nodes everything went according to planned. There are no other errors in the RabbitMQ logs.
At the time this happened Rebus was configured to connect only to the node that was the slave (then promoted to master) so Rebus did not experience the rabbitmq failure and thus no Rebus connection errors. However, it seems that Rebus stopped handling messages when the failure occurred.
We are actually experiencing this on a few queues it seems, and some of them, but not all seems to have ended up in an unsynchronized state.
UPDATE 2
I was able to reproduce the problem quite easily, so it might be a configuration issue in our setup. But this is what we do to reproduce it
Start two nodes in a cluster, ex. rabbit1 (master) and rabbit2 (slave)
Rebus connects to rabbit2, the slave
Close rabbit1, the master. rabbit2 is promoted to master
The queues are mirrored
We have two small tests apps to reproduce this, a "sender" that sends a message every second and a "consumer" that handles the messages.
When rabbit1 is closed, the "consumer" stops handling messages, but the "sender" keeps sending the messages and the queue keeps growing.
Start rabbit1 again, it joins as slave
This has no effect and the "consumer" still does not handle messages.
Restart the "consumer" app
When the "consumer" is restarted it retrieves all the messages and handles them.
I think I have followed the setup guides correctly, but it might be a configuration issue on our part. I can't seem to find anything that would suggest what we have done wrong.
Rebus is still connected to RabbitMQ, we see that in the connections tab on the management site, the "consumers" send/recieved B/s drop to about 2 B/s when it stops handling messages
UPDATE 3
Ok so I downloaded the Rebus source and attached to our process so I could see what happens in the "RabbitMqMessageQueue" class when it stops. When "rabbit1* is closed the "BasicDeliverEventArgs" is null, this is the code
BasicDeliverEventArgs ea;
if (!threadBoundSubscription.Next((int)BackoffTime.TotalMilliseconds, out ea))
{
return null;
}
// wtf??
if (ea == null)
{
return null;
}
See: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L178
I like the "wtf ??" comment :)
That sounds very weird!
Whenever Rebus' RabbitMQ transport experiences an error on the connection, it will throw out the connection, wait a few seconds, and ensure that the connection is re-established again when it can.
You can see the relevant place in the source here: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L205
So I guess the question is whether the RabbitMQ client library can somehow enter a faulted state, silently, without throwing an exception when Rebus attemps to get the next message...?
When you experienced the error, did you check out the 'connections' tab in RabbitMQ management UI and see if the client was still connected?
Update:
Thanks for you thorough investigation :)
The "wtf??" is in there because I once experienced a hiccup when ea had apparently been null, which was unexpected at the time, thus causing a NullReferenceException later on and the vomiting of exceptions all over my logs.
According to the docs, Next will return true and set the result to null when it reaches "end-of-stream", which is apparently what happens when the underlying model is closed.
The correct behavior in that case for Rebus would be to throw a proper exception and let the connection be re-established - I'll implement that right away!
Sit tight, I'll have a fix ready for you in a few minutes!