We have an issue in our Rebus/RabbitMQ setup where Rebus suddenly stops retrieving/handling messages from RabbitMQ. This has happened two times in the last month and we're not really sure how to proceed.
Our RabbitMQ setup has two nodes on different servers, and the Rebus side is a windows service.
We see no errors in Rebus or in the eventlog on the server where Rebus runs. We also do not see errors on the RabbitMQ servers.
Rebus (and the windows service) keeps running as we do see other log messages, like the DueTimeOutSchedular and timeoutreplies. However it seems the worker thread stops running, but without any errors being logged.
It results in a RabbitMQ input queue that keeps growing :(, we're adding logging to monitor this so we get notified if it happens again.
But I'm looking for advise on how to continue the "investigation" and ideas on how to prevent this. Maybe some of you have experienced this before?
UPDATE
It seems that we actually did have a node crashing, at least the last time it happened. The master RabbitMQ node crashed (the server crashed) and the slave was promoted to master. As far as I can see from the RabbitMQ logs on the nodes everything went according to planned. There are no other errors in the RabbitMQ logs.
At the time this happened Rebus was configured to connect only to the node that was the slave (then promoted to master) so Rebus did not experience the rabbitmq failure and thus no Rebus connection errors. However, it seems that Rebus stopped handling messages when the failure occurred.
We are actually experiencing this on a few queues it seems, and some of them, but not all seems to have ended up in an unsynchronized state.
UPDATE 2
I was able to reproduce the problem quite easily, so it might be a configuration issue in our setup. But this is what we do to reproduce it
Start two nodes in a cluster, ex. rabbit1 (master) and rabbit2 (slave)
Rebus connects to rabbit2, the slave
Close rabbit1, the master. rabbit2 is promoted to master
The queues are mirrored
We have two small tests apps to reproduce this, a "sender" that sends a message every second and a "consumer" that handles the messages.
When rabbit1 is closed, the "consumer" stops handling messages, but the "sender" keeps sending the messages and the queue keeps growing.
Start rabbit1 again, it joins as slave
This has no effect and the "consumer" still does not handle messages.
Restart the "consumer" app
When the "consumer" is restarted it retrieves all the messages and handles them.
I think I have followed the setup guides correctly, but it might be a configuration issue on our part. I can't seem to find anything that would suggest what we have done wrong.
Rebus is still connected to RabbitMQ, we see that in the connections tab on the management site, the "consumers" send/recieved B/s drop to about 2 B/s when it stops handling messages
UPDATE 3
Ok so I downloaded the Rebus source and attached to our process so I could see what happens in the "RabbitMqMessageQueue" class when it stops. When "rabbit1* is closed the "BasicDeliverEventArgs" is null, this is the code
BasicDeliverEventArgs ea;
if (!threadBoundSubscription.Next((int)BackoffTime.TotalMilliseconds, out ea))
{
return null;
}
// wtf??
if (ea == null)
{
return null;
}
See: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L178
I like the "wtf ??" comment :)
That sounds very weird!
Whenever Rebus' RabbitMQ transport experiences an error on the connection, it will throw out the connection, wait a few seconds, and ensure that the connection is re-established again when it can.
You can see the relevant place in the source here: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L205
So I guess the question is whether the RabbitMQ client library can somehow enter a faulted state, silently, without throwing an exception when Rebus attemps to get the next message...?
When you experienced the error, did you check out the 'connections' tab in RabbitMQ management UI and see if the client was still connected?
Update:
Thanks for you thorough investigation :)
The "wtf??" is in there because I once experienced a hiccup when ea had apparently been null, which was unexpected at the time, thus causing a NullReferenceException later on and the vomiting of exceptions all over my logs.
According to the docs, Next will return true and set the result to null when it reaches "end-of-stream", which is apparently what happens when the underlying model is closed.
The correct behavior in that case for Rebus would be to throw a proper exception and let the connection be re-established - I'll implement that right away!
Sit tight, I'll have a fix ready for you in a few minutes!
Related
I develop a very simple app using RabbitMQ. One machine, multiple queues and exchanges, one publisher and one consumer. After reading further about Clustering and HA I connect a second machine to create a cluster, besides I mirrored queues to have at least one replica. Now when I want to publish some data into a queue, I use the first machine as my host and it works fine, but if RabbitMQ service of the first machine not running my app crashed. My question is how to know which machine is up for creating connection and how to change the host while publishing messages?
UPDATEI use one of CreateConnection overloads to pass all my hosts for creating a connection. OK, this will solve the problem of finding an available machine to create a connection. But the second question is still there, look at the code below:
for(int i = 0, i < 300, i++){
var message = string.Format("Message #{0}: {1}", i, Guid.NewGuid());
var messageBodyTypes = Encoding.UTF8.GetBytes(message);
channel.BasicPublish(ExchangeName, "123456", null, messageBodyBytes);
}
These lines of code is work perfect when the connection is OK, but assume that in the middle of publishing messages to an exchange, the service stopped unexpectedly, then in this case first System.IO.FileLoadException raised and if I continue the executation RabbitMQ.Client.Exceptions.AlreadyClosedException raised which is saying:
Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=320, text="CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'", classId=0, methidId=0, cause=
I think there must be a way to change the host when the connection closed during publishing messages, but how, no IDEA!
These lines of code is work perfect when the connection is OK, but
assume that in the middle of publishing messages to an exchange, the
service stopped unexpectedly, then in this case first
System.IO.FileLoadException raised and if I continue the executation
RabbitMQ.Client.Exceptions.AlreadyClosedException raised which is
saying:
You must close the channel and current connection and open a new one of each mid-loop. That should use a different connection. You only have to do this when the exception is caught, not on every iteration of the loop.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
I'm trying to add a member to a Group using SignalR 2.2. Every single time, I hit a 30 second timeout and get a "System.Threading.Tasks.TaskCanceledException: A task was canceled." error.
From a GroupSubscriptionController that I've written, I'm calling:
var hubContext = GlobalHost.ConnectionManager.GetHubContext<ProjectHub>();
await hubContext.Groups.Add(connectionId, groupName);
I've found this issue where people are periodically encountering this, but it happens to me every single time. I'm running the backend (ASP.NET 4.5) on one VS2015 launched localhost port, and the frontend (AngularJS SPA) on another VS 2015 launched localhost port.
I had gotten SignalR working to the point where messages were being broadcast to every connected client. It seemed so easy. Now, adding in the Groups part (so that people only get select messages from the server) has me pulling my hair out...
That task cancellation error could be being thrown because the connectionId can't be found in the SignalR registry of connected clients.
How are you getting this connectionId? You have multiple servers/ports going - is it possible that you're getting your wires crossed?
I know there is an accepted answer to this, but I came across this once for a different reason.
First off, do you know what Groups.Add does?
I had expected Groups.Add's task to complete almost immediately every time, but not so. Groups.Add returns a task that only completes, when the client (i.e. Javascript) acknowledges that it has been added to a group - this is useful for reconnecting so it can resubscribe to all its old groups. Note this acknowledgement is not visible to the developer code and nicely covered up for you.
The problem is that the client may not respond because they have disconnected (i.e. they've navigated to another page). This will mean that the await call will have to wait until the connection has disconnected (default timeout 30 seconds) before giving up by throwing a TaskCanceledException.
See http://www.asp.net/signalr/overview/guide-to-the-api/working-with-groups for more detail on groups
So...this one's got me baffled.
The target queue lives on ServerA where MSMQ is running in Workgroup mode. The queue is a non-transactional, private queue, with Full rights on pretty much the world (including NETWORK SERVICE, but EXCLUDING ANONYMOUS LOGON).
I'm specifying the queue address as such: FormatName:DIRECT=OS:ServerA\private$\targetqueue.
If I'm interested in sending "fire-and-forget"-style (no need for transaction as there is no other persistence going on), I would assume it would be fine to simply call:
Message message = ConstructMessageWithObjectPayload(serializableObject);
using (MessageQueue queue = new MessageQueue(queueAddress))
{
queue.Send(message);
}
But strangely, the message never arrives in the target queue and enabling negative source journaling (which interestingly enough causes the message to be sent to the Dead-letter messages queue on the target server) tells me that it is a "Nontransactional message".
Consequently, using
queue.Send(message, MessageQueueTransactionType.Single);
works! Having a hard time wrapping my head around this. What am I missing?
Also, I've seen a good number of posts by others where their similar problem was solved by giving ANONYMOUS LOGIN Full rights. In what scenario is this necessary? Giving NETWORK SERVICE access somewhat made sense because that is the account that MSMQ itself runs under. If running in Workgroup mode like I am, is it necessary at all to assign rights to Everyone or even the account that my process runs under?
Appreciate the help!
I have the following setup and problem with MSMQ. Based on previous experience with MSMQ I'm betting that it is something simple I'm missing but I just don't know what it is.
The Setup
I have 3 load-balanced web servers (lets call them Servers W1, W2 and W3) and 1 server which processes certain events/data away from web requests (which I'll call P). All 3 of the web servers, once a particular event occurs within the web application, will send a message to a remote private queue on Server P, which will then process each message from the queue and carry out some task.
The Problem
For the most part - at a guess 95% of the time - everything runs fine, but occasionally Server P does not receive messages from the web servers. This is either because W1, W2 or W3 are not sending them or they are not being received by P, I just can't tell. This means I'm missing vital events happening from the users on the web application but I cannot find any errors listed in my own logs.
The Details
Here are all the details I can think of which may help explain my setup and what I've figured out so far:
The private queue on Server P is non-transactional.
The private queue has permissions setup for Everyone to both Send and Receive Messages.
This is the code I use (C#) to send the message to the remote private queue:
var queue = new MessageQueue(#"FormatName:DIRECT=OS:ServerP\PRIVATE$\MyMessageQueue");
var defaultProperties = queue.DefaultPropertiesToSend;
defaultProperties.AcknowledgeType = AcknowledgeTypes.FullReachQueue | AcknowledgeTypes.FullReceive;
defaultProperties.Recoverable = true;
defaultProperties.UseDeadLetterQueue = true;
defaultProperties.UseJournalQueue = true;
queue.Send(requestData);
Sending the message using the code above does not appear to throw an exception - if it did my error handler in the web application would have caught and logged it, so I'm assuming it is sent.
There are outgoing queues on W1, W2 and W3 all pointing to the private queue on P - all these are empty.
On W1, W2 and W3 I cannot see any "dead-letter" messages.
On P the private queue is empty so messages are being processed (which I can verify from my database).
On P there are no "dead-letter" messages. There are journal messages but they don't seem to correspond to any recent date/times.
All servers are running Windows Server 2012.
Most of the time messages are sent, received and processed just fine but, without any pattern visible to me, sometimes they are not. Can anyone see what is going wrong? Or explain to me how I can try and figure out what is happening?
Are you sure that the receiver on P does not crash/lose the message somehow? Because your queue is not transactional, if somehow processing fails then that's one lost message.
Anyway, there are many possible causes why this could fail.
What kind of logging do you have (DEBUG/INFO levels)?
I think the following will help tracking down the issue:
When an event is generated in the web app.
Right before you send an event from the web app, via MSMQ.
In the receiver when you get a message from the queue.
This way you could at least match sent messages to received messages and to processed messages.
As a side note, when you check for dead-letter messages you do so on the source computer and on any intermediary hops, not on the destination one. If you don't have any hops, then they will be relayed to the non-transactional dead-letter queue on the web servers.
Feature Description
The NServiceBus gateway, http://docs.particular.net/nservicebus/gateway/, seems to be a way to achieve an internal webhook using the NServiceBus infrastructure.
We need to go further with this concept to open up a few event to any 3rd party subscriber that has access to register a webhook url in our system.
Review
We plan to create two initial window services
1) WebHookBatchService, that can be added as a subscriber to specific messages of interest.
<UnicastBusConfig>
<MessageEndpointMappings>
.......
<add Messages="MyMessages.MyImportantMessage, MyMessages" Endpoint="WebHookBatchService.Queue"/>
.......
</MessageEndpointMappings>
</UnicastBusConfig>
2) WebHookProcessService - actually processes 1 message sent by the WebHookBatchService.
Once messages are received on the WebHookBatchService.Queue our WebHookBatchService will look up all the subscribers for the specific tenant + message type and foreach send individual messages to WebHookProcessService.Queue for the WebHookProcessService (which we can make an instance of nservicebus loadbalancer to bridge the batch and actual processor) to actually process the real messages probably using http://restsharp.org/.
Questions
Are there any existing open source projects that do this today?
Now since we have no control of the durability of the subscribers how should we manage errors?
http://wiki.shopify.com/WebHook
A webhook will be deleted if there are 19 consecutive failures for the exact same webhook.
It doesn't mention any delays in the webhook.. What have people experienced with standard delay in retry logic?
Here are some other thoughts:
proposal 0: MaxRetries="1". Purge WebHookProcessService.ErrorQueue nightly. (no retry - guaranteed message loss if it fails the first time)
proposal 1:
MaxRetries="1" on exception catch send email containing xml version of the message that would have been delivered over http.
Purge WebHookProcessService.ErrorQueue nightly.
-- I see potential a spam issues.
proposal 2: The nservicebus MaxRetries retries right away without delay. So i would need to create (1hr - 24hr) bucket queues and use a RetrySchedulerService although I see this as difficult to maintain and confusing for subscribers when they all at once get 25 messages in a non DateCreated ordered fashion when there service endpoint begins to work.
Digging for ideas...
The Gateway is typically used for communication between physical sites over HTTP. Since you are exposing an endpoint to the world to accept callbacks, I'm thinking you could just use the built-in WCF hosting and expose your endpoint through the firewall to 3rd parties. The rest of your setup sounds appropriate to me.
As for errors, you are correct, NSB retries immediately, but if you using web call backs this may get you by in the cases there are small hiccups. You will need to determine how you want to process the error queues, we just build in a new endpoint to process the error queues with logic to determine the retries, delay etc. A nice way to accomplish this is to use a Saga, which includes a Timeout manager. This enables a workflow where you can retry a specified number of times, try another communication, log everything, and ultimately notify someone who can contact the 3rd party to let them know there stuff is busted.