I'm currently investigating this but thought I'd ask anyway. Will post an answer once I find out if not answered.
The problem is as follows:
An application calls RabbitHutch.CreateBus to create an instance of IBus/IAdvancedBus to publish messages to RabbitMQ. The instance is returned but the IsConnected flag is set to false (i.e. connection retry is done in the background). When the application serves a specific request, IAdvancedBus.PublishAsync is called to publish a message while the bus still isn't connected. Under significant load, requests to the application end up timing out as the bus was never able to connect to RabbitMQ.
Same behaviour is observed when connectivity to RabbitMQ is lost while processing requests.
The question is:
How is EasyNetQ handling attempts to publish messages while the bus is disconnected?
Are messages queued in memory until the connection can be established? If so, is it disposing of messages after it reaches some limit? Is this configurable?
Or is it forcing the bus to try to connect to RabbitMQ?
Or is it dumping the message altogether?
Is PublisherConfirms switched on impacting the behavior?
I haven't been able to test all scenarios described above, but it looks like before trying to publish to RabbitMQ, EasyNetQ is checking that the bus is connected. If it isn't, it is entering a connection loop more or less as described here: https://github.com/EasyNetQ/EasyNetQ/wiki/Error-Conditions#there-is-a-network-failure-between-my-subscriber-and-the-rabbitmq-broker
As we are increasing load, it looks as if connection loops are spiralling out of control as none of them ever manage to connect to RabbitMQ because our infrastructure or configuration is broken. Why are we getting timeouts I have not identified yet but I suspect that there could be a concurrency issue going on when several connection loops attempt to connect simultaneously.
I also doubt that switching off PublisherConfirms would help at all as we are not able to publish messages and therefore not waiting for acknowledgement from RabbitMQ.
Our solution:
So why have I not got a clear answer to this question? The truth is, at this point in time the messages that we are trying to publish are not mission critical, strictly speaking. If our configuration is wrong, deployment will fail when running a health check and we'll essentially abort the deployment. If RabbitMQ becomes unavailable for some reason, we are OK with not having these messages published.
Also, to avoid timing out, we're wrapping up message publishing with a circuit breaker to stop message publishing if we detect that the circuit between our application and RabbitMQ is opened. Roughly speaking, this is working as follows:
var bus = RabbitHutch.Create(...).Advanced;
var rabbitMqCircuitBreaker = new CircuitBreaker(...);
rabbitMqCircuitBreaker.AttemptCall(() => {
if (!bus.IsConnected)
throw new Exception(...);
bus.Publish(...);
});
Notice that we are notifying our circuit breaker that there is a problem when the IsConnected flag is set to false by throwing an exception. If the exception is thrown X number of times over a configured period of time, the circuit will open and we will stop trying to publish messages for a configured amount of time. We think that this is acceptable as the connection should be really quick and available 99.xxx% of the time if RabbitMQ is available. Also worth noting that the bus is created when our application is starting up, not before each call, therefore the likelihood of checking the flag before it is actually set in a valid scenario is pretty low.
Works for us at the moment, any additional information would be appreciated.
Related
RabbitMQ seems to be in a weird state. We install RabbitMQ and Erlang, one of its pre requsite in our application. When trying to send message in the queue, it throws us exceptions thus the queue is just filling up. We need to either reboot the PC or restart the RabbitMQ server to start sending the message again.
Note - I don't have any logs and know what is exact exception as this happened in our during installation at customer site ad we have no access to them. This issue was found happening only at customers site in many of the platforms.
I require suggestions as to what may cause such use case. Is there a way i can test for weird state of rabbitmq and restart them in such use case from code. Or any generic way to handle from code ?
We have recently decided to start using on-premise Service Fabric and have encountered a 'dependency' problem.
We have several guest executables which have dependencies between them, and can't recover from a restart of the service they are dependant on without a restart themselves.
An example to make it clear:
In the chart below service B is dependant on service A.
If service A encounters an unexpected error and gets restarted, service B will go into an 'error' state (which won't be reported to the fabric). This means service B will report an OK health state although it's in an error state.
We were thinking of a solution around these lines:
Raise an independent service which monitors the health state events of all replicas/partitions/applications in the cluster and contains the entire dependency tree.
When the health state of a service changes, it restarts its direct dependencies, which will cause a domino effect of events -> restarts untill the entire subtree has been reset (as shown in the Event-> Action flow chart bellow).
The problem is the healthReport events don't get sent within short intervals of time (meaning my entire system could not work and I wouldn't know for a few a minutes). I would monitor the health state, but I do need to know history (even if the state is healthy now, it doesn't mean it wasn't in error state earlier).
Another problem is that the events could pop at any service level (replica/partition), and it would require me to aggregate all the events.
I would really appreciate any help on the matter. I am also completely open to any other suggestion for this problem, even if it's in a completely other direction.
Cascading failures in services can generally be avoided by introducing fault tolerance at the communication boundaries between services. A few strategies to achieve this:
Introduce retries for failed operations with a delay in between. The time between delays may grow exponentially. This is an easy option to implement if you are currently doing a lot of remote procedure call (RPC) style communication between services. It may be very effective if your dependent services don't take too long to restart. Polly is a well-known library for implementing retries.
Use circuit breakers to close down communications with failing services. In this metaphor, a closed circuit is formed between two services communicating normally. The circuit breaker monitors the communications. If it detects some number of failed communications, it 'opens' the circuit, causing any further communications to fail immediately. The circuit breaker then sends periodic queries to the failing service to check its health, and closes the circuit once the failing service becomes operational. This is a little more involved than retry policies since you are responsible for preventing an open circuit from crashing your service, and also for deciding what constitutes a healthy service. Polly also supports circuit breakers
Use queues to form fully asynchronous communication between services. Instead of communicating directly from service B to A, queue outbound operations to A in service B. Process the queue in its own thread - do not allow communication failures to escape the queue processor. You may also add an inbound queue to service A to receive messages from service B's outbound queue to completely isolate message processing from the network. This is probably the most durable but also the most complex as it requires a very different architecture from RPC, and you must also decide how to deal with messages which fail repeatedly. You might retry failed messages immediately, send them to the back of the queue after a delay, send them to a dead letter collection for manual processing, or drop the message altogether. Since you're using guest executables you don't have the luxury of reliable collections to help with this process, so a third party solution like RabbitMQ might be useful if you decide to go this way.
I have an issue with using PUBSUB on Azure.
The Azure firewall will close connections that are idle for any length of time. The length of time is under much debate, but people think it is around 5 - 15 minutes.
I am using Redis as a Message Queue. To do this ServiceStack.Redis library provides a RedisMqServer which subscribes to the following channel:
mq:topic:in
On a background thread it blocks receiving data from a socket, waiting to receive a message from Redis. The problem is:
If the socket waiting on a Redis message is idle for any length of time the Azure firewall
closes the connection silently. My application is not aware as it is
now waiting on a closed connection (which as far as its concerned is
open). The background thread is effectively hung.
I had thought to implement some kind of Keep Alive which would be to wait for a message for a minute, but if one is not received then PING the server with two goals:
Keep the connection open by telling Azure this connection is
still being used.
Check if the connection has been closed, if so
start over and resubscribe.
However when I implemented this I found that I cannot use the PING command whilst subscribed?? Not sure why this is, but does anyone have an alternative solution?
I do not want to unsubscribe and resubscribe regularly as I may miss messages.
I have read the following article: http://blogs.msdn.com/b/cie/archive/2014/02/14/connection-timeout-for-windows-azure-cloud-service-roles-web-worker.aspx which talks about how the Azure Load Balancer tears down connections after 4 minutes. But even if I can keep a connection alive I still need to achieve the second goal of restarting a subscription if the connection is killed for another reason (redis node goes down).
I just implemented PING support in Pub/Sub mode in the unstable branch of Redis in this commit: https://github.com/antirez/redis/commit/27839e5ecb562d9b79e740e2e20f7a6db7270a66
This will be backported in the next days into Redis 2.8 stable.
This is an issue due to our handling of keepAlive packets when hosting Redis in Azure. We will have this fixed shortly.
Also, as suggested above you can keep the connection alive manually by pinging. For a sub/pub connection a hack you can use today is calling unsubscribe to a random channel. (This is what StackExchange.Redis does)
When a client subscribes, that connection is basically blocked for outgoing commands as it is used to listen for incoming messages. One possible workaround is to publish a periodic keepalive message on the channel.
I've recently started hosting a side project of mine on the new Azure VMs. The app uses Redis as an in-memory cache. Everything was working fine in my local environment but now that I've moved the code to Azure I'm seeing some weird exceptions coming out of Booksleeve.
When the app first fires up everything works fine. However, after about 5-10 minutes of inactivity the next request to the app experiences a network exception (I'm at work right now and don't have the exact error messages on me, so I will post them when I get home if people think they're germane to the discussion) This causes the internal MessageQueue to close, which results in every subsequent Enqueue() throwing an exception ("The Queue Is Closed").
So after some googling I found this SO post: Maintaining an open Redis connection using BookSleeve about a DIY connection manager. I can certainly implement something similar if that's the best course of action.
So, questions:
Is it normal for the RedisConnection to close periodically after a certain amount of time?
I've seen the conn.SetKeepAlive() method but I've tried many different values and none seem to make a difference. Is there more to this or am I barking up the wrong tree?
Is the connection manager idea from the post above the best way to handle this scenario?
Can anyone shed any additional light on why hosting my Redis instance in a new Azure VM causes this issue? I can also confirm that if I run my local environement against the Azure Redis VM I experience this issue.
Like I said, if it's unusual for a Redis connection to die after inactivity, I will post the stack traces and exceptions from my logs when I get home.
Thanks!
UPDATE
Didier pointed out in the comments that this may be related to the load balanacer that Azure uses: http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/12/windows-azure-load-balancer-timeout-details.aspx
Assuming that's the case, what would be the best way to implement a connection manager that could account for this goofy problem. I assume I shouldn't create a connection per unit of work right?
From other answers/comments, it sounds like this is caused by the azure infrastructure shutting down sockets that look idle. You could simply have a timer somewhere that performs some kind of operation periodically, but note that this is already built into Booksleeve: when it connects, it checks what the redis connection timeout is, and configures a heartbeat to prevent redis from closing the socket. You might be able to piggy-back this to prevent azure closing the socket too. For example, in a redis-cli session:
config set timeout 30
should configure redis (on the fly, without having to restart) to have a 30 second connection timeout. Booksleeve should then automatically take steps to ensure that there is a heartbeat shortly before 30 seconds. Note that if this is successful, you should also edit your configuration file so that this setting applies after the next restart too.
The Load Balancer in Windows Azure will close the connection after X amount of time depend on total connection load on load balancer and because of it you will get a random timeout in your connection.
As I am not well known to Redis connections I am unable to suggest how to implement it correctly however in general the suggested workaround is the have a heartbeat pulse to keep your session alive. Have you have chance to look for the workaround suggested in blog and try to implement in Redis, if that works out for you?
I'm trying to write a durable WCF service, whereby clients can handle that the server is unavailable (due to internet connection, etc) gracefully.
All evidence points to using the MSMQ binding, but I can't do that because my "server" is the Azure cloud, which does not support MSMQ.
Does anyone have a recommended alternative for accomplishing durable messaging with Azure?
EDIT: To clarify, its important that the client (which does not run on Azure) has durable messaging to the server. This means that if the internet connection is unavailable (which may happen often due to it being connected on 3G cellular), messages are stored for delivery locally.
Azure Queuing makes no sense because if the internet was reliable enough to deliver the message to the Azure queue, it could have just as easily delivered to the service directly.
It turns out nothing like this exists, so I'm developing it on my own. I have some basic queueing implemented and I'll have more updates soon. Stay tuned!
I would suggest some implementation that uses Azure queues. Basically, just put your "request" in a queue, read the queue, try to make the request, if the request succeeds delete the message from the queue, if not don't delete the message. The azure queue has a setting called Visibility timeout. This sets how long the message is hidden from potential future callers. So in the scenario I listed above, if you set your visibility timeout to 5 minutes your retries would occur every 5 minutes. See these links for more information:
http://wag.codeplex.com/
http://azuretoolkit.codeplex.com/
http://msdn.microsoft.com/en-us/library/dd179363.aspx