I have an issue with using PUBSUB on Azure.
The Azure firewall will close connections that are idle for any length of time. The length of time is under much debate, but people think it is around 5 - 15 minutes.
I am using Redis as a Message Queue. To do this ServiceStack.Redis library provides a RedisMqServer which subscribes to the following channel:
mq:topic:in
On a background thread it blocks receiving data from a socket, waiting to receive a message from Redis. The problem is:
If the socket waiting on a Redis message is idle for any length of time the Azure firewall
closes the connection silently. My application is not aware as it is
now waiting on a closed connection (which as far as its concerned is
open). The background thread is effectively hung.
I had thought to implement some kind of Keep Alive which would be to wait for a message for a minute, but if one is not received then PING the server with two goals:
Keep the connection open by telling Azure this connection is
still being used.
Check if the connection has been closed, if so
start over and resubscribe.
However when I implemented this I found that I cannot use the PING command whilst subscribed?? Not sure why this is, but does anyone have an alternative solution?
I do not want to unsubscribe and resubscribe regularly as I may miss messages.
I have read the following article: http://blogs.msdn.com/b/cie/archive/2014/02/14/connection-timeout-for-windows-azure-cloud-service-roles-web-worker.aspx which talks about how the Azure Load Balancer tears down connections after 4 minutes. But even if I can keep a connection alive I still need to achieve the second goal of restarting a subscription if the connection is killed for another reason (redis node goes down).
I just implemented PING support in Pub/Sub mode in the unstable branch of Redis in this commit: https://github.com/antirez/redis/commit/27839e5ecb562d9b79e740e2e20f7a6db7270a66
This will be backported in the next days into Redis 2.8 stable.
This is an issue due to our handling of keepAlive packets when hosting Redis in Azure. We will have this fixed shortly.
Also, as suggested above you can keep the connection alive manually by pinging. For a sub/pub connection a hack you can use today is calling unsubscribe to a random channel. (This is what StackExchange.Redis does)
When a client subscribes, that connection is basically blocked for outgoing commands as it is used to listen for incoming messages. One possible workaround is to publish a periodic keepalive message on the channel.
Related
I'm currently investigating this but thought I'd ask anyway. Will post an answer once I find out if not answered.
The problem is as follows:
An application calls RabbitHutch.CreateBus to create an instance of IBus/IAdvancedBus to publish messages to RabbitMQ. The instance is returned but the IsConnected flag is set to false (i.e. connection retry is done in the background). When the application serves a specific request, IAdvancedBus.PublishAsync is called to publish a message while the bus still isn't connected. Under significant load, requests to the application end up timing out as the bus was never able to connect to RabbitMQ.
Same behaviour is observed when connectivity to RabbitMQ is lost while processing requests.
The question is:
How is EasyNetQ handling attempts to publish messages while the bus is disconnected?
Are messages queued in memory until the connection can be established? If so, is it disposing of messages after it reaches some limit? Is this configurable?
Or is it forcing the bus to try to connect to RabbitMQ?
Or is it dumping the message altogether?
Is PublisherConfirms switched on impacting the behavior?
I haven't been able to test all scenarios described above, but it looks like before trying to publish to RabbitMQ, EasyNetQ is checking that the bus is connected. If it isn't, it is entering a connection loop more or less as described here: https://github.com/EasyNetQ/EasyNetQ/wiki/Error-Conditions#there-is-a-network-failure-between-my-subscriber-and-the-rabbitmq-broker
As we are increasing load, it looks as if connection loops are spiralling out of control as none of them ever manage to connect to RabbitMQ because our infrastructure or configuration is broken. Why are we getting timeouts I have not identified yet but I suspect that there could be a concurrency issue going on when several connection loops attempt to connect simultaneously.
I also doubt that switching off PublisherConfirms would help at all as we are not able to publish messages and therefore not waiting for acknowledgement from RabbitMQ.
Our solution:
So why have I not got a clear answer to this question? The truth is, at this point in time the messages that we are trying to publish are not mission critical, strictly speaking. If our configuration is wrong, deployment will fail when running a health check and we'll essentially abort the deployment. If RabbitMQ becomes unavailable for some reason, we are OK with not having these messages published.
Also, to avoid timing out, we're wrapping up message publishing with a circuit breaker to stop message publishing if we detect that the circuit between our application and RabbitMQ is opened. Roughly speaking, this is working as follows:
var bus = RabbitHutch.Create(...).Advanced;
var rabbitMqCircuitBreaker = new CircuitBreaker(...);
rabbitMqCircuitBreaker.AttemptCall(() => {
if (!bus.IsConnected)
throw new Exception(...);
bus.Publish(...);
});
Notice that we are notifying our circuit breaker that there is a problem when the IsConnected flag is set to false by throwing an exception. If the exception is thrown X number of times over a configured period of time, the circuit will open and we will stop trying to publish messages for a configured amount of time. We think that this is acceptable as the connection should be really quick and available 99.xxx% of the time if RabbitMQ is available. Also worth noting that the bus is created when our application is starting up, not before each call, therefore the likelihood of checking the flag before it is actually set in a valid scenario is pretty low.
Works for us at the moment, any additional information would be appreciated.
Imagine some spherical horse in a vacuum:
I lost control of my client application, maybe some error has happened. And I tried to re-enter to the hub immediately.
Is it possible, that OnConnected starts faster then OnDisconnected and I turn up twice on the server?
Edited:
Sorry, I didn't say than I meant SignalR library. I think if my application won't call stop() the server will wait about 30 seconds by default. And I can connect to the server again before OnDisconnected is called. Isn't it?
You'll have to take it from the client's side, also note that if you're using TCP the following would take place:
TCP ensures that your packets will arrive in the order they were sent. And so let's imagine that at the same moment the "horse" hit the space and the connection broke, your server is sending the next packet that would check the connection (if you implemented your server good enough that is).
Here, there's two things that may happen:
The client has already recovered and can respond in time. Meaning the interval in time when the connection had problems was small enough that the next packet from the server hasn't arrived yet. And so responding to your question, there's no disconnection in the first place.
The next packet from the server arrived but the client is not responding (the connection is severed). The server would instantly take note of this, raising the OnDisconnected event. If the client recovers virtually at the same time the server takes note, then it would initiate another connection (OnConnected).
So there's no chance that the client would turn twice. If any, the
disconnection interval will be small enough for the server not to
notice the problem in the first place.
Again, another protocol may behave differently. But TCP is will designed to guarantee a well established connection and communication between a server and clients.
It's worth mentioning that many of the communication frameworks (if not all) use TCP implicitly by default.
A client can connect a second time while the first connection is open (it will have a separate connection id though).
If the client doesn't manage to notify the server that it's closing the connection, the server will wait for a certain amount of time before removing the connection (DisconnectTimeout).
So in that case, if you restart the connection immediately, it will be a new logical connection to the server with a new connection id.
SignalR will also try to reconnect to the existing connection when it is lost, in which case it would retain its connection id once reconnected. I would recommend reading the entire article about SignalR connection lifetime events.
I've recently started hosting a side project of mine on the new Azure VMs. The app uses Redis as an in-memory cache. Everything was working fine in my local environment but now that I've moved the code to Azure I'm seeing some weird exceptions coming out of Booksleeve.
When the app first fires up everything works fine. However, after about 5-10 minutes of inactivity the next request to the app experiences a network exception (I'm at work right now and don't have the exact error messages on me, so I will post them when I get home if people think they're germane to the discussion) This causes the internal MessageQueue to close, which results in every subsequent Enqueue() throwing an exception ("The Queue Is Closed").
So after some googling I found this SO post: Maintaining an open Redis connection using BookSleeve about a DIY connection manager. I can certainly implement something similar if that's the best course of action.
So, questions:
Is it normal for the RedisConnection to close periodically after a certain amount of time?
I've seen the conn.SetKeepAlive() method but I've tried many different values and none seem to make a difference. Is there more to this or am I barking up the wrong tree?
Is the connection manager idea from the post above the best way to handle this scenario?
Can anyone shed any additional light on why hosting my Redis instance in a new Azure VM causes this issue? I can also confirm that if I run my local environement against the Azure Redis VM I experience this issue.
Like I said, if it's unusual for a Redis connection to die after inactivity, I will post the stack traces and exceptions from my logs when I get home.
Thanks!
UPDATE
Didier pointed out in the comments that this may be related to the load balanacer that Azure uses: http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/12/windows-azure-load-balancer-timeout-details.aspx
Assuming that's the case, what would be the best way to implement a connection manager that could account for this goofy problem. I assume I shouldn't create a connection per unit of work right?
From other answers/comments, it sounds like this is caused by the azure infrastructure shutting down sockets that look idle. You could simply have a timer somewhere that performs some kind of operation periodically, but note that this is already built into Booksleeve: when it connects, it checks what the redis connection timeout is, and configures a heartbeat to prevent redis from closing the socket. You might be able to piggy-back this to prevent azure closing the socket too. For example, in a redis-cli session:
config set timeout 30
should configure redis (on the fly, without having to restart) to have a 30 second connection timeout. Booksleeve should then automatically take steps to ensure that there is a heartbeat shortly before 30 seconds. Note that if this is successful, you should also edit your configuration file so that this setting applies after the next restart too.
The Load Balancer in Windows Azure will close the connection after X amount of time depend on total connection load on load balancer and because of it you will get a random timeout in your connection.
As I am not well known to Redis connections I am unable to suggest how to implement it correctly however in general the suggested workaround is the have a heartbeat pulse to keep your session alive. Have you have chance to look for the workaround suggested in blog and try to implement in Redis, if that works out for you?
I have one client subscribe to one channel. After a certain period of time about 10 minutes idle, the client can not receive any message, but the publish command still returns 1. I've tried redis-py and servicestack.redis clients. The only difference is seems the idle period can be little longer when use servicestack.redis.
Any idea? Thanks in advance.
I had similar issues with an older version of Redis that was fixed by the latest version.
As an alternative you could try adding a separate thread that sends a "PING" command once in a while to keep the connection up.
I'm trying to write a durable WCF service, whereby clients can handle that the server is unavailable (due to internet connection, etc) gracefully.
All evidence points to using the MSMQ binding, but I can't do that because my "server" is the Azure cloud, which does not support MSMQ.
Does anyone have a recommended alternative for accomplishing durable messaging with Azure?
EDIT: To clarify, its important that the client (which does not run on Azure) has durable messaging to the server. This means that if the internet connection is unavailable (which may happen often due to it being connected on 3G cellular), messages are stored for delivery locally.
Azure Queuing makes no sense because if the internet was reliable enough to deliver the message to the Azure queue, it could have just as easily delivered to the service directly.
It turns out nothing like this exists, so I'm developing it on my own. I have some basic queueing implemented and I'll have more updates soon. Stay tuned!
I would suggest some implementation that uses Azure queues. Basically, just put your "request" in a queue, read the queue, try to make the request, if the request succeeds delete the message from the queue, if not don't delete the message. The azure queue has a setting called Visibility timeout. This sets how long the message is hidden from potential future callers. So in the scenario I listed above, if you set your visibility timeout to 5 minutes your retries would occur every 5 minutes. See these links for more information:
http://wag.codeplex.com/
http://azuretoolkit.codeplex.com/
http://msdn.microsoft.com/en-us/library/dd179363.aspx