Redis connection errors when using Booksleeve Redis client in Azure VM - c#

I've recently started hosting a side project of mine on the new Azure VMs. The app uses Redis as an in-memory cache. Everything was working fine in my local environment but now that I've moved the code to Azure I'm seeing some weird exceptions coming out of Booksleeve.
When the app first fires up everything works fine. However, after about 5-10 minutes of inactivity the next request to the app experiences a network exception (I'm at work right now and don't have the exact error messages on me, so I will post them when I get home if people think they're germane to the discussion) This causes the internal MessageQueue to close, which results in every subsequent Enqueue() throwing an exception ("The Queue Is Closed").
So after some googling I found this SO post: Maintaining an open Redis connection using BookSleeve about a DIY connection manager. I can certainly implement something similar if that's the best course of action.
So, questions:
Is it normal for the RedisConnection to close periodically after a certain amount of time?
I've seen the conn.SetKeepAlive() method but I've tried many different values and none seem to make a difference. Is there more to this or am I barking up the wrong tree?
Is the connection manager idea from the post above the best way to handle this scenario?
Can anyone shed any additional light on why hosting my Redis instance in a new Azure VM causes this issue? I can also confirm that if I run my local environement against the Azure Redis VM I experience this issue.
Like I said, if it's unusual for a Redis connection to die after inactivity, I will post the stack traces and exceptions from my logs when I get home.
Thanks!
UPDATE
Didier pointed out in the comments that this may be related to the load balanacer that Azure uses: http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/12/windows-azure-load-balancer-timeout-details.aspx
Assuming that's the case, what would be the best way to implement a connection manager that could account for this goofy problem. I assume I shouldn't create a connection per unit of work right?

From other answers/comments, it sounds like this is caused by the azure infrastructure shutting down sockets that look idle. You could simply have a timer somewhere that performs some kind of operation periodically, but note that this is already built into Booksleeve: when it connects, it checks what the redis connection timeout is, and configures a heartbeat to prevent redis from closing the socket. You might be able to piggy-back this to prevent azure closing the socket too. For example, in a redis-cli session:
config set timeout 30
should configure redis (on the fly, without having to restart) to have a 30 second connection timeout. Booksleeve should then automatically take steps to ensure that there is a heartbeat shortly before 30 seconds. Note that if this is successful, you should also edit your configuration file so that this setting applies after the next restart too.

The Load Balancer in Windows Azure will close the connection after X amount of time depend on total connection load on load balancer and because of it you will get a random timeout in your connection.
As I am not well known to Redis connections I am unable to suggest how to implement it correctly however in general the suggested workaround is the have a heartbeat pulse to keep your session alive. Have you have chance to look for the workaround suggested in blog and try to implement in Redis, if that works out for you?

Related

How to troubleshoot MaxConcurrentSessions exceeded in IIS hosted WCF Service

I'm way out of my comfort zone so bear with me on providing the relevant information. We have just moved a IIS hosted WCF service to a new server and clients calling this service started experiencing timeouts. It does ok for about 10 minutes after recycling the app pool and then everything begins timing out. We enabled WCF tracing where I can see that its saying the MaxConcurrentSessions has been exceeded. The documentation says that value defaults to 2 x [# of processors] so it should be 200 for us.
The server is behind a load balancer, but is currently the only server. We notice the connections hang out at around 6 per second in Performance Monitor but will climb up to around 30 when the timeouts happen and continue climbing up from there.
The clients are connecting using a wsHttpBinding TransportWithMessageCredential security. The service validates the credentials provided in the message using the asp.net membership provider in a custom UserNamePasswordValidator configured for use on the server binding behavior. The clients do not enable reliableSession on their bindings. The service uses the default SessionMode and InstanceContextMode which I believe are Allowed and PerSession respectively? We do not call Close on the service proxies because in past investigation, I've found this only sets a flag on the option preventing it from being re-used and ours always go out of scope anyway...but now doing testing to see if this does close the connection.
If I'm interpreting the WCF trace log correctly (and I don't understand the majority of what I'm reading there) it appears we are processing around 30-40 messages per minute and that each request is completed in less than 300ms (usually much less, on rare occasions nearly 1s.) I determined the number of messages by counting the Processing message n messages over a few 1 min spans. So if we're getting 40 per minute and it takes 100s for those connections/sessions to timeout and close, we would still only have about 68 open at once before the first ones begin to time out. Not close to the 200 limit. Does the connection for a single client request get more than one session?
The strange thing is we didn't have any timeouts before and copied the service and web.config straight over to the new server. I believe the server and IIS versions were upgraded (server 2016, IIS 10.) Can you please help me identify and provide the relevant information to track down the problem causing these timeouts?
Edit:
From my reading, everything seems to indicate that the client must call Close otherwise the server will leave the connection open until it times out. However, in our test, we see one connection created in perf. mon. but it remains open after Close has been called anyway. So I can't determine if the need to call close is a rumor or if we are misinterpretting our monitoring. The real test would be to call Close everywhere and see if it eliminates our timeouts.
After increasing our MaxConcurrentSessions to 400, in performance monitor, we saw the number of concurrent sessions and instances steadily rise by about 1 per second up to about 225 where it finally leveled off and it's hovering around there. So it seems like sessions are not being closed.
Well we figured it out. There was nothing that just popped up and told us what the problem was and it took a lot of brain storming, but here's what we did:
Enabled WCF tracing. Went through the traces and was able to understand enough to basically see that the traffic didn't look out of the ordinary. All of the events seemed to be for the expected amount and types of service calls. Viewing in svctraceviewer, It didn't seem to be a DOS attack or anything like that. We just used the default configuration from that link, but it looks like it can be very customized to provide the specific information you're after if you know what that is.
What really helped in this case was finding the WCF Performance Counters. Initially we were using ASP.NET performance counters to look at sessions open which was not the right metric. This codeproject guide helped us enable the WCF performance counters to give us an insight into the number of sessions and the limit in real time.
It also helped to brush up on how WCF sessions and instances are related as well as creation of a security context:
https://www.codeproject.com/Articles/188749/WCF-Sessions-Brief-Introduction
http://webservices20.blogspot.com/2009/01/wcf-performance-gearing-up-your-service.html
https://learn.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/hh273122(v=vs.100)
We were able to see the percentage of the max WCF sessions being used, and observed it climbing higher and higher towards the default limit of 200 (100 per processor) but eventually level off between 150 and 200. This leveling off, together with far more sessions existing at a given time than the average number of requests per minute seen in our WCF tracing, indicated that sessions were closing but seemed to be remaining open until they timed out rather than closing as soon as the server completed the request.
Somewhere on Stack Overflow, that I've been unable to find, I once asked about the purpose of the [ClientBase<TChannel>.Close][4] method (a.k.a. the close method of a WCF service proxy) and, somewhat incorrectly, came to the conclusion that all it did is set a flag on the proxy object marking it closed so that it couldn't be used again. The documentation's description of the method seems in line with that:
Causes the ClientBase<TChannel> object to transition from its current
state into the closed state.
Well at the point that I would call Close, my references always just go out of scope anyway allowing garbage collection to clean it up so that seemed pointless. But I think a key factor was that that was regarding basicHttpBindings which are stateless. In this case, we are using wsHttpBindings which are stateful which means the server leaves keeps the session and leaves the connection open after it completes the request so that subsequent calls from the client can be made on the same connection. So, though I couldn't find any documentation or track down in the source code where it happens, it seems WCF clients must call Close on their service proxy after they make their last request in order to tell the server it can close the connection and free up that session slot. I didn't have the opportunity to look for a message sent to the server upon calling Close to do this, but we were able to observe, using the Performance Counter, the number of sessions dropping from 1 to 0 where before it would remain at 1 after our client called the service.
But we're saying a WCF client, who we may have no control over, is able to harm server performance and possibly create a denial of service if they aren't diligent in their coding and remembering to call Close and the server has no control over its own performance?? That sounds like a recipe for disaster. Well there are two things you can do on the server to mitigate this. First you can increase the max number of sessions. In our case we were hovering around 175 but occasionally under traffic spikes exceeding the 200. We bumped it up to 800 temporarily to ensure we wouldn't exceed the max. The trade-off is dedicating more server resources to holding those sessions that will probably never be used again until they time out. Luckily, the server also controls the timeout. The service can control the length these sessions are held open using the ReceiveTimeout and the InactivityTimeout. Both default to 10 minutes but the lesser of the two will be used. If you're thinking, "Receive timeout sounds wrong. That controls the amount of time the service can take to receive a large message", you're not alone. However, that's incorrect. On the server side:
ReceiveTimeout – used by the Service Framework Layer to initialize the session-idle timeout which controls how long a session can be idle before timing out.
And on the client-side it is not used. So we set our ReceiveTimeout to 30 seconds and the sessions dropped significantly. That may have actually been too low because some spots in code that do re-use the service proxy (making multiple calls in a loop for instance, or doing some data processing in between calls) are now getting an error when trying to call the service after the session has been closed. So you will have to find the right balance. But best practice, it seems, is to close your connections.
One gotcha to watch out for is using Dispose on your service proxy. I had always tried typing .dispo to see if intellisense would popup the Dispose method on my proxy and found that it didn't so assumed it didn't implement IDisposable and didn't need to be closed or disposed. It turns out it does implement IDisposable but it does it explicitly so you'd have to cast it as an IDisposable to call Dispose on it. But wait! Don't go putting your proxy in a using statement just yet. The implementation of Dispose sillily just calls Close on the proxy which will throw an exception if the proxy is in the faulted state (i.e. if a service call threw an exception). So you can't safely do something like this:
using(MyWcfClient proxy = new MyWcfClient())
{
try
{
proxy.Calculate();
}
catch(Exception)
{
}
}
because if Calculate throws an exception, the closing bracket of the using block will also throw an exception when it tries to dispose your proxy. Instead you just have to call Close after your last service method call. Evidently you can also call Abort in the catch, but I'm not sure if that actually communicates with the server to end the session.
MyWcfClient proxy = new MyWcfClient
try
{
proxy.Calculate();
proxy.Close();
}
catch(Exception)
{
proxy.Abort();
}
Addendum
We surmise the reason we started experiencing this when moving servers and were not experiencing it before is we were using Barracuda products before and are now using Oracle and perhaps the old load balancer or firewall was closing open connections for us.

What happens when publishing messages with EasyNetQ and the bus is disconnected?

I'm currently investigating this but thought I'd ask anyway. Will post an answer once I find out if not answered.
The problem is as follows:
An application calls RabbitHutch.CreateBus to create an instance of IBus/IAdvancedBus to publish messages to RabbitMQ. The instance is returned but the IsConnected flag is set to false (i.e. connection retry is done in the background). When the application serves a specific request, IAdvancedBus.PublishAsync is called to publish a message while the bus still isn't connected. Under significant load, requests to the application end up timing out as the bus was never able to connect to RabbitMQ.
Same behaviour is observed when connectivity to RabbitMQ is lost while processing requests.
The question is:
How is EasyNetQ handling attempts to publish messages while the bus is disconnected?
Are messages queued in memory until the connection can be established? If so, is it disposing of messages after it reaches some limit? Is this configurable?
Or is it forcing the bus to try to connect to RabbitMQ?
Or is it dumping the message altogether?
Is PublisherConfirms switched on impacting the behavior?
I haven't been able to test all scenarios described above, but it looks like before trying to publish to RabbitMQ, EasyNetQ is checking that the bus is connected. If it isn't, it is entering a connection loop more or less as described here: https://github.com/EasyNetQ/EasyNetQ/wiki/Error-Conditions#there-is-a-network-failure-between-my-subscriber-and-the-rabbitmq-broker
As we are increasing load, it looks as if connection loops are spiralling out of control as none of them ever manage to connect to RabbitMQ because our infrastructure or configuration is broken. Why are we getting timeouts I have not identified yet but I suspect that there could be a concurrency issue going on when several connection loops attempt to connect simultaneously.
I also doubt that switching off PublisherConfirms would help at all as we are not able to publish messages and therefore not waiting for acknowledgement from RabbitMQ.
Our solution:
So why have I not got a clear answer to this question? The truth is, at this point in time the messages that we are trying to publish are not mission critical, strictly speaking. If our configuration is wrong, deployment will fail when running a health check and we'll essentially abort the deployment. If RabbitMQ becomes unavailable for some reason, we are OK with not having these messages published.
Also, to avoid timing out, we're wrapping up message publishing with a circuit breaker to stop message publishing if we detect that the circuit between our application and RabbitMQ is opened. Roughly speaking, this is working as follows:
var bus = RabbitHutch.Create(...).Advanced;
var rabbitMqCircuitBreaker = new CircuitBreaker(...);
rabbitMqCircuitBreaker.AttemptCall(() => {
if (!bus.IsConnected)
throw new Exception(...);
bus.Publish(...);
});
Notice that we are notifying our circuit breaker that there is a problem when the IsConnected flag is set to false by throwing an exception. If the exception is thrown X number of times over a configured period of time, the circuit will open and we will stop trying to publish messages for a configured amount of time. We think that this is acceptable as the connection should be really quick and available 99.xxx% of the time if RabbitMQ is available. Also worth noting that the bus is created when our application is starting up, not before each call, therefore the likelihood of checking the flag before it is actually set in a valid scenario is pretty low.
Works for us at the moment, any additional information would be appreciated.

Why can I not PING when Subscribed using PUBSUB?

I have an issue with using PUBSUB on Azure.
The Azure firewall will close connections that are idle for any length of time. The length of time is under much debate, but people think it is around 5 - 15 minutes.
I am using Redis as a Message Queue. To do this ServiceStack.Redis library provides a RedisMqServer which subscribes to the following channel:
mq:topic:in
On a background thread it blocks receiving data from a socket, waiting to receive a message from Redis. The problem is:
If the socket waiting on a Redis message is idle for any length of time the Azure firewall
closes the connection silently. My application is not aware as it is
now waiting on a closed connection (which as far as its concerned is
open). The background thread is effectively hung.
I had thought to implement some kind of Keep Alive which would be to wait for a message for a minute, but if one is not received then PING the server with two goals:
Keep the connection open by telling Azure this connection is
still being used.
Check if the connection has been closed, if so
start over and resubscribe.
However when I implemented this I found that I cannot use the PING command whilst subscribed?? Not sure why this is, but does anyone have an alternative solution?
I do not want to unsubscribe and resubscribe regularly as I may miss messages.
I have read the following article: http://blogs.msdn.com/b/cie/archive/2014/02/14/connection-timeout-for-windows-azure-cloud-service-roles-web-worker.aspx which talks about how the Azure Load Balancer tears down connections after 4 minutes. But even if I can keep a connection alive I still need to achieve the second goal of restarting a subscription if the connection is killed for another reason (redis node goes down).
I just implemented PING support in Pub/Sub mode in the unstable branch of Redis in this commit: https://github.com/antirez/redis/commit/27839e5ecb562d9b79e740e2e20f7a6db7270a66
This will be backported in the next days into Redis 2.8 stable.
This is an issue due to our handling of keepAlive packets when hosting Redis in Azure. We will have this fixed shortly.
Also, as suggested above you can keep the connection alive manually by pinging. For a sub/pub connection a hack you can use today is calling unsubscribe to a random channel. (This is what StackExchange.Redis does)
When a client subscribes, that connection is basically blocked for outgoing commands as it is used to listen for incoming messages. One possible workaround is to publish a periodic keepalive message on the channel.

SOAP - Operation Timed Out - Rule out my server

I am using SOAP in C# .Net 3.5 to consume a web service, from a video game company. I am having lots of SOAP Exceptions with the error "Operation Timed Out"
While one process is timing out, others fly by with no problems. I would like to rule out a problem on my end, but I have no idea where to begin. My timeout is 5 minutes. For every 5,000 requests, maybe 500 fail.
Anyone have some advice for diagnosing web services failures? The web service owner will probably give no support to helping me on this, as it's a free service.
Thanks
I've had to do a lot of debugging connecting to a SOAP Service using PHP and timeouts are the worst problem. Normally the problem is the 'client' doesn't have a high enough timeout and bombs after something like 30s.
I test making the calls using SoapUI. I keep using a higher client-side timeout using that until I find something that works. Once I find that out I use the newly found time to my client and re-test.
Your only solution may be to make sure your 'clients' have a high enough timeout that will work for everything. 5 minutes should be fine for most of your server-side timeouts.
OK this is a huge question and there is a lot that it could be.
Have you tackled HTTP two connection limit? http://madskristensen.net/post/Optimize-HTTP-requests-and-web-service-calls.aspx
Have you got enough IO threads to cater for the load? Use the performance monitoring to check this for your App Pool - I think there is a IO threads counter. A quick google turned this up - http://www.guidanceshare.com/wiki/ASP.NET_2.0_Performance_Guidelines_-_Threading
Are you exhausting your bandwidth? Use performance monitoring again to check the usage of your network card.
This is a really hard subject to broach textually, as it so dependent on environment but I hop these might help.
This also looks interesting - http://software.intel.com/en-us/articles/how-to-tune-the-machineconfig-file-on-the-aspnet-platform/

Fast way to check if a server is accessible over the network in C#

I've got a project where I'm hitting a bunch of custom Windows Performance Counters on multiple servers and aggregating them into a database. If a server is down, I want to skip it, and just continue on with my day.
Currently I'm checking to see if a server is live by doing a DirectoryInfo on a share that I've got to look at later in the process anyways, then checking the .Exists property.This is my current code snippet for testing:
DirectoryInfo di = new DirectoryInfo(machine.Share_Path);
if (!di.Exists)
{
log.Warn("Could not access " + machine.Name + "! Maybe its down?");
continue; // Skips to the next server in my loop where this snippet exists.
}
This works, but its pretty slow. It takes about 68 seconds on average for the di.Exists bit to finish its work, and I ideally need to know within a second whether or not a server is accessible. Pinging also isn't an option since a server can be pingable but not "live" in our environment.
I'm still kind of fresh to the .NET world, so I'm open to any advice people can offer.
Thanks in advance.
-Weegee
Ping First, Ask Questions Later
Why not ping first, and then do the di.Exists if you get a response?
That would allow you to fail early in the case that is not reachable, and not waste the time for machines that are down hard.
I have, in fact, used this method successfully before.
MSDN Ping Documentation
Paralellize
Another option you have is to paralellize the checking, and action on the servers as they are known to be available.
You could use the Paralell.ForEach() method, and use a thread-safe queue along with a simple consumer thread to do the required action. Combined with the checking method above, this could alleviate almost all of your bottleneck on the up/down checking.
Knock on the Door
Yet another method would be to ckeck if the required remote service is running (either by hitting its port directly or by querying it with WMI).
Since WMI is almost always running when a machine is up, your connection should be very quick to either succeed or fail.
The only "quick" way I think to see if it's up without relying on ping would be to create a socket, and see if you can actually connect to the port of the service you're trying to reach.
This would be the equivalent of telnet servername 135 to see if it's up.
Specifically...
Create a .NET TCP socket client (System.Net.Sockets.TcpClient)
Call BeginConnect() as an asynchronous operation, to connect to the server in question on one of the RPC ports that your directory exists code would use anyway (TCP 135, 139, or 445).
If you don't hear back from it within X milliseconds, call Close() to cancel the connection.
Disclaimer: I have no idea what effect this would have on any threat/firewall protection that may see this type of Connect / Disconnect with no data sent activity as a threat.
Opening Socket to a specific port usually does the trick. If you really want it to be fast, be sure to set the NoDelay property on the new socket (Nagle algorithm) so there is no buffering.
Fast will largely depend on latency but this is probably the fastest way I know to connect to an endpoint. It's pretty simple to parallelize using the async methods. How fast you can check will largely depend on your network topology but in tests for 1000 servers (latency between 0-75ms) I've been able to get connectivity state in ~30 seconds. Not scientific data at all but should give you the idea.
Also, don't ever do this through UNC file shares because if the server no longer exists you will have a lot of dangling connections that take forever to timeout. So if you have a lot of servers with invalid DNS records and you try to poll them you will bring Windows down completely over time. Things like File.Exists and any file access will cause this.
The "Full-Blown" option would be to install a monitoring tool like SCOM (System Center Operations Manager), this has an SDK you can use to query SCOM for (performance) and maintenance information avout machines being monitored. Might be a bridge to far though....
Telnet is another option. Try telnetting to the target machine to see if it responds.
Create a small Windows Service that you install on your target machine, have the sys admin stop it when they perform maintenance on the target machine (just use batch file to net stop / net start the service)

Categories