Detect and handle unresponsive RabbitMQ in .NET application

Detect and handle unresponsive RabbitMQ in .NET application - c#

We recently had an outage where one of our APIs became unresponsive due to our rabbit cluster being given artificially high load. We where running out of threads in mono (.NET) and requests to the API failed. Although this is unlikely to happen again we would like to put some protection in against this. Ideally we would have calls to bus.Publish() timeout after a set amount of time but we can't workout how.
We then came across the blocked connections notification feature of RabbitMQ and thought this might help. However we can't figure out how to get at the connection object that is in the IServiceBus. So far we have tried
_serviceBus = serviceBus;
var connection =
((MassTransit.Transports.RabbitMq.RabbitMqEndpointAddress) _serviceBus.Endpoint.Address)
.ConnectionFactory.CreateConnection();
connection.ConnectionBlocked += Connection_ConnectionBlocked;
connection.ConnectionUnblocked += Connection_ConnectionUnblocked;
But when we do this we get a BrokerUnreachableException which I don't understand.
My questions are, is this the right approach to detect timeouts and fail (we have a backup mechanism to collect the data in the message and repost later) and if this is correct, how do we make it work?

I think you can manage this by combining System.Timer or Observable.Timer to schedule checks, and the check, which use request-response. Consumer for the request should be inside the same process. You can specify a cancellation token with reasonable timeout for the Request call and it you get a timeout - your messaging infrastructure is down or too busy, or your endpoint is too busy.

Related

Real time data storage with Azure Tables using ASP.NET

I am using a Lab View application to simulate a test running, which would post a JSON string to my ASP.NET application. Within the ASP.NET application I format the data with the proper partition and row keys, then send it to Azure Table Storage.
The problem that I am having is that after what seems like a random amount of time (i.e. 5 minutes, 2 hours, 5 hours), the data fails to be saved into Azure. I am try to catch any exceptions within the ASP.NET application and send the error message back to the Lab View app and the Lab View app is also catching any exceptions in may encounter so I can trouble shoot where the issue is occurring.
The only error that I am able to catch is a Timeout Error 56 in the Lab View program. My question is, does anyone have an idea of where I should be looking for the root cause of this? I do not know where to begin.
EDIT:
I am using a table storage writer that I found here to do batch operations with retries.

The constructor for exponential retry policy is below:
public ExponentialRetry(TimeSpan deltaBackoff, int maxAttempts)
when you (or the library you use to be exact) instantiate this as RetryPolicy = new ExponentialRetry(TimeSpan.FromMilliseconds(2),100) you are basically setting the max attempts as 100 which means you may end up waiting up to around 2^100 milliseconds (there is some more math behind this but just simplifying) for each of your individual batch requests to fail on the client side until the sdk gives up retrying.
The other issue with that code is it executes batch requests sequentially and synchronously, that has multiple bad effects, first, all subsequent batch requests are blocked by the current batch request, second your cores are blocked waiting on I/O operations, third it has no exception handling so if one of the batch operations throw an exception, the method bails out and would not continue any further processing other batch requests.
My recommendation, do not use that library, batch operations are fairly straight forward. The default retry policy if you do not explicitly define is the exponential retry policy anyways with sensible default parameters (does 3 retries) so you do not even need to define your own retry object. For best scalability and throughput run your batch operations async (and concurrently).
As to why things fail, when you write your own api, catch the StorageException and check the http status code on the exception itself. You could be getting throttled by azure as one of the possibilities but it is hard to say without further debugging or you providing the http status code for the failed batch operations to us.

You need to check whether an exception is transient or not. As Peter said on his comment, Azure Storage client already implements a retry policy. You can also wrap your code with another retry code (e.g using polly) or you should change the default policy associated to Azure Storage Client.

WMQ: Distributing MQ readers over several machines

I am using WMQ to access an IBM WebSphere MQ on a mainframe - using c#.
We are considering spreading out our service on several machines, and we then need to make sure that two services on two different machines cannot read/get the same MQ message at the same time.
My code for getting messages is this:
var connectionProperties = new Hashtable();
const string transport = MQC.TRANSPORT_MQSERIES_CLIENT;
connectionProperties.Add(MQC.TRANSPORT_PROPERTY, transport);
connectionProperties.Add(MQC.HOST_NAME_PROPERTY, mqServerIP);
connectionProperties.Add(MQC.PORT_PROPERTY, mqServerPort);
connectionProperties.Add(MQC.CHANNEL_PROPERTY, mqChannelName);
_mqManager = new MQQueueManager(mqManagerName, connectionProperties);
var queue = _mqManager.AccessQueue(_queueName, MQC.MQOO_INPUT_SHARED + MQC.MQOO_FAIL_IF_QUIESCING);
var queueMessage = new MQMessage {Format = MQC.MQFMT_STRING};
var queueGetMessageOptions = new MQGetMessageOptions {Options = MQC.MQGMO_WAIT, WaitInterval = 2000};
queue.Get(queueMessage, queueGetMessageOptions);
queue.Close();
_mqManager.Commit();
return queueMessage.ReadString(queueMessage.MessageLength);
Is WebSphere MQ transactional by default, or is there something I need to change in my configuration to enable this?
Or - do I need to ask our mainframe guys to do some of their magic?
Thx

Unless you actively BROWSE the message (ie read it but leave it there with no locks), only one getter will ever be able to 'get' the message. Even without transactionality, MQ will still only deliver the message once... but once delivered its gone
MQ is not transactional 'by default' - you need to get with GMO_SYNCPOINT (MQ transactions) and commit at the connection (MQQueueManager level) if you want transactionality (or integrate with .net transactions is another option)
If you use syncpoint then one getter will get the message, the other will ignore it, but if you subsequently have an issue and rollback, then it is made available to any getter (as you would want). It is this scenario where you might see a message twice, but thats because you aborted the transaction and hence asked for it to be put back to how it was before the get.

I wish I'd found this sooner because the accepted answer is incomplete. MQ provides once and only once delivery of messages as described in the other answer and IBM's documentation. If you have many clients listening on the same queue, MQ will deliver only one copy of the message. This is uncontested.
That said, MQ, or any other async messaging for that matter, must deal with session handling and ambiguous outcomes. The affect of these factors is such that any async messaging application should be designed to gracefully handle dupe messages.
Consider an application putting a message onto a queue. If the PUT call receives a 2009 Connection Broken response, it is unclear whether the connection failed before or after the channel agent received and acted on the API call. The application, having no way to tell the difference, must put the message again to assure it is received. Doing the PUT under syncpoint can result in a 2009 on the COMMIT (or equivalent return code in messaging transports other than MQ) and the app doesn't know if the COMMIT was successful or if the PUT will eventually be rolled back. To be safe it must PUT the message again.
Now consider the partner application receiving the messages. A GET issued outside of syncpoint that reaches the channel agent will permanently remove the message from the queue, even if the channel agent cannot then deliver it. So use of transacted sessions ensures that the message is not lost. But suppose that the message has been received and processed and the COMMIT returns a 2009 Connection Broken. The app has no way to know whether the message was removed during the COMMIT or will be rolled back and delivered again. At the very least the app can avoid losing messages by using transacted sessions to retrieve them, but can not guarantee to never receive a dupe.
This is of course endemic to all async messaging, not just MQ, which is why the JMS specification directly address it. The situation is addressed in all versions but in the JMS 1.1 spec look in section 4.4.13 Duplicate Production of Messages which states:
If a failure occurs between the time a client commits its work on a
Session and the commit method returns, the client cannot determine if
the transaction was committed or rolled back. The same ambiguity
exists when a failure occurs between the non-transactional send of a
PERSISTENT message and the return from the sending method.
It is up to a JMS application to deal with this ambiguity. In some
cases, this may cause a client to produce functionally duplicate
messages.
A message that is redelivered due to session recovery is not
considered a duplicate message.
If it is critical that the application receive one and only one copy of the message, use 2-Phase transactions. The transaction manager and XA protocol will provide very strong (but still not absolute) assurance that only one copy of the message will be processed by the application.
The behavior of the messaging transport in delivering one and only one copy of a given message is a measure of the reliability of the transport. By contrast, the behavior of an application which relies on receipt of one and only one copy of the message is a measure of the reliability of the application.
Any duplicate messages received from an IBM MQ transport are almost certainly going to be due to the application's failure to use XA to account for the ambiguous outcomes inherent in async messaging and not a defect in MQ. Please keep this in mind when the Production version of the application chokes on its first duplicate message.
On a related note, if Disaster Recovery is involved, the app must also gracefully recover from lost messages, or else find a way to violate the laws of relativity.

Unable to resolve DNS (sometimes?)

Given an application that in parallel requests 100 urls at a time for 10000 urls, I'll receive the following error for 50-5000 of them:
The remote name cannot be resolved 'www.url.com'
I understand that the error means the DNS Server was unable to resolve the url. However, for each run, the number of urls that cannot be resolved changes (ranging from 50 to 5000).
Am I making too many requests too fast? And can I even do that? - Running the same test on a much more powerful server, shows that only 10 urls could not be resolved - which sounds much more realistic.
The code that does the parallel requesting:
var semp = new SemaphoreSlim(100);
var uris = File.ReadAllLines(#"C:\urls.txt").Select(x => new Uri(x));
foreach(var uri in uris)
{
Task.Run(async () =>
{
await semp.WaitAsync();
var result = await Web.TryGetPage(uri); // Using HttpWebRequest
semp.Release();
});
}

I'll bet that you didn't know that the DNS lookup of HttpWebRequest (which is the cornerstone of all .net http apis) happens synchronously, even when making async requests (annoying, right?). This means that firing off many requests at once causes severe ThreadPool strain and large amount of latency. This can lead to unexpected timeouts. If you really want to step things up, don't use the .net dns implementation. You can use a third party library to resolve hosts and create your webrequest with an ip instead of a hostname, then manually set the host header before firing off the request. You can achieve much higher throughput this way.

It does sound like you're swamping your local DNS server (in the jargon, your local recursive DNS resolver).
When your program issues a DNS resolution request, it sends a port 53 datagram to the local resolver. That resolver responds either by replying from its cache or recursively resending the request to some other resolver that's been identified as possibly having the record you're looking for.
So, your multithreaded program is causing a lot of datagrams to fly around. Internet Protocol hosts and routers handle congestion and overload by dropping datagram packets. It's like handling a traffic jam on a bridge by bulldozing cars off the bridge. In an overload situation, some packets just disappear.
So, it's up to endpoint software using datagram protocols to try again if their packets get lost. That's the purpose of TCP, and that's how it can provide the illusion of an error-free stream of data even though it can only communicate with datagrams.
So, your program will need to try again when you get resolution failure on some of your DNS requests. You're a datagram endpoint so you own the responsibility of retry. I suspect the .net library is give you back failure when some of your requests time out because your datagrams got dropped.
Now, here's the important thing. It is also the responsibility of a datagram endpoint program, like yours, to implement congestion control. TCP does this automatically using its sliding window system, with an algorithm called slow-start / exponential backoff. If TCP didn't do this all internet routers would be congested all the time. This algorithm was dreamed up by Van Jacobson, and you should go read about it.
In the meantime you should implement a simple form of it in your bulk DNS lookup program. Here's how you might do that.
Start with a batch size of, say, 5 lookups.
Every time you get the whole batch back successfully, increase your batch size by one for your next batch. This is slow-start. As long as you're not getting congestion, you increase the network load.
Every time you get a failure to resolve a name, reduce the size of the next batch by half. So, for example, if your batch size was 30 and you got a failure, your next batch size will be 15. This is exponential backoff. You respond to congestion by dramatically reducing the load you're putting on the network.
Implement a maximum batch size of something like 100 just to avoid being too much of a pig and looking like a crude denial-of-service attack to the DNS system.
I had a similar project a while ago and this strategy worked well for me.

nservicebus + webhooks +Errors +MaxRetries

Feature Description
The NServiceBus gateway, http://docs.particular.net/nservicebus/gateway/, seems to be a way to achieve an internal webhook using the NServiceBus infrastructure.
We need to go further with this concept to open up a few event to any 3rd party subscriber that has access to register a webhook url in our system.
Review
We plan to create two initial window services
1) WebHookBatchService, that can be added as a subscriber to specific messages of interest.
<UnicastBusConfig>
<MessageEndpointMappings>
.......
<add Messages="MyMessages.MyImportantMessage, MyMessages" Endpoint="WebHookBatchService.Queue"/>
.......
</MessageEndpointMappings>
</UnicastBusConfig>
2) WebHookProcessService - actually processes 1 message sent by the WebHookBatchService.
Once messages are received on the WebHookBatchService.Queue our WebHookBatchService will look up all the subscribers for the specific tenant + message type and foreach send individual messages to WebHookProcessService.Queue for the WebHookProcessService (which we can make an instance of nservicebus loadbalancer to bridge the batch and actual processor) to actually process the real messages probably using http://restsharp.org/.
Questions
Are there any existing open source projects that do this today?
Now since we have no control of the durability of the subscribers how should we manage errors?
http://wiki.shopify.com/WebHook
A webhook will be deleted if there are 19 consecutive failures for the exact same webhook.
It doesn't mention any delays in the webhook.. What have people experienced with standard delay in retry logic?
Here are some other thoughts:
proposal 0: MaxRetries="1". Purge WebHookProcessService.ErrorQueue nightly. (no retry - guaranteed message loss if it fails the first time)
proposal 1:
MaxRetries="1" on exception catch send email containing xml version of the message that would have been delivered over http.
Purge WebHookProcessService.ErrorQueue nightly.
-- I see potential a spam issues.
proposal 2: The nservicebus MaxRetries retries right away without delay. So i would need to create (1hr - 24hr) bucket queues and use a RetrySchedulerService although I see this as difficult to maintain and confusing for subscribers when they all at once get 25 messages in a non DateCreated ordered fashion when there service endpoint begins to work.
Digging for ideas...

The Gateway is typically used for communication between physical sites over HTTP. Since you are exposing an endpoint to the world to accept callbacks, I'm thinking you could just use the built-in WCF hosting and expose your endpoint through the firewall to 3rd parties. The rest of your setup sounds appropriate to me.
As for errors, you are correct, NSB retries immediately, but if you using web call backs this may get you by in the cases there are small hiccups. You will need to determine how you want to process the error queues, we just build in a new endpoint to process the error queues with logic to determine the retries, delay etc. A nice way to accomplish this is to use a Saga, which includes a Timeout manager. This enables a workflow where you can retry a specified number of times, try another communication, log everything, and ultimately notify someone who can contact the 3rd party to let them know there stuff is busted.

WCF - Client callback vs. polling for "keep list of subscribers"

I want to create a simple client-server example in WCF. I did some testing with callbacks, and it works fine so far. I played around a little bit with the following interface:
[ServiceContract(SessionMode = SessionMode.Required, CallbackContract = typeof(IStringCallback))]
public interface ISubscribeableService
{
[OperationContract]
void ExecuteStringCallBack(string value);
[OperationContract]
ServerInformation Subscribe(ClientInformation c);
[OperationContract]
ServerInformation Unsubscribe(ClientInformation c);
}
Its a simple example. a little bit adjusted. You can ask the server to "execute a string callback" in which case the server reversed the string and calls all subscribed client callbacks.
Now, here comes the question: If I want to implement a system where all clients "register" with the server, and the server can "ask" the clients if they are still alive, would you implement this with callbacks (so instead of this "stringcallback" a kind of TellTheClientThatIAmStillHereCallback). By checking the communication state on the callback I can also "know" if a client is dead. Something similar to this:
Subscribers.ForEach(delegate(IStringCallback callback)
{
if (((ICommunicationObject)callback).State == CommunicationState.Opened)
{
callback.StringCallbackFunction(new string(retVal));
}
else
{
Subscribers.Remove(callback);
}
});
My problem, put in another way:
The server might have 3 clients
Client A dies (I pull the plug of the laptop)
The server dies and comes back online
A new client comes up
So basically, would you use callbacks to verify the "still living state" of clients, or would you use polling and keep track "how long I havent heard of a client"...

You can detect most changes to the connection state via the Closed, Closing, and Faulted events of ICommunicationObject. You can hook them at the same time that you set up the callback. This is definitely better than polling.
IIRC, the Faulted event will only fire after you actually try to use the callback (unsuccessfully). So if the Client just disappears - for example, a hard reboot or power-off - then you won't be notified right away. But do you need to be? And if so, why?
A WCF callback might fail at any time, and you always need to keep this in the back of your mind. Even if both the client and server are fine, you might still end up with a faulted channel due to an exception or a network outage. Or maybe the client went offline sometime between your last poll and your current operation. The point is, as long as you code your callback operations defensively (which is good practice anyway), then hooking the events above is usually enough for most designs. If an error occurs for any reason - including a client failing to respond - the Faulted event will kick in and run your cleanup code.
This is what I would refer to as the passive/lazy approach and requires less coding and network chatter than polling or keep-alive approaches.

If you enable reliable sessions, WCF internally maintains a keep-alive control mechanism. It regularly checks, via hidden infrastructure test messages, if the other end is still there. The time interval of these checks can be influenced via the ReliableSession.InactivityTimeout property. If you set the property to, say, 20 seconds, then the ICommunicationObject.Faulted event will be raised about 20 to 30 (maximum) seconds after a service breakdown has occurred on the other side.
If you want to be sure that client applications always remain "auto-connected", even after temporary service breakdowns, you may want to use a worker thread (from the thread pool) that repeatedly tries to create a new proxy instance on the client side, and calls a session-initiating operation, after the Faulted event has been raised there.
As a second approach, since you are implementing a worker thread mechanism anyway, you might also ignore the Faulted event and let the worker thread loop during the whole lifetime of the client application. You let the thread repeatedly check the proxy state, and try to do its repair work whenever the state is faulted.
Using the first or the second approach, you can implement a service bus architecture (mediator pattern), guaranteeing that all client application instances are constantly ready to receive "spontaneous" service messages whenever the service is running.
Of course, this only works if the reliable session "as such" is configured correctly to begin with (using a session-capable binding, and applying the ServiceContractAttribute.SessionMode, ServiceBehaviorAttribute.InstanceContextMode, OperationContractAttribute.IsInitiating, and OperationContractAttribute.IsTerminating properties in meaningful ways).

I had a similar situation using WCF and callbacks. I did not want to use polling, but I was using a "reilable" protocol, so if a client died, then it would hang the server until it timed out and crashed.
I do not know if this is the most correct or elegant solution, but what I did was create a class in the service to represent the client proxy. Each instance of this class contained a reference to the client proxy, and would execute the callback function whenever the server set the "message" property of the class. By doing this, when a client disconnected, the individual wrapper class would get the timeout excetpion, and remove itself from the server's list of listeners, but the service would not have to wait for it. This doesn't actually answer your question about determining if the client is alive, but it is another way of structuring the service to addrss the issue. If you needed to know when a client died, you would be able to pick up when the client wrapper removed itself from the listener list.

I have not tried to use WCF callbacks over the wire but i have used them for interprocess communication. I was having a problem where call of the calls that were being sent were ending up on the same thread and making the service dead lock when there were calls that were dependant on the same thread.
This may apply to the problem that you are currently have so here is what I had to do to fix the problem.
Put this attribute onto the server and client of the WCF server implemetation class
[ServiceBehavior(ConcurrencyMode = ConcurrencyMode.Multiple)]
public class WCFServerClass
The ConcurrencyMode.Multiple makes each call process on its own thread which should help you with the server locking up when a client dies until it timesout.
I also made sure to use a Thread Pool on the client side to make sure that there were no threading issues on the client side

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.