I am using a Lab View application to simulate a test running, which would post a JSON string to my ASP.NET application. Within the ASP.NET application I format the data with the proper partition and row keys, then send it to Azure Table Storage.
The problem that I am having is that after what seems like a random amount of time (i.e. 5 minutes, 2 hours, 5 hours), the data fails to be saved into Azure. I am try to catch any exceptions within the ASP.NET application and send the error message back to the Lab View app and the Lab View app is also catching any exceptions in may encounter so I can trouble shoot where the issue is occurring.
The only error that I am able to catch is a Timeout Error 56 in the Lab View program. My question is, does anyone have an idea of where I should be looking for the root cause of this? I do not know where to begin.
EDIT:
I am using a table storage writer that I found here to do batch operations with retries.
The constructor for exponential retry policy is below:
public ExponentialRetry(TimeSpan deltaBackoff, int maxAttempts)
when you (or the library you use to be exact) instantiate this as RetryPolicy = new ExponentialRetry(TimeSpan.FromMilliseconds(2),100) you are basically setting the max attempts as 100 which means you may end up waiting up to around 2^100 milliseconds (there is some more math behind this but just simplifying) for each of your individual batch requests to fail on the client side until the sdk gives up retrying.
The other issue with that code is it executes batch requests sequentially and synchronously, that has multiple bad effects, first, all subsequent batch requests are blocked by the current batch request, second your cores are blocked waiting on I/O operations, third it has no exception handling so if one of the batch operations throw an exception, the method bails out and would not continue any further processing other batch requests.
My recommendation, do not use that library, batch operations are fairly straight forward. The default retry policy if you do not explicitly define is the exponential retry policy anyways with sensible default parameters (does 3 retries) so you do not even need to define your own retry object. For best scalability and throughput run your batch operations async (and concurrently).
As to why things fail, when you write your own api, catch the StorageException and check the http status code on the exception itself. You could be getting throttled by azure as one of the possibilities but it is hard to say without further debugging or you providing the http status code for the failed batch operations to us.
You need to check whether an exception is transient or not. As Peter said on his comment, Azure Storage client already implements a retry policy. You can also wrap your code with another retry code (e.g using polly) or you should change the default policy associated to Azure Storage Client.
Related
I have a Service Fabric cluster hosting an 'Orchestrator'-type service which spins up and shuts down other Stateful services to do work, using FabricClient.ServiceManagementClient's CreateServiceAsync and DeleteServiceAsync methods.
The work involves processing messages which are stored for a short time within a ReliableConcurrentQueue.
I'm trying to handle the graceful shutdown of these services via the CancellationToken by ensuring that the queue is completely drained of messages before the service is deleted, but have found that the service's access to the ReliableConcurrentQueue is revoked once the CancellationToken is cancelled.
For example, calling StateManager.GetOrAddAsync<T>() from a callback registered with the CancellationToken, results in a FabricNotReadableException, containing the message "Primary state manager is currently not readable".
Reading around, it seems this is expected behaviour:
"In Service Fabric, when a Primary is demoted, one of the first things
that happens is that write access to the underlying state is revoked."
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle
Also, the answers to this question suggest that FabricNotReadableException is often a transient issue, and affected calls can be retried. This doesn't seem to be the case in this example; multiple retries at various frequencies/delays all seem to fail the same way.
Is there a way to guarantee that everything in the queue is processed using the combination of Stateful services, Reliable Collections and CancellationTokens? Or should I be looking into storage outside of what Service Fabric can provide?
Consider performing the queue item processing inside RunAsync.
Stopping / changing the role of a service causes the CancellationToken passed to RunAsync to be cancelled.
Once that happens, you need to make sure that you only exit that method when the queue depth is 0.
Also, once this cancellation is requested, you should probably stop allowing new items to be enqueued.
We recently had an outage where one of our APIs became unresponsive due to our rabbit cluster being given artificially high load. We where running out of threads in mono (.NET) and requests to the API failed. Although this is unlikely to happen again we would like to put some protection in against this. Ideally we would have calls to bus.Publish() timeout after a set amount of time but we can't workout how.
We then came across the blocked connections notification feature of RabbitMQ and thought this might help. However we can't figure out how to get at the connection object that is in the IServiceBus. So far we have tried
_serviceBus = serviceBus;
var connection =
((MassTransit.Transports.RabbitMq.RabbitMqEndpointAddress) _serviceBus.Endpoint.Address)
.ConnectionFactory.CreateConnection();
connection.ConnectionBlocked += Connection_ConnectionBlocked;
connection.ConnectionUnblocked += Connection_ConnectionUnblocked;
But when we do this we get a BrokerUnreachableException which I don't understand.
My questions are, is this the right approach to detect timeouts and fail (we have a backup mechanism to collect the data in the message and repost later) and if this is correct, how do we make it work?
I think you can manage this by combining System.Timer or Observable.Timer to schedule checks, and the check, which use request-response. Consumer for the request should be inside the same process. You can specify a cancellation token with reasonable timeout for the Request call and it you get a timeout - your messaging infrastructure is down or too busy, or your endpoint is too busy.
I'm running a small WCF client application that connects to an IIS server every few minutes to download data. There are about 500 of these clients for 2 or 3 servers, and my basic code is something like this:
Client connection = null;
try
{
connection = new Client();
List<TPointer> objects = connection.GetList();
// Some work on List<T>
foreach (TPointer pointer in objects)
{
T data = GetDataFromStream(pointer, connection);
// Some additional processing on T
}
connection.SendMoreData();
// More work
}
catch (...)
{
// Exception handling for various exceptions
}
finally
{
// Handle Close() or Abort()
if (connection != null)
connection.Close();
}
When I simulate running all the clients at once for large amounts of TPointers, I start encountering the following error:
System.TimeoutException: The request channel timed out while waiting for a reply after 00:01:00.
That seems like one of those errors that can occur for any number of reasons. For all I know the server could just be swamped, or I could be requesting too large/too many objects and it's taking too long to download (a whole minute though?). Increasing the timeout is an option, but I'd like to understand the actual problem instead of fixing the symptom.
Given I have no control over the server, how can I streamline my client?
I'm not actually sure what the "request channel" mentioned in the timeout refers to. Does the timeout start ticking from when I create new Client() until I call Client.Close()? Or does each specific request I'm sending to the server (e.g. GetList or GetData) get another minute? Is it worth my while to close Client() in between each call to the server? (I'm hoping not... that would be ugly)
Would it be helpful to chunk up the amount of data I'm receiving? The GetList() call can be quite large (running into the thousands). I could try obtaining a few objects at a time and jobbing off the post-processing for later...
Edit:
Since a few people mentioned streaming:
The Client binding uses TransferMode.StreamedResponse.
GetDataFromStream() uses a Stream derived from TPointer, and SendMoreData()'s payload size is more or less negligible.
Only GetList() actually returns a non-stream object, but I'm unclear as to whether or not that affects the method of transfer.
Or does each specific request I'm sending to the server (e.g. GetList or GetData) get another minute?
The timeout property applies to each and every operation that you're doing. It's reset. If your timeout is one minute, then it starts the moment you invoke that method.
What I'd do is implement a retry policy and use an async version of the client's method and use a CancellationToken or call Abort() on your client when it's taking too long. Alternatively, you can increment or set your timeouts on the InnerChannel on the operation timout.
client.InnerChannel.OperationTimeout = TimeSpan.FromMinutes(10);
You can use that during your operation and in your retry policy you can abort entirely and reset your timeout after your retries have failed or succeeded.
Alternatively, you can try to stream your results and see if you can operate individually on them, but I don't know if keeping that connection open will trip the timeout. You'll have to hold off on operating on your collection until you have everything.
Also, set TransferMode = TransferMode.StreamedResponse in your binding.
I believe the timeout you are hitting is time to first response. In your scenario here first response is the whole response since you are returning the list, more data more time. You might want to consider streaming the data instead of returning a full list.
I suggest to modify both your web.config file (wcf side) and also app.config (client side), adding binding section like this (i.e. timeout of 25 minutes in stead of 1 minute which is default value):
<bindings>
<wsHttpBinding>
<binding name="WSHttpBinding_IYourService"
openTimeout="00:25:00"
closeTimeout="00:25:00"
sendTimeout="00:25:00"
receiveTimeout="00:25:00">
</binding>
</wsHttpBinding>
</bindings>
Given I have no control over the server, how can I streamline my client?
Basically you can not do this when you only have control over the client. It seems like the operations return no Stream (unless the pointers are types which derive from Stream).
If you want to know more about how to generally achieve streaming just read up on this MSDN article.
Everything you can do on the client is scratching on the surface of the problem. Like #The Anathema proposed in his answer you can create a retry logic and/or set the timeout to a higher value. But to eradicate the root of the problem you'd need to investigate the source of the service itself so that it can handle a higher amount of requests. Or have instances of the service running on multiple servers with a load balancer in front.
I ended up going with a combination of the answers here, so I'll just post an answer. I chunked GetList() to a certain size to avoid keeping the connection open so long (it also had a positive effect on the code in general, since I was keeping less in memory temporarily.) I already have a retry policy in place, but will also plan on messing with the timeout, as The Anathema and a couple others suggested.
Given an application that in parallel requests 100 urls at a time for 10000 urls, I'll receive the following error for 50-5000 of them:
The remote name cannot be resolved 'www.url.com'
I understand that the error means the DNS Server was unable to resolve the url. However, for each run, the number of urls that cannot be resolved changes (ranging from 50 to 5000).
Am I making too many requests too fast? And can I even do that? - Running the same test on a much more powerful server, shows that only 10 urls could not be resolved - which sounds much more realistic.
The code that does the parallel requesting:
var semp = new SemaphoreSlim(100);
var uris = File.ReadAllLines(#"C:\urls.txt").Select(x => new Uri(x));
foreach(var uri in uris)
{
Task.Run(async () =>
{
await semp.WaitAsync();
var result = await Web.TryGetPage(uri); // Using HttpWebRequest
semp.Release();
});
}
I'll bet that you didn't know that the DNS lookup of HttpWebRequest (which is the cornerstone of all .net http apis) happens synchronously, even when making async requests (annoying, right?). This means that firing off many requests at once causes severe ThreadPool strain and large amount of latency. This can lead to unexpected timeouts. If you really want to step things up, don't use the .net dns implementation. You can use a third party library to resolve hosts and create your webrequest with an ip instead of a hostname, then manually set the host header before firing off the request. You can achieve much higher throughput this way.
It does sound like you're swamping your local DNS server (in the jargon, your local recursive DNS resolver).
When your program issues a DNS resolution request, it sends a port 53 datagram to the local resolver. That resolver responds either by replying from its cache or recursively resending the request to some other resolver that's been identified as possibly having the record you're looking for.
So, your multithreaded program is causing a lot of datagrams to fly around. Internet Protocol hosts and routers handle congestion and overload by dropping datagram packets. It's like handling a traffic jam on a bridge by bulldozing cars off the bridge. In an overload situation, some packets just disappear.
So, it's up to endpoint software using datagram protocols to try again if their packets get lost. That's the purpose of TCP, and that's how it can provide the illusion of an error-free stream of data even though it can only communicate with datagrams.
So, your program will need to try again when you get resolution failure on some of your DNS requests. You're a datagram endpoint so you own the responsibility of retry. I suspect the .net library is give you back failure when some of your requests time out because your datagrams got dropped.
Now, here's the important thing. It is also the responsibility of a datagram endpoint program, like yours, to implement congestion control. TCP does this automatically using its sliding window system, with an algorithm called slow-start / exponential backoff. If TCP didn't do this all internet routers would be congested all the time. This algorithm was dreamed up by Van Jacobson, and you should go read about it.
In the meantime you should implement a simple form of it in your bulk DNS lookup program. Here's how you might do that.
Start with a batch size of, say, 5 lookups.
Every time you get the whole batch back successfully, increase your batch size by one for your next batch. This is slow-start. As long as you're not getting congestion, you increase the network load.
Every time you get a failure to resolve a name, reduce the size of the next batch by half. So, for example, if your batch size was 30 and you got a failure, your next batch size will be 15. This is exponential backoff. You respond to congestion by dramatically reducing the load you're putting on the network.
Implement a maximum batch size of something like 100 just to avoid being too much of a pig and looking like a crude denial-of-service attack to the DNS system.
I had a similar project a while ago and this strategy worked well for me.
Feature Description
The NServiceBus gateway, http://docs.particular.net/nservicebus/gateway/, seems to be a way to achieve an internal webhook using the NServiceBus infrastructure.
We need to go further with this concept to open up a few event to any 3rd party subscriber that has access to register a webhook url in our system.
Review
We plan to create two initial window services
1) WebHookBatchService, that can be added as a subscriber to specific messages of interest.
<UnicastBusConfig>
<MessageEndpointMappings>
.......
<add Messages="MyMessages.MyImportantMessage, MyMessages" Endpoint="WebHookBatchService.Queue"/>
.......
</MessageEndpointMappings>
</UnicastBusConfig>
2) WebHookProcessService - actually processes 1 message sent by the WebHookBatchService.
Once messages are received on the WebHookBatchService.Queue our WebHookBatchService will look up all the subscribers for the specific tenant + message type and foreach send individual messages to WebHookProcessService.Queue for the WebHookProcessService (which we can make an instance of nservicebus loadbalancer to bridge the batch and actual processor) to actually process the real messages probably using http://restsharp.org/.
Questions
Are there any existing open source projects that do this today?
Now since we have no control of the durability of the subscribers how should we manage errors?
http://wiki.shopify.com/WebHook
A webhook will be deleted if there are 19 consecutive failures for the exact same webhook.
It doesn't mention any delays in the webhook.. What have people experienced with standard delay in retry logic?
Here are some other thoughts:
proposal 0: MaxRetries="1". Purge WebHookProcessService.ErrorQueue nightly. (no retry - guaranteed message loss if it fails the first time)
proposal 1:
MaxRetries="1" on exception catch send email containing xml version of the message that would have been delivered over http.
Purge WebHookProcessService.ErrorQueue nightly.
-- I see potential a spam issues.
proposal 2: The nservicebus MaxRetries retries right away without delay. So i would need to create (1hr - 24hr) bucket queues and use a RetrySchedulerService although I see this as difficult to maintain and confusing for subscribers when they all at once get 25 messages in a non DateCreated ordered fashion when there service endpoint begins to work.
Digging for ideas...
The Gateway is typically used for communication between physical sites over HTTP. Since you are exposing an endpoint to the world to accept callbacks, I'm thinking you could just use the built-in WCF hosting and expose your endpoint through the firewall to 3rd parties. The rest of your setup sounds appropriate to me.
As for errors, you are correct, NSB retries immediately, but if you using web call backs this may get you by in the cases there are small hiccups. You will need to determine how you want to process the error queues, we just build in a new endpoint to process the error queues with logic to determine the retries, delay etc. A nice way to accomplish this is to use a Saga, which includes a Timeout manager. This enables a workflow where you can retry a specified number of times, try another communication, log everything, and ultimately notify someone who can contact the 3rd party to let them know there stuff is busted.