Entity framework transient failure and command timeout - c#

I have some sporadic transient errors when connecting from my docker container to the hosting machine database.
For this reason, I configure my DB context to retry on failure. This solved the problem partially because the client application still fails since the retry of the request can take up to 2 minutes (on average 30 seconds).
This is too long and represents a bad user experience. I was trying to understand why it takes so long but the only thing I can think of is that the timeout until declaring the connection as failure is too long. I thought of making the command timeout smaller. (By default it is 30 seconds if I am not wrong) Maybe put it 2-3 seconds. (most of my queries take less than 30ms) but I don't know if this would create other problems.
When checking my logs I discovered that the problem doesn't rely on the retry logic because it retries straight after the failure but what takes so long is the failure response.
This is my current configuration.
builder.Services.AddDbContext<AuthDbContext>(options =>
{
options.UseNpgsql(EnvironmentVariables.GetEnvironmentVariable(EnvironmentVariables.DB_AUTH), conf =>
{
conf.EnableRetryOnFailure(5, TimeSpan.FromSeconds(5), new List<string> { "4060" });
conf.CommandTimeout(2); //This is the command timeout that I want to add.
});
options.LogTo(
filter: (eventId, level) => eventId.Id == CoreEventId.ExecutionStrategyRetrying,
logger: (eventData) =>
{
var retryEventData = eventData as ExecutionStrategyEventData;
var exceptions = retryEventData.ExceptionsEncountered;
Log.Information("TRANSIENT ERROR Retry #{attemptNumber} with delay {delayMs} due to error: {errorMessage}", exceptions.Count, retryEventData.Delay, exceptions.Last().Message);
});
});

There are a few points that you may consider
I am not sure which of 2 timeouts you experienced What is the difference between SqlCommand.CommandTimeout and SqlConnection.ConnectionTimeout?
Setting connection timeout to 3 seconds will make errors more often, I suggest do not make it less than 15 sec.
I recently learned that EF Core doesn’t consider timeouts as transient errors, and EnableRetryOnFailure doesn’t retry in case of timeout. You can used custom strategy if you want to retry timeout, eg. as https://github.com/dotnet/efcore/issues/27826#issuecomment-1177641624

Related

Move message to 'deadletter ' in azure servicebus

I have implemented backoff exponential retry. So basically if there is any exception i clone the message and then i re-submit it to the queue by adding some delay.
Now i am facing 2 issues - 1) i see that the delivery count is not increasing when i clone and resubmit back to queue
2) I want to move it to deadletter if the max delivery count is reached.
Code :
catch (Exception ex)
{
_logger.Error(ex, $"Failed to process request {requestId}");
var clone = messageResult.Message.Clone();
clone.ScheduledEnqueueTimeUtc = DateTime.UtcNow.AddSeconds(45);
await messageResult.ResendMessage(clone);
if (retryCount == MaxAttempts)
{
//messageResult.dea
}
return new PdfResponse { Error = ex.ToString() };
}
please help me on this
When you clone a message it becomes a new message, that means system properties are not cloned which gives the cloned message a fresh delivery count starting at 1 again. See also https://docs.azure.cn/zh-cn/dotnet/api/microsoft.azure.servicebus.message.clone?view=azure-dotnet
You can look into the Peek Lock Feature of Azure Service Bus. When using PeekLock the message gets invisible on the queue until you explicitly abandon it (put it back to the queue with delivery count increased) or complete if everything works out as expected when processing the message. Another option is to explicitly dead letter this message.
The feature is documented here: https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-transfers-locks-settlement#peeklock
But the important thing about this is that if you do not perform any of the above mentioned actions such as cloning Azure Service Bus will automatically make the message visible again after a defined interval (the LockDuration property) or when you abandon it.
So to get a delayed retry and dead letter behaviour (when maximum delivery count has been reached) you can use the following options:
Option 1. Retry via Azure service bus auto-unlock
When processing of the message cannot be performed at the moment for some reason catch the exception and make sure none of the mentioned actions (abandon, complete or deadletter) are performed. This will keep the message invisible for the remaining time and will make it again visible after the configured lock duration has been reached. And the delivery count will also be increased by Azure Service Bus as expected.
Option 2. Implement your own retry policy
Perform your own retry policy in your code and retry processing of the message. If your maximum retries have been reached abandon the message which will make it visible again for the next queue reading step after the retry time has been reached. In this case the delivery count is increased as well.
Note: If you choose option 2.) make sure your retry period will conform to the defined LockDuration so that your message will not be visible again on the queue if you are still processing it with retries. You could also renew the lock between retries by calling the RenewLock() method on the message between retries.
If you implement the retry policy in your code I recommend using into Polly .Net which already gives you great features such as Retry and Circuit Breaker policies. See https://github.com/App-vNext/Polly

Determine if message will be retried from observer context in MassTransit 3

I would like to track the number of message retries and redelivers that occur while using MassTransit 3. I have both retries and redeliveries configured:
config.UseDelayedRedelivery(r => r.Immediate(2));
config.UseRetry(r => r.Immediate(3));
I have set up a IConsumeObserver and a IReceiveObserver as described here. And I can inspect the ConsumeContext/ReceiveContext in PostConsume<T>(ConsumeContext<T> context)/PostReceive(ReceiveContext context).
But when inspecting the contexts I cannot see a difference between the context for a message which was consumed without exception and one that threw an exception during consumption and will be redelivered.
How can I, in the PostConsume, method of an IConsumeObserver or IReceiveObserver determine if context represents a message that will be redelivered or one that has completed sucesfully?
You can do it. MassTransit keeps the redelivery count in the message headers, otherwise, it won't know when to stop redelivering, according to your policy.
If this line returns a non-zero (or not null, I am not sure) - you are dealing with a redelivered message.
context.Headers.Get(MessageHeaders.RedeliveryCount, default(int?)));
If your message is being retried (not redelivered), check this answer from Chris: Get MassTransit message retries amount
The consumer can influence whether or not a message will be redelivered, but it doesn't have full control or knowledge of it.
For example, everything succeeds on the consuming side, but it just takes too long, the publisher will retry and the consumer has no simple way to know that this will happen.
It's often best to design your application so that consuming the same message multiple times has the same effect as consuming it one time.
Additionally, you check the MessageId on consuming the message if you want to see if you've consumed it before.
The ConsumeContext also has a RetryCount, but I don't believe it's incremented until the next time the consumer runs.

Hangfire: huge latency when using MySQL

I'm using HangFire with MySQL backend. When I do simple
var jobId = BackgroundJob.Enqueue(
() => Debug.WriteLine("Test"));
I see a delay of 5-10 seconds, even though I've set the polling rate to 1 sec:
app.UseHangfireServer(new BackgroundJobServerOptions()
{
SchedulePollingInterval = TimeSpan.FromSeconds(1),
ServerCheckInterval = TimeSpan.FromSeconds(1),
HeartbeatInterval = TimeSpan.FromSeconds(1)
});
My setup is the most simple as possible - server and client on the same machine, queue is empty. I was expecting delays not bigger than 1 second.
I do not plan to use distributed servers, can I force somehow a server to immediately start the task? I assume if I switch to in-memory storage for hangfire it would start tasks immediately, - I found https://github.com/perrich/Hangfire.MemoryStorage but it is stated that it should not be used in production. What are my options to optimize latency?

ArgumentException in asynchronous webservice calls

I'm running into a problem, I haven't experienced before. I'm calling a method asynchronously, via the BeginXXX/EndXX pattern, with an extra wrapper for some timing functionality, like so:
BeginGetStuffTimed(arguments)
{
var timingstuff = TimingStuff;
var callback = new AsyncResult(CallbackMethod);
service.BeginGetStuff(CallbackMethod, timingstuff);
}
CallbackMethodTimed(IAsyncResult result)
{
SaveTiming(result.AsyncState...);
service.EndGetStuff;
}
Now every once in a while, an exception gets thrown:
Async End called on wrong channel.
Parameter name: result
Since it doesn't occur always this kind of puzzles me. I was thinking IIS couldn't keep up with requests and goes fubar, so I added a pause in my calling code and this does seem to work. The longer the pause, the less frequent the exception.
Of course this is no real solution, so I'm looking for some insights into this matter.
Edit: upon further inspection this seems to be unrelated to IIS, IIS can process up to 5000 (per CPU) simultaneous threads. I'm pushing nowhere near this limit.

Timeout for Web Request

What is a reasonable amount of time to wait for a web request to return? I know this is maybe a little loaded as a question, but all I am trying to do is verify if a web page is available.
Maybe there is a better way?
try
{
// Create the web request
HttpWebRequest request = WebRequest.Create(this.getUri()) as HttpWebRequest;
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
// 2 minutes for timeout
request.Timeout = 120 * 1000;
if (request != null)
{
// Get response
response = request.GetResponse() as HttpWebResponse;
connectedToUrl = processResponseCode(response);
}
else
{
logger.Fatal(getFatalMessage());
string error = string.Empty;
}
}
catch (WebException we)
{
...
}
catch (Exception e)
{
...
}
You need to consider how long the consumer of the web service is going to take e.g. if you are connecting to a DB web server and you run a lengthy query, you need to make the web service timeout longer then the time the query will take. Otherwise, the web service will (erroneously) time out.
I also use something like (consumer time) + 10 seconds.
Offhand I'd allow 10 seconds, but it really depends on what kind of network connection the code will be running with. Try running some test pings over a period of a few days/weeks to see what the typical response time is.
I would measure how long it takes for pages that do exist to respond. If they all respond in about the same amount of time, then I would set the timeout period to approximately double that amount.
Just wanted to add that a lot of the time I'll use an adaptive timeout. Could be a simple metric like:
period += (numTimeouts/numRequests > .01 ? someConstant: 0);
checked whenever you hit a timeout to try and keep timeouts under 1% (for example). Just be careful about decrementing it too low :)
The reasonable amount of time to wait for a web request may differ from one server to the next. If a server is at the far end of a high-delay link then clearly it will take longer to respond than when it is in the next room. But two minutes seems like it's more than ample time for a server to respond. The default timeout value for the PING command is expressed in seconds, not minutes. I suggest you look into the timeout values that are used by networking utilities like PING or TRACERT for inspiration.
I guess this depends on two things:
network speed/load (as others wrote, using ping might give you an idea about this)
the kind of page you are calling: e.g. is it a static HTML page or is it a page which might do some time-consuming operations (DB access, etc.)
Anyway, I think 2 minutes is a lot of time. I would definitely reduce the timeout to less than 30 seconds.
I realize this doesn't directly answer your question, but then an "answer" to this question is a little tough. Anyway, a tool I've used gomez in the past to measure page load times from various parts of the world. It's free and if you haven't done this kind of testing before it might be helpful in terms of giving you a firm idea of what typical page load times are for a given page from a given location.
I would only wait (MAX) 30 seconds probably closer to 15. It really depends on what you are doing and what the result is of unsuccessful connection. As I am sure you know there is lots of reason why you could get a timeout...

Categories