Resiliency during SaveChanges in EntityFrameworkCore - c#

I want to ensure that when I do a context.SaveChanges(), this is retried because the database might be temporarily down.
So far all I've found involves writing a lot of code that I'd then need to maintain, so is there something ready, an out-of-the-box tool, that I can use for resiliency?

I've created a small library called ResilientSaveChanges.EFCore that allows resilient context.SaveChanges / SaveChangesAsync in Entity Framework Core, logging of long-running transactions and limiting of concurrent SaveChanges. It's straight to the point.
Available on GitHub and NuGet. Tried and tested in production on multiple private projects.

Yes, connection resiliency is available in EF Core. For MySQL, it's available through the Pomelo driver's EnabelRetryOnFailure() option. The Github Blame shows this was added 5 years ago which is a bit surprising. An overload added 3 years ago allows specifying extra errors to retry.
This code taken from one of the integration tests shows how it's used:
services.AddDbContextPool<AppDb>(
options => options.UseMySql(
GetConnectionString(),
AppConfig.ServerVersion,
mysqlOptions =>
{
mysqlOptions.MaxBatchSize(AppConfig.EfBatchSize);
mysqlOptions.UseNewtonsoftJson();
if (AppConfig.EfRetryOnFailure > 0)
{
mysqlOptions.EnableRetryOnFailure(AppConfig.EfRetryOnFailure, TimeSpan.FromSeconds(5), null);
}
}
));
Without parameters EnableRetryOnFailure() uses the default retry count and maximum delay which are 6 and 30 seconds.
The third parameter is an ICollection<int> of additional errors to retry.
By default, the MySqlTransientExceptionDetector class specifies that only transient exceptions are retried, ie those that have the IsTransient property set, or timeout exceptions.
public static bool ShouldRetryOn([NotNull] Exception ex)
=> ex is MySqlException mySqlException
? mySqlException.IsTransient
: ex is TimeoutException;

As #PanagiotisKanavos already pointed out, Pomelo already has connection resiliency support.
The simplest way to use it, is to enable the default strategy:
dbContextOptions.UseMySql(
connectionString,
serverVersion,
mySqlOptions => mySqlOptions.EnableRetryOnFailure())
It will retry up to six times, incrementally waiting for longer periods between retries (but not longer than 30 seconds).
If you want to configure the retry strategy, use the following overload instead:
dbContextOptions.UseMySql(
connectionString,
serverVersion,
mySqlOptions => mySqlOptions
.EnableRetryOnFailure(
maxRetryCount: 3,
maxRetryDelay: TimeSpan.FromSeconds(15),
errorNumbersToAdd: null))
If you want full control over all aspects of the execution strategy, you can inject your own implementation (that either inherits from MySqlExecutionStrategy or directly implements IExecutionStrategy):
dbContextOptions.UseMySql(
connectionString,
serverVersion,
mySqlOptions => mySqlOptions
.ExecutionStrategy(dependencies => new YourExecutionStrategy(dependencies))

Related

Polly not catching Nsubstitute's mocked exception

I am using Polly's retry policy for my unsuccessful call. But it is not catching the exception and retrying.
Using:
Polly 7.2.3
.NET6.0
Nsubstitute 4.2.2
Setup:
var delay = Backoff.DecorrelatedJitterBackoffV2(TimeSpan.FromMilliseconds(RetryDelay), RetryCount);
_retryPolicy = Policy.Handle<HttpRequestException>()
.Or<CustomException>()
.OrResult<string>(response => !string.IsNullOrEmpty(response))
.WaitAndRetryAsync(delay);
Usage:
public async Task ProcessRequest()
{
var errors = await _retryPolicy.ExecuteAsync(async () => await this.RetryProcessRequest());
}
private async Task<string> RetryProcessRequest()
{
var token = await _tokenInfrastructure.GetTokenAsync();
return await _timezoneInfrastructure.ProcessRequest(token);
}
Unit test:
[Fact]
public async Task ProcessRequest_Throws()
{
string errors = _fixture.Create<string>();
var token = _fixture.Create<string>();
// Retry policy configured to retry 3 times on failed call
var expectedReceivedCalls = 4;
// this is throwing but Polly is not catching it and not retrying
_tokenInfrastructure.GetTokenAsync().Returns(Task.FromException<string>(new HttpRequestException()));
// this errors can be caught by Polly as configured and retrying
_timezoneInfrastructure.ProcessRequest(token).Returns(errors);
await _timezoneOrchestration.Awaiting(o => o.ProcessRequest()).Should()
.ThrowAsync<HttpRequestException>();
await _tokenInfrastructure.Received(expectedReceivedCalls).GetTokenAsync();
await _timezoneInfrastructure.Received(expectedReceivedCalls).ProcessRequest(Arg.Any<string>());
}
After doing Rubber duck debugging found my mistake. Actually, Polly was configured well and retrying.
this line of code was never calling because above we were getting exceptions.
return await _timezoneInfrastructure.ProcessRequest(token);
In Unit tests, it was expecting some retry calls:
_timezoneInfrastructure.Received(expectedReceivedCalls).ProcessRequest(Arg.Any<string>());
This post is not answer for the OP's question (the problem has already been addressed here). It is more like a set of suggestions (you can call it code review if you wish).
Exponential backoff
I'm glad to see that you are using the V2 of the backoff logic which utilizes the jitter in a proper way.
My only concern here is that depending on the actual values of RetryDelay and RetryCount the sleepDuration might explode: It can easily reach several minutes. I would suggest two solutions in that case:
Change factor parameter of the DecorrelatedJitterBackoffV2 from 2 (which is the default) to a lower number
Or try to top the max sleepDuration, here I have detailed one way to do that
Combined Retry logic
Without knowing what does RetryProcessRequest do, it seems like this _retryPolicy smashes two different policies into one. I might be wrong, so this section could suggest something which is not applicable for your code.
I assume this part decorates the _tokenInfrastructure.GetTokenAsync call
_retryPolicy = Policy.Handle<HttpRequestException>()
.WaitAndRetryAsync(delay);
whereas this part decorates the _timezoneInfrastructure.ProcessRequest call
_retryPolicy = Policy.Handle<CustomException>()
.OrResult<string>(response => !string.IsNullOrEmpty(response))
.WaitAndRetryAsync(delay);
Based on your naming I assume that these are different downstream systems: tokenInfrastructure, timezoneInfrastructure. I would suggest to create separate policies for them. You might want to apply different Timeout for them or use separate Circuit Breakers.
Naming
I know naming is hard and I assume your method names (ProcessRequest, RetryProcessRequest or ProcessRequest_Throws) are dumyfied for StackOverflow. If not then please try to spend some time to come up with more expressive names.
Component testing
Your ProcessRequest_Throws test is not really a unit test. It is more likely a component test. You are testing there the integration between the Polly's policy and the decorated code.
If you would test only the correctness of the policy setup or test only the decorated code (with NoOpPolicy) then they were unit tests.

Is it possible to add dynamic data to an MassTransit courier/routing slip custom event?

I have a MassTransit routing slip configured and working. For reference, the routing slip takes in an ID of an item in a MongoDB database and then creates a "version" of that document in a SQL database using EF Core. The activities (as commands) are:
Migrate document to SQL
Update audit info in MongoDB document
Update MongoDB document status (i.e. to published)
All of the above are write commands.
I have added a new 1st step which runs a query to make sure the MongoDB document is valid (e.g. name and description fields are completed) before running the migration. If this step fails it throws a custom exception, which in turns fires a failed event which is then picked up and managed by my saga. Below is a snippet of my activity code followed by the routing slip builder code:
Activity code
var result = await _queryDispatcher.ExecuteAsync<SelectModuleValidationResultById, ModuleValidationResult>(query).ConfigureAwait(false);
if (!result.ModuleValidationMessages.Any())
{
return context.Completed();
}
return context.Faulted(new ModuleNotValidException
{
ModuleId = messageCommand.ModuleId,
ModuleValidationMessages = result.ModuleValidationMessages
});
Routing slip builder code
builder.AddActivity(
nameof(Step1ValidateModule),
context.GetDestinationAddress(ActivityHelper.BuildQueueName<Step1ValidateModule>(ActivityQueueType.Execute)),
new SelectModuleValidationResultById(
context.Message.ModuleId,
context.Message.UserId,
context.Message.LanguageId)
);
builder.AddSubscription(
context.SourceAddress,
RoutingSlipEvents.ActivityFaulted,
RoutingSlipEventContents.All,
nameof(Step1ValidateModule),
x => x.Send<IModuleValidationFailed>(new
{
context.Message.ModuleId,
context.Message.LanguageId,
context.Message.UserId,
context.Message.DeploymentId,
}));
Whilst all of this works and the event gets picked up by my saga I would ideally like to add the ModuleValidationMessages (i.e. any failed validation messages) to the event being returned but I can't figure out how or even if that's possible (or more fundamentally if it's right thing to do).
It's worth noting that this is a last resort check and that the validation is checked by the client before even trying the migration so worse case scenario I can just leave it has "Has validation issues" but ideally I would like to include the derail in the failed response.
Good use case, and yes, it's possible to add the details you need to the built-in routing slip events. Instead of throwing an exception, you can Terminate the routing slip, and include variables - such as an array of messages, which are added to the RoutingSlipTerminated event that will be published.
This way, it isn't a fault but more of a business decision to terminate the routing slip prematurely. It's a contextual difference, which is why it allows variables to be specified (versus Faulted, which is a full-tilt exception).
You can then pull the array from the variables and use those in your saga or consumer.

MassTransit saga with request/response timeouts when deployed to test server

I have a fully working MassTransit saga, which runs some commands and then executes a request/response call to query a database and then ultimately return a response to the calling controller.
Locally this all works now 99% of the time (thanks to a lot of support I've received on here). However, when deployed to my Azure VM, which has a local copy of RabbitMQ and the 2 ASP.NET Core services running on it, the first call to the saga goes through straight away but all subsequent calls timeout.
I feel like it might be related to the fact that I'm using an InMemorySagaRepository (which in theory should be fine for my use case).
The saga is configured initially like so:
InstanceState(s => s.CurrentState);
Event(() => RequestLinkEvent, x => x.CorrelateById(context => context.Message.LinkId));
Event(() => LinkCreatedEvent, x => x.CorrelateById(context => context.Message.LinkId));
Event(() => CreateLinkGroupFailedEvent, x => x.CorrelateById(context => context.Message.LinkId));
Event(() => CreateLinkFailedEvent, x => x.CorrelateById(context => context.Message.LinkId));
Event(() => RequestLinkFailedEvent, x => x.CorrelateById(context => context.Message.LinkId));
Request(() => LinkRequest, x => x.UrlRequestId, cfg =>
{
cfg.ServiceAddress = new Uri($"{hostAddress}/{nameof(SelectUrlByPublicId)}");
cfg.SchedulingServiceAddress = new Uri($"{hostAddress}/{nameof(SelectUrlByPublicId)}");
cfg.Timeout = TimeSpan.FromSeconds(30);
});
It's worth noting that my LinkId is ALWAYS a unique Guid as it is created in the controller before the message is sent.
ALSO when I restart the apppool it works again for the first call and then starts timing out again.
I feel like something might be locking somewhere but I can't reproduce it locally!
So I wanted to post my solution to my own problem here in the hopes that it will aide others in the future.
I made 3 fundamental changes which either in isolation or combination solved this issue and everything now flys and works 100% of the time whether I use an InMemorySagaRepository, Redis or MongoDB.
Issue 1
As detailed in another question I posted here:
MassTransit saga with Redis persistence gives Method Accpet does not have an implementation exception
In my SagaStateMachineInstance class I had mistakenly declared the CurrentState property as a 'State' type when it should have been a string as such:
public string CurrentState { get; set;}
This was a fundamental issue and it came to light as soon as I started trying to add persistence so it may have been causing troubles when using the InMemorySagaRepository too.
Issue 2
In hindsight I suspect this was probably my main issue and I'm not completely convinced I've solved it in the best way but I'm happy with how things are.
I made sure my final event is managed in all states. I think what was happening was my request/response was finishing before the CurrentState of the saga had been updated. I realised this was happening by experimenting with using MongoDB as my persistence and seeing that I had sagas not completing stuck in the penultimate state.
Issue 3
This should be unnecessary but I wanted to add it as something to consider/try for those having issues.
I removed the request/response step from my saga and replaced it with a publish/subscribe. To do this I published an event to my consumer which when complete publishes an event with the CorrelationId (as suggested by #alexey-zimarev in my other issue). So in my consumer that does the query (i.e. reuqest) I do the following after it completes:
context.Publish(new LinkCreatedEvent { ... , CorrelationId = context.Message.CorrelationId })
Because the CorrelationId is in there my saga picks it up and handles the event as such:
When(LinkCreatedEvent )
.ThenAsync(HandleLinkCreatedEventAsync)
.TransitionTo(LinkCreated)
I'm really happy with how it all works now and feel confident about putting the solution live.

Should Polly Policies be singletons?

I have a query, IGetHamburgers, that calls an external API.
I've registered the implementation of IGetHamburgers in my DI container as a Singleton. Im using Polly as a Circuitbreaker, if two requests fails the circuit will open.
My goal is that all calls to the Hamburger api should go through the same circuitbreaker, if GetHamburgers fails, then all other calls should fail as well.
How should I use my Policy? Should I register my Policy as a field like this:
private Policy _policy;
private Policy Policy
{
get
{
if(this_policy != null)
{
return this_policy;
}
this._policy = Policy
.Handle<Exception>()
.CircuitBreaker(2, TimeSpan.FromMinutes(1));
return this._policy;
}
}
public object Execute(.......)
{
return Policy.Execute(() => this.hamburgerQuery.GetHamburgers());
}
OR
public object Execute(.......)
{
var breaker = Policy
.Handle<Exception>()
.CircuitBreaker(2, TimeSpan.FromMinutes(1));
return breaker.Execute(() => this.hamburgerQuery.GetHamburgers());
}
I guess that the first option is the correct way since then the Policy object will always be the same and can keep track of the exception count and stuff like that.
My question is, will option number two work as well? I've found a lot of samples/examples on Pollys Github but I can't find any "real world" examples where Polly is used together with DI and stuff like that?
I guess that the first option is the correct way since then the Policy object will always be the same and can keep track of the exception count and stuff like that.
Correct. This is described in the Polly wiki here. In brief:
Share the same breaker policy instance across call sites when you want those call sites to break in common - for instance they have a common downstream dependency.
Don't share a breaker instance across call sites when you want those call sites to have independent circuit state and break independently.
See this stackoverflow answer for a more extensive discussion of configuring policies separately from their usage, injecting them to usage sites by DI, and the effects of re-using the same instance (for example a singleton) versus using separate instances, across the full range (at June 2017) of Polly policies.
will option number two work as well?
No (for the converse reason: each call creates a separate instance, so won't share circuit statistics/states with other calls).

EF Core and big traffic leads to max pool size was reached error

We're using ASP.NET Entity Framework Core for querying our MSSQL database in our Web API app. Sometimes when we have big traffic, querying to DB ends with this error:
Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
I wonder if our pattern of using DbContext and querying is correct or if I am missing some using/dispose pattern and error is caused by some memory leak (after some research I read then I should not use using because the lifetime is managed by the framework). I am following documentation...
My connectionString:
"myConnection": "Server=xxx;Database=xxx;user id=xxx;password=xxx;Max Pool Size=200;Timeout=200;"
My Startup.cs
public void ConfigureServices(IServiceCollection services)
{
.....
// scoped context
services.AddDbContext<MyDbContext>(
options => options.UseSqlServer(this.Configuration.GetConnectionString("myConnection")));
}
then in controllers I used dbcontext by dependency injection:
public class MyController : Controller
public MyController (MyDbContext context)
{
this.Context = context;
}
public ActionResult Get(int id)
{
// querying
return this.Context.tRealty.Where(x=>x.id == id).FirstOrDefault();
}
Should I use something like:
using (var context = this.Context)
{
return this.Context.tRealty.Where(x => x.id == id).FirstOrDefault();
}
But I think that this is bad pattern when I am using dependency injection of DbContext.
I think problem was caused by storing objects from database context queries to In memory cache. I had one big LINQ query to database context with some other subqueries inside. I called FirstOrDefault() on the end of main query but not inside subqueries. Controller was fine with it, it materialize queries by default.
return this.Context.tRealty.AsNoTracking().Where(
x => x.Id == id && x.RealtyProcess == RealtyProcess.Visible).Select(
s => new
{ .....
// subquery
videos = s.TVideo.Where(video => video.RealtyId == id && video.IsPublicOnYouTube).
Select(video => video.YouTubeId).ToList()), // missing ToList()
.....
}).FirstOrDefault();
And there was problem - subqueries were holding connection to database context when they where storing to In memory cache. When I implemented Redis distributed cache, it was first failing on some strange errors. It helps when I write ToList() or FirstOrDefault() to all my subqueries because distributed cache needs materialized objects.
Now I have all my queries materialized explicitly and I got no max pool size was reached error. So that one must be careful when stored objects from database context queries to In memory cache. It is need to materialize all queries to avoid to holding connection somewhere in memory.
You can set the lifetime of the DbContext in your startup.cs, see if this helps:
services.AddDbContext<MyDbContext>(options => options
.UseSqlServer(connection), ServiceLifetime.Scoped);
Also if your query is a simple read you can remove tracking by using .AsNoTracking().
Another way to improve your throughput is to prevent locks by using a transaction block with IsolationLevel.ReadUncommitted for simple reads.
You can also use the Snapshot isolation level - which is slightly more restrictive - if you do not want dirty reads.
TransactionOptions transactionOptions = new TransactionOptions() { IsolationLevel = IsolationLevel.ReadUncommitted};
using (TransactionScope transactionScope = new TransactionScope(TransactionScopeOption.Required, transactionOptions))
{
// insert magic here
}
Edit : As the author of the question mentioned, the above code is not (yet?) possible in EF Core.
A workaround can be found here using an explicit transaction:
using (var connection = new SqlConnection(connectionString))
{
connection.Open();
using (var transaction = connection.BeginTransaction())
{
// transaction.Commit();
// transaction.Rollback();
}
}
I have not tested this.
Edit 2: Another untested snippet where you can have executed commands to set isolation level:
using (var c1= new SqlConnection(connectionString))
{
c1.Open();
// set isolation level
Exec(c1, "SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;");
Exec(c1, "BEGIN TRANSACTION;");
// do your magic here
}
With Exec:
private static void Exec(SqlConnection c, string s)
{
using (var m = c.CreateCommand())
{
m.CommandText = s;
m.ExecuteNonQuery();
}
}
Edit 3: According to that thread, Transactions will be supported from .NET Core version 1.2 onwards.
#mukundabrt this is tracked by dotnet/corefx#2949. Note that
TransactionScope has already been ported to .NET Core but will only be
available in .NET Core 1.2.
I am adding an alternative answer, in case anyone lands here with a slightly different root cause, as was the case for my .NET Core MVC application.
In my scenario, the application was producing these "timeout expired... max pool size was reached" errors due to mixed use of async/await and Task.Result within the same controller.
I had done this in an attempt to reuse code by calling a certain asynchronous method in my constructor to set a property. Since constructors do not allow asynchronous calls, I was forced to use Task.Result. However, I was using async Task<IActionResult> methods to await database calls within the same controller. We engaged Microsoft Support, and an Engineer helped explain why this happens:
Looks like we are making a blocking call to an Async method inside
[...] constructor.
...
So, basically something is going wrong in the call to above
highlighted async method and because of which all the threads listed
above are blocked.
Looking at the threads which are doing same operation and blocked:
...
85.71% of threads blocked (174 threads)
We should avoid mixing async and blocking code. Mixed async and
blocking code can cause deadlocks, more-complex error handling and
unexpected blocking of context threads.
https://msdn.microsoft.com/en-us/magazine/jj991977.aspx
https://blogs.msdn.microsoft.com/jpsanders/2017/08/28/asp-net-do-not-use-task-result-in-main-context/
Action Plan
Please engage your application team to revisit the application code of above mentioned method to understand what is going
wrong.
Also, I would appreciate if you could update your application logic to
not mix async and blocking code. You could use await Task instead of
Task.Wait or Task.Result.
So in our case, I pulled the Task.Result out of the constructor and moved it into a private async method where we could await it. Then, since I only want it to run the task once per use of the controller, I store the result to that local property, and run the task from within that method only if the property value is null.
In my defense, I expected the compiler would at least throw a warning if mixing async and blocking code is so problematic. However, it seems obvious enough to me, in hindsight!
Hopefully, this helps someone...

Categories