I am using EF Core 7. It looks like, since EF Core 5, there is now Single vs Split Query execution.
I see that the default configuration still uses the Single Query execution though.
I noticed in my logs it was saying:
Microsoft.EntityFrameworkCore.Query.MultipleCollectionIncludeWarning':
Compiling a query which loads related collections for more than one
collection navigation, either via 'Include' or through projection, but
no 'QuerySplittingBehavior' has been configured. By default, Entity
Framework will use 'QuerySplittingBehavior.SingleQuery', which can
potentially result in slow query performance.
Then I configured a warning on db context to get more details:
services.AddDbContextPool<TheBestDbContext>(
options => options.UseSqlServer(configuration.GetConnectionString("TheBestDbConnection"))
.ConfigureWarnings(warnings => warnings.Throw(RelationalEventId.MultipleCollectionIncludeWarning))
);
Then I was able to specifically see which call was actually causing that warning.
var user = await _userManager.Users
.Include(x => x.UserRoles)
.ThenInclude(x => x.ApplicationRole)
.ThenInclude(x => x.RoleClaims)
.SingleOrDefaultAsync(u => u.Id == userId);
So basically same code would be like:
var user = await _userManager.Users
.Include(x => x.UserRoles)
.ThenInclude(x => x.ApplicationRole)
.ThenInclude(x => x.RoleClaims)
.AsSplitQuery() // <===
.SingleOrDefaultAsync(u => u.Id == userId);
with Split query option.
I went through the documentation, but I'm still not sure how to create a pattern out of it.
I would like to set the most common one as a default value across the project, and only use the other for specific scenarios.
Based on the documentation, I have a feeling that the "Split" should be used as default in general but with caution. I also noticed on their documentation specific to pagination, that it says:
When using split queries with Skip/Take, pay special attention to making your query ordering fully unique; not doing so could cause incorrect data to be returned. For example, if results are ordered only by date, but there can be multiple results with the same date, then each one of the split queries could each get different results from the database. Ordering by both date and ID (or any other unique property or combination of properties) makes the ordering fully unique and avoids this problem. Note that relational databases do not apply any ordering by default, even on the primary key.
which completely makes sense as the query will be split.
But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?
I would love to hear if there are any best practices on that and when to use which approach.
But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?
It depends, let's examine your example in Single query approach:
var user = await _userManager.Users // 1 records based on SingleOrDefault but to server goes TAKE 2
.Include(x => x.UserRoles) // R roles
.ThenInclude(x => x.ApplicationRole) // 1 record
.ThenInclude(x => x.RoleClaims) // C claims
.SingleOrDefaultAsync(u => u.Id == userId);
As result on the client will be returned RecordCount = 1 * R * 1 * C records. Then they will be deduplicated and placed in appropriate collections.
If RecordCount is approximately small Single query can be best approach.
Also EF Core adds ORDER BY for such query which may slowdown execution. So better examine execution plan.
Side note: Better to use FirstOrDefault/Async it CAN be a lot faster than SingleOrDefault/Async, when SQL server fails to detect that there no 2 records in recordset early.
The documentation at https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries outlines the considerations when Split Queries could have unintentional consequences, particularly around isolation and ordering. As mentioned when loading a single record with related details, a singlw query execution is generally perferred. The warning is appearing because you have a one-to-many, which contains a one-to-many, so it is warning that this can potentially lead to a much larger Cartesian Product in terms of a JOIN-based query. To avoid the warning as you are confident that the query is reasonable in size, you can specify .AsSingleQuery() explicitly and the warning should disappear.
When working with object graphs like this you can consider designing operations against the data state to be as atomic as possible. IF you are editing a User that has Roles & Claims, rather than loading everything for a User and attempting to edit the entire graph in memory in one go, you might structure the application to perform actions like "AddRoleToUser", "RemoveRoleFromUser", AddClaimToUserRole", etc. So instead of loading User /w Roles /w Claims, these actions just load Roles for a user, or Claims for a UserRole respectively to alter this data.
After searching through this to figure out if there is any pattern to apply this, and with all the great content provided at the bottom, I was still not sure as I was looking for "When to use split queries" and "when not to", so I tried the summarized my understanding at the bottom.
I will use the same example that Microsoft shows on Single vs Split Queries
var blogs = ctx.Blogs
.Include(b => b.Posts)
.Include(b => b.Contributors)
.ToList();
and here is the generated SQL for that:
SELECT [b].[Id], [b].[Name], [p].[Id], [p].[BlogId], [p].[Title], [c].[Id], [c].[BlogId], [c].[FirstName], [c].[LastName]
FROM [Blogs] AS [b]
LEFT JOIN [Posts] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [Contributors] AS [c] ON [b].[Id] = [c].[BlogId]
ORDER BY [b].[Id], [p].[Id]
Microsoft says:
In this example, since both Posts and Contributors are collection
navigations of Blog - they're at the same level - relational databases
return a cross product: each row from Posts is joined with each row
from Contributors. This means that if a given blog has 10 posts and 10
contributors, the database returns 100 rows for that single blog. This
phenomenon - sometimes called cartesian explosion - can cause huge
amounts of data to unintentionally get transferred to the client,
especially as more sibling JOINs are added to the query; this can be a
major performance issue in database applications.
However what it doesn't clearly mention is, other than sorting/ordering issues, this may easily mess up the performance of the queries.
First concern is, we are going to be hitting to database multiple times in that case.
Let's check this one:
using (var context = new BloggingContext())
{
var blogs = context.Blogs
.Include(blog => blog.Posts)
.AsSplitQuery()
.ToList();
}
And check out the generated SQL when .AsSplitQuery() is used.
SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]
SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]
So above query was kind of surprised me. It is interesting that when it uses the split option, it still joins on the second query even though second query should only be pulling data from posts table. Pretty sure EF Core folks had some idea behind that but it just doesn't make sense to me. Then what is the point of having that foreign key over there?
Looks like Microsoft was mainly focused on a solution to avoid cartesian explosion problem but obviously it doesn't mean that "split queries" should be used as best practices by default going forward. Definitely not!
And another possible problem I can think of is data inconsistency, yet the queries are ran separate, you can't guarantee the data consistency. (unless completely locked)
I just don't want to throw away the feature of course. There are still some "good" scenarios to use Split Queries imo, (unless you are really worried about the data consistency) like if we are returning lots of columns with a relation and the size is pretty large, then this could be really performance factor. Or the parent data is not a lot, but tons of navigation sets, then there is your cartesian explosion.
PS: Note that cartesian explosion does not occur when the two JOINs aren't at the same level.
Last but not least, personally, if I am really going to be pulling some heavy amount of data with bunch of relation of relation of relation, I would still prefer those "good old" Stored Procedures. It never gets old!
I am using https://github.com/VahidN/EFCoreSecondLevelCacheInterceptor package to cache EF Core query results in my ASP.NET Core app. According to creator, it works like that:
The results of EF commands will be stored in the cache, so that the
same EF commands will retrieve their data from the cache rather than
executing them against the database again.
So this library returns cached results if the generated SQL is the same.
The problem is that I am using a lot of Include() methods while querying database. This is needed due to some pages showing a lot of information that is stored in different related database tables.
Also, the query contains current-user-specific info, e.g. if the current user liked the Post or not. Caching user-specific info is not right as I heard and results in redundant cache entries since the only data that is changing is the current user ID.
Example of a query:
var post = dbContext.Posts
.Where(p => p.Id == 1)
.Include(p => p.Comments)
.Include(p => p.VoteTracker
.Where(t => t.UserId == "{current user ID}"))
// ... other Include(...) calls
.ToList();
The SQL generated can be quite huge with a lot of JOINs. So the problem with caching here are:
if Comments get changed, then the whole query above will be invalidated aswell since this library
watches for all of the CRUD operations using its interceptor and
then invalidates the related cache entries automatically
Having current-user-specific info in the query creates a lot of identical cache entries, which are only varying by that current-user-specific info
What is the better approach for caching here?
The first thing that comes to mind is to have separate queries that can be cached. For example:
// Post and Comments query will be cached
var post = dbContext.Posts.SingleOrDefault(p => p.Id == 1);
var comments = dbContext.Comments
.Where(c => c.PostId == 1)
.ToList();
// This one will be excluded from caching (because of user-specific info)
var voteTracker = dbContext.VoteTrackers
.SingleOrDefault(t => t.PostId == 1 && t.UserId == "{current user ID}");
Is it a good approach? And if it's not which one is better?
I am struggling with this a lot, but having a hard time finding the right solution. Thank you very much in advance! :)
I have a question about Entity Framework Core and using LINQ. I would like to get the other table details while accessing the Clients table. I can get them using below code. There are a total of around 10 tables I need to join, in this case is the below approach is good or any other, better approach? ClientId is the foreign key for all tables.
Actually I am getting a warning as below
[09:34:33 Warning] Microsoft.EntityFrameworkCore.Query
Compiling a query which loads related collections for more than one collection navigation either via 'Include' or through projection but no 'QuerySplittingBehavior' has been configured. By default Entity Framework will use 'QuerySplittingBehavior.SingleQuery' which can potentially result in slow query performance. See https://go.microsoft.com/fwlink/?linkid=2134277 for more information. To identify the query that's triggering this warning call 'ConfigureWarnings(w => w.Throw(RelationalEventId.MultipleCollectionIncludeWarning))'
Code:
var client = await _context.Clients
.Include(x => x.Address)
.Include(x => x.Properties)
.Include(x => x.ClientDetails)
-------------------
-------------------
-------------------
-------------------
.Where(x => x.Enabled == activeOnly && x.Id == Id).FirstOrDefaultAsync();
Actually when you use Eager loading (using include()) It uses left join (all needed queries in one query) to fetch data. Its default the ef behavior in ef 5.
You can set AsSplitQuery() in your query for split all includes in separated queries. like:
var client = await _context.Clients
.Include(x => x.Address)
.Include(x => x.Properties)
.Include(x => x.ClientDetails)
-------------------
-------------------
-------------------
-------------------
.Where(x =>x.Id == Id).AsSplitQuery().FirstOrDefaultAsync()
This approach needs more database connection, but it's nothing really important.
and for the final recommendation, I advise using AsNoTracking() for queries to high performance.
I have 3 different approaches depending on the version of EF Core you're using
EF Core 5 - as some have mentioned in previous answers there is new call which will simply break up the query into smaller subqueries and map all the relations in the end.
/*rest of the query here*/.AsSplitQuery();
If you are not able to just migrate your EF version you could still split the query manually
var client = await _context.Clients.FirstOrDefaultAsync(t => t.Enabled /*other conditions*/);
var Address = await _context.Addresses.FirstOrDefaultAsync(t => t.ClientId == client.Id);
/// Because they are tracked EF's entitytracker can under the hood
/// map the sub queries to their correct relations
/// in this case you should not use .AsNoTracking()
/// unless you would want to stitch relations together yourself
Another alternative is to write your query as a Select statement. This greatly improves performance but is a bit more of a hassle to construct.
var clientResult = await _context.Clients.Where(x => x.Id == id).Select(x => new
{
client = x,
x.Address,
Properties = x.Properties.Select(property => new
{
property.Name /*sub query for one to many related*/
}).ToList(),
x.ClientDetails
}).ToListAsync();
it doesn't take many includes to create cartesian explosion
you can read more up on the problem at hand in this article here
cartesian explosion in EF Core
and referral link to optimizing performance through EF Core can be found here
Maximizing Entity Framework Core Query Performance
I have a query which looks like following:
users = ctx.SearchedUsers.AsNoTracking()
.IncludeEFU(ctx,c=>c.SearchedUserItems)
.IncludeEFU(ctx,c=>c.UserTransactions)
.Where(y => y.InQueue == false &&
y.LastTimeSearchedAt.Value > inPast10Dayss)
.OrderBy(y => y.LastUpdatedAt)
.Take(100).ToList();
The IncludeEFU() method is called via EFUtilities library which I got from here:
https://github.com/MikaelEliasson/EntityFramework.Utilities
And another version of this query (using regular include) looks like this:
users = ctx.SearchedUsers.AsNoTracking()
.Include("SearchedUserItems")
.Include("UserTransactions")
.Where(y => y.InQueue == false &&
y.LastTimeSearchedAt.Value > inPast10Dayss)
.OrderBy(y => y.LastUpdatedAt)
.Take(100).ToList();
Now both of these queries give terrible performance results, where taking 100 records from SearchedUsers table with approximately anywhere between 2000 to 5000 records in the two neighboring tables SearchedUserItems and UserTransactions take anywhere between 2 to 5 minutes....
What I have done earlier is pulling the records for each record from SearchedUser table and then pulled the neighbor records via FK (which is indexed in both tables for faster performance).
Now my question here is... Is there any faster way to include these records with one query so I don't have to pull them one by one in an efficient manner?
Can someone help me out?
I have a Linq query on a DbSet that hits a table and grabs 65k rows. The query takes about 3 minutes, to me that seems like obviously too much. Although I don't have a line of comparison but I'm certain this can be improved. I'm relative new to EF and Linq so I suspect I may also be structuring my query in a way that is a big "NO".
I read that change tracking is where EF spends most of it's time, and that is enabled on the entity in question so perhaps I should turn that off (if so, how)?
Here's the code:
ReportTarget reportTarget = repository.GetById(reportTargetId);
if (reportTarget != null)
{
ReportsBundle targetBundle = reportTarget.SavedReportsBundles.SingleOrDefault(rb => rb.ReportsBundleId == target.ReportsBundleId);
if (targetBundle != null)
{
}
}
This next line takes 3 Minutes to execute (65k records):
IPoint[] pointsData = targetBundle.ReportEntries
.Where(e => ... a few conditions )
.Select((entry, i) => new
{
rowID = entry.EntryId,
x = entry.Profit,
y = i,
weight = target.HiddenPoints.Contains(entry.EntryId) ? 0 : 1,
group = 0
}.ActLike<IPoint>())
.ToArray();
Note: ActLike() is from Impromptu Interface library that uses the .NET DLR to make dynamic proxies of objects that implement an interface on the fly. I doubt this is the bottle neck.
How can I optimize performance for this particular DbSet (TradesReportEntries) as I'll be querying this table for large data sets (IPoint[]s) often
Well, it looks like you're loading an entity object then querying navigation properties. When this occurs, EF loads all related entities FIRST (via lazy loading), then your query is performed on the entire collection. This may be why you're having performance issues.
Try querying against the collection by using the following:
context.Entry(targetBundle)
.Collection(p => p.TradesReportEntries)
.Query()
.Where( e => <your filter here> )
.Select( <your projection here> )
This allows you to specify a filter in addition to the behind-the-curtain filter that handles loading the nav property by default. Let us know how that works out.