Include() vs Select() performance

Include() vs Select() performance - c#

I have a parent entity with a navigation property to a child entity. The parent entity may not be removed as long as there are associated records in the child entity. The child entity can contain hundreds of thousands of records.
I'm wondering what will be the most efficient to do in Entity Framework to do this:
var parentRecord = _context.Where(x => x.Id == request.Id)
.Include(x => x.ChildTable)
.FirstOrDefault();
// check if parentRecord exists
if (parentRecord.ChildTable.Any()) {
// cannot remove
}
or
var parentRecord = _context.Where(x => x.Id == request.Id)
.Select(x => new {
ParentRecord = x,
HasChildRecords = x.ChildTable.Any()
})
.FirstOrDefault();
// check if parentRecord exists
if (parentRecord.HasChildRecords) {
// cannot remove
}
The first query may include thousands of records while the second query will not, however, the second one is more complex.
Which is the best way to do this?

I would say it depens. It depends on which DBMS you're using. it depends on how good the optimizer works etc.
So one single statement with a JOIN could be far faster than a lot of SELECT statements.
In general I would say when you need the rows from your Child table use .Include(). Otherwise don't include them.
Or in simple words, just read the data you need.

The answer depends on your database design. Which columns are indexed? How much data is in table?
Include() offloads work to your C# layer, but means a more simple query. It's probably the better choice here but you should consider extracting the SQL that is generated by entity framework and running each through an optimisation check.
You can output the sql generated by entity framework to your visual studio console as note here.
This example might create a better sql query that suites your needs.

Related

EF Core Single vs. Split Queries

I am using EF Core 7. It looks like, since EF Core 5, there is now Single vs Split Query execution.
I see that the default configuration still uses the Single Query execution though.
I noticed in my logs it was saying:
Microsoft.EntityFrameworkCore.Query.MultipleCollectionIncludeWarning':
Compiling a query which loads related collections for more than one
collection navigation, either via 'Include' or through projection, but
no 'QuerySplittingBehavior' has been configured. By default, Entity
Framework will use 'QuerySplittingBehavior.SingleQuery', which can
potentially result in slow query performance.
Then I configured a warning on db context to get more details:
services.AddDbContextPool<TheBestDbContext>(
options => options.UseSqlServer(configuration.GetConnectionString("TheBestDbConnection"))
.ConfigureWarnings(warnings => warnings.Throw(RelationalEventId.MultipleCollectionIncludeWarning))
);
Then I was able to specifically see which call was actually causing that warning.
var user = await _userManager.Users
.Include(x => x.UserRoles)
.ThenInclude(x => x.ApplicationRole)
.ThenInclude(x => x.RoleClaims)
.SingleOrDefaultAsync(u => u.Id == userId);
So basically same code would be like:
var user = await _userManager.Users
.Include(x => x.UserRoles)
.ThenInclude(x => x.ApplicationRole)
.ThenInclude(x => x.RoleClaims)
.AsSplitQuery() // <===
.SingleOrDefaultAsync(u => u.Id == userId);
with Split query option.
I went through the documentation, but I'm still not sure how to create a pattern out of it.
I would like to set the most common one as a default value across the project, and only use the other for specific scenarios.
Based on the documentation, I have a feeling that the "Split" should be used as default in general but with caution. I also noticed on their documentation specific to pagination, that it says:
When using split queries with Skip/Take, pay special attention to making your query ordering fully unique; not doing so could cause incorrect data to be returned. For example, if results are ordered only by date, but there can be multiple results with the same date, then each one of the split queries could each get different results from the database. Ordering by both date and ID (or any other unique property or combination of properties) makes the ordering fully unique and avoids this problem. Note that relational databases do not apply any ordering by default, even on the primary key.
which completely makes sense as the query will be split.
But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?
I would love to hear if there are any best practices on that and when to use which approach.

But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?
It depends, let's examine your example in Single query approach:
var user = await _userManager.Users // 1 records based on SingleOrDefault but to server goes TAKE 2
.Include(x => x.UserRoles) // R roles
.ThenInclude(x => x.ApplicationRole) // 1 record
.ThenInclude(x => x.RoleClaims) // C claims
.SingleOrDefaultAsync(u => u.Id == userId);
As result on the client will be returned RecordCount = 1 * R * 1 * C records. Then they will be deduplicated and placed in appropriate collections.
If RecordCount is approximately small Single query can be best approach.
Also EF Core adds ORDER BY for such query which may slowdown execution. So better examine execution plan.
Side note: Better to use FirstOrDefault/Async it CAN be a lot faster than SingleOrDefault/Async, when SQL server fails to detect that there no 2 records in recordset early.

The documentation at https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries outlines the considerations when Split Queries could have unintentional consequences, particularly around isolation and ordering. As mentioned when loading a single record with related details, a singlw query execution is generally perferred. The warning is appearing because you have a one-to-many, which contains a one-to-many, so it is warning that this can potentially lead to a much larger Cartesian Product in terms of a JOIN-based query. To avoid the warning as you are confident that the query is reasonable in size, you can specify .AsSingleQuery() explicitly and the warning should disappear.
When working with object graphs like this you can consider designing operations against the data state to be as atomic as possible. IF you are editing a User that has Roles & Claims, rather than loading everything for a User and attempting to edit the entire graph in memory in one go, you might structure the application to perform actions like "AddRoleToUser", "RemoveRoleFromUser", AddClaimToUserRole", etc. So instead of loading User /w Roles /w Claims, these actions just load Roles for a user, or Claims for a UserRole respectively to alter this data.

After searching through this to figure out if there is any pattern to apply this, and with all the great content provided at the bottom, I was still not sure as I was looking for "When to use split queries" and "when not to", so I tried the summarized my understanding at the bottom.
I will use the same example that Microsoft shows on Single vs Split Queries
var blogs = ctx.Blogs
.Include(b => b.Posts)
.Include(b => b.Contributors)
.ToList();
and here is the generated SQL for that:
SELECT [b].[Id], [b].[Name], [p].[Id], [p].[BlogId], [p].[Title], [c].[Id], [c].[BlogId], [c].[FirstName], [c].[LastName]
FROM [Blogs] AS [b]
LEFT JOIN [Posts] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [Contributors] AS [c] ON [b].[Id] = [c].[BlogId]
ORDER BY [b].[Id], [p].[Id]
Microsoft says:
In this example, since both Posts and Contributors are collection
navigations of Blog - they're at the same level - relational databases
return a cross product: each row from Posts is joined with each row
from Contributors. This means that if a given blog has 10 posts and 10
contributors, the database returns 100 rows for that single blog. This
phenomenon - sometimes called cartesian explosion - can cause huge
amounts of data to unintentionally get transferred to the client,
especially as more sibling JOINs are added to the query; this can be a
major performance issue in database applications.
However what it doesn't clearly mention is, other than sorting/ordering issues, this may easily mess up the performance of the queries.
First concern is, we are going to be hitting to database multiple times in that case.
Let's check this one:
using (var context = new BloggingContext())
{
var blogs = context.Blogs
.Include(blog => blog.Posts)
.AsSplitQuery()
.ToList();
}
And check out the generated SQL when .AsSplitQuery() is used.
SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]
SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]
So above query was kind of surprised me. It is interesting that when it uses the split option, it still joins on the second query even though second query should only be pulling data from posts table. Pretty sure EF Core folks had some idea behind that but it just doesn't make sense to me. Then what is the point of having that foreign key over there?
Looks like Microsoft was mainly focused on a solution to avoid cartesian explosion problem but obviously it doesn't mean that "split queries" should be used as best practices by default going forward. Definitely not!
And another possible problem I can think of is data inconsistency, yet the queries are ran separate, you can't guarantee the data consistency. (unless completely locked)
I just don't want to throw away the feature of course. There are still some "good" scenarios to use Split Queries imo, (unless you are really worried about the data consistency) like if we are returning lots of columns with a relation and the size is pretty large, then this could be really performance factor. Or the parent data is not a lot, but tons of navigation sets, then there is your cartesian explosion.
PS: Note that cartesian explosion does not occur when the two JOINs aren't at the same level.
Last but not least, personally, if I am really going to be pulling some heavy amount of data with bunch of relation of relation of relation, I would still prefer those "good old" Stored Procedures. It never gets old!

How do I update multiple Entity models in one SQL statement?

I had the following:
List<Message> unreadMessages = this.context.Messages
.Where( x =>
x.AncestorMessage.MessageID == ancestorMessageID &&
x.Read == false &&
x.SentTo.Id == userID ).ToList();
foreach(var unreadMessage in unreadMessages)
{
unreadMessage.Read = true;
}
this.context.SaveChanges();
But there must be a way of doing this without having to do 2 SQL queries, one for selecting the items, and one for updating the list.
How do i do this?

Current idiomatic support in EF
As far as I know, there is no direct support for "bulk updates" yet in Entity Framework (there has been an ongoing discussion for bulk operation support for a while though, and it is likely it will be included at some point).
(Why) Do you want to do this?
It is clear that this is an operation that, in native SQL, can be achieved in a single statement, and provides some significant advantages over the approach followed in your question. Using the single SQL statement, only a very small amount of I/O is required between client and DB server, and the statement itself can be completely executed and optimized by the DB server. No need to transfer to and iterate through a potentially large result set client side, just to update one or two fields and send this back the other way.
How
So although not directly supported by EF, it is still possible to do this, using one of two approaches.
Option A. Handcode your SQL update statement
This is a very simple approach, that does not require any other tools/packages and can be performed Async as well:
var sql = "UPDATE TABLE x SET FIELDA = #fieldA WHERE FIELDB = #fieldb";
var parameters = new SqlParameter[] { ..., ... };
int result = db.Database.ExecuteSqlCommand(sql, parameters);
or
int result = await db.Database.ExecuteSqlCommandAsync(sql, parameters);
The obvious downside is, well breaking the nice linqy paradigm and having to handcode your SQL (possibly for more than one target SQL dialect).
Option B. Use one of the EF extension/utility packages
Since a while, a number of open source nuget packages are available that offer specific extensions to EF. A number of them do provide a nice "linqy" way to issue a single update SQL statement to the server. Two examples are:
Entity Framework Extended Library that allows performing a bulk update using a statement like:
context.Messages.Update(
x => x.Read == false && x.SentTo.Id == userID,
x => new Message { Read = true });
It is also available on github
EntityFramework.Utilities that allows performing a bulk update using a statement like:
EFBatchOperation
.For(context, context.Messages)
.Where(x => x.Read == false && x.SentTo.Id == userID)
.Update(x => x.Read, x => x.Read = true);
It is also available on github
And there are definitely other packages and libraries out there that provide similar support.

Even SQL has to do this in two steps in a sense, in that an UPDATE query with a WHERE clause first runs the equivalent of a SELECT behind the scenes, filtering via the WHERE clause, then applying the update. So really, I don't think you need to be worried about improving this.
Further, the reason why it's broken into two steps like this in LINQ is precisely for performance reasons. You want that "select" to be as minimal as possible, i.e. you don't want to load any more objects from the database into in memory objects than you have to. Only then do you alter objects (in the foreach).
If you really want to run a native UPDATE on the SQL side, you could use a System.Data.SqlClient.SqlCommand to issue the update, instead of having LINQ give you back objects that you then update. That will be faster, but then you conceptually move some of your logic out of your C# code object model space into the database model space (you are doing things in the database, not in your object space), even if the SqlCommand is being issued from your code.

Selecting one to many which one is better

In a one to many relationship situation which of the following has better performance.
1st approach
public Order GetOrder(long orderId) {
var orderDetails =
(from o in Orders
from d in OrderDetails
where d.OrderId = o.Id && o.Id = orderId
select new {
Order = o,
Detail = d
}).ToList();
var order = orderDetails.First().Order;
order.Details = orderDetails.Select(od => od.Detail).ToList();
return order;
}
2nd approach
public Order GetOrder(long orderId) {
var order = Orders.First(o => o.Id == orderId);
order.Details = OrderDetails.Where(od => od.OrderId = orderId).ToList();
return order;
}
The point I am trying to figure out (in terms of performance) is, in first approach there is single query but repeated data is being selected where, in second approach, there are two seperate queries but selecting only the data that is enough.
You can assume Orders and OrderDetails are IQueryable<T> of EntityFramework (dbContext.Set<T>()) or NHibernate (session.Query<T>()). I tried with both and they create very similar sql queries. Also as far as I know, these ORM's built in one to many queries use something like the first approach.
UPDATE, to clarify what I am asking: Which one (single query but repeated data or only required data but multiple queries) performs better under which circumstances? There may be many situations that I may not think of. That's why I am not trying benchmarking. As already stated in some answers column count or more joins were the kinds of answers that I expected. (There may be also something about row count of table and/or result set). Based on these kind of answers I may try benchmarking. And of course I am asking why? I am not trying to solve Order - OrderDetail problem or solve anything at all. I am trying to learn and understand when to use single query but repeated data or only required data but multiple queries.

A single one-to-many query is pretty straightforward for ORMs. It's when you need to make several interrelated one-to-many queries that performance considerations start making themselves known.

always measure performance for your particular case. if order table has few-small sized columns, getting all data in one round trip may be better. if order tables has too many or blob columns, issuing 2 seperate queries may outperform.

Using the EntityFramework, you should either call Include on the context
var order = context.Orders.Include(x => x.Details).First(x => x.Id == orderId);
Loading Related Objects

Entity Framework - How many trips to the database on DeleteObject

Is this entity framework call actually making two trips to the database?
var result = from p in entities.people where p.id == 6 select p;
entities.DeleteObject(result);
It strikes me that maybe DeleteObject would force the first trip to get results and THEN, having the object to work with, would execute the delete function.
If that is the case, how do I avoid the second trip? How does one do a remove-by-query in entity framework with a single database trip?
Thanks!
EDIT
My original example was misleading, because it was a query by primary key. I guess the real question is whether there is a way to have a single-trip function that can delete items from an IQueryable. For example:
var result = from p in entities.people where p.cityid == 6 select p;
foreach (var r in result)
{
entities.DeleteObject(r);
}
(Notice that the query is of a foreign key, so there may be multiple results).

You can do it like this:
entities.ExecuteStoreCommand("DELETE FROM people WHERE people.cityid={0}", 6);
this is one trip to the database for sure, and effective as it can be.
EDIT:
Also, take a look here, they suggest the same solution. And to answer the question, this is the only way to delete entities, not referenced by primary key, from a database using entity framework, without fetching these entities (and without writing some helper extension methods like suggested in this answer).

Direct delete:
var people = new people() { id = 6 };
entities.people.Attach(people);
entities.people.Remove(people);
entities.SaveChanges();
If you want to see it for yourself, fire up a profiler.
EDIT:
This will allow you to use Linq but it won't be one trip.
var peopleToDelete = entities.people.Where(p => p.id == 6);
foreach (var people in peopleToDelete )
entities.people.DeleteObject(people );
entities.SaveChanges();
There's no easy way to do that out of the box in EF, a big annoyance indeed (as long as one does not want to resort to using direct SQL, which personally I don't). One of the other posters links to an answer that in turn links to this article, which describes a way to make your own function for this, using ToTraceString.

How do I apply the LINQ to SQL Distinct() operator to a List<T>?

I have a serious(it's getting me crazy) problem with LINQ to SQL. I am developing an ASP.NET MVC3 application using c# and Razor in Visual Studio 2010.
I have two database tables, Product and Categories:
Product(Prod_Id[primary key], other attributes)
Categories((Dept_Id, Prod_Id) [primary keys], other attributes)
Obviously Prod_Id in Categories is a foreign key. Both classes are mapped using the Entity Framework (EF). I do not mention the context of the application for simplicity.
In Categories there are multiple rows containing the Prod_Id. I want to make a projection of all Distinct Prod_Id in Categories. I did it using plain (T)SQL in SQL Server MGMT Studio according to this (really simple) query:
SELECT DISTINCT Prod_Id
FROM Categories
and the result is correct. Now I need to make this query in my application so I used:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
I go to check the result of my query by using:
query.Select(m => m.Prod_Id);
or
foreach(var item in query)
{
item.Prod_Id;
//other instructions
}
and it does not work. First of all the Intellisense when I attempt to write query.Select(m => m. or item.shows just suggestions about methods (such as Equals, etc...) and not properties. I thought that maybe there was something wrong with Intellisense (I guess most of you many times hoped that Intellisense was wrong :-D) but when I launch the application I receive an error at runtime.
Before giving your answer keep in mind that;
I checked many forums, I tried the normal LINQ to SQL (without using lambdas) but it does not work. The fact that it works in (T)SQL means that there is something wrong with the LINQ to SQL instruction (other queries in my application work perfectly).
For application related reasons, I used a List<T> variable instead of _StoreDB.Categories and I thought that was the problem. If you can offer me a solution without using a List<T> is appreciated as well.

This line:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
Your LINQ query most likely returns IEnumerable... of ints (judging by Select(m => m.Prod_Id)). You have list of integers, not list of entity objects. Try to print them and see what you got.

Calling _StoreDB.Categories.Select(m => m.Prod_Id) means that query will contain Prod_Id values only, not the entire entity. It would be roughly equivalent to this SQL, which selects only one column (instead of the entire row):
SELECT Prod_Id FROM Categories;
So when you iterate through query using foreach (var item in query), the type of item is probably int (or whatever your Prod_Id column is), not your entity. That's why Intellisense doesn't show the entity properties that you expect when you type "item."...
If you want all of the columns in Categories to be included in query, you don't even need to use .Select(m => m). You can just do this:
var query = _StoreDB.Categories.Distinct();
Note that if you don't explicitly pass an IEqualityComparer<T> to Distinct(), EqualityComparer<T>.Default will be used (which may or may not behave the way you want it to, depending on the type of T, whether or not it implements System.IEquatable<T>, etc.).
For more info on getting Distinct to work in situations similar to yours, take a look at this question or this question and the related discussions.

As has been explained by the other answers, the error that the OP ran into was because the result of his code was a collection of ints, not a collection of Categories.
What hasn't been answered was his question about how to use the collection of ints in a join or something in order to get at some useful data. I will attempt to do that here.
Now, I'm not really sure why the OP wanted to get a distinct list of Prod_Ids from Categories, rather than just getting the Prod_Ids from Projects. Perhaps he wanted to find out what Products are related to one or more Categories, thus any uncategorized Products would be excluded from the results. I'll assume this is the case and that the desired result is a collection of distinct Products that have associated Categories. I'll first answer the question about what to do with the Prod_Ids first, and then offer some alternatives.
We can take the collection of Prod_Ids exactly as they were created in the question as a query:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
Then we would use join, like so:
var products = query.Join(_StoreDB.Products, id => id, p => p.Prod_Id,
(id,p) => p);
This takes the query, joins it with the Products table, specifies the keys to use, and finally says to return the Product entity from each matching set. Because we know that the Prod_Ids in query are unique (because of Distinct()) and the Prod_Ids in Products are unique (by definition because it is the primary key), we know that the results will be unique without having to call Distinct().
Now, the above will get the desired results, but it's definitely not the cleanest or simplest way to do it. If the Category entities are defined with a relational property that returns the related record from Products (which would likely be called Product), the simplest way to do what we're trying to do would be the following:
var products = _StoreDB.Categories.Select(c => c.Product).Distinct();
This gets the Product from each Category and returns a distinct collection of them.
If the Category entity doesn't have the Product relational property, then we can go back to using the Join function to get our Products.
var products = _StoreDB.Categories.Join(_StoreDB.Products, c => c.Prod_Id,
p => p.Prod_Id, (c,p) => p).Distinct();
Finally, if we aren't just wanting a simple collection of Products, then some more though would have to go into this and perhaps the simplest thing would be to handle that when iterating through the Products. Another example would be for getting a count for the number of Categories each Product belongs to. If that's the case, I would reverse the logic and start with Products, like so:
var productsWithCount = _StoreDB.Products.Select(p => new { Product = p,
NumberOfCategories = _StoreDB.Categories.Count(c => c.Prod_Id == p.Prod_Id)});
This would result in a collection of anonymous typed objects that reference the Product and the NumberOfCategories related to that Product. If we still needed to exclude any uncatorized Products, we could append .Where(r => r.NumberOfCategories > 0) before the semicolon. Of course, if the Product entity is defined with a relational property for the related Categories, you wouldn't need this because you could just take any Product and do the following:
int NumberOfCategories = product.Categories.Count();
Anyway, sorry for rambling on. I hope this proves helpful to anyone else that runs into a similar issue. ;)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Include() vs Select() performance - c#

Related

EF Core Single vs. Split Queries

How do I update multiple Entity models in one SQL statement?

Selecting one to many which one is better

Entity Framework - How many trips to the database on DeleteObject

How do I apply the LINQ to SQL Distinct() operator to a List<T>?

Categories

Resources