I am using EF Core 7. It looks like, since EF Core 5, there is now Single vs Split Query execution.
I see that the default configuration still uses the Single Query execution though.
I noticed in my logs it was saying:
Microsoft.EntityFrameworkCore.Query.MultipleCollectionIncludeWarning':
Compiling a query which loads related collections for more than one
collection navigation, either via 'Include' or through projection, but
no 'QuerySplittingBehavior' has been configured. By default, Entity
Framework will use 'QuerySplittingBehavior.SingleQuery', which can
potentially result in slow query performance.
Then I configured a warning on db context to get more details:
services.AddDbContextPool<TheBestDbContext>(
options => options.UseSqlServer(configuration.GetConnectionString("TheBestDbConnection"))
.ConfigureWarnings(warnings => warnings.Throw(RelationalEventId.MultipleCollectionIncludeWarning))
);
Then I was able to specifically see which call was actually causing that warning.
var user = await _userManager.Users
.Include(x => x.UserRoles)
.ThenInclude(x => x.ApplicationRole)
.ThenInclude(x => x.RoleClaims)
.SingleOrDefaultAsync(u => u.Id == userId);
So basically same code would be like:
var user = await _userManager.Users
.Include(x => x.UserRoles)
.ThenInclude(x => x.ApplicationRole)
.ThenInclude(x => x.RoleClaims)
.AsSplitQuery() // <===
.SingleOrDefaultAsync(u => u.Id == userId);
with Split query option.
I went through the documentation, but I'm still not sure how to create a pattern out of it.
I would like to set the most common one as a default value across the project, and only use the other for specific scenarios.
Based on the documentation, I have a feeling that the "Split" should be used as default in general but with caution. I also noticed on their documentation specific to pagination, that it says:
When using split queries with Skip/Take, pay special attention to making your query ordering fully unique; not doing so could cause incorrect data to be returned. For example, if results are ordered only by date, but there can be multiple results with the same date, then each one of the split queries could each get different results from the database. Ordering by both date and ID (or any other unique property or combination of properties) makes the ordering fully unique and avoids this problem. Note that relational databases do not apply any ordering by default, even on the primary key.
which completely makes sense as the query will be split.
But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?
I would love to hear if there are any best practices on that and when to use which approach.
But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?
It depends, let's examine your example in Single query approach:
var user = await _userManager.Users // 1 records based on SingleOrDefault but to server goes TAKE 2
.Include(x => x.UserRoles) // R roles
.ThenInclude(x => x.ApplicationRole) // 1 record
.ThenInclude(x => x.RoleClaims) // C claims
.SingleOrDefaultAsync(u => u.Id == userId);
As result on the client will be returned RecordCount = 1 * R * 1 * C records. Then they will be deduplicated and placed in appropriate collections.
If RecordCount is approximately small Single query can be best approach.
Also EF Core adds ORDER BY for such query which may slowdown execution. So better examine execution plan.
Side note: Better to use FirstOrDefault/Async it CAN be a lot faster than SingleOrDefault/Async, when SQL server fails to detect that there no 2 records in recordset early.
The documentation at https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries outlines the considerations when Split Queries could have unintentional consequences, particularly around isolation and ordering. As mentioned when loading a single record with related details, a singlw query execution is generally perferred. The warning is appearing because you have a one-to-many, which contains a one-to-many, so it is warning that this can potentially lead to a much larger Cartesian Product in terms of a JOIN-based query. To avoid the warning as you are confident that the query is reasonable in size, you can specify .AsSingleQuery() explicitly and the warning should disappear.
When working with object graphs like this you can consider designing operations against the data state to be as atomic as possible. IF you are editing a User that has Roles & Claims, rather than loading everything for a User and attempting to edit the entire graph in memory in one go, you might structure the application to perform actions like "AddRoleToUser", "RemoveRoleFromUser", AddClaimToUserRole", etc. So instead of loading User /w Roles /w Claims, these actions just load Roles for a user, or Claims for a UserRole respectively to alter this data.
After searching through this to figure out if there is any pattern to apply this, and with all the great content provided at the bottom, I was still not sure as I was looking for "When to use split queries" and "when not to", so I tried the summarized my understanding at the bottom.
I will use the same example that Microsoft shows on Single vs Split Queries
var blogs = ctx.Blogs
.Include(b => b.Posts)
.Include(b => b.Contributors)
.ToList();
and here is the generated SQL for that:
SELECT [b].[Id], [b].[Name], [p].[Id], [p].[BlogId], [p].[Title], [c].[Id], [c].[BlogId], [c].[FirstName], [c].[LastName]
FROM [Blogs] AS [b]
LEFT JOIN [Posts] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [Contributors] AS [c] ON [b].[Id] = [c].[BlogId]
ORDER BY [b].[Id], [p].[Id]
Microsoft says:
In this example, since both Posts and Contributors are collection
navigations of Blog - they're at the same level - relational databases
return a cross product: each row from Posts is joined with each row
from Contributors. This means that if a given blog has 10 posts and 10
contributors, the database returns 100 rows for that single blog. This
phenomenon - sometimes called cartesian explosion - can cause huge
amounts of data to unintentionally get transferred to the client,
especially as more sibling JOINs are added to the query; this can be a
major performance issue in database applications.
However what it doesn't clearly mention is, other than sorting/ordering issues, this may easily mess up the performance of the queries.
First concern is, we are going to be hitting to database multiple times in that case.
Let's check this one:
using (var context = new BloggingContext())
{
var blogs = context.Blogs
.Include(blog => blog.Posts)
.AsSplitQuery()
.ToList();
}
And check out the generated SQL when .AsSplitQuery() is used.
SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]
SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]
So above query was kind of surprised me. It is interesting that when it uses the split option, it still joins on the second query even though second query should only be pulling data from posts table. Pretty sure EF Core folks had some idea behind that but it just doesn't make sense to me. Then what is the point of having that foreign key over there?
Looks like Microsoft was mainly focused on a solution to avoid cartesian explosion problem but obviously it doesn't mean that "split queries" should be used as best practices by default going forward. Definitely not!
And another possible problem I can think of is data inconsistency, yet the queries are ran separate, you can't guarantee the data consistency. (unless completely locked)
I just don't want to throw away the feature of course. There are still some "good" scenarios to use Split Queries imo, (unless you are really worried about the data consistency) like if we are returning lots of columns with a relation and the size is pretty large, then this could be really performance factor. Or the parent data is not a lot, but tons of navigation sets, then there is your cartesian explosion.
PS: Note that cartesian explosion does not occur when the two JOINs aren't at the same level.
Last but not least, personally, if I am really going to be pulling some heavy amount of data with bunch of relation of relation of relation, I would still prefer those "good old" Stored Procedures. It never gets old!
Related
I have an EF Core query like this:
var existingViolations = await _context.Parent
.Where(p => p.ProjectId == projectId)
.Include(p => p.Relation1)
.Include(p => p.Relation2)
.ThenInclude(r => r.Relation21)
.Include(p => p.Relation3)
.AsSplitQuery()
.ToListAsync();
This query usually takes between 55-65 seconds which can sometimes cause database timeouts. All the tables included in the query, including the parent table, contain anywhere from 30k-60k rows and 3-6 columns each. I have tried splitting it up into smaller queries using LoadAsync() like this:
_context.ChangeTracker.LazyLoadingEnabled = false;
_context.ChangeTracker.AutoDetectChangesEnabled = false;
await _context.Relation1.Where(r1 => r1.Parent.ProjectId == projectId).LoadAsync();
await _context.Relation2.Where(r2 => r2.Parent.ProjectId == projectId).Include(r2 => r2.Relation21).LoadAsync();
await _context.Relation3.Where(r3 => r3.Parent.ProjectId == projectId).LoadAsync();
var result = await _context.Parent.Where(p => p.ProjectId == projectId).ToListAsync();
That shaves about 5 seconds off the query time, so nothing to brag about. I've done some timings, and it's the last line (var result = await _context.Parent.Where(p => p.ProjectId == projectId).ToListAsync();) that takes by far the longest to complete, about 90% of the spent time.
How can I optimize this further?
EDIT: Here is the generated SQL query:
SELECT [v].[Id], [v].[Description], [v].[ProjectId], [v].[RuleId], [v].[StateStatus], [v0].[Id], [v0].[ElementId], [v0].[Role], [v0].[ParentId], [t].[Id], [t].[ActivatedDate], [t].[StateStatus], [t].[ParentId], [t].[Id0], [t].[RunId], [t].[SerializedState], [t].[StateId], [p].[Id], [p].[ActualValue], [p].[CurrentValue], [p].[ParameterId], [p].[ParentId]
FROM [Parent] AS [v]
LEFT JOIN [Relation1] AS [v0] ON [v].[Id] = [v0].[ParentId]
LEFT JOIN (
SELECT [s].[Id], [s].[ActivatedDate], [s].[StateStatus], [s].[ParentId], [s0].[Id] AS [Id0], [s0].[RunId], [s0].[SerializedState], [s0].[StateId]
FROM [Relation2] AS [s]
LEFT JOIN [Relation21] AS [s0] ON [s].[Id] = [s0].[StateId]
) AS [t] ON [v].[Id] = [t].[ParentId]
LEFT JOIN [Relation3] AS [p] ON [v].[Id] = [p].[ParentId]
WHERE [v].[ProjectId] = #__projectId_0
ORDER BY [v].[Id], [v0].[Id], [t].[Id], [t].[Id0], [p].[Id]
When running the SQL query directly in the database, it takes about 3-4 seconds to complete, so the problem seems to be with how EF processes the results.
Without seeing the real entities and how they might be configured, it's anyone's guess.
Generally speaking when looking at performance issues like this, the first thing I would look to tackle is "what is this data being loaded for?" Typically when I see queries using a lot of Includes, this is something like a read operation to be loaded for a view or computation based on that selected data. Projection down to a simpler model can help significantly here if you really only need a few columns from each table to satisfy your needs. The benefit of projection using a Select across the related data to fill either a DTO/ViewModel class for a view or an an anonymous type for a computation is that Include will want to pass all columns for all eager loaded tables in the one go, where projection will only pass back the columns referenced. This can be critically important where tables can contain things like large text/binary columns that you don't need at all or right away. This is also very important in cases where the database server might be some distance from the consuming client or web server. Less data over the wire = faster performance, though the issue right now sounds like the DB query itself.
The next thing to check would be the relationships between all of the tables and any relevant configuration in EF vs. the table design. Waiting a minute to pull a few records from 30-60k rows is ridiculously long and I would be highly suspect of some flawed relationship mapping that isn't using FKs/indexes. Another place to look would be to run a profiler against the database to capture the exact SQL statement(s) being run, then execute those manually to investigate their execution plan which might reveal schema problems or some weirdness with the entity relationship mapping producing very inefficient queries.
The next thing to check would be to use a process of elimination to see if there is a bad relationship. Eliminate each of the eager load Include statements one by one and see how long each the query scenario takes. If there is a particular Include that is responsible for a drastic slow-down, drill down into that relationship to see why that might be.
That should give you a couple avenues to check. Consider revising your question with the actual entities and any further troubleshooting results.
I have a function in my asp.net core app which updates a bunch of records based on a certain criteria I write in a where clause ... I read that ToList() has bad performance , so is there a better and faster way than using tolist and foreach ???
This is my current way doing it , I would appreciate it if someone provides a more efficient way
public async Task UpdateCatalogOnTenantApproval(int tenantID)
{
var catalogQuery = GetQueryable();
var catalog = await catalogQuery.Where(x => x.IdTenant == tenantID).ToListAsync();
catalog.ForEach(c => { c.IsApprovedByAdmin = true; c.IsActive = true; });
Context.UpdateRange(catalog);
await Context.SaveChangesAsync(); ;
}
read that ToList() has bad performance ,
That is wrong. ToList has as good a performance as you will get - submit a bad query which is overly complex and which results in bad SQL that SQL Server will take ages to execute and it is slow.
Also, many people think "ToList" is slow (as in: in the profiler). You see, yo ustart with a db context, take a set of entities there, add some where clauses - all fast. Then ToList and it takes "long" (compared to the rest). Well, THAT is where the query is sent to the sql server ;) WHere (x=>whatever) takes "no time" because all it does is add some nodes to the expression tree, not executing the query. THAT is mostly what people mix up - delayed execution which exeutes only when asked for the results.
And third, some people like "ToList().Where() and complain about performance. Filter as much as possible no the DB.
All three reasons are why people think ToList is slow - but all it shows is a lack of understanding of how LINQ and SQL operate.
Entity Framework does not handle bulk update operations by default -- hence your existing code. If you really want to do these bulk operations, then you have two options:
Write the SQL yourself and use the ExecuteSqlCommand() method
to execute it; or
Look at 3rd party extensions, such as https://entityframework-extensions.net/
We can reduce query cost by selecting a subset of data before attaching for EF to track, and then updating.
However, it may be just pointless micro-optimization that does not perform significantly better unless you are processing massive amount of records.
// select pk for EF to track, and the 2 fields to be modified
var catalog = await catalogQuery.Where(x => x.IdTenant == tenantID)
.Select(x => new Catelog{x.CatelogId, x.IsApprovedByAdmin, x.IsActive }).ToListAsync();
//next we attach range here to let EF track the list
Context.AttachRange(catalog);
//perform your update as usual, this will be flagged as modified if changed
catalog.ForEach(c => { c.IsApprovedByAdmin = true; c.IsActive = true; });
//save and let EF update based on modified fields.
await Context.SaveChangesAsync();
Let me explain to you what you have done and what you are trying to do.
You are partially right about the performance issues related to ToList and ToListAsync as they are mainly responsible to upload entities to the memory and track them.
Based on that if your request is expected to deal intensively with light data you are not required to enhance your code. if it is not, however, there are many open approaches each one has its pros and cons and you have to treat and balance between them for each case you do not want to use the dual app-SQL requests.
let's be more realistic by talking about your case:
1- we assume that your method is a resource-consuming by (loading high volume of data, intensively called, or both)
2- I see the modification is too static by updating all of the rows by c.IsApprovedByAdmin = true; c.IsActive = true;
form (1) and (2) I suggest to write a stored procedure or ExexcuteSqlCammand (as Bryan Lewis suggested) that does this for you
because (3) the stored procedures, triggers, and all the SQL based operation are hard-maintainable and are highly potential for hidden exceptions. In your case, however, you less likely to fell into that as your code is too basic and you could reduce more the risk by construct your query from dynamic elements such as nameof(yourClassName that is the table name).YouProperty and the like ...
Anyway, this is an example to show that there is no ideal approach and you have study each case alone.
Finally, I do not agree with the 3d parties extensions as most of freely provided developed by unprofessionals and tracking exceptions caused by them are nightmares, and the paid versions are too expensive and not 0-exception extensions. The 3d party extension are more oriented to the complex bulk update/delete and/or huge data.
e.g.
await Context.UpdateAsync(e=> new Catalog
{ Archived = e.LastUpdate >
DateTime.UtcNow.AddYears(-99)? false : true
});
I'm an EF noob (as in I just started today, I've only used other ORMs), and I'm experiencing a baptism of fire.
I've been asked to improve the performance of this query created by another dev:
var questionnaires = await _myContext.Questionnaires
.Include("Sections")
.Include(q => q.QuestionnaireCommonFields)
.Include("Sections.Questions")
.Include("Sections.Questions.Answers")
.Include("Sections.Questions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Where(q => questionnaireIds.Contains(q.Id))
.ToListAsync().ConfigureAwait(false);
A quick web-surf tells me that Include() results in a cols * rows product and poor performance if you run multiple levels deep.
I've seen some helpful answers on SO, but they have limited less complex examples, and I can't figure out the best approach for a rewrite of the above.
The multiple repeat of the part -"Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers..." looks suspicious to me like it could be done separately and then another query issued, but I don't know how to build this up or whether such an approach would even improve performance.
Questions:
How do I rewrite this query to something more sensible to improve performance, while ensuring that the eventual result set is the same?
Given the last line: .Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
Why do I need all the intermediate lines? (I guess it's because some of the joins may not be left joins?)
EF Version info: package id="EntityFramework" version="6.2.0" targetFramework="net452"
I realise this question is a bit rubbish, but I'm trying to resolve as fast as I can from a point of no knowledge.
Edit
After mulling over this for half a day and thanks to StuartLC's suggestions I came up with some options:
Poor - split the query so that it performs multiple round-trips to fetch the data. This is likely to provide a slightly slower experience for the user, but will stop the SQL timing out. (This is not much better than just increasing the EF command timeout).
Good - change the clustered indexing on child tables to be clustered by their parent's foreign key (assuming you don't have a lot of insert operations).
Good - change the code to only query the first few levels and lazy-load (separate db hit) anything below this, i.e. remove all but the top few Includes, then change the ICollections - Answers.SubQuestions, Answers.AnswerMetadatas, and Question.Answers to all be virtual. Presumably the downside to making these virtual is that if any (other) existing code in the app expects those ICollection properties to be eager-loaded, you may have to update that code (i.e. if you want/need them to load immediately within that code). I will be investigating this option further. Further edit - unfortunately this won't work if you need to serialize the response due to self-referencing loop.
Non-trivial - Write a sql stored proc/view manually and build a new EF object pointed at it.
Longer term
The obvious, best, but most time-consuming option - rewrite the app design, so it doesn't need the whole data tree in a single api call, or go with the option below:
Rewrite the app to store the data in a NoSQL fashion (e.g. store the object tree as json so there are no joins). As Stuart mentioned this is not a good option if you need to filter the data in other ways (via something other than the questionnaireId), which you might need to do. Another alternative is to partially store NoSQL-style and partially relational as required.
First up, it must be said that this isn't a trivial query. Seemingly we have:
6 levels of recursion through a nested question-answer tree
A total of 20 tables are joined in this way via eager loaded .Include
I would first take the time to determine where this query is used in your app, and how often it is needed, with particular attention to where it is used most frequently.
YAGNI optimizations
The obvious place to start is to see where the query is used in your app, and if you don't need the whole tree all the time, then suggest you don't join in the nested question and answer tables if they are not needed in all usages of the query.
Also, it is possible to compose on IQueryable dynamically, so if there are multiple use cases for your query (e.g. from a "Summary" screen which doesn't need the question + answers, and a details tree which does need them), then you can do something like:
var questionnaireQuery = _myContext.Questionnaires
.Include(q => q.Sections)
.Include(q => q.QuestionnaireCommonFields);
// Conditionally extend the joins
if (mustIncludeQandA)
{
questionnaireQuery = questionnaireQuery
.Include(q => q.Sections.Select(s => s.Questions.Select(q => q.Answers..... etc);
}
// Execute + materialize the query
var questionnaires = await questionnaireQuery
.Where(q => questionnaireIds.Contains(q.Id))
.ToListAsync()
.ConfigureAwait(false);
SQL Optimizations
If you really have to fetch the whole tree all the time, then look at your SQL table design and indexing.
1) Filters
.Where(q => questionnaireIds.Contains(q.Id))
(I'm assuming SQL Server terminology here, but the concepts are applicable in most other RDBMs as well.)
I'm guessing Questionnaires.Id is a clustered primary key, so will be indexed, but just check for sanity (it will look something PK_Questionnaires CLUSTERED UNIQUE PRIMARY KEY in SSMS)
2) Ensure all child tables have indexes on their foreign keys back to the parent.
e.g. q => q.Sections means that table Sections has a foreign key back to Questionnaires.Id - make sure this has at least a non-clustered index on it - EF Code First should do this automagically, but again, check to be sure.
This would look like IX_QuestionairreId NONCLUSTERED on column Sections(QuestionairreId)
3) Consider changing the clustered indexing on child tables to be clustered by their parent's foreign key, e.g. Cluster Section by Questions.SectionId. This will keep all child rows related to the same parent together, and reduce the number of pages of data that SQL needs to fetch. It isn't trivial to achieve in EF code first, but your DBA can assist you in doing this, perhaps as a custom step.
Other comments
If this query is only used to query data, not to update or delete, then adding .AsNoTracking() will marginally reduce the memory consumption and in-memory performance of EF.
Unrelated to performance, but you've mixed the weakly typed ("Sections") and strongly typed .Include statements (q => q.QuestionnaireCommonFields). I would suggest moving to the strongly typed includes for the additional compile time safety.
Note that you only need to specify the include path for the longest chain(s) which are eager loaded - this will obviously force EF to include all higher levels too. i.e. You can reduce the 20 .Include statements to just 2. This will do the same job more efficiently:
.Include(q => q.QuestionnaireCommonFields)
.Include(q => q.Sections.Select(s => s.Questions.Select(q => q.Answers .... etc))
You'll need .Select any time there is a 1:Many relationship, but if the navigation is 1:1 (or N:1) then you don't need the .Select, e.g. City c => c.Country
Redesign
Last but not least, if data is only ever filtered from the top level (i.e. Questionnaires), and if the whole questionairre 'tree' (Aggregate Root) is typically always added or updated all at once, then you might try and approach the data modelling of the question and answer tree in a NoSQL way, e.g. by simply modelling the whole tree as XML or JSON, and then treat the whole tree as a long string. This will avoid all the nasty joins altogether. You would need a custom deserialization step in your data tier. This latter approach won't be very useful if you need to filter from nodes in the tree (i.e. a Query like find me all questionairre's where the SubAnswer to Question 5 is "Foo" won't be a good fit)
I have the following:
var objectives = _objectivesRepository
.GetAll()
.Where(o => o.ExamId == examId || examId == 0)
.Include(o => o.ObjectiveDetails)
.ToList();
In a previous post one of the users said that it was important to put the where before the include in a LINQ query.
Can someone let me know if this is correct? Does order matter? How about if there are many where and includes ?
In Entity Framework yes it does matter, but only in certain scenarios. When using groupings or projections, it will fail to include the requested data.
See this blog post on the subject.
The actual answer, is that usually, the order does not matter significantly. Following your example statement, I would describe the logical translational steps to a relational query:
Get all objects, with all their properties (in relational algebra they are considered attributes)
Restrict the retrieved rows based on your condition ((relational algebra projection operation)
Restrict the attributes of the retrieved rows which are eagerly loaded (relational algebra selection operation)
In your specific query, the steps 2 and 3 are interchangeable without altering the final outcome. As stated here, this is the default case. Nevertheless, even if the final outcome would not change, the performance could be significantly be affected. This is the reason for which the modern databases have query optimizers which create an execution plan to optimize the specific query.
Nevertheless, this is not always the case. So, I suppose that you could always find a case where the above do not apply. Regarding performance, no assumptions are safe. You should always measure things. You could always use the SQL Server profiler to see the translation of your linq to entities query to the final SQL query. Then you could use the SQL server tools (like the query analyzer) to see the execution plan of the final SQL query.
Hope I helped!
Is it the correct behaviour of entity framework to load all items with the given foreign key for a navigation property before querying/filtering?
For example:
myUser.Apples.First(a => a.Id == 1 && !a.Expires.HasValue);
Will load all apples associated with that user. (The SQL query doesn't query the ID or Expires fields).
There are two other ways of doing it (which generate the correct SQL) but neither as clean as using the navigation properties:
myDbContext.Entry(myUser).Collection(u => u.Apples).Query().First(a => a.Id == 1 && !a.Expires.HasValue);
myDbContext.Apples.First(a => a.UserId == myUser.Id && a.Id == 1 && !a.Expires.HasValue);
Things I've Checked
Lazy load is enabled and is not disabled anywhere.
The navigation properties are virtual.
EDIT:
Ok based on your edit I think i had the wrong idea about what you were asking (which makes a lot more sense now). Ill leave the previous answer around as i think its probably useful to explain but is much less relevant to your specific question as it stands.
From what you've posted your user object is enabled for lazy loading. EF enables lazy loading by default, however there is one requirement to lazy loading which is to mark navigation properties as virtual (which you have done).
Lazy loading works by attaching to the get method on a navigation property and performing a SQL query at that point to retrieve the foreign entity. Navigation properties are also not queriable collections, which means that when you execute the get method your query will be executed immediately.
In your above example the apples collection on User is enumerated before you execute the .first call (which occurs using plain old linq to objects). This means that SQL will return back all of the apples associated to the user and filter them in memory on the querying machine (as you have observed). This will also mean you need two queries to pull down the apples you are interested in (one for the user and one for the nav property) which may not be efficient for you if all you want is apples.
A perhaps better way of doing this is to keep the whole expression as a query for as long as possible. An example of this would be something like the following:
myDbContext.Users
.Where(u=>u.Id == userId)
.SelectMany(u=>u.Apples)
.Where(a=>a.Id == 1 && !a.Expires.HasValue);
this should execute as a single SQL statement and only pull down the apples you care about.
HTH
Ok from what i can understand of your question you are asking why EF appears to allow you to use navigation properties in a query even though they may be null in the result set.
In answer to your question yes this is expected behavior, heres why:
Why you write a query it is translated into SQL, for example something like
myDbContext.Apples.Where(a=>a.IsRed)
will turn into something like
Select * from Apples
where [IsRed] = 1
similarly something like the following will also be translated directly to SQL
myDbContext.Apples.Where(a=>a.Tree.Height > 100)
will turn into something like
Select a.* from Apples as a
inner join Tree as t on a.TreeId = t.Id
where t.Height > 100
However its a bit of a different story when we actually pull down the result sets.
To avoid pulling down too much data and making it slow EF offers several mechanisms for specifying what comes back in the result set. One is lazy loading (which incidently needs to be used carefully if you want to avoid performance issues) and the second is the include syntax. These methods restrict what we are pulling back so that queries are quick and dont consume un-needed resources.
For example in the above you will note that only Apple fields are returned.
If we were to add an include to that as below you could get a different result:
myDbContext.Apples.Include(a=>a.Tree).Where(a=>a.Tree.Height > 100)
will translate to SQL similar to:
Select a.*, t.* from Apples as a
inner join Tree as t on a.TreeId = t.Id
where t.Height > 100
In your above example (which I'm fairly sure isn't syntactically correct as myContext.Users should be a collection and therefore shouldn't have a .Apples) you are creating a query therefor all variables are available. When you enumerate that query you have to be explicit about whats returned.
For more details on navigation properties and how they work (and the .Include syntax) check out my blog: http://blog.staticvoid.co.nz/2012/07/entity-framework-navigation-property.html