Entity Framework - Excessive data loads with Views

Entity Framework - Excessive data loads with Views - c#

We are having a problem navigating between entities one of which is based on a view.
The problem is when we go
TableEntity.ViewEntity.Where(x => x.Id == Id).FirstOrDefault())
In the background it is loading all records in the view which is not what we want or expect.
However when we go
_objectContext.TableEntityView
.Where(x => x.TableObjectId == TableObjectId && x.Id == Id)
Then it just loads up the one row which is what we are expecting
In short using the navigation properties causes a massive data load – it’s like the query is being realised early.
We are using EF 4 with SQL 2005 database. The views are used to provide aggregated information which EF couldn’t easily do without big data loads (ironically). We have manually constructed 1: Many associations between the views.
Why then do we get the large data load in the first instance but not the second?
Many thanks for all/any help

That's how navigation collections work in EF: accessing the collection loads all entities, and any linq queries you run thereafter simply query against the objects in memory. I don't think there's anything you can do about it short of a custom query like you've already done.
FWIW I'm told NHibernate supports more fine-grained navigation loads, but that feature has yet to make its way into Entity Framework.
EDIT
This answer from Ladislav Mrnka shows a possible solution to your problem from the CTP days. Not sure if anything has changed since then. It uses the DbContext, so you still won't be able to just plow through the navigation property, but it's probably as close as you're going to get.
int count = context.Entry(myAccount)
.Collection(a => a.Orders).Query().Count();`
or for your case, I'm guessing it would be
TableEntityView obj = context.Entry(TableEntity)
.Collection(a => a.ViewEntity)
.Query().FirstOrDefault(x => x.Id == Id);

I've had some issues with the way that EntityFramework generates SQL so first of all I will suggest that you use LinqPad and one or more of the following: EntityFramework profiler (paid for software), SQL Profiler (assuming you are using SQL Server) and/or EFTracingProvider
I've had issues where the same table was inner joined several times on some queries so having those tools generally helps find out what's causing the issue.
The things that I've tried that have often made some queries run faster:
Writing a full Linq Query rather than using Lambda expressions: they are often easier to read and they look a lot more like sql so it's easier to see the relationship between your code and the generated sql
and
EntitySet.Include(x=>x.Property)
This tells Linq2Entities to include the property in the query

Related

In Entity Framework Core, how can I query data on joined tables with a condition at DB level (not local LINQ)?

As part of a small .NET Core 3 project, I'm trying to use the data model based in Entity Framework, but I'm having some troubles related with queries on joined tables.
When looking for data matching a condition in a single table, the model is easy to understand
List<Element> listOfElements = context.Elements.Where(predicate).ToList();
However, when this query requires joined tables, I'm not sure how to do it efficiently. After some investigation, it seems that the include (and theninclude) methods are the way to go, but I have the impression that the Where clause after the include is not executed at DB level but after all the data has been retrieved. This might work with small datasets, but I don't think it's a good idea for a production system with millions of rows.
List<Element> listOfElements = context.Elements.Include(x => x.SubElement).
Where(predicate).ToList();
I've seen some examples using EF+ library, but I'm looking for a solution using the nominal EF Core. Is there any clean/elegant way to do it?
Thank you.

There are a few scenarios when the data from DB is populating:
Deferred query execution: this is when you try to access your query results, for example, in the foreach statement.
Immediate Query Execution: this when you call ToList() (or conversions to other collections, like ToArray()).
I think the answer's to your question:
... but I have the impression that the Where clause after the include is not executed at DB level but after all the data has been retrieved.
is that your assumptions are wrong because you are calling ToList() at the end, not before the Where method.
For more information please also check here.
I've another suggestion also: to be sure about what is exactly executing at the DB level run SQL Server Profiler when executing your query.
Hope this will help ))

Best way to improve performance in WPF application

I'm currently working on a WPF application which was build using entity framework to access data (SQL Server database) (database first).
In the past, the database was on an internal server and I did not notice any problem regarding the performance of the application even though the database is very badly implemented (only tables, no views, no indexes or stored procedure). I'm the one who created it but it was my first job and I was not very good with databases so I felt like entity framework was the best approach to focus mainly on code.
However, the database is now on another server which is waaay slower. As you guessed it, the application has now big performance issues (more than 10 seconds to load a dozen rows, same to insert new rows,...).
Should I stay with entity framework but try to improve performance by altering the database adding views and stored procedure ?
Should I get rid off entity framework and use only "basic" code (and improve the database at the same time) ?
Is there a simple ORM I could use instead of EF ?
Time is not an issue here, I can use all the time I want to improve the application but I can't seem to make a decision about the best way to make my application evolved.
The database is quite simple (around 10 tables), the only thing that could complicates thing is that I store files in there. So I'm not sure I can really use whatever I want. And I don't know if it's important but I need to display quite a few calculated fields. Any advice ?
Feel free to ask any relevant questions.

For performance profiling, the first place I recommend looking is an SQL profiler. This can capture the exact SQL statements that EF is running, and help identify possible performance culprits. I cover a few of these here. The Schema issues are probably the most relevant place to start. The title targets MVC, but most of the items relate to WPF and any application.
A good, simple profiler that I use for SQL Server is ExpressProfiler. (https://github.com/OleksiiKovalov/expressprofiler)
With the move to a new server, and it now sending the data over the wire rather than pulling from a local database, the performance issues you're noticing will most likely be falling under the category of "loading too much, too often". Now you won't only be waiting for the database to load the data, but also for it to package it up and send it over the wire. Also, does the new database represent the same data volume and serve only a single client, or now serving multiple clients? Other catches for developers is "works on my machine" where local testing databases are smaller and not dealing with concurrent queries from the server. (where locks and such can impact performance)
From here, run a copy of the application with an isolated database server (no other clients hitting it to reduce "noise") with the profiler running against it. The things to look out for:
Lazy Loading - This is cases where you have queries to load data, but then see lots (dozens to hundreds) of additional queries being spun off. Your code may say "run this query and populate this data" which you expect should be 1 SQL query, but by touching lazy-loaded properties, this can spin off a great many other queries.
The solution to lazy loading: If you need the extra data, eager load it with .Include(). If you only need some of the data, look into using .Select() to select view models / DTO of the data you need rather than relying on complete entities. This will eliminate lazy load scenarios, but may require some significant changes to your code to work with view models/dtos. Tools like Automapper can help greatly here. Read up on .ProjectTo() to see how Automapper can work with IQueryable to eliminate lazy load hits.
Reading too much - Loading entities can be expensive, especially if you don't need all of that data. Culprits for performance include excessive use of .ToList() which will materialize entire entity sets where a subset of data is needed, or a simple exists check or count would suffice. For example, I've seen code that does stuff like this:
var data = context.MyObjects.SingleOrDefault(x => x.IsActive && x.Id = someId);
return (data != null);
This should be:
var isData = context.MyObjects.Where(x => x.IsActive && x.Id = someId).Any();
return isData;
The difference between the two is that in the first example, EF will effectively do a SELECT * operation, so in the case where data is present it will return back all columns into an entity, only to later check if the entity was present. The second statement will run a faster query to simply return back whether a row exists or not.
var myDtos = context.MoyObjects.Where(x => x.IsActive && x.ParentId == parentId)
.ToList()
.Select( x => new ObjectDto
{
Id = x.Id,
Name = x.FirstName + " " + x.LastName,
Balance = calculateBalance(x.OrderItems.ToList()),
Children = x.Children.ToList()
.Select( c => new ChildDto
{
Id = c.Id,
Name = c.Name
}).ToList()
}).ToList();
Statements like this can go on and get rather complex, but the real problems is the .ToList() before the .Select(). Often these creep in because devs try to do something that EF doesn't understand, like call a method. (i.e. calculateBalance()) and it "works" by first calling .ToList(). The problem here is that you are materializing the entire entity at that point and switching to Linq2Object. This means that any "touches" on related data, such as .Children will now trigger lazy loads, and again further .ToList() calls can saturate more data to memory which might otherwise be reduced in a query. The culprit to look out for is .ToList() calls and to try removing them. Select simpler values before calling .ToList() and then feed that data into view models where the view models can calculate resulting data.
The worst culprit like this I've seen was due to a developer wanting to use a function in a Where clause:
var data = context.MyObjects.ToList().Where(x => calculateBalance(x) > 0).ToList();
That first ToList() statement will attempt to saturate the whole table to entities in memory. A big performance impact beyond just the time/memory/bandwidth needed to load all of this data is simply the # of locks the database must make to reliably read/write data. The fewer rows you "touch" and the shorter you touch them, the nicer your queries will play with concurrent operations from multiple clients. These problems magnify greatly as systems transition to being used by more users.
Provided you've eliminated extra lazy loads and unnecessary queries, the next thing to look at is query performance. For operations that seem slow, copy the SQL statement out of the profiler and run that in the database while reviewing the execution plan. This can provide hints about indexes you can add to speed up queries. Again, using .Select() can greatly increase query performance by using indexes more efficiently and reducing the amount of data the server needs to pull back.
For file storage: Are these stored as columns in a relevant table, or in a separate table that is linked to the relevant record? What I mean by this, is if you have an Invoice record, and also have a copy of an invoice file saved in the database, is it:
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFileData
or
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFile
InvoiceId
InvoiceFileData
It is a better structure to keep large, seldom used data in separate tables rather than combined with commonly used data. This keeps queries to load entities small and fast, where that expensive data can be pulled up on-demand when needed.
If you are using GUIDs for keys (as opposed to ints/longs) are you leveraging newsequentialid()? (assuming SQL Server) Keys set to use newid() or in code, Guid.New() will lead to index fragmentation and poor performance. If you populate the IDs via database defaults, switch them over to use newsequentialid() to help reduce the fragmentation. If you populate IDs via code, have a look at writing a Guid generator that mimics newsequentialid() (SQL Server) or pattern suited to your database. SQL Server vs. Oracle store/index GUID values differently so having the "static-like" part of the UUID bytes in the higher order vs. lower order bytes of the data will aid indexing performance. Also consider index maintenance and other database maintenance jobs to help keep the database server running efficiently.
When it comes to index tuning, database server reports are your friends. After you've eliminated most, or at least some serious performance offenders from your code, the next thing is to look at real-world use of your system. The best thing here to learn where to target your code/index investigations are the most used and problem queries that the database server identifies. Where these are EF queries, you can usually reverse-engineer based on the tables being hit which EF query is responsible. Grab these queries and feed them through the execution plan to see if there is an index that might help matters. Indexing is something that developers either forget, or get prematurely concerned about. Too many indexes can be just as bad as too few. I find it's best to monitor real-world usage before deciding on what indexes to add.
This should hopefully give you a start on things to look for and kick the speed of that system up a notch. :)

First you need to run a performance profiler and find put what is the bottle neck here, it can be database, entity framework configuration, entity framework queries and so on
In my experience, entity framework is a good option to this kind of applications, but you need understand how it works.
Also, What entity framework are you using? the lastest version is 6.2 and has some performance improvements that olders does not have, so if you are using a old one i suggest that update it

Based on the comments I am going to hazard a guess that it is mostly a bandwidth issue.
You had an application that was working fine when it was co-located, perhaps a single switch, gigabit ethernet and 200m of cabling.
Now that application is trying to send or retrieve data to/from a remote server, probably over the public internet through an unknown number of internal proxies in contention with who knows what other traffic, and it doesn't perform as well.
You also mention that you store files in the database, and your schema has fields like Attachment.data and Doc.file_content. This suggests that you could be trying to transmit large quantities (perhaps megabytes) of data for a simple query and that is where you are falling down.
Some general pointers:
Add indexes for anywhere you are joining tables or values you
commonly query on.
Be aware of the difference between Lazy & Eager
loading in Entity Framework. There is no right or wrong answer,
but you should be know what you approach you are using and why.
Split any file content
into its own table, with the same primary key as the main table or
play with different EF classes to make sure you only retrieve files
when you need to use them.

Entity Framework performance of include

It's more a technical (behind the scenes of EF) kind of question for a better understanding of Include for my own.
Does it make the query faster to Include another table when using a Select statement at the end?
ctx.tableOne.Include("tableTwo").Where(t1 => t1.Value1 == "SomeValueFor").Select(res => new {
res.Value1,
res.tableTwo.Value1,
res.tableTwo.Value2,
res.tableTwo.Value3,
res.tableTwo.Value4
});
Does it may depend on the number of values included from another table?
In the example above, 4 out of 5 values are from the included table. I wonder if it does have any performance-impact. Even a good or a bad one?
So, my question is: what is EF doing behind the scenes and is there any preferred way to use Include when knowing all the values I will select before?

In your case it doesn't matter if you use Include(<relation-property-name>) or not because you don't materialize the values before the Select(<mapping-expression>). If you use the SQL Server Profiler (or other profiler) you can see that EF generates two exactly the same queries.
The reason for this is because the data is not materialized in memory before the Select - you are working on IQueryable which means EF will generate an SQL query at the end (before calling First(), Single(), FirstOrDefault(), SingleOrDefault(), ToList() or use the collection in a foreach statement). If you use ToList() before the Select() it will materialize the entities from the database into your memory where Include() will come in hand not to make N+1 queries when accessing nested properties to other tables.

It is about how you want EF to load your data. If you want A 'Table' data to be pre populated than use Include. It is more handy if Include statement table is going to be used more frequently and it will be little slower as EF has to load all the relevant date before hand. Read the difference between Lazy and Eager loading. by using Include, it will be the eager loading where data will be pre populated while on the other hand EF will send a call to the secondary table when projection takes place i-e Lazy loading.

I agree with #Karamfilov for his general discussion, but in your example your query could not be the most performant. The performance can be affected by many factors, such as indexes present on the table, but you must always help EF in the generation of SQL. The Include method can produce a SQL that includes all columns of the table, you should always check what is the generated SQL and verify if you can obtain a better one using a Join.
This article explains the techniques that can be used and what impact they have on performance: https://msdn.microsoft.com/it-it/library/bb896272(v=vs.110).aspx

System.OutOfMemoryException when trying to use ToDictionary on a LINQ Query

I am trying to populate a dictionary object from a sales table that has 6 million records. This is because the original linq query times out. I am joining 3 or 4 tables to the sales table and it is just too much and times out. I thought I could get the sales table in memory to prevent the time out from sql server.
My code is:
var sales = dc.Sales
.Where(c => c.Active == true)
.Select(s => s)
.ToDictionary(s => s.Id, s => s);
Does anyone know how to get my dictionary object to populate without raising an out of memory exception?

This question hints at serious misunderstanding how database querying works. Usually, when query is timing out, you have to limit the data you are pulling out. If you really have to pull so much data (which I think is error in business requirements), then you should pull them in batches.
we normally use linq to sql
Linq-to-SQL has been depreciated for years now. It is practically incompetent in creating usable SQL queries. This is probably why are you getting the timeouts. I would consider LINQ-to-SQL not only as bad choice, but outright risk equivalent to bomb in your software that is ready to explode. You should either move to Entity Framework with LINQ-to-Entities or use hand-crafted SQL. This hints at second problem:
we load data into dictionaries to speed up the application and just join the dictionaries together and extract the data
I never heard anyone who would ever use this approach or consider it acceptable to use. SQL does this easily and much faster. If your team really thinks this is the way to do things, then serious study of SQL is required.

Thanks everyone for the feedback. I found out what was causing the out of memory exception. I decided to not use a dictionary because it was just not going to work. I went to sql server and replicated my linq query and ran it and only got 30,000 records back. So I knew that the linq query was not causing the out of memory. My original query was just a simple lookup to the database with a few .Where() clauses. Turns out that I was doing a .Distinct() on the collection that was populated with the 30,000 records and this was causing the out of memory exception. I removed the .Distinct() call on the linq query and it fixed it. The out of memory exception makes sense because it was trying to handle 30,000 records in the collection and distinct them and it was just to much to compare.

I Would suggest using ctx.Configuration.LazyLoadingEnabled = true; to enable the lazy loading and then when the data is required then required entities will be loaded on its own. The loaded entities can then remain in the context and reused if you do not close the context and open again.
I have tried the program with 100000 entities and using lazy loading allow the object required to be opened instantly.

.Include() vs .Load() performance in EntityFramework

When querying a large table where you need to access the navigation properties later on in code (I explicitly don't want to use lazy loading) what will perform better .Include() or .Load()? Or why use the one over the other?
In this example the included tables all only have about 10 entries and employees has about 200 entries, and it can happen that most of those will be loaded anyway with include because they match the where clause.
Context.Measurements.Include(m => m.Product)
.Include(m => m.ProductVersion)
.Include(m => m.Line)
.Include(m => m.MeasureEmployee)
.Include(m => m.MeasurementType)
.Where(m => m.MeasurementTime >= DateTime.Now.AddDays(-1))
.ToList();
or
Context.Products.Load();
Context.ProductVersions.Load();
Context.Lines.Load();
Context.Employees.Load();
Context.MeasurementType.Load();
Context.Measurements.Where(m => m.MeasurementTime >= DateTime.Now.AddDays(-1))
.ToList();

It depends, try both
When using Include(), you get the benefit of loading all of your data in a single call to the underlying data store. If this is a remote SQL Server, for example, that can be a major performance boost.
The downside is that Include() queries tend to get really complicated, especially if you have any filters (Where() calls, for example) or try to do any grouping. EF will generate very heavily nested queries using sub-SELECT and APPLY statements to get the data you want. It is also much less efficient -- you get back a single row of data with every possible child-object column in it, so data for your top level objects will be repeated a lot of times. (For example, a single parent object with 10 children will product 10 rows, each with the same data for the parent-object's columns.) I've had single EF queries get so complex they caused deadlocks when running at the same time as EF update logic.
The Load() method is much simpler. Each query is a single, easy, straightforward SELECT statement against a single table. These are much easier in every possible way, except you have to do many of them (possibly many times more). If you have nested collections of collections, you may even need to loop through your top level objects and Load their sub-objects. It can get out of hand.
Quick rule-of-thumb
Try to avoid having any more than three Include calls in a single query. I find that EF's queries get too ugly to recognize beyond that; it also matches my rule-of-thumb for SQL Server queries, that up to four JOIN statements in a single query works very well, but after that it's time to consider refactoring.
However, all of that is only a starting point.
It depends on your schema, your environment, your data, and many other factors.
In the end, you will just need to try it out each way.
Pick a reasonable "default" pattern to use, see if it's good enough, and if not, optimize to taste.

Include() will be written to SQL as JOIN: one database roundtrip.
Each Load()-instruction is "explicitly loading" the requested entities, so one database roundtrip per call.
Thus Include() will most probably be the more sensible choice in this case, but it depends on the database layout, how often this code is called and how long your DbContext lives. Why don't you try both ways and profile the queries and compare the timings?
See Loading Related Entities.

I agree with #MichaelEdenfield in his answer but I did want to comment on the nested collections scenario. You can get around having to do nested loops (and the many resulting calls to the database) by turning the query inside out.
Rather than loop down through a Customer's Orders collection and then performing another nested loop through the Order's OrderItems collection say, you can query the OrderItems directly with a filter such as the following.
context.OrderItems.Where(x => x.Order.CustomerId == customerId);
You will get the same resulting data as the Loads within nested loops but with just a single call to the database.
Also, there is a special case that should be considered with Includes. If the relationship between the parent and the child is one to one then the problem with the parent data being returned multiple times would not be an issue.
I am not sure what the effect would be if the majority case was where no child existed - lots of nulls? Sparse children in a one to one relationship might be better suited to the direct query technique that I outlined above.

Include is an example of eager loading, where as you not only load the entities you are querying for, but also all related entities.
Load is an manual override of the EnableLazyLoading. If this one is set to false. You can still lazily load the entity you asked for with .Load()

It's always hard to decide whether to go with Eager, Explicit or even Lazy Loading.
What I would recommend anyway is always to perform some profiling. That's the only way to be sure your request will be performant or not.
There're a lot of tools that will help you out. Have a look at this article from Julie Lerman where she lists several different ways to do profiling. One simple solution is to start profiling in your SQL Server Management Studio.
Do not hesitate to talk with a DBA (if you have on near you) that will help you to understand the execution plan.
You could also have a look a this presentation where I wrote a section about loading data and performance.

One more thing to add to this thread. It depends on what server you use. If you are working on sql server it's ok to use eager loading but for sqlite you will have to use .Load() to avoid crossloading exception cause sqlite can not deal with some include statements that go deeper than one dependency level

Updated answer: As of EF Core 5.0 you can use AsSplitQuery().
This is particularly useful and I personally use it all the time when I have many joins
which will result in a possible cartesian explosion, or will just take more time to complete.
As the name implies, EF will execute separate queries for each Entity instead of using joins.
So, where you would use Explicit loading, you can now use Eager loading with split queries to achieve the same result, and it's definitely more readable imo.
See https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.