When querying a large table where you need to access the navigation properties later on in code (I explicitly don't want to use lazy loading) what will perform better .Include() or .Load()? Or why use the one over the other?
In this example the included tables all only have about 10 entries and employees has about 200 entries, and it can happen that most of those will be loaded anyway with include because they match the where clause.
Context.Measurements.Include(m => m.Product)
.Include(m => m.ProductVersion)
.Include(m => m.Line)
.Include(m => m.MeasureEmployee)
.Include(m => m.MeasurementType)
.Where(m => m.MeasurementTime >= DateTime.Now.AddDays(-1))
.ToList();
or
Context.Products.Load();
Context.ProductVersions.Load();
Context.Lines.Load();
Context.Employees.Load();
Context.MeasurementType.Load();
Context.Measurements.Where(m => m.MeasurementTime >= DateTime.Now.AddDays(-1))
.ToList();
It depends, try both
When using Include(), you get the benefit of loading all of your data in a single call to the underlying data store. If this is a remote SQL Server, for example, that can be a major performance boost.
The downside is that Include() queries tend to get really complicated, especially if you have any filters (Where() calls, for example) or try to do any grouping. EF will generate very heavily nested queries using sub-SELECT and APPLY statements to get the data you want. It is also much less efficient -- you get back a single row of data with every possible child-object column in it, so data for your top level objects will be repeated a lot of times. (For example, a single parent object with 10 children will product 10 rows, each with the same data for the parent-object's columns.) I've had single EF queries get so complex they caused deadlocks when running at the same time as EF update logic.
The Load() method is much simpler. Each query is a single, easy, straightforward SELECT statement against a single table. These are much easier in every possible way, except you have to do many of them (possibly many times more). If you have nested collections of collections, you may even need to loop through your top level objects and Load their sub-objects. It can get out of hand.
Quick rule-of-thumb
Try to avoid having any more than three Include calls in a single query. I find that EF's queries get too ugly to recognize beyond that; it also matches my rule-of-thumb for SQL Server queries, that up to four JOIN statements in a single query works very well, but after that it's time to consider refactoring.
However, all of that is only a starting point.
It depends on your schema, your environment, your data, and many other factors.
In the end, you will just need to try it out each way.
Pick a reasonable "default" pattern to use, see if it's good enough, and if not, optimize to taste.
Include() will be written to SQL as JOIN: one database roundtrip.
Each Load()-instruction is "explicitly loading" the requested entities, so one database roundtrip per call.
Thus Include() will most probably be the more sensible choice in this case, but it depends on the database layout, how often this code is called and how long your DbContext lives. Why don't you try both ways and profile the queries and compare the timings?
See Loading Related Entities.
I agree with #MichaelEdenfield in his answer but I did want to comment on the nested collections scenario. You can get around having to do nested loops (and the many resulting calls to the database) by turning the query inside out.
Rather than loop down through a Customer's Orders collection and then performing another nested loop through the Order's OrderItems collection say, you can query the OrderItems directly with a filter such as the following.
context.OrderItems.Where(x => x.Order.CustomerId == customerId);
You will get the same resulting data as the Loads within nested loops but with just a single call to the database.
Also, there is a special case that should be considered with Includes. If the relationship between the parent and the child is one to one then the problem with the parent data being returned multiple times would not be an issue.
I am not sure what the effect would be if the majority case was where no child existed - lots of nulls? Sparse children in a one to one relationship might be better suited to the direct query technique that I outlined above.
Include is an example of eager loading, where as you not only load the entities you are querying for, but also all related entities.
Load is an manual override of the EnableLazyLoading. If this one is set to false. You can still lazily load the entity you asked for with .Load()
It's always hard to decide whether to go with Eager, Explicit or even Lazy Loading.
What I would recommend anyway is always to perform some profiling. That's the only way to be sure your request will be performant or not.
There're a lot of tools that will help you out. Have a look at this article from Julie Lerman where she lists several different ways to do profiling. One simple solution is to start profiling in your SQL Server Management Studio.
Do not hesitate to talk with a DBA (if you have on near you) that will help you to understand the execution plan.
You could also have a look a this presentation where I wrote a section about loading data and performance.
One more thing to add to this thread. It depends on what server you use. If you are working on sql server it's ok to use eager loading but for sqlite you will have to use .Load() to avoid crossloading exception cause sqlite can not deal with some include statements that go deeper than one dependency level
Updated answer: As of EF Core 5.0 you can use AsSplitQuery().
This is particularly useful and I personally use it all the time when I have many joins
which will result in a possible cartesian explosion, or will just take more time to complete.
As the name implies, EF will execute separate queries for each Entity instead of using joins.
So, where you would use Explicit loading, you can now use Eager loading with split queries to achieve the same result, and it's definitely more readable imo.
See https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries
Related
I'm currently working on a WPF application which was build using entity framework to access data (SQL Server database) (database first).
In the past, the database was on an internal server and I did not notice any problem regarding the performance of the application even though the database is very badly implemented (only tables, no views, no indexes or stored procedure). I'm the one who created it but it was my first job and I was not very good with databases so I felt like entity framework was the best approach to focus mainly on code.
However, the database is now on another server which is waaay slower. As you guessed it, the application has now big performance issues (more than 10 seconds to load a dozen rows, same to insert new rows,...).
Should I stay with entity framework but try to improve performance by altering the database adding views and stored procedure ?
Should I get rid off entity framework and use only "basic" code (and improve the database at the same time) ?
Is there a simple ORM I could use instead of EF ?
Time is not an issue here, I can use all the time I want to improve the application but I can't seem to make a decision about the best way to make my application evolved.
The database is quite simple (around 10 tables), the only thing that could complicates thing is that I store files in there. So I'm not sure I can really use whatever I want. And I don't know if it's important but I need to display quite a few calculated fields. Any advice ?
Feel free to ask any relevant questions.
For performance profiling, the first place I recommend looking is an SQL profiler. This can capture the exact SQL statements that EF is running, and help identify possible performance culprits. I cover a few of these here. The Schema issues are probably the most relevant place to start. The title targets MVC, but most of the items relate to WPF and any application.
A good, simple profiler that I use for SQL Server is ExpressProfiler. (https://github.com/OleksiiKovalov/expressprofiler)
With the move to a new server, and it now sending the data over the wire rather than pulling from a local database, the performance issues you're noticing will most likely be falling under the category of "loading too much, too often". Now you won't only be waiting for the database to load the data, but also for it to package it up and send it over the wire. Also, does the new database represent the same data volume and serve only a single client, or now serving multiple clients? Other catches for developers is "works on my machine" where local testing databases are smaller and not dealing with concurrent queries from the server. (where locks and such can impact performance)
From here, run a copy of the application with an isolated database server (no other clients hitting it to reduce "noise") with the profiler running against it. The things to look out for:
Lazy Loading - This is cases where you have queries to load data, but then see lots (dozens to hundreds) of additional queries being spun off. Your code may say "run this query and populate this data" which you expect should be 1 SQL query, but by touching lazy-loaded properties, this can spin off a great many other queries.
The solution to lazy loading: If you need the extra data, eager load it with .Include(). If you only need some of the data, look into using .Select() to select view models / DTO of the data you need rather than relying on complete entities. This will eliminate lazy load scenarios, but may require some significant changes to your code to work with view models/dtos. Tools like Automapper can help greatly here. Read up on .ProjectTo() to see how Automapper can work with IQueryable to eliminate lazy load hits.
Reading too much - Loading entities can be expensive, especially if you don't need all of that data. Culprits for performance include excessive use of .ToList() which will materialize entire entity sets where a subset of data is needed, or a simple exists check or count would suffice. For example, I've seen code that does stuff like this:
var data = context.MyObjects.SingleOrDefault(x => x.IsActive && x.Id = someId);
return (data != null);
This should be:
var isData = context.MyObjects.Where(x => x.IsActive && x.Id = someId).Any();
return isData;
The difference between the two is that in the first example, EF will effectively do a SELECT * operation, so in the case where data is present it will return back all columns into an entity, only to later check if the entity was present. The second statement will run a faster query to simply return back whether a row exists or not.
var myDtos = context.MoyObjects.Where(x => x.IsActive && x.ParentId == parentId)
.ToList()
.Select( x => new ObjectDto
{
Id = x.Id,
Name = x.FirstName + " " + x.LastName,
Balance = calculateBalance(x.OrderItems.ToList()),
Children = x.Children.ToList()
.Select( c => new ChildDto
{
Id = c.Id,
Name = c.Name
}).ToList()
}).ToList();
Statements like this can go on and get rather complex, but the real problems is the .ToList() before the .Select(). Often these creep in because devs try to do something that EF doesn't understand, like call a method. (i.e. calculateBalance()) and it "works" by first calling .ToList(). The problem here is that you are materializing the entire entity at that point and switching to Linq2Object. This means that any "touches" on related data, such as .Children will now trigger lazy loads, and again further .ToList() calls can saturate more data to memory which might otherwise be reduced in a query. The culprit to look out for is .ToList() calls and to try removing them. Select simpler values before calling .ToList() and then feed that data into view models where the view models can calculate resulting data.
The worst culprit like this I've seen was due to a developer wanting to use a function in a Where clause:
var data = context.MyObjects.ToList().Where(x => calculateBalance(x) > 0).ToList();
That first ToList() statement will attempt to saturate the whole table to entities in memory. A big performance impact beyond just the time/memory/bandwidth needed to load all of this data is simply the # of locks the database must make to reliably read/write data. The fewer rows you "touch" and the shorter you touch them, the nicer your queries will play with concurrent operations from multiple clients. These problems magnify greatly as systems transition to being used by more users.
Provided you've eliminated extra lazy loads and unnecessary queries, the next thing to look at is query performance. For operations that seem slow, copy the SQL statement out of the profiler and run that in the database while reviewing the execution plan. This can provide hints about indexes you can add to speed up queries. Again, using .Select() can greatly increase query performance by using indexes more efficiently and reducing the amount of data the server needs to pull back.
For file storage: Are these stored as columns in a relevant table, or in a separate table that is linked to the relevant record? What I mean by this, is if you have an Invoice record, and also have a copy of an invoice file saved in the database, is it:
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFileData
or
Invoices
InvoiceId
InvoiceNumber
...
InvoiceFile
InvoiceId
InvoiceFileData
It is a better structure to keep large, seldom used data in separate tables rather than combined with commonly used data. This keeps queries to load entities small and fast, where that expensive data can be pulled up on-demand when needed.
If you are using GUIDs for keys (as opposed to ints/longs) are you leveraging newsequentialid()? (assuming SQL Server) Keys set to use newid() or in code, Guid.New() will lead to index fragmentation and poor performance. If you populate the IDs via database defaults, switch them over to use newsequentialid() to help reduce the fragmentation. If you populate IDs via code, have a look at writing a Guid generator that mimics newsequentialid() (SQL Server) or pattern suited to your database. SQL Server vs. Oracle store/index GUID values differently so having the "static-like" part of the UUID bytes in the higher order vs. lower order bytes of the data will aid indexing performance. Also consider index maintenance and other database maintenance jobs to help keep the database server running efficiently.
When it comes to index tuning, database server reports are your friends. After you've eliminated most, or at least some serious performance offenders from your code, the next thing is to look at real-world use of your system. The best thing here to learn where to target your code/index investigations are the most used and problem queries that the database server identifies. Where these are EF queries, you can usually reverse-engineer based on the tables being hit which EF query is responsible. Grab these queries and feed them through the execution plan to see if there is an index that might help matters. Indexing is something that developers either forget, or get prematurely concerned about. Too many indexes can be just as bad as too few. I find it's best to monitor real-world usage before deciding on what indexes to add.
This should hopefully give you a start on things to look for and kick the speed of that system up a notch. :)
First you need to run a performance profiler and find put what is the bottle neck here, it can be database, entity framework configuration, entity framework queries and so on
In my experience, entity framework is a good option to this kind of applications, but you need understand how it works.
Also, What entity framework are you using? the lastest version is 6.2 and has some performance improvements that olders does not have, so if you are using a old one i suggest that update it
Based on the comments I am going to hazard a guess that it is mostly a bandwidth issue.
You had an application that was working fine when it was co-located, perhaps a single switch, gigabit ethernet and 200m of cabling.
Now that application is trying to send or retrieve data to/from a remote server, probably over the public internet through an unknown number of internal proxies in contention with who knows what other traffic, and it doesn't perform as well.
You also mention that you store files in the database, and your schema has fields like Attachment.data and Doc.file_content. This suggests that you could be trying to transmit large quantities (perhaps megabytes) of data for a simple query and that is where you are falling down.
Some general pointers:
Add indexes for anywhere you are joining tables or values you
commonly query on.
Be aware of the difference between Lazy & Eager
loading in Entity Framework. There is no right or wrong answer,
but you should be know what you approach you are using and why.
Split any file content
into its own table, with the same primary key as the main table or
play with different EF classes to make sure you only retrieve files
when you need to use them.
It's more a technical (behind the scenes of EF) kind of question for a better understanding of Include for my own.
Does it make the query faster to Include another table when using a Select statement at the end?
ctx.tableOne.Include("tableTwo").Where(t1 => t1.Value1 == "SomeValueFor").Select(res => new {
res.Value1,
res.tableTwo.Value1,
res.tableTwo.Value2,
res.tableTwo.Value3,
res.tableTwo.Value4
});
Does it may depend on the number of values included from another table?
In the example above, 4 out of 5 values are from the included table. I wonder if it does have any performance-impact. Even a good or a bad one?
So, my question is: what is EF doing behind the scenes and is there any preferred way to use Include when knowing all the values I will select before?
In your case it doesn't matter if you use Include(<relation-property-name>) or not because you don't materialize the values before the Select(<mapping-expression>). If you use the SQL Server Profiler (or other profiler) you can see that EF generates two exactly the same queries.
The reason for this is because the data is not materialized in memory before the Select - you are working on IQueryable which means EF will generate an SQL query at the end (before calling First(), Single(), FirstOrDefault(), SingleOrDefault(), ToList() or use the collection in a foreach statement). If you use ToList() before the Select() it will materialize the entities from the database into your memory where Include() will come in hand not to make N+1 queries when accessing nested properties to other tables.
It is about how you want EF to load your data. If you want A 'Table' data to be pre populated than use Include. It is more handy if Include statement table is going to be used more frequently and it will be little slower as EF has to load all the relevant date before hand. Read the difference between Lazy and Eager loading. by using Include, it will be the eager loading where data will be pre populated while on the other hand EF will send a call to the secondary table when projection takes place i-e Lazy loading.
I agree with #Karamfilov for his general discussion, but in your example your query could not be the most performant. The performance can be affected by many factors, such as indexes present on the table, but you must always help EF in the generation of SQL. The Include method can produce a SQL that includes all columns of the table, you should always check what is the generated SQL and verify if you can obtain a better one using a Join.
This article explains the techniques that can be used and what impact they have on performance: https://msdn.microsoft.com/it-it/library/bb896272(v=vs.110).aspx
I was reading this SO question but still I am not clear about one specific thing.
If I use NHibernate, why do I need LINQ?
The question in my mind becomes more aggravated when I knew that NHibernate also included LINQ support.
LINQ to NHibernate?
WTF!
LINQ is a query language. It allows you to express queries in a way that is not tied in to your persistence layer.
You may be thinking about the LINQ 2 SQL ORM.
Using LINQ in naming the two is causes unfortunate confusions like yours.
nHibernate, EF, LINQ2XML are all LINQ providers - they all allow you to query a data source using the LINQ syntax.
Well, you don't need Linq, you can always do without it, but you might want it.
Linq provides a way to express operations that behave on sets of data that can be queried and where we can then perform other operations based on the state of that data. It's deliberately written so as to be as agnostic as possible whether that data is in-memory collections, XML, database, etc. Ultimately it's always operating on some sort of in-memory object, with some means of converting between in-memory and the ultimate source, though some bindings go further than others in pushing some of the operations down to a different layer. E.g. calling .Count() can end up looking at a Count property, spinning through a collection and keeping a tally, sending a Count(*) query to a database or maybe something else.
ORMs provide a way to have in-memory objects and database rows reflect each other, with changes to one being reflected by changes to the other.
That fits nicely into the "some means of converting" bit above. Hence Linq2SQL, EF and Linq2NHibernate all fulfil both the ORM role and the Linq provider role.
Considering that Linq can work on collections you'd have to be pretty perverse to create an ORM that couldn't support Linq at all (you'd have to design your collections to not implement IEnumerable<T> and hence not work with foreach). More directly supporting it though means you can offer better support. At the very least it should make for more efficient queries. For example if an ORM gave us a means to get a Users object that reflected all rows in a users table, then we would always be able to do:
int uID = (from u in Users where u.Username == "Alice" select u.ID).FirstOrDefault();
Without direct support for Linq by making Users implement IQueryable<User>, then this would become:
SELECT * FROM Users
Followed by:
while(dataReader.Read())
yield return ConstructUser(dataReader);
Followed by:
foreach(var user in Users)
if(user.Username == "Alice")
return user.ID;
return 0;
Actually, it'd be just slightly worse than that. With direct support the SQL query produced would be:
SELECT TOP 1 id FROM Users WHERE username = 'Alice'
Then the C# becomes equivalent to
return dataReader.Read() ? dataReader.GetInt32(0) : 0;
It should be pretty clear how the greater built-in Linq support of a Linq provider should lead to better operation.
Linq is an in-language feature of C# and VB.NET and can also be used by any .NET language though not always with that same in-language syntax. As such, ever .NET developer should know it, and every C# and VB.NET developer should particularly know it (or they don't know C# or VB.NET) and that's the group NHibernate is designed to be used by, so they can depend on not needing to explain a whole bunch of operations by just implementing them the Linq way. Not supporting it in a .NET library that represents queryable data should be considered a lack of completeness at best; the whole point of an ORM is to make manipulating a database as close to non-DB related operations in the programming language in use, as possible. In .NET that means Linq supprt.
First of all LINQ alone is not an ORM. It is a DSL to query the objects irrespective of the source it came from.
So it makes perfect sense that you can use LINQ with Nhibernate too
I believe you misunderstood the LINQ to SQL with plain LINQ.
Common sense?
There is a difference between an ORM like NHibernate and compiler integrated way to express queries which is use full in many more scenarios.
Or: Usage of LINQ (not LINQ to SQL etc. - the langauge, which is what you talk about though I am not sure you meant what you said) means you dont have to deal with Nhibernate special query syntax.
Or: Anyone NOT using LINQ - regardless of NHibernate or not - de qualifies without good explanation.
You don't need it, but you might find it useful. Bear in mind that Linq, as others have said, is not the same thing as Linq to SQL. Where I work, we write our own SQL queries to retrieve data, but it's quite common for us to use Linq to manipulate that data in order to serve a specific need. For instance, you might have a data access method that allows you to retrieve all dogs owned by Dave:
new DogOwnerDal().GetForOwner(id);
If you're only interested in Dave's daschunds for one specific need, and performance isn't that much of an issue, you can use Linq to filter the response for all of Dave's dogs down to the specific data that you need:
new DogOwnerDal().GetForOwner(id).Where(d => d.Breed == DogBreeds.Daschund);
If performance was crucial, you might want to write a specific data access method to retrieve dogs by owner and breed, but in many cases the effort expended to create the new data access method doesn't increase efficiency enough to be worth doing.
In your example, you might want to use NHibernate to retrieve a lump of data, and then use Linq to break that data into lots of individual subsets for some form of processing. It might well be cheaper to get the data once and use Linq to split it up, instead of repeatedly interrogating the database for different mixtures of the same data.
We are having a problem navigating between entities one of which is based on a view.
The problem is when we go
TableEntity.ViewEntity.Where(x => x.Id == Id).FirstOrDefault())
In the background it is loading all records in the view which is not what we want or expect.
However when we go
_objectContext.TableEntityView
.Where(x => x.TableObjectId == TableObjectId && x.Id == Id)
Then it just loads up the one row which is what we are expecting
In short using the navigation properties causes a massive data load – it’s like the query is being realised early.
We are using EF 4 with SQL 2005 database. The views are used to provide aggregated information which EF couldn’t easily do without big data loads (ironically). We have manually constructed 1: Many associations between the views.
Why then do we get the large data load in the first instance but not the second?
Many thanks for all/any help
That's how navigation collections work in EF: accessing the collection loads all entities, and any linq queries you run thereafter simply query against the objects in memory. I don't think there's anything you can do about it short of a custom query like you've already done.
FWIW I'm told NHibernate supports more fine-grained navigation loads, but that feature has yet to make its way into Entity Framework.
EDIT
This answer from Ladislav Mrnka shows a possible solution to your problem from the CTP days. Not sure if anything has changed since then. It uses the DbContext, so you still won't be able to just plow through the navigation property, but it's probably as close as you're going to get.
int count = context.Entry(myAccount)
.Collection(a => a.Orders).Query().Count();`
or for your case, I'm guessing it would be
TableEntityView obj = context.Entry(TableEntity)
.Collection(a => a.ViewEntity)
.Query().FirstOrDefault(x => x.Id == Id);
I've had some issues with the way that EntityFramework generates SQL so first of all I will suggest that you use LinqPad and one or more of the following: EntityFramework profiler (paid for software), SQL Profiler (assuming you are using SQL Server) and/or EFTracingProvider
I've had issues where the same table was inner joined several times on some queries so having those tools generally helps find out what's causing the issue.
The things that I've tried that have often made some queries run faster:
Writing a full Linq Query rather than using Lambda expressions: they are often easier to read and they look a lot more like sql so it's easier to see the relationship between your code and the generated sql
and
EntitySet.Include(x=>x.Property)
This tells Linq2Entities to include the property in the query
I have been reading and looking around to find this answer in black and white.
Let's talk about the familiar
Customer and Order problem. Let's
say I load 100 orders, and each order
is linked to one and only one
customer.
Using Fluent NHibernate, I will use References() to link my Order to Customer and I will define my Not.LazyLoad() and Fetch.Join() as well.
Now I am thinking hypothetically, NHibernate can simply join these two tables and would be pretty easy to hydrate entities. However, in my tests I always see rather N+1 queries (in fact perhaps only the unique IDs). I can share my code and tables but it might bore you, so
Is it possible to overcome N+1 for Order->Customer (one->one or rather Many->One)? Or I have to use batching or Criteria API?
If possible can you please point me to an Fluent NHibernate example?
Frequently there is the complain that fetch="join" doesn't work. This is because it is not considered by HQL. You can declare it within HQL.
I used fetch="join" hoping to make performance better but stopped using it in many cases. The problem was that joining to many tables could make SQL server run into a maximal number of columns limit. In some cases you don't need the data at all and therefore it is not very useful to specify it globally in the mapping file.
So I would recommend to
either use explicit join fetching in HQL, because there you know if the data is actually used.
or for any other case, batches are a great solution, because they are transparent (your code doesn't need to know about), make use of lazy loading and reducing the N+1 problem at the same time.
I wouldn't look at the mapping as much as the actual querying you are doing. I leave ALL of my mappings as LazyLoad by default and override as needed.
I use the Criteria API to query, and I use CreateAlias to join other tables as needed. NHProf is highly recommended to find and eliminate situations like this.
Hi
There are two ways you can address your problem
a) Using Criteria query
It will look something like this
Session.CreateQuery(typeof(Order))
.Add(<Restrictions if any>)
.SetFetchMode("Customer",FetchMode.Eager)
.List<Order>();
b) Using HQL
Session.CreateQuery("select o from Order inner join fetch o.Customer where <conditionifany>").List<Order>();
Hope this helps..
What query API are you using?
If it's HQL, you can use join fetch to retrieve an association eagerly.
For LINQ and QueryOver, use .Fetch()