I have a very large amount of data that I need to gather for a report I am generating. All of this data comes from a Database that I am connected to via entity framework. For this query I have tried doing this a few different ways but no matter what I do it seems to be slow.
Overall I am curious if it is more efficient to have a LINQ query that has sub queries or is it better to do a foreach and then query for those values.
additional information for the DB a lot of the sub queries/loop iterations would be querying most of the largest tables in the DB.
Example code:
var b = (from brk in entities.Brokers
join pcy in Policies on brk.BrkId equals pcy.pcyBrkId
where pcy.DateStamp > twoYearsAgo
select new returnData
{
BroId = brk.brkId,
currentPrem = (from pcy in Policies
where pcy.PcyBrkID = brk.Brk.Id && pcy.InvDate > startDate && pcy.InvDate < endDate
select pcy.Premium).Sum(),
// 5 more similar subqueries
}).GroupBy(x=> x.BrkId).Select(x=> x.FirstOrDefault()).ToList();
OR
var b = (from brk in entities.Brokers
join pcy in Policies on brk.BrkId equals pcy.pcyBrkId
where pcy.DateStamp > twoYearsAgo
select new returnData
{
BroId = brk.brkId
}).GroupBy(x=> x.BrkId).Select(x=> x.FirstOrDefault()).ToList();
foreach( brk in b){
// grab data from subqueries here
}
One additional detail may be that I may be able to filter out some additional information if I grab the primary information reducing the results to go through in the foreach.
First of all, matters of performance always warrant profiling, no matter how reasonable or logical one or another solution might seem.
Saying that, usually, while working with database, less trips you do to database is better. Hence in your case it might be more efficient to have one single SQL query that retrieves big chunk of data over network, and after you process it locally with loops and whatnot. This guideline has to be an optimal solution for most cases.
All, obviously, depends on how big that data is, how big your network bandwidth is, and how fast and tuned your database is.
Side note: in general, if you work with big, or complex (intertwined) data, better to avoid using Entity Framework at all, especially when you're concerned about performance. Not sure if that might work for you.
Related
I have often found that if I have too many joins in a Linq query (whether using Entity Framework or NHibernate) and/or the shape of the resulting anonymous class is too complex, Linq takes a very long time to materialize the result set into objects.
This is a generic question, but here's a specific example using NHibernate:
var libraryBookIdsWithShelfAndBookTagQuery = (from shelf in session.Query<Shelf>()
join sbttref in session.Query<ShelfBookTagTypeCrossReference>() on
shelf.ShelfId equals sbttref.ShelfId
join bookTag in session.Query<BookTag>() on
sbttref.BookTagTypeId equals (byte)bookTag.BookTagType
join btbref in session.Query<BookTagBookCrossReference>() on
bookTag.BookTagId equals btbref.BookTagId
join book in session.Query<Book>() on
btbref.BookId equals book.BookId
join libraryBook in session.Query<LibraryBook>() on
book.BookId equals libraryBook.BookId
join library in session.Query<LibraryCredential>() on
libraryBook.LibraryCredentialId equals library.LibraryCredentialId
join lcsg in session
.Query<LibraryCredentialSalesforceGroupCrossReference>()
on library.LibraryCredentialId equals lcsg.LibraryCredentialId
join userGroup in session.Query<UserGroup>() on
lcsg.UserGroupOrganizationId equals userGroup.UserGroupOrganizationId
where
shelf.ShelfId == shelfId &&
userGroup.UserGroupId == userGroupId &&
!book.IsDeleted &&
book.IsDrm != null &&
book.BookFormatTypeId != null
select new
{
Book = book,
LibraryBook = libraryBook,
BookTag = bookTag
});
// add a couple of where clauses, then...
var result = libraryBookIdsWithShelfAndBookTagQuery.ToList();
I know it's not the query execution, because I put a sniffer on the database and I can see that the query is taking 0ms, yet the code is taking about a second to execute that query and bring back all of 11 records.
So yeah, this is an overly complex query, having 8 joins between 9 tables, and I could probably restructure it into several smaller queries. Or I could turn it into a stored procedure - but would that help?
What I'm trying to understand is, where is that red line crossed between a query that is performant and one that starts to struggle with materialization? What's going on under the hood? And would it help if this were a SP whose flat results I subsequently manipulate in memory into the right shape?
EDIT: in response to a request in the comments, here's the SQL emitted:
SELECT DISTINCT book4_.bookid AS BookId12_0_,
libraryboo5_.librarybookid AS LibraryB1_35_1_,
booktag2_.booktagid AS BookTagId15_2_,
book4_.title AS Title12_0_,
book4_.isbn AS ISBN12_0_,
book4_.publicationdate AS Publicat4_12_0_,
book4_.classificationtypeid AS Classifi5_12_0_,
book4_.synopsis AS Synopsis12_0_,
book4_.thumbnailurl AS Thumbnai7_12_0_,
book4_.retinathumbnailurl AS RetinaTh8_12_0_,
book4_.totalpages AS TotalPages12_0_,
book4_.lastpage AS LastPage12_0_,
book4_.lastpagelocation AS LastPag11_12_0_,
book4_.lexilerating AS LexileR12_12_0_,
book4_.lastpageposition AS LastPag13_12_0_,
book4_.hidden AS Hidden12_0_,
book4_.teacherhidden AS Teacher15_12_0_,
book4_.modifieddatetime AS Modifie16_12_0_,
book4_.isdeleted AS IsDeleted12_0_,
book4_.importedwithlexile AS Importe18_12_0_,
book4_.bookformattypeid AS BookFor19_12_0_,
book4_.isdrm AS IsDrm12_0_,
book4_.lightsailready AS LightSa21_12_0_,
libraryboo5_.bookid AS BookId35_1_,
libraryboo5_.libraryid AS LibraryId35_1_,
libraryboo5_.externalid AS ExternalId35_1_,
libraryboo5_.totalcopies AS TotalCop5_35_1_,
libraryboo5_.availablecopies AS Availabl6_35_1_,
libraryboo5_.statuschangedate AS StatusCh7_35_1_,
booktag2_.booktagtypeid AS BookTagT2_15_2_,
booktag2_.booktagvalue AS BookTagV3_15_2_
FROM shelf shelf0_,
shelfbooktagtypecrossreference shelfbookt1_,
booktag booktag2_,
booktagbookcrossreference booktagboo3_,
book book4_,
librarybook libraryboo5_,
library librarycre6_,
librarycredentialsalesforcegroupcrossreference librarycre7_,
usergroup usergroup8_
WHERE shelfbookt1_.shelfid = shelf0_.shelfid
AND booktag2_.booktagtypeid = shelfbookt1_.booktagtypeid
AND booktagboo3_.booktagid = booktag2_.booktagid
AND book4_.bookid = booktagboo3_.bookid
AND libraryboo5_.bookid = book4_.bookid
AND librarycre6_.libraryid = libraryboo5_.libraryid
AND librarycre7_.librarycredentialid = librarycre6_.libraryid
AND usergroup8_.usergrouporganizationid =
librarycre7_.usergrouporganizationid
AND shelf0_.shelfid = #p0
AND usergroup8_.usergroupid = #p1
AND NOT ( book4_.isdeleted = 1 )
AND ( book4_.isdrm IS NOT NULL )
AND ( book4_.bookformattypeid IS NOT NULL )
AND book4_.lightsailready = 1
EDIT 2: Here's the performance analysis from ANTS Performance Profiler:
It is often database "good" practice to place lots of joins or super common joins into views. ORMs don't let you ignore these facts nor do they supplement the decades of time spent fine tuning databases to do these kinds of things efficiently. Refactor those joins into a singular view or a couple views if that'd make more sense in the greater perspective of your application.
NHibernate should be optimizing the query down and reducing the data so that .Net only has to mess with the important parts. However, if those domain objects are just naturally large, that's still a lot of data. Also, if it's a really large result set in terms of rows returned, that's a lot of objects getting instantiated even if the DB is able to return the set quickly. Refactoring this query into a view that only returns the data you actually need would also reduce object instantiation overhead.
Another thought would be to not do a .ToList(). Return the enumerable and let your code lazily consume the data.
According to profiling information, the CreateQuery takes 45% of the total execution time. However as you mentioned the query took 0ms when you executed directly. But this alone is not enough to say there is a performance problem because,
You are running the query with the profiler which has significant impact on execution time.
When you use a profiler, it will affect every code is being profiled but not the sql execution time (because it happens in the SQL server), so you can see everything else is slower compared to SQL statement.
so ideal scenario is to measure how long it takes to execute entire code block, measure time for SQL query and calculate times, and if you do that you will probably end up with different values.
However, I'm not saying that the the NH Linq to SQL implementation is optimized for any query you come up with, but there are other ways in NHibernate to deal with those situations such as QueryOverAPI, CriteriaQueries, HQL and finally SQL.
Where is that red line crossed between a query that is performant and
one that starts to struggle with materialization. What's going on under the hood?
This one is pretty hard question and without having detail knowledge of NHibernate Linq to SQL provider it's hard to provide a accurate answer. You can always try different mechanisms provided and see which one is the best for given scenario.
And would it help if this were a SP whose flat results I subsequently
manipulate in memory into the right shape?
Yes, using a SP would help things to work pretty fast, but using SP would add more maintenance problems to your code base.
You have generic question, I'll tell you generic answer :)
If you query data for reading (not for update) try to use anonymous classes. The reason is - they are lighter to create, they have no navigatoin properties. And you select only data you need! It's very important rule. So, try to replace your select with smth like this:
select new
{
Book = new { book.Id, book.Name},
LibraryBook = new { libraryBook.Id, libraryBook.AnotherProperty},
BookTag = new { bookTag.Name}
}
Stored procedures are good, when query is complex and linq-provider generates not effective code, so, you can replace it with plain SQL or stored procedure. It's not offten case and, I think, it's not your situation
Run your sql-query. How many rows it returns? Is it the same value as result? Sometimes linq provider generates code, that select much more rows to select one entity. It happens, when entity has one to many relationship with another selecting entity. For example:
class Book
{
int Id {get;set;}
string Name {get;set;}
ICollection<Tag> Tags {get;set;}
}
class Tag
{
string Name {get;set;}
Book Book {get;set;}
}
...
dbContext.Books.Where(o => o.Id == 1).Select(o=>new {Book = o, Tags = o.Tags}).Single();
I Select only one book with Id = 1, but provider will generate code, that returns rows amount equals to Tags amount (entity framework does this).
Split complex query to set of simple and join in client side. Sometimes, you have complex query with many conditionals and resulting sql become terrible. So, you split you big query to more simple, get results of each and join/filter on client side.
At the end, I advice you to use anonymous class as result of select.
Don’t use Linq’s Join. Navigate!
in that post you can see:
As long as there are proper foreign key constraints in the database, the navigation properties will be created automatically. It is also possible to manually add them in the ORM designer. As with all LINQ to SQL usage I think that it is best to focus on getting the database right and have the code exactly reflect the database structure. With the relations properly specified as foreign keys the code can safely make assumptions about referential integrity between the tables.
I agree 100% with the sentiments expressed by everyone else (with regards to their being two parts to the optimisation here and the SQL execution being a big unknown, and likely cause of poor performance).
Another part of the solution that might help you get some speed is to pre-compile your LINQ statements. I remember this being a huge optimisation on a tiny project (high traffic) I worked on ages and ages ago... seems like it would contribute to the client side slowness you're seeing. Having said all that though I've not found a need to use them since... so heed everyone else's warnings first! :)
https://msdn.microsoft.com/en-us/library/vstudio/bb896297(v=vs.100).aspx
I was looking for some tips to improve my entity framework query performance and came accross this useful article.
The author of this article mentioned following:
09 Avoid using Views
Views degrade the LINQ query performance costly. These are slow in performance and impact the performance greatly. So avoid using views in LINQ to Entities.
I am just familiar with this meaning of view in the context of databases. And beacuse I don't understand this statement: Which views does he mean?
It depends, though rarely to a significant degree.
Let's say we've a view like:
CREATE VIEW TestView
AS
Select A.x, B.y, B.z
FROM A JOIN B on A.id = B.id
And we create an entity mapping for this.
Let's also assume that B.id is bound so that it is non-nullable and has a foreign key relationship with A.id - that is, whenever there's a B row, there is always at least one corresponding A.
Now, if we could do something like from t in context.TestView where t.x == 3 instead of from a in context.A join b in context.B on a.id equals b.id where a.x == 3 select new {a.x, b.y, b.z}.
We can expect the former to be converted to SQL marginally faster, because it's a marginally simpler query (from both the Linq and SQL perspective).
We can expect the latter to be converted from an SQL query to a SQLServer (or whatever) internal query marginally faster.
We can expect that internal query to be pretty much identical, unless something went a bit strange. As such, we'd expect the performance at that point to be identical.
In all, there isn't very much to choose between them. If I had to bet on one, I'd bet on that using the view being slightly faster especially on first call, but I wouldn't bet a lot on it.
Now lets consider (from t in context.TestView select t.z).Distinct(). vs (from b in context.B select b.z).Distinct().
Both of these should turn into a pretty simple SELECT DISTINCT z FROM ....
Both of these should turn into a table scan or index scan only of table B.
The first might not (flaw in the query plan), but that would be surprising. (A quick check on a similar view does find SQLServer ignoring the irrelevant table).
The first could take slightly longer to produce a query plan for, since the fact that the join on A.id is irrelevant would have to be deduced. But then database servers are good at that sort of thing; it's a set of computer science and problems that have had decades of work done on them.
If I had to bet on one, I'd bet on the view making things very slightly slower, though I'd bet more on it being so slight a difference that it disappears. An actual test with these two sorts of query found the two to be within the same margin of differences (i.e the range of different times for the two overlapped with each other).
The effect in this case on the production of the SQL from the linq query will be nil (they're effectively the same at that point, but with different names).
Lets consider if we had a trigger on that view, so that inserting or deleting carried out the equivalent inserts or deletes. Here we will gain slightly from using one SQL query rather than two (or more), and it's easier to ensure it happens in a single transaction. So a slight gain for views in this case.
Now, let's consider a much more complicated view:
CREATE VIEW Complicated
AS
Select A.x, B.x as y, C.z, COALESCE(D.f, D.g, E.h) as foo
FROM
A JOIN B on A.r = B.f + 2
JOIN C on COALESCE(A.g, B.x) = C.x
JOIN D on D.flag | C.flagMask <> 0
WHERE EXISTS (SELECT null from G where G.x + G.y = A.bar AND G.deleted = 0)
AND A.deleted = 0 AND B.deleted = 0
We could do all of this at the linq level. If we did, it would probably be a bit expensive as query production goes, though that is rarely the most expensive part of the overall hit on a linq query, though compiled queries may balance this out.
I'd lean toward the view being the more efficient approach, though I'd profile if that was my only reason for using the view.
Now lets consider:
CREATE VIEW AllAncestry
AS
WITH recurseAncetry (ancestorID, descendantID)
AS
(
SELECT parentID, childID
FROM Parentage
WHERE parentID IS NOT NULL
UNION ALL
SELECT ancestorID, childID
FROM recurseAncetry
INNER JOIN Parentage ON parentID = descendantID
)
SELECT DISTINCT (cast(ancestorID as bigint) * 0x100000000 + descendantID) as id, ancestorID, descendantID
FROM recurseAncetry
Conceptually, this view does a large number of selects; doing a select, and then recursively doing a select based on the result of that select and so on until it has all the possible results.
In actual execution, this is converted into two table scans and a lazy spool.
The linq-based equivalent would be much heavier; really you'd be better off either calling into the equivalent raw SQL, or loading the table into memory and then producing the full graph in C# (but note that this is going to be a waste on queries based on this that don't need everything).
In all, using a view here is going to be a big saving.
In summary; using views is generally of negligible performance impact, and that impact can go either way. Using views with triggers can give a slight performance win and make it easier to ensure data integrity, by forcing it to happen in a single transaction. Using views with a CTE can be a big performance win.
Non-performance reasons for using or avoiding views though are:
The use of views hides the relationship between the entities related to that view and the entities related to the underlying tables from your code. This is bad as your model is now incomplete in this regard.
If the views are used in other applications apart from yours, you will be more consistent with those other applications, take advantage of already tried-and-tested code, and automatically deal with changes to the view's implementation.
That's some pretty serious micro-optimisation in that article.
I wouldn't take it as gospel personally, having worked with EF quite a bit.
Sure those things can matter, but generally speaking, it's pretty quick.
If you've got a complicated view, and then you're performing further LINQ on that view, then sure, it could probably cause some slow performace, I wouldn't bet on it though.
The article doesn't even have any bench marks!
If performance is a serious issue for your program, narrow down which queries are slow and post them here, see if the SO community can help optimise the query for you. Much better solution than all the micro-optimisation if you ask me.
I have a little program that needs to do some calculation on a data range. The range maybe contain about half a millon of records. I just looked to my db and saw that a group by was executed.
I thought that the result was executed on the first line, and later I just worked with data in RAM. But now I think that the query builder combine the expression.
var Test = db.Test.Where(x => x > Date.Now.AddDays(-7));
var Test2 = (from p in Test
group p by p.CustomerId into g
select new { UniqueCount = g.Count() } );
In my real world app I got more subqueries that is based on the range selected by the first query. I think I just added a big overhead to let the DB make different selects.
Now I bascilly just call .ToList() after the first expression.
So my question is am I right about that the query builder combine different IQueryable when it builds the expression tree?
Yes, you are correct. LINQ expressions are lazily evaluated at the moment you evaluate them (via .ToList(), for example). At that point in time, Entity Framework will look at the total query and build an SQL statement to represent it.
In this particular case, it's probably wiser to not evaluate the first query, because the SQL database is optimized for performing set-based operations like grouping and counting. Rather than forcing the database to send all the Test objects across the wire, deserializing the results into in-memory objects, and then performing the grouping and counting locally, you will likely see better performance by having the SQL database just return the resulting Counts.
Is there a way to get count of resultset but return only top 5 records while making just one db hit instead of 2 (one for count and second for data)
There is not a particularly good way to do this in Entity Framework, at least as of v4. #Tobias writes a single LINQ query, but his suspicions are correct. You'll see multiple queries roll by in SQL Profiler.
Ignoring EF for a minute, this is a relatively complicated problem for SQL Server. Well, it's complicated once your data size gets large or your query gets complicated. You can get a flavor for what's involved here.
With that said, I wouldn't worry about it being 2 queries just yet. Don't optimize until you know it is an actual performance problem. You'll likely end up working around EF, maybe using the EF extensions and creating a stored proc that can take advantage of windowed functions and CTE's. Or maybe it will just return two result sets in a single procedure.
This little query should do the trick (I'm not sure if that's really just one physical query though, and it could be that the grouping is done in the code rather than in the DB), but it's definitely more convenient:
var obj = (from x in entities.SomeTable
let item = new { N = 1, x }
group item by item.N into g
select new { Count = g.Count(), First = g.Take(5) }).FirstOrDefault();
Nonetheless, just doing this in two queries will definitely be much faster (especially if you define them in one stored procedure, as proposed here).
The following (cut down) code excerpt is a Linq-To-Entities query that results in SQL (via ToTraceString) that is much slower than a hand crafted query. Am I doing anything stupid, or is Linq-to-Entities just bad at optimizing queries?
I have a ToList() at the end of the query as I need to execute it before using it to build an XML data structure (which was a whole other pain).
var result = (from mainEntity in entities.Main
where (mainEntity.Date >= today) && (mainEntity.Date <= tomorrow) && (!mainEntity.IsEnabled)
select new
{
Id = mainEntity.Id,
Sub =
from subEntity in mainEntity.Sub
select
{
Id = subEntity.Id,
FirstResults =
from firstResultEntity in subEntity.FirstResult
select new
{
Value = firstResultEntity.Value,
},
SecondResults =
from secondResultEntity in subEntity.SecondResult
select
{
Value = secondResultEntity.Value,
},
SubSub =
from subSubEntity in entities.SubSub
where (subEntity.Id == subSubEntity.MainId) && (subEntity.Id == subSubEntity.SubId)
select
new
{
Name = (from name in entities.Name
where subSubEntity.NameId == name.Id
select name.Name).FirstOrDefault()
}
}
}).ToList();
While working on this, I've also has some real problems with Dates. When I just tried to include returned dates in my data structure, I got internal error "1005".
Just as a general observation and not based on any practical experience with Linq-To-Entities (yet): having four nested subqueries inside a single query doesn't look like it's awfully efficient and speedy to begin with.
I think your very broad statement about the (lack of) quality of the SQL generated by Linq-to-Entities is not warranted - and you don't really back it up by much evidence, either.
Several well respected folks including Rico Mariani (MS Performance guru) and Julie Lerman (author of "Programming EF") have been showing in various tests that in general and overall, the Linq-to-SQL and Linq-to-Entities "engines" aren't really all that bad - they achieve overall at least 80-95% of the possible peak performance. Not every .NET app dev can achieve this :-)
Is there any way for you to rewrite that query or change the way you retrieve the bits and pieces that make up its contents?
Marc
Have you tried not materializing the result immediately by calling .ToList()? I'm not sure it will make a difference, but you might see improved performance if you iterate over the result instead of calling .ToList() ...
foreach( var r in result )
{
// build your XML
}
Also, you could try breaking up the one huge query into separate queries and then iterating over the results. Sending everything in one big gulp might be the issue.