Why does LINQ-to-Entities put this query in a sub-select?

Why does LINQ-to-Entities put this query in a sub-select? - c#

I have the following LINQ query:
var queryGroups = (from p in db.cl_contact_event
select new Groups { inputFileName = p.input_file_name }).Distinct();
Which translates to the following when run:
SELECT
[Distinct1].[C1] AS [C1],
[Distinct1].[input_file_name] AS [input_file_name]
FROM ( SELECT DISTINCT
[Extent1].[input_file_name] AS [input_file_name],
1 AS [C1]
FROM [mel].[cl_contact_event] AS [Extent1]
) AS [Distinct1]
Now I'm pretty sure that the reason there is a sub-select is because I have the base LINQ query surrounded by () and then perform .Distinct() but I don't know enough about LINQ to be sure of this. If that's indeed the case is there a way to restructure/code my query so that a sub-select doesn't occur?
I know that it probably seems that I'm just nit-picking here but I'm just curious.

In this I suspect that the actual root cause of the subquery is the anonymous type constructor. Because you are not selecting a known entity, but rather an arbitrary object constructed from other entity values, the EF parser needs to make sure it can produce the exact set of fields -- whether from a single table, joined tables, calculated fields, other sub-queries, etc. The expression tree parser is very good at writing SQL statements out of LINQ queries whenever possible, but it's not omniscient. It processes the queries in a systematic way, that will always produce correct results (in the sense that you get what you asked for), though not always optimal results.
As far as rewriting the query to eliminate the sub-select, first off: I don't see an obvious way to do so that eliminates the anonymous type and produces correct results. More importantly, though, I wouldn't bother. Modern SQL servers like Sybase are very smart -- often smarter than the developer -- and very good at producing an optimal query plan out of a query. Besides that, EF loves sub-queries, because they are very good ways to write complex queries in an automated fashion. You often find them even when your LINQ query did not appear use them. Trying to eliminate them all from your queries will quickly become an exercise in futility.

I wouldn't worry about this particular situation at all. SQL Server (and most likely any enterprise database) will optimize away the outer Select statement anyway. I would theorize that the reason this SQL statement is generated is because this is the most generic and reusable statement. From my experience, this always happens on Distinct().

Related

Linq slowness materializing complex queries

I have often found that if I have too many joins in a Linq query (whether using Entity Framework or NHibernate) and/or the shape of the resulting anonymous class is too complex, Linq takes a very long time to materialize the result set into objects.
This is a generic question, but here's a specific example using NHibernate:
var libraryBookIdsWithShelfAndBookTagQuery = (from shelf in session.Query<Shelf>()
join sbttref in session.Query<ShelfBookTagTypeCrossReference>() on
shelf.ShelfId equals sbttref.ShelfId
join bookTag in session.Query<BookTag>() on
sbttref.BookTagTypeId equals (byte)bookTag.BookTagType
join btbref in session.Query<BookTagBookCrossReference>() on
bookTag.BookTagId equals btbref.BookTagId
join book in session.Query<Book>() on
btbref.BookId equals book.BookId
join libraryBook in session.Query<LibraryBook>() on
book.BookId equals libraryBook.BookId
join library in session.Query<LibraryCredential>() on
libraryBook.LibraryCredentialId equals library.LibraryCredentialId
join lcsg in session
.Query<LibraryCredentialSalesforceGroupCrossReference>()
on library.LibraryCredentialId equals lcsg.LibraryCredentialId
join userGroup in session.Query<UserGroup>() on
lcsg.UserGroupOrganizationId equals userGroup.UserGroupOrganizationId
where
shelf.ShelfId == shelfId &&
userGroup.UserGroupId == userGroupId &&
!book.IsDeleted &&
book.IsDrm != null &&
book.BookFormatTypeId != null
select new
{
Book = book,
LibraryBook = libraryBook,
BookTag = bookTag
});
// add a couple of where clauses, then...
var result = libraryBookIdsWithShelfAndBookTagQuery.ToList();
I know it's not the query execution, because I put a sniffer on the database and I can see that the query is taking 0ms, yet the code is taking about a second to execute that query and bring back all of 11 records.
So yeah, this is an overly complex query, having 8 joins between 9 tables, and I could probably restructure it into several smaller queries. Or I could turn it into a stored procedure - but would that help?
What I'm trying to understand is, where is that red line crossed between a query that is performant and one that starts to struggle with materialization? What's going on under the hood? And would it help if this were a SP whose flat results I subsequently manipulate in memory into the right shape?
EDIT: in response to a request in the comments, here's the SQL emitted:
SELECT DISTINCT book4_.bookid AS BookId12_0_,
libraryboo5_.librarybookid AS LibraryB1_35_1_,
booktag2_.booktagid AS BookTagId15_2_,
book4_.title AS Title12_0_,
book4_.isbn AS ISBN12_0_,
book4_.publicationdate AS Publicat4_12_0_,
book4_.classificationtypeid AS Classifi5_12_0_,
book4_.synopsis AS Synopsis12_0_,
book4_.thumbnailurl AS Thumbnai7_12_0_,
book4_.retinathumbnailurl AS RetinaTh8_12_0_,
book4_.totalpages AS TotalPages12_0_,
book4_.lastpage AS LastPage12_0_,
book4_.lastpagelocation AS LastPag11_12_0_,
book4_.lexilerating AS LexileR12_12_0_,
book4_.lastpageposition AS LastPag13_12_0_,
book4_.hidden AS Hidden12_0_,
book4_.teacherhidden AS Teacher15_12_0_,
book4_.modifieddatetime AS Modifie16_12_0_,
book4_.isdeleted AS IsDeleted12_0_,
book4_.importedwithlexile AS Importe18_12_0_,
book4_.bookformattypeid AS BookFor19_12_0_,
book4_.isdrm AS IsDrm12_0_,
book4_.lightsailready AS LightSa21_12_0_,
libraryboo5_.bookid AS BookId35_1_,
libraryboo5_.libraryid AS LibraryId35_1_,
libraryboo5_.externalid AS ExternalId35_1_,
libraryboo5_.totalcopies AS TotalCop5_35_1_,
libraryboo5_.availablecopies AS Availabl6_35_1_,
libraryboo5_.statuschangedate AS StatusCh7_35_1_,
booktag2_.booktagtypeid AS BookTagT2_15_2_,
booktag2_.booktagvalue AS BookTagV3_15_2_
FROM shelf shelf0_,
shelfbooktagtypecrossreference shelfbookt1_,
booktag booktag2_,
booktagbookcrossreference booktagboo3_,
book book4_,
librarybook libraryboo5_,
library librarycre6_,
librarycredentialsalesforcegroupcrossreference librarycre7_,
usergroup usergroup8_
WHERE shelfbookt1_.shelfid = shelf0_.shelfid
AND booktag2_.booktagtypeid = shelfbookt1_.booktagtypeid
AND booktagboo3_.booktagid = booktag2_.booktagid
AND book4_.bookid = booktagboo3_.bookid
AND libraryboo5_.bookid = book4_.bookid
AND librarycre6_.libraryid = libraryboo5_.libraryid
AND librarycre7_.librarycredentialid = librarycre6_.libraryid
AND usergroup8_.usergrouporganizationid =
librarycre7_.usergrouporganizationid
AND shelf0_.shelfid = #p0
AND usergroup8_.usergroupid = #p1
AND NOT ( book4_.isdeleted = 1 )
AND ( book4_.isdrm IS NOT NULL )
AND ( book4_.bookformattypeid IS NOT NULL )
AND book4_.lightsailready = 1
EDIT 2: Here's the performance analysis from ANTS Performance Profiler:

It is often database "good" practice to place lots of joins or super common joins into views. ORMs don't let you ignore these facts nor do they supplement the decades of time spent fine tuning databases to do these kinds of things efficiently. Refactor those joins into a singular view or a couple views if that'd make more sense in the greater perspective of your application.
NHibernate should be optimizing the query down and reducing the data so that .Net only has to mess with the important parts. However, if those domain objects are just naturally large, that's still a lot of data. Also, if it's a really large result set in terms of rows returned, that's a lot of objects getting instantiated even if the DB is able to return the set quickly. Refactoring this query into a view that only returns the data you actually need would also reduce object instantiation overhead.
Another thought would be to not do a .ToList(). Return the enumerable and let your code lazily consume the data.

According to profiling information, the CreateQuery takes 45% of the total execution time. However as you mentioned the query took 0ms when you executed directly. But this alone is not enough to say there is a performance problem because,
You are running the query with the profiler which has significant impact on execution time.
When you use a profiler, it will affect every code is being profiled but not the sql execution time (because it happens in the SQL server), so you can see everything else is slower compared to SQL statement.
so ideal scenario is to measure how long it takes to execute entire code block, measure time for SQL query and calculate times, and if you do that you will probably end up with different values.
However, I'm not saying that the the NH Linq to SQL implementation is optimized for any query you come up with, but there are other ways in NHibernate to deal with those situations such as QueryOverAPI, CriteriaQueries, HQL and finally SQL.
Where is that red line crossed between a query that is performant and
one that starts to struggle with materialization. What's going on under the hood?
This one is pretty hard question and without having detail knowledge of NHibernate Linq to SQL provider it's hard to provide a accurate answer. You can always try different mechanisms provided and see which one is the best for given scenario.
And would it help if this were a SP whose flat results I subsequently
manipulate in memory into the right shape?
Yes, using a SP would help things to work pretty fast, but using SP would add more maintenance problems to your code base.

You have generic question, I'll tell you generic answer :)
If you query data for reading (not for update) try to use anonymous classes. The reason is - they are lighter to create, they have no navigatoin properties. And you select only data you need! It's very important rule. So, try to replace your select with smth like this:
select new
{
Book = new { book.Id, book.Name},
LibraryBook = new { libraryBook.Id, libraryBook.AnotherProperty},
BookTag = new { bookTag.Name}
}
Stored procedures are good, when query is complex and linq-provider generates not effective code, so, you can replace it with plain SQL or stored procedure. It's not offten case and, I think, it's not your situation
Run your sql-query. How many rows it returns? Is it the same value as result? Sometimes linq provider generates code, that select much more rows to select one entity. It happens, when entity has one to many relationship with another selecting entity. For example:
class Book
{
int Id {get;set;}
string Name {get;set;}
ICollection<Tag> Tags {get;set;}
}
class Tag
{
string Name {get;set;}
Book Book {get;set;}
}
...
dbContext.Books.Where(o => o.Id == 1).Select(o=>new {Book = o, Tags = o.Tags}).Single();
I Select only one book with Id = 1, but provider will generate code, that returns rows amount equals to Tags amount (entity framework does this).
Split complex query to set of simple and join in client side. Sometimes, you have complex query with many conditionals and resulting sql become terrible. So, you split you big query to more simple, get results of each and join/filter on client side.
At the end, I advice you to use anonymous class as result of select.

Don’t use Linq’s Join. Navigate!
in that post you can see:
As long as there are proper foreign key constraints in the database, the navigation properties will be created automatically. It is also possible to manually add them in the ORM designer. As with all LINQ to SQL usage I think that it is best to focus on getting the database right and have the code exactly reflect the database structure. With the relations properly specified as foreign keys the code can safely make assumptions about referential integrity between the tables.

I agree 100% with the sentiments expressed by everyone else (with regards to their being two parts to the optimisation here and the SQL execution being a big unknown, and likely cause of poor performance).
Another part of the solution that might help you get some speed is to pre-compile your LINQ statements. I remember this being a huge optimisation on a tiny project (high traffic) I worked on ages and ages ago... seems like it would contribute to the client side slowness you're seeing. Having said all that though I've not found a need to use them since... so heed everyone else's warnings first! :)
https://msdn.microsoft.com/en-us/library/vstudio/bb896297(v=vs.100).aspx

Complex Linq-To-Entities query with deferred execution: prevent OrderBy being used as a subquery/projection

I built a dynamic LINQ-to-Entities query to support optional search parameters. It was quite a bit of work to get this producing performant SQL and I am NEARLY there, but I stumble across a big issue with OrderBy which gets translated into kind of a projection / subquery containing the actual query, causing extremely inperformant SQL. I can't find a solution to get this right. Maybe someone can help me out :)
I spare you the complete query for now as it is long and complex, I translate it into a simple sample for better understanding:
I'm doing something like this:
// Start with the base query
var query = from a in db.Articles
where a.UserId = 1;
// Apply some optional conditions
if (tagParam != null)
query = query.Where(a => a.Tag = tagParam);
if (authorParam != null)
query = query.Where(a => a.Author = authorParam);
// ... and so on ...
// I only want the 50 most recent articles, so I finally want to apply Take and OrderBy
query = query.OrderByDescending(a => a.Published);
query = query.Take(50);
The resulting SQL strangely translates the OrderBy in an container query:
select top 50 Id, Published, Title, Content
from (select Id, Published, Title Content
from Articles
where UserId = 1
and Author = #paramAuthor)
order by Published desc
Note that also the Top 50 got moved to the outer query. In case I would only use Take(50), the top 50 sql statement would correctly be applied to the inner query above (the outer query wouldn't even exist). Only when I use OrderBy, Linq-To-Entities uses this container query approach.
This causes a very bad execution plan where the inner query takes all articles that apply to the parameters from Disk and pass them to the outer query - and only there, OrderBy and Top is processed. In my case, this can be hundred thousands of lines. I already tried to move the order by manually into the inner statement and execute this - this produces much better results as the existing indexes allow the SQL Server to easily find the top 50 rows in right order without reading all rows from disk.
Is there any way I can get EF to append the order by clause to the inner query? Or any other trick to get this working right?
Any help would be greatly appreciated :)
Edit: As an additional information, some tests with less complex queries showed that the Optimizer normally handles such subquery scenarios well. In my scenario, the Optimizer fails on this unfortunately and moves hundrets of thousands of rows through the query plan. But moving the OrderBy to the inner query solves it and the Optimizer does it right.
Edit 2: After couple of hours of more testing it seems the issue with the wrong execution plan is a SQL Server issue that is not caused by the created container query. While the move of the order by and top clause into the inside query did fix the issue initially, I can't reproduce this now anymore, SQL Server started using the bad execution plan now also here (while the data in the DB remained unchanged). The move of the order by clause might caused SQL Server to take other statistics into account but it seems it was not due to the better/more clean query design. However, I still want to know why EF uses a container query here and if I can influence this behavior. If it will not improve performance, at least it would make debugging easier if the generated EF queries are more straightforward and not that convoluted.

More complex queries in LINQ than SQL

I am new to LINQ queries. I have read/researched about all advantages of LINQ queries over SQL but i have one basic question why do we need to use these queries as i feel their syntax is more complicated than traditional sql queries?
For example look at below example for simple Left Outer Join
var q=(from pd in dataContext.tblProducts
join od in dataContext.tblOrders on pd.ProductID equals od.ProductID into t
from rt in t.DefaultIfEmpty()
orderby pd.ProductID
select new
{
//To handle null values do type casting as int?(NULL int)
//since OrderID is defined NOT NULL in tblOrders
OrderID=(int?)rt.OrderID,
pd.ProductID,
pd.Name,
pd.UnitPrice,
//no need to check for null since it is defined NULL in database
rt.Quantity,
rt.Price,
})
.ToList();

So, the point of LINQ (Language Integrated Queries) is to provide easy ways of working with enumerable collections in executing memory. Contrast to SQL, which is a language for determining what the user gets from a set of data in a database.
Because of the SQL-like syntax, it's easy to confuse LINQ code with SQL, and think that they're 'alike' - they're really not. SQL gets a subset of data from a superset; LINQ is 'syntactic sugar' that hides common operations involving foreach loops.
For instance, this is a common programming pattern:
foreach(Thing thing in things)
{
if(thing.SomeProperty() == "Some Value")
return true;
}
...this is done rather easily in LINQ:
return things.Any(t => t.SomeProperty() == "Some Value");
The two code are functionally the same, and I'm pretty sure even compile to roughly the same IL code. The difference is how it looks to you.
You don't have to use LINQ; you can choose to use a standard foreach, and there are times, such as complex loops, where it is useful to do so. Ultimately it is a question of readability - my counter-question to you is, is the LINQ version of your foreach loop more, or less, readable than the original foreach loop?
If the answer is 'less', then I suggest converting it back to a foreach.

I'm by no means an sql or a linq expert, I use them both.
There is a trend to either make linq into something bad or a silver bullet depending on what side are you.
You need to seriously consider your project requirements in order to choose. The choice is not mutually exclusive. Take what is good from them both .
Advantages
Quick turn around for development
Queries can be dynamically
Tables are automatically created into class
Columns are automatically created into properties
Relationship are automatically appeaded to classes
Lambda expressions are awesome
Data is easy to setup and use
Disadvantages
No clear outline for Tiers
No good way of view permissions
Small data sets will take longer to build the query than execute
There is an overhead for creating queries
When queries are moved from sql to application side, joins are very slow
DBML concurrency issues
Hard to understand advance queries using Expressions
I found that programmers used to Sql will have a hard time figuring out the tricks with LINQ. But programmers with Sql knowledge, but haven't done a ton of work with it, will pick up linq quicker.

The main issue when people start using LINQ is that they keep thinking in the SQL way, they design the SQL query first and then translate it to LINQ. You need to learn how to think in the LINQ way and your LINQ query will become neater and simpler. For instance, in your LINQ you don't need joins. You should use Associations/Navigation Properties instead. Check this post for more details.

Entity Framework and large queries. What's practical?

I'm from old school where DB had all data access encapsulated into views, procedures, etc. Now I'm forcing myself into using LINQ for most of the obvious queries.
What I'm wondering though, is when to stop and what practical? Today I needed to run query like this:
SELECT D.DeviceKey, D.DeviceId, DR.DriverId, TR.TruckId, LP.Description
FROM dbo.MBLDevice D
LEFT OUTER JOIN dbo.DSPDriver DR ON D.DeviceKey = DR.DeviceKey
LEFT OUTER JOIN dbo.DSPTruck TR ON D.DeviceKey = TR.DeviceKey
LEFT OUTER JOIN
(
SELECT LastPositions.DeviceKey, P.Description, P.Latitude, P.Longitude, P.Speed, P.DeviceTime
FROM dbo.MBLPosition P
INNER JOIN
(
SELECT D.DeviceKey, MAX(P.PositionKey) LastPositionKey
FROM dbo.MBLPosition P
INNER JOIN dbo.MBLDevice D ON P.DeviceKey = D.DeviceKey
GROUP BY D.DeviceKey
) LastPositions ON P.PositionKey = LastPositions.LastPositionKey
) LP ON D.DeviceKey = LP.DeviceKey
WHERE D.IsActive = 1
Personally, I'm not able to write corresponing LINQ. So, I found tool online and got back 2 page long LINQ. It works properly-I can see it in profiler but it's not maintainable IMO. Another problem is that I'm doing projection and getting Anonymous object back. Or, I can manually create class and project into that custom class.
At this point I wonder if it is better to create View on SQL Server and add it to my model? It will break my "all SQL on cliens side" mantra but will be easier to read and maintain. No?
I wonder where you stop with T-SQL vs LINQ ?
EDIT
Model description.
I have DSPTrucks, DSPDrivers and MBLDevices.
Device can be attached to Truck or to Driver or to both.
I also have MBLPositions which is basically pings from device (timestamp and GPS position)
What this query does - in one shot it returns all device-truck-driver information so I know what this device attached to and it also get's me last GPS position for those devices. Response may look like so:
There is some redundant stuff but it's OK. I need to get it in one query.

In general, I would also default to LINQ for most simple queries.
However, when you get at a point where the corresponding LINQ query becomes harder to write and maintain, then what's the point really? So I would simply leave that query in place. It works, after all. To make it easier to use it's pretty straight-forward to map a view or cough stored procedure in your EF model. Nothing wrong with that, really (IMO).

You can firstly store Linq queries in variables which may help to make it not only more readable, but also reusable.
An example maybe like the following:
var redCars = from c in cars
where c.Colour == "red"
select c;
var redSportsCars = from c in redCars
where c.Type == "Sports"
select c;
Queries are lazily executed and not composed until you compile them or iterate over them so you'll notice in profiler that this does produce an effecient query
You will also benifit from defining relationships in the model and using navigation properties, rather than using the linq join syntax. This (again) will make these relationships reusable between queries, and more readable (because you don't specify the relationships in the query like the SQL above)
Generally speaking your LINQ query will be shorter than the equivalent SQL, but I'd suggest trying to work it out by hand rather than using a conversion tool.
With the exception of CTEs (which I'm fairly sure you can't do in LINQ) I would write all queries in LINQ these days

I find when using LINQ its best to ignore whatever sql it generates as long as its retrieving the right thing and is performant, only when one of those doesn't work do I actually look at what its generating.
In terms of the sql it generates being maintainable, you shouldn't really worry about the SQL being maintainable but more the LINQ query that is generating the SQL.
In the end if the sql is not quite right I believe there are various things you can do to make LINQ generate SQL more along the lines you want..to some extent.
AFAIK there isn't any inherent problem with getting anonymous objects back, however if you are doing it it multiple places you may want to create a class to keep things neater.

Selecting first 100 records using Linq

How can I return first 100 records using Linq?
I have a table with 40million records.
This code works, but it's slow, because will return all values before filter:
var values = (from e in dataContext.table_sample
where e.x == 1
select e)
.Take(100);
Is there a way to return filtered? Like T-SQL TOP clause?

No, that doesn't return all the values before filtering. The Take(100) will end up being part of the SQL sent up - quite possibly using TOP.
Of course, it makes more sense to do that when you've specified an orderby clause.
LINQ doesn't execute the query when it reaches the end of your query expression. It only sends up any SQL when either you call an aggregation operator (e.g. Count or Any) or you start iterating through the results. Even calling Take doesn't actually execute the query - you might want to put more filtering on it afterwards, for instance, which could end up being part of the query.
When you start iterating over the results (typically with foreach) - that's when the SQL will actually be sent to the database.
(I think your where clause is a bit broken, by the way. If you've got problems with your real code it would help to see code as close to reality as possible.)

I don't think you are right about it returning all records before taking the top 100. I think Linq decides what the SQL string is going to be at the time the query is executed (aka Lazy Loading), and your database server will optimize it out.

Have you compared standard SQL query with your linq query? Which one is faster and how significant is the difference?
I do agree with above comments that your linq query is generally correct, but...
in your 'where' clause should probably be x==1 not x=1 (comparison instead of assignment)
'select e' will return all columns where you probably need only some of them - be more precise with select clause (type only required columns); 'select *' is a vaste of resources
make sure your database is well indexed and try to make use of indexed data
Anyway, 40milions records database is quite huge - do you need all that data all the time? Maybe some kind of partitioning can reduce it to the most commonly used records.

I agree with Jon Skeet, but just wanted to add:
The generated SQL will use TOP to implement Take().
If you're able to run SQL-Profiler and step through your code in debug mode, you will be able to see exactly what SQL is generated and when it gets executed. If you find the time to do this, you will learn a lot about what happens underneath.
There is also a DataContext.Log property that you can assign a TextWriter to view the SQL generated, for example:
dbContext.Log = Console.Out;
Another option is to experiment with LINQPad. LINQPad allows you to connect to your datasource and easily try different LINQ expressions. In the results panel, you can switch to see the SQL generated the LINQ expression.

I'm going to go out on a limb and guess that you don't have an index on the column used in your where clause. If that's the case then it's undoubtedly doing a table scan when the query is materialized and that's why it's taking so long.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.