Does "where" position in LINQ query matter when joining in-memory?

Does "where" position in LINQ query matter when joining in-memory? - c#

Situation: Say we are executing a LINQ query that joins two in-memory lists (so no DbSets or SQL-query generation involved) and this query also has a where clause. This where only filters on properties included in the original set (the from part of the query).
Question: Does the linq query interpreter optimize this query in that it first executes the where before it performs the join, regardless of whether I write the where before or after the join? – so it does not have to perform a join on elements that are not included later anyways.
Example: For example, I have a categories list I want to join with a products list. However, I am just interested in the category with ID 1. Does the linq interpreter internally perform the exact same operations regardless of whether I write:
from category in categories
join prod in products on category.ID equals prod.CategoryID
where category.ID == 1 // <------ below join
select new { Category = category.Name, Product = prod.Name };
or
from category in categories
where category.ID == 1 // <------ above join
join prod in products on category.ID equals prod.CategoryID
select new { Category = category.Name, Product = prod.Name };
Previous research: I already saw this question but the OP author stated that his/her question is only targeting non-in-memory cases with generated SQL. I am explicitly interested with LINQ executing a join on two lists in-memory.
Update: This is not a dublicate of "Order execution of chain linq query" question as the referenced question clearly refers to a dbset and my question explicitly addressed a non-db scenario. (Moreover, although similar, I am not asking about inclusions based on navigational properties here but about "joins".)
Update2: Although very similar, this is also not a dublicate of "Is order of the predicate important when using LINQ?" as I am asking explicitly about in-memory situations and I cannot see the referenced question explicitly addressing this case. Moreover, the question is a bit old and I am actually interested in linq in the context of .NET Core (which didn't exist in 2012), so I updated the tag of this question to reflect this second point.
Please note: With this question I am aiming at whether the linq query interpreter somehow optimizes this query in the background and am hoping to get a reference to a piece of documentation or source code that shows how this is done by linq. I am not interested in answers such as "it does not matter because the performance of both queries is roughly the same".

The LINQ query syntax will be compiled to a method chain. For details, read e.g. in this question.
The first LINQ query will be compiled to the following method chain:
categories
.Join(
products,
category => category.ID,
prod => prod.CategoryID,
(category, prod) => new { category, prod })
.Where(t => t.category.ID == 1)
.Select(t => new { Category = t.category.Name, Product = t.prod.Name });
The second one:
categories
.Where(category => category.ID == 1)
.Join(
products,
category => category.ID,
prod => prod.CategoryID,
(category, prod) => new { Category = category.Name, Product = prod.Name });
As you can see, the second query will cause less allocations (note only one anonymous type vs 2 in the first query, and note how many instances of those anonymous types will be created on performing the query).
Furthermore, it's clear that the first query will perform a join operation on lot more data than the second (already filtered) one.
There will be no additional query optimization in case of LINQ-to-objects queries.
So the second version is preferable.

For in memory lists (IEnumerables), no optimization is applied and query execution is made in chained order for in-memory lists.
I also tried result by first casting it to IQueryable then apply filtering but apparently casting time is pretty high for this big table.
I made a quick test for this case.
Console.WriteLine($"List Row Count = {list.Count()}");
Console.WriteLine($"JoinList Row Count = {joinList.Count()}");
var watch = Stopwatch.StartNew();
var result = list.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Where(t => t.inner.Prop3 == "Prop13")
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result.Dump();
watch.Stop();
Console.WriteLine($"Result1 Elapsed = {watch.ElapsedTicks}");
watch.Restart();
var result2 = list
.Where(t => t.Prop3 == "Prop13")
.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result2.Dump();
watch.Stop();
Console.WriteLine($"Result2 Elapsed = {watch.ElapsedTicks}");
watch.Restart();
var result3 = list.AsQueryable().Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Where(t => t.inner.Prop3 == "Prop13")
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result3.Dump();
watch.Stop();
Console.WriteLine($"Result3 Elapsed = {watch.ElapsedTicks}");
Findings:
List Count = 100
JoinList Count = 10
Result1 Elapsed = 27
Result2 Elapsed = 17
Result3 Elapsed = 591
List Count = 1000
JoinList Count = 10
Result1 Elapsed = 20
Result2 Elapsed = 12
Result3 Elapsed = 586
List Count = 100000
JoinList Count = 10
Result1 Elapsed = 603
Result2 Elapsed = 19
Result3 Elapsed = 1277
List Count = 1000000
JoinList Count = 10
Result1 Elapsed = 1469
Result2 Elapsed = 88
Result3 Elapsed = 3219

Related

What is the difference between group n by vs group n by into g in LINQ?

I notice that both LINQ query produce the same output. May I know what's difference of these two query for grouping? Is it because into can group by 2 element?
var groupBy = from n in numbers
group n by n;
And:
var groupBy = from n in numbers
group n by n into g
select g;

The difference stands out immediately in method syntax:
var groupBy = numbers.GroupBy(n => n);
vs. (with into)
var groupBy = numbers.GroupBy(n => n).Select(g => g);
Now your example isn't too useful to demonstrate the practical differences because each group is just one item, so let's take this example:
var group = from c in Company
group c by c.City;
If this is all we need, listing companies by cities, we're done. But if we want to do anything with the results of the grouping we need into and select, for example:
var group = from c in Company
group c by c.City
into cg
select new
{
City = cg.Key,
NumberOfCompanies = cg.Count()
};
In method syntax:
var group = Companies
.GroupBy(c => c.City)
.Select(gc => new
{
City = cg.Key,
NumberOfCompanies = cg.Count()
});

https://codeblog.jonskeet.uk/2010/09/15/query-expression-syntax-continuations/
When "into" is used after either a "group x by y" or "select x"
clause, it’s called a query continuation. (Note that "join … into"
clauses are not query continuations; they’re very different.) A query
continuation effectively says, "I’ve finished one query, and I want to
do another one with the results… but all in one expression."
The into keyword makes your query continuation, it is effectively starting a new query with the results of the old one in a new range variable.
You can also see how they are Compiled.

Please help me write LINQ statement for following SQL query

My Db named MyDB has 5 tables: MSize, MModel, Met, MResult, SResult. They are connected as follows:
MSize has a common field MSizeId with MModel.
MModel links with Met with MModelId.
Met can be linked with MResult on basis of MId.
Similarly SResult can be linked with MResult on SResultId.
My aim is to get average accuracy of all the items(description field in Msize table) with Acc(decimal data type) >=70 and <=130 grouped by description.
Here is my SQL query:
use MyDB;
SELECT a.[Description],AVG(CASE WHEN d.[Acc] >= 70 AND d.[Acc] <= 130 THEN d.[Acc] END)
FROM MSize a
INNER JOIN MModel b ON a.MSizeId = b.MSizeId
INNER JOIN Met c ON b.MModelId = c.MModelId
INNER JOIN MResult d ON c.MId = d.MId
INNER JOIN SResult e ON d.SResultId = e.SResultId
GROUP BY a.Description
This query gives me the correct result on SQL server.
I have been struggling to write a LINQ query for the same. The problem comes with the SQL CASE statement. I don't want to specify the false result of the CASE, meaning, if d.acc doesn't fall in the range specified in SQL query, discard it.
Assuming all Model classes and fields have the same name as these DBtables and columns. What can be the LINQ query for the given SQL statement?
You can fill up the code here in curly braces:
using (var db = new MyDBContext()){ }
here MyDBContext refers to Partial Class Data Model template generated by LINQ

You didn't bother to write the classes, and I'm not going to do that for you.
Apparently you have a sequence of MSizes, where every Msize has zero or more MModels. Every MModel has zero or more Mets, and every Met has zero or more MResults, and every MResult has an Acc.
You also forgot to write in words your requirements, now I had to extract it from your SQL query
It seemt that you want the Description of every MSize with the average value of all the Accs that it has, that have a value between 70 and 130.
If you use entity framework, you can use the virtual ICollection which makes live fairly easy. I'll do it in two steps, because below I do the same with a GroupJoin without using the ICollection. The 2nd part is the same for both methods.
First I'll fetch the Description of every MSize, together with all its deeper Acc that are in the MResults of the Mets of the MModels of this MSize:
var descriptionsWithTheirAccs = dbContext.MSizes.Select(msize => new
{
Description = msize.Description,
// SelectMany the inner collections until you see the Accs
Accs = mSize.Mmodels.SelectMany(
// collection selector:
model => model.Mets,
// result selector: flatten MResults in the Mets
(model, mets) => mets
.SelectMany(met => met.MResults,
// result Selector: from every mResult take the Acc
(met, mResults) => mResults
.Select(mResult => mResult.Acc)));
Now that we have the Description of every MSize with all Accs that it has deep inside it,
we can throw away all Accs that we don't want and Average the remaining ones:
var result= descriptionsWithTheirAccs.Select(descriptionWithItsAccs => new
{
Description = descriptionWithItsAccs.Description,
Average = descriptionWithItsAccs.Accs
.Where(acc => 70 <= acc && acc <= 130)
// and the average from all remaining Accs
.Avg(),
});
If you don't have access to the ICollections, you'll have to do the Groupjoin yourself, which looks pretty horrible if you have so many tables:
var descriptionsWithTheirAccs = dbContext.MSizes.GroupJoin(dbContext.MModels,
msize => msize.MSizeId,
mmodel => mmodel.MSizeId,
(msize, mmodels) => new
{
Description = msize.Description,
Accs = mmodels.GroupJoin(dbContext.Mets,
mmodel => mModel.MModelId,
met => met.MModelId,
(mmodel, metsofThisModel) => metsOfThisModel
.GroupJoin(dbContext.MResults,
met => met.MetId
mresult => mresult.MetId,
// result selector
(met, mresults) => mResult.Select(mresult => mresult.Acc))),
});
Now that you have the DescriptionsWithTheirAccs, you can use the Select above to calculation the Averages.

EF Creating business objects in Linq or in foreach

I am measuring differences in query execution and stumbled upon a case I have no explanation for. The query should retrieve 10000 customers with their main address (a customer can have many addresses). We used 2 different methods with Navigation Properties which differ greatly in execution time.
The first method retrieves the customers the way I usually write Linq queries: write the results directly to a business object and calling ToList(). This method takes 25 seconds to execute.
The second method retrieves the customers as a list of EF Entities first. The EF Entities are converted to business objects in a foreach loop. This method takes 2 seconds to execute.
Can someone explain the difference? And is it possible to modify the first method so the execution time is similar to the second?
private List<ICustomer> NavigationProperties_SO(int method)
{
using (Entities context = new Entities())
{
context.Database.Log = s => System.Diagnostics.Debug.WriteLine(s);
context.Configuration.ProxyCreationEnabled = false;
context.Configuration.AutoDetectChangesEnabled = false;
List<ICustomer> customerList = new List<ICustomer>();
if (method == 1)
{
// Execution time: 25 seconds
customerList = (from c in context.cust
.Include(o => o.AddressList)
.Include(o => o.AddressList.Select(p => p.ADDR))
let mainAddress = c.AddressList.Where(o => o.Main_addr == "1").FirstOrDefault()
select new Customer
{
cust = c,
mainAddress = mainAddress,
addr = mainAddress == null ? null : mainAddress.ADDR
}).AsNoTracking().ToList<ICustomer>();
}
else if (method == 2)
{
// Execution time: 2 seconds
var tempList = (from c in context.cust
.Include(o => o.AddressList)
.Include(o => o.AddressList.Select(p => p.ADDR))
select c).AsNoTracking().ToList();
foreach (var c in tempList)
{
ICustomer customer = new Customer();
var mainaddress = c.AddressList.Where(o => o.Main_addr == "1").FirstOrDefault();
customer.cust = c;
customer.mainAddress = mainaddress;
customer.addr = mainaddress == null ? null : mainaddress.ADDR;
customerList.Add(customer);
}
}
return customerList;
}
}
Edit
Here are the (simplified) queries generated by Entity Framework:
Method 1
SELECT
*
FROM [DBA].[CUST] AS [Extent1]
OUTER APPLY (SELECT TOP ( 1 )
*
FROM [DBA].[CUST_ADDR] AS [Extent2]
WHERE (([Extent1].[Id] = [Extent2].[Id]) AND (N'1' = [Extent2].[Main_addr])
ORDER BY 'a' ) AS [Limit1]
LEFT OUTER JOIN [DBA].[ADDR] AS [Extent3] ON [Limit1].[Id] = [Extent3].[Id]
Method 2
SELECT
*
FROM ( SELECT
*
FROM [DBA].[CUST] AS [Extent1]
LEFT OUTER JOIN (SELECT *
FROM [DBA].[CUST_ADDR] AS [Extent2]
LEFT OUTER JOIN [DBA].[ADDR] AS [Extent3] ON [Extent2].[Id] = [Extent3].[Id] ) AS [Join1] ON ([Extent1].[Id] = [Join1].[Id])
) AS [Project1]
The difference is that the first method does the filtering in the query (´let´) while the second method retrieves all records and filters in the loop.

I suspect
let mainAddress = c.AddressList.Where(o => o.Main_addr == "1").FirstOrDefault()
is the culprit. Certain queries forces EF to ask for all possible combinations to be returned. EF then spends a little time narrowing down the scope before it provides you with a reasonable result set. You can use SQL Server Profiler to look at the queries generated.
In any case, you can use LINQ, rather than a foreach, at the end of your second method (this won't help performance, but readability might improve):
return tempList.Select(c => new Customer{cust=c, mainAddress = c.AddressList.FirstOrDefault(o=>o.Main_addr=="1"), ...);

Answer related to comments... (but two long for a comment)
For the "how to choose the best syntax" part
I would say that it comes partially from "experience" (see, 9Rune5 and I suspected the same point, which was the problematic one before seeing the generated sql) : but experience, sometimes, may also leed to wrong conclusions ;)
So to be a little bit more pragmatic, I would suggest you to use tools/libs which will help you to look at the generated sql / time by query, or page...
ANTS Performance profiler, Miniprofiler, Sql Server profiler, etc, it may depend on your technologies / needs...
By the way, if you want to keep a "linq" syntax, you could go for
var tempList = context.cust
.Include(o => o.AddressList)
.Include(o => o.AddressList.Select(p => p.ADDR))
.AsNoTracking()
.ToList();
var result = (from c in tempList
let mainAddress = c.AddressList.Where(o => o.Main_addr == "1").FirstOrDefault()
select new Customer
{
cust = c,
mainAddress = mainAddress,
addr = mainAddress == null ? null : mainAddress.ADDR
}).ToList<ICustomer>();
But not really less verbose than the foreach syntax...

sub linq query is making this take a very long time, how can I make this faster?

I have a list of employees that I build like this:
var employees = db.employees.Where(e => e.isActive == true).ToList();
var latestSales = from es in db.employee_sales.Where(x => x.returned == false);
Now what I want is a result like this:
int employeeId
List<DateTime> lastSaleDates
So I tried this, but the query takes a very very long time to finish:
var result =
(from e in employees
select new EmployeeDetails
{
EmployeeId = e.employeeId,
LastSaleDates =
(from lsd in latestSales.Where(x => x.EmployeeId == e.EmployeeId)
.Select(x => x.SaleDate)
select lsd).ToList()
};
The above works, but literally takes 1 minute to finish.
What is a more effecient way to do this?

You can use join to get all data in single query
var result = from e in db.employees.Where(x => x.isActive)
join es in db.employee_sales.Where(x => x.returned)
on e.EmployeeId equals es.EmployeeId into g
select new {
EmployeeId = e.employeeId,
LastSaleDates = g.Select(x => x.SaleDate)
};
Unfortunately you can't use ToList() method with Linq to Entities. So either map anonymous objects manually to your EmployeeDetails or change LastSalesDates type to IEnumerable<DateTime>.

Your calls to ToList are pulling things into memory. You should opt to build up a Linq expression instead of pulling an entire query into memory. In your second query, you are issuing a new query for each employee, since your are then operating in the Linq-to-objects domain (as opposed to in the EF). Try removing your calls to ToList.
You should also look into using Foreign Key Association Properties to makes this query a lot nicer. Association properties are some of the most powerful and useful parts of EF. Read more about them here. If you have the proper association properties, your query can look as nice as this:
var result = from e in employees
select new EmployeeDetails
{
EmployeeId = e.employeeId,
LastSaleDates = e.AssociatedSales
}
You might also consider using a join instead. Read about Linq's Join method here.

Is there an association in your model between employees and latestSales? Have you checked SQL Profiler or other profiling tools to see the SQL that's generated? Make sure the ToList() isn't issuing a separate query for each employee.
If you can live with a result structure as IEnumerable<EmployeeId, IEnumerable<DateTime>>, you could consider modifying this to be:
var result = (from e in employees
select new EmployeeDetails
{
EmployeeId = e.employeeId,
LastSaleDates = (from lsd in latestSales
where e.employeeId equals lsd.EmployeeId
select lsd.SaleDate)
};
I have some more general recommendations at http://www.thinqlinq.com/Post.aspx/Title/LINQ-to-Database-Performance-hints to help track issues down.

Filtering by max and grouping by id with joins to other entities in LINQ to Entity Framework (C#)

The following snippet does work for what I need. I believe though that there must be a better practice? A more optimal way of doing this query?
What is needed is to get a list of employee objects that are the direct reports for employee/mgr x. The direct reports are listed in a history table that has multiple records for each employee, and so only one (the most recent) record should be returned from that table per each direct report (employee) and then the Employee table should be used to get the employee object where employee id is equal to employee id from each history record in this filtered resultset. I can get both halves with two separate LINQ to EF queries.
A problem occurs when trying to join on the employeeHistory object from the first result set. According to MSDN: Referencing Non-Scalar Closures is Not Supported [Referencing a non-scalar closure, such as an entity, in a query is not supported. When such a query executes, a NotSupportedException exception is thrown with a message that states "Unable to create a constant value of type 'Closure type'. Only primitive types ('such as Int32, String, and Guid') are supported in this context."]
So I run two queries and make the first a list of type int rather than a complex object. This does work, but seems contrived. Any suggestions as to a better way (I would like to do one query).
private List<BO.Employee> ListDirectReports(int mgrId)
{
IQueryable<BO.Employee> directRpts;
using(var ctx = new Entities())
{
//to get a list of direct rpts we perform two separate queries. linq to ef with linq to objects
//first one gets a list of emp ids for a direct mgr emp id from the history table
//this first qry uses grouping and a filter by empid and a filter by max(date)
//the second qry joins to the resultset from the first and goes to the employee table
//to get whole employee objects for everyone in the int emp id list from qry #1
//qry #1: just a list of integers (emp ids for those reporting to emp id of mgrId)
IEnumerable<int> directRptIDList =
from employeeHistory in ctx.EmployeeHistory
.Where(h => h.DirectManagerEmployeeID == mgrId).ToList()
group employeeHistory by employeeHistory.EmployeeID into grp
let maxDt = grp.Max(g => g.DateLastUpdated) from history in grp
where history.DateLastUpdated == maxDt
select history.EmployeeID;
//qry #2: a list of Employee objects from the Employee entity. filtered by results from qry #1:
directRpts = from emp in ctx.Employee
join directRptHist in directRptIDList.ToList()
on emp.EmployeeID equals directRptHist
select emp;
}
return directRpts.ToList();
}
Thank you.

2 things I can think of to improve your queries:
ToList is non-deffered. Calling it on your Queryable collections is causing lots of extra trips to the DB. I also believe this call, along with the explicit declaration of IEnumerable<int>, was causing the closure error.
Use the relation between EmployeeHistory and Employee, in your ObjectContex, to join the queries. This will let the Framework produce more efficient SQL. And when directRpts is evaluated on your ToList call, it should only make 1 trip to the DB.
Let me know if this helps.
private List<BO.Employee> ListDirectReports(int mgrId)
{
using(var ctx = new Entities())
{
var directRptIDList =
from employeeHistory in ctx.EmployeeHistory
.Where(h => h.DirectManagerEmployeeID == mgrId)
group employeeHistory by employeeHistory.EmployeeID into grp
let maxDt = grp.Max(g => g.DateLastUpdated) from history in grp
where history.DateLastUpdated == maxDt
select history;
var directRpts =
from emp in ctx.Employee
join directRptHist in directRptIDList
on emp equals directRptHist.Employee
select emp;
}
return directRpts.ToList();
}

There are a number of issues here, not the least of which is that by doing your Where before you get the most recent history item, you're getting records that are no longer valid. Here's how I'd do it:
private List<BO.Employee> ListDirectReports(int mgrId)
{
using(var ctx = new Entities())
{
// First make sure we're only looking at the current employee information
var currentEntries =
from eh in ctx.EmployeeHistory
group employeeHistory by employeeHistory.EmployeeID into grp
select grp.OrderBy(eh => eh.DateLastUpdated).FirstOrDefault();
// Now filter by the manager's ID
var directRpts = currentEntries
.Where(eh => eh.DirectManagerEmployeeID == mgrId);
// This would be ideal, assuming your entity associations are set up right
var employees = directRpts.Select(eh => eh.Employee).Distinct();
// If the above won't work, this is the next-best thing
var employees2 = ctx.Employee.Where(
emp => directRpts.Any(
eh => eh.EmployeeId == emp.EmployeeId));
return employees.ToList();
}
}

Thank you Sorax. The code I had posted did not error and did give me the results I needed, but as you pointed out, merging the two queries errored when including the ToList() method. Using your tip I merged both successfully (tested it) and have posted the improved single query method below. StriplingWarrior I tried yours as well, maybe I could massage it even more. Function evaluation times out on the first query, so I will stick with Sorax' suggestion for now. I appreciate the help and will revisit this.
private static List<BO.Employee> ListDirectReports(int mgrId)
{
IQueryable<BO.Employee> directRpts;
using(var ctx = new Entities())
{
directRpts =
from emp in ctx.Employee
join directRptHist in
(from employeeHistory in ctx.EmployeeHistory
.Where(h => h.DirectManagerEmployeeID == mgrId)
group employeeHistory by employeeHistory.EmployeeID into grp
let maxDt = grp.Max(g => g.DateLastUpdated) from history in grp
where history.DateLastUpdated == maxDt
select history)
on emp equals directRptHist.Employee
select emp;
}
return directRpts.ToList();
//IQueryable<BO.Employee> employees;
//using(var ctx = new Entities())
//{
// //function evaluation times out on this qry:
// // First make sure we're only looking at the current employee information
// IQueryable<BO.EmployeeHistory> currentEntries =
// from eh in ctx.EmployeeHistory
// group eh by eh.EmployeeID into grp
// select grp.OrderBy(eh => eh.DateLastUpdated).FirstOrDefault();
// // Now filter by the manager's ID
// var dirRpts = currentEntries
// .Where(eh => eh.DirectManagerEmployeeID == mgrId);
// // This would be ideal, assuming your entity associations are set up right
// employees = dirRpts.Select(eh => eh.Employee).Distinct();
// //// If the above won't work, this is the next-best thing
// //var employees2 = ctx.Employee.Where(
// // emp => directRpts.Any(
// // eh => eh.EmployeeId == emp.EmployeeId));
//}
//return employees.ToList();
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.