My Db named MyDB has 5 tables: MSize, MModel, Met, MResult, SResult. They are connected as follows:
MSize has a common field MSizeId with MModel.
MModel links with Met with MModelId.
Met can be linked with MResult on basis of MId.
Similarly SResult can be linked with MResult on SResultId.
My aim is to get average accuracy of all the items(description field in Msize table) with Acc(decimal data type) >=70 and <=130 grouped by description.
Here is my SQL query:
use MyDB;
SELECT a.[Description],AVG(CASE WHEN d.[Acc] >= 70 AND d.[Acc] <= 130 THEN d.[Acc] END)
FROM MSize a
INNER JOIN MModel b ON a.MSizeId = b.MSizeId
INNER JOIN Met c ON b.MModelId = c.MModelId
INNER JOIN MResult d ON c.MId = d.MId
INNER JOIN SResult e ON d.SResultId = e.SResultId
GROUP BY a.Description
This query gives me the correct result on SQL server.
I have been struggling to write a LINQ query for the same. The problem comes with the SQL CASE statement. I don't want to specify the false result of the CASE, meaning, if d.acc doesn't fall in the range specified in SQL query, discard it.
Assuming all Model classes and fields have the same name as these DBtables and columns. What can be the LINQ query for the given SQL statement?
You can fill up the code here in curly braces:
using (var db = new MyDBContext()){ }
here MyDBContext refers to Partial Class Data Model template generated by LINQ
You didn't bother to write the classes, and I'm not going to do that for you.
Apparently you have a sequence of MSizes, where every Msize has zero or more MModels. Every MModel has zero or more Mets, and every Met has zero or more MResults, and every MResult has an Acc.
You also forgot to write in words your requirements, now I had to extract it from your SQL query
It seemt that you want the Description of every MSize with the average value of all the Accs that it has, that have a value between 70 and 130.
If you use entity framework, you can use the virtual ICollection which makes live fairly easy. I'll do it in two steps, because below I do the same with a GroupJoin without using the ICollection. The 2nd part is the same for both methods.
First I'll fetch the Description of every MSize, together with all its deeper Acc that are in the MResults of the Mets of the MModels of this MSize:
var descriptionsWithTheirAccs = dbContext.MSizes.Select(msize => new
{
Description = msize.Description,
// SelectMany the inner collections until you see the Accs
Accs = mSize.Mmodels.SelectMany(
// collection selector:
model => model.Mets,
// result selector: flatten MResults in the Mets
(model, mets) => mets
.SelectMany(met => met.MResults,
// result Selector: from every mResult take the Acc
(met, mResults) => mResults
.Select(mResult => mResult.Acc)));
Now that we have the Description of every MSize with all Accs that it has deep inside it,
we can throw away all Accs that we don't want and Average the remaining ones:
var result= descriptionsWithTheirAccs.Select(descriptionWithItsAccs => new
{
Description = descriptionWithItsAccs.Description,
Average = descriptionWithItsAccs.Accs
.Where(acc => 70 <= acc && acc <= 130)
// and the average from all remaining Accs
.Avg(),
});
If you don't have access to the ICollections, you'll have to do the Groupjoin yourself, which looks pretty horrible if you have so many tables:
var descriptionsWithTheirAccs = dbContext.MSizes.GroupJoin(dbContext.MModels,
msize => msize.MSizeId,
mmodel => mmodel.MSizeId,
(msize, mmodels) => new
{
Description = msize.Description,
Accs = mmodels.GroupJoin(dbContext.Mets,
mmodel => mModel.MModelId,
met => met.MModelId,
(mmodel, metsofThisModel) => metsOfThisModel
.GroupJoin(dbContext.MResults,
met => met.MetId
mresult => mresult.MetId,
// result selector
(met, mresults) => mResult.Select(mresult => mresult.Acc))),
});
Now that you have the DescriptionsWithTheirAccs, you can use the Select above to calculation the Averages.
Related
I am using Entity Framework in a C# application and I am using lazy loading. I am experiencing performance issues when calculating the sum of a property in a collection of elements. Let me illustrate it with a simplified version of my code:
public decimal GetPortfolioValue(Guid portfolioId) {
var portfolio = DbContext.Portfolios.FirstOrDefault( x => x.Id.Equals( portfolioId ) );
if (portfolio == null) return 0m;
return portfolio.Items
.Where( i =>
i.Status == ItemStatus.Listed
&&
_activateStatuses.Contains( i.Category.Status )
)
.Sum( i => i.Amount );
}
So I want to fetch the value for all my items that have a certain status of which their parent has a specific status as well.
When logging the queries generated by EF I see it is first fetching my Portfolio (which is fine). Then it does a query to load all Item entities that are part of this portfolio. And then it starts fetching ALL Category entities for each Item one by one. So if I have a portfolio that contains 100 items (each with a category), it literally does 100 SELECT ... FROM categories WHERE id = ... queries.
So it seems like it's just fetching all info, storing it in its memory and then calculating the sum. Why does it not do a simple join between my tables and calculate it like that?
Instead of doing 102 queries to calculate the sum of 100 items I would expect something along the lines of:
SELECT
i.id, i.amount
FROM
items i
INNER JOIN categories c ON c.id = i.category_id
WHERE
i.portfolio_id = #portfolioId
AND
i.status = 'listed'
AND
c.status IN ('active', 'pending', ...);
on which it could then calculate the sum (if it is not able to use the SUM directly in the query).
What is the problem and how can I improve the performance other than writing a pure ADO query instead of using Entity Framework?
To be complete, here are my EF entities:
public class ItemConfiguration : EntityTypeConfiguration<Item> {
ToTable("items");
...
HasRequired(p => p.Portfolio);
}
public class CategoryConfiguration : EntityTypeConfiguration<Category> {
ToTable("categories");
...
HasMany(c => c.Products).WithRequired(p => p.Category);
}
EDIT based on comments:
I didn't think it was important but the _activeStatuses is a list of enums.
private CategoryStatus[] _activeStatuses = new[] { CategoryStatus.Active, ... };
But probably more important is that I left out that the status in the database is a string ("active", "pending", ...) but I map them to an enum used in the application. And that is probably why EF cannot evaluate it? The actual code is:
... && _activateStatuses.Contains(CategoryStatusMapper.MapToEnum(i.Category.Status)) ...
EDIT2
Indeed the mapping is a big part of the problem but the query itself seems to be the biggest issue. Why is the performance difference so big between these two queries?
// Slow query
var portfolio = DbContext.Portfolios.FirstOrDefault(p => p.Id.Equals(portfolioId));
var value = portfolio.Items.Where(i => i.Status == ItemStatusConstants.Listed &&
_activeStatuses.Contains(i.Category.Status))
.Select(i => i.Amount).Sum();
// Fast query
var value = DbContext.Portfolios.Where(p => p.Id.Equals(portfolioId))
.SelectMany(p => p.Items.Where(i =>
i.Status == ItemStatusConstants.Listed &&
_activeStatuses.Contains(i.Category.Status)))
.Select(i => i.Amount).Sum();
The first query does a LOT of small SQL queries whereas the second one just combines everything into one bigger query. I'd expect even the first query to run one query to get the portfolio value.
Calling portfolio.Items this will lazy load the collection in Items and then execute the subsequent calls including the Where and Sum expressions. See also Loading Related Entities article.
You need to execute the call directly on the DbContext the Sum expression can be evaluated database server side.
var portfolio = DbContext.Portfolios
.Where(x => x.Id.Equals(portfolioId))
.SelectMany(x => x.Items.Where(i => i.Status == ItemStatus.Listed && _activateStatuses.Contains( i.Category.Status )).Select(i => i.Amount))
.Sum();
You also have to use the appropriate type for _activateStatuses instance as the contained values must match the type persisted in the database. If the database persists string values then you need to pass a list of string values.
var _activateStatuses = new string[] {"Active", "etc"};
You could use a Linq expression to convert enums to their string representative.
Notes
I would recommend you turn off lazy loading on your DbContext type. As soon as you do that you will start to catch issues like this at run time via Exceptions and can then write more performant code.
I did not include error checking for if no portfolio was found but you could extend this code accordingly.
Yep CategoryStatusMapper.MapToEnum cannot be converted to SQL, forcing it to run the Where in .Net. Rather than mapping the status to the enum, _activeStatuses should contain the list of integer values from the enum so the mapping is not required.
private int[] _activeStatuses = new[] { (int)CategoryStatus.Active, ... };
So that the contains becomes
... && _activateStatuses.Contains(i.Category.Status) ...
and can all be converted to SQL
UPDATE
Given that i.Category.Status is a string in the database, then
private string[] _activeStatuses = new[] { CategoryStatus.Active.ToString(), ... };
Situation: Say we are executing a LINQ query that joins two in-memory lists (so no DbSets or SQL-query generation involved) and this query also has a where clause. This where only filters on properties included in the original set (the from part of the query).
Question: Does the linq query interpreter optimize this query in that it first executes the where before it performs the join, regardless of whether I write the where before or after the join? – so it does not have to perform a join on elements that are not included later anyways.
Example: For example, I have a categories list I want to join with a products list. However, I am just interested in the category with ID 1. Does the linq interpreter internally perform the exact same operations regardless of whether I write:
from category in categories
join prod in products on category.ID equals prod.CategoryID
where category.ID == 1 // <------ below join
select new { Category = category.Name, Product = prod.Name };
or
from category in categories
where category.ID == 1 // <------ above join
join prod in products on category.ID equals prod.CategoryID
select new { Category = category.Name, Product = prod.Name };
Previous research: I already saw this question but the OP author stated that his/her question is only targeting non-in-memory cases with generated SQL. I am explicitly interested with LINQ executing a join on two lists in-memory.
Update: This is not a dublicate of "Order execution of chain linq query" question as the referenced question clearly refers to a dbset and my question explicitly addressed a non-db scenario. (Moreover, although similar, I am not asking about inclusions based on navigational properties here but about "joins".)
Update2: Although very similar, this is also not a dublicate of "Is order of the predicate important when using LINQ?" as I am asking explicitly about in-memory situations and I cannot see the referenced question explicitly addressing this case. Moreover, the question is a bit old and I am actually interested in linq in the context of .NET Core (which didn't exist in 2012), so I updated the tag of this question to reflect this second point.
Please note: With this question I am aiming at whether the linq query interpreter somehow optimizes this query in the background and am hoping to get a reference to a piece of documentation or source code that shows how this is done by linq. I am not interested in answers such as "it does not matter because the performance of both queries is roughly the same".
The LINQ query syntax will be compiled to a method chain. For details, read e.g. in this question.
The first LINQ query will be compiled to the following method chain:
categories
.Join(
products,
category => category.ID,
prod => prod.CategoryID,
(category, prod) => new { category, prod })
.Where(t => t.category.ID == 1)
.Select(t => new { Category = t.category.Name, Product = t.prod.Name });
The second one:
categories
.Where(category => category.ID == 1)
.Join(
products,
category => category.ID,
prod => prod.CategoryID,
(category, prod) => new { Category = category.Name, Product = prod.Name });
As you can see, the second query will cause less allocations (note only one anonymous type vs 2 in the first query, and note how many instances of those anonymous types will be created on performing the query).
Furthermore, it's clear that the first query will perform a join operation on lot more data than the second (already filtered) one.
There will be no additional query optimization in case of LINQ-to-objects queries.
So the second version is preferable.
For in memory lists (IEnumerables), no optimization is applied and query execution is made in chained order for in-memory lists.
I also tried result by first casting it to IQueryable then apply filtering but apparently casting time is pretty high for this big table.
I made a quick test for this case.
Console.WriteLine($"List Row Count = {list.Count()}");
Console.WriteLine($"JoinList Row Count = {joinList.Count()}");
var watch = Stopwatch.StartNew();
var result = list.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Where(t => t.inner.Prop3 == "Prop13")
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result.Dump();
watch.Stop();
Console.WriteLine($"Result1 Elapsed = {watch.ElapsedTicks}");
watch.Restart();
var result2 = list
.Where(t => t.Prop3 == "Prop13")
.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result2.Dump();
watch.Stop();
Console.WriteLine($"Result2 Elapsed = {watch.ElapsedTicks}");
watch.Restart();
var result3 = list.AsQueryable().Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Where(t => t.inner.Prop3 == "Prop13")
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result3.Dump();
watch.Stop();
Console.WriteLine($"Result3 Elapsed = {watch.ElapsedTicks}");
Findings:
List Count = 100
JoinList Count = 10
Result1 Elapsed = 27
Result2 Elapsed = 17
Result3 Elapsed = 591
List Count = 1000
JoinList Count = 10
Result1 Elapsed = 20
Result2 Elapsed = 12
Result3 Elapsed = 586
List Count = 100000
JoinList Count = 10
Result1 Elapsed = 603
Result2 Elapsed = 19
Result3 Elapsed = 1277
List Count = 1000000
JoinList Count = 10
Result1 Elapsed = 1469
Result2 Elapsed = 88
Result3 Elapsed = 3219
I have a query, which will give the result set . based on a condition I want to take the 100 records. that means . I have a variable x, if the value of x is 100 then I have to do .take(100) else I need to get the complete records.
var abc=(from st in Context.STopics
where st.IsActive==true && st.StudentID == 123
select new result()
{
name = st.name }).ToList().Take(100);
Because LINQ returns an IQueryable which has deferred execution, you can create your query, then restrict it to the first 100 records if your condition is true and then get the results. That way, if your condition is false, you will get all results.
var abc = (from st in Context.STopics
where st.IsActive && st.StudentID == 123
select new result
{
name = st.name
});
if (x == 100)
abc = abc.Take(100);
abc = abc.ToList();
Note that it is important to do the Take before the ToList, otherwise, it would retrieve all the records, and then only keep the first 100 - it is much more efficient to get only the records you need, especially if it is a query on a database table that could contain hundreds of thousands of rows.
One of the most important concept in SQL TOP command is order by. You should not use TOP without order by because it may return different results at different situations.
The same concept is applicable to linq too.
var results = Context.STopics.Where(st => st.IsActive && st.StudentID == 123)
.Select(st => new result(){name = st.name})
.OrderBy(r => r.name)
.Take(100).ToList();
Take and Skip operations are well defined only against ordered sets. More info
Although the other users are correct in giving you the results you want...
This is NOT how you should be using Entity Framework.
This is the better way to use EF.
var query = from student in Context.Students
where student.Id == 123
from topic in student.Topics
order by topic.Name
select topic;
Notice how the structure more closely follows the logic of the business requirements.
You can almost read the code in English.
I need to union these rows on two ids without using an IEqualityComparer, as those are not supported in LINQ to Entities.
In result I need every unique combination of BizId and BazId, with the value from foos if the id pair came from there, else the value should be zero. This is a greatly simplified example and in reality these tables are very large and these operations cannot be done in memory. Because of this, this query needs to work with LINQ to Entities so that it can be translated to valid SQL and execute on the database. I suspect this can be done with some combination of where, join, and DefaultIfEmpty() instead of the Union and Distinct() but I am at a loss for now.
var foos = from f in Context.Foos where f.isActive select new { BizId = f.bizId, BazId = f.BazId, Value = f.Value };
var bars = from b in Context.Bars where b.isEnabled select new { BizId = b.bizId, BazId = b.BazId, Value = 0 };
var result = foos.Union(bars).Distinct(); //I need this to compare only BizId and BazId
You can group by the two fields and then get the first item of each group:
foos.Union(bars).GroupBy(x => new { x.bizId, x.bazId })
.Select(g => g.FirstOrDefault())
I need to add a literal value to a query. My attempt
var aa = new List<long>();
aa.Add(0);
var a = Products.Select(p => p.sku).Distinct().Union(aa);
a.ToList().Dump(); // LinqPad's way of showing the values
In the above example, I get an error:
"Local sequence cannot be used in LINQ to SQL implementation
of query operators except the Contains() operator."
If I am using Entity Framework 4 for example, what could I add to the Union statement to always include the "seed" ID?
I am trying to produce SQL code like the following:
select distinct ID
from product
union
select 0 as ID
So later I can join the list to itself so I can find all values where the next highest value is not present (finding the lowest available ID in the set).
Edit: Original Linq Query to find lowest available ID
var skuQuery = Context.Products
.Where(p => p.sku > skuSeedStart &&
p.sku < skuSeedEnd)
.Select(p => p.sku).Distinct();
var lowestSkuAvailableList =
(from p1 in skuQuery
from p2 in skuQuery.Where(a => a == p1 + 1).DefaultIfEmpty()
where p2 == 0 // zero is default for long where it would be null
select p1).ToList();
var Answer = (lowestSkuAvailableList.Count == 0
? skuSeedStart :
lowestSkuAvailableList.Min()) + 1;
This code creates two SKU sets offset by one, then selects the SKU where the next highest doesn't exist. Afterward, it selects the minimum of that (lowest SKU where next highest is available).
For this to work, the seed must be in the set joined together.
Your problem is that your query is being turned entirely into a LINQ-to-SQL query, when what you need is a LINQ-to-SQL query with local manipulation on top of it.
The solution is to tell the compiler that you want to use LINQ-to-Objects after processing the query (in other words, change the extension method resolution to look at IEnumerable<T>, not IQueryable<T>). The easiest way to do this is to tack AsEnumerable() onto the end of your query, like so:
var aa = new List<long>();
aa.Add(0);
var a = Products.Select(p => p.sku).Distinct().AsEnumerable().Union(aa);
a.ToList().Dump(); // LinqPad's way of showing the values
Up front: not answering exactly the question you asked, but solving your problem in a different way.
How about this:
var a = Products.Select(p => p.sku).Distinct().ToList();
a.Add(0);
a.Dump(); // LinqPad's way of showing the values
You should create database table for storing constant values and pass query from this table to Union operator.
For example, let's imagine table "Defaults" with fields "Name" and "Value" with only one record ("SKU", 0).
Then you can rewrite your expression like this:
var zero = context.Defaults.Where(_=>_.Name == "SKU").Select(_=>_.Value);
var result = context.Products.Select(p => p.sku).Distinct().Union(zero).ToList();