Selecting one to many which one is better - c#

In a one to many relationship situation which of the following has better performance.
1st approach
public Order GetOrder(long orderId) {
var orderDetails =
(from o in Orders
from d in OrderDetails
where d.OrderId = o.Id && o.Id = orderId
select new {
Order = o,
Detail = d
}).ToList();
var order = orderDetails.First().Order;
order.Details = orderDetails.Select(od => od.Detail).ToList();
return order;
}
2nd approach
public Order GetOrder(long orderId) {
var order = Orders.First(o => o.Id == orderId);
order.Details = OrderDetails.Where(od => od.OrderId = orderId).ToList();
return order;
}
The point I am trying to figure out (in terms of performance) is, in first approach there is single query but repeated data is being selected where, in second approach, there are two seperate queries but selecting only the data that is enough.
You can assume Orders and OrderDetails are IQueryable<T> of EntityFramework (dbContext.Set<T>()) or NHibernate (session.Query<T>()). I tried with both and they create very similar sql queries. Also as far as I know, these ORM's built in one to many queries use something like the first approach.
UPDATE, to clarify what I am asking: Which one (single query but repeated data or only required data but multiple queries) performs better under which circumstances? There may be many situations that I may not think of. That's why I am not trying benchmarking. As already stated in some answers column count or more joins were the kinds of answers that I expected. (There may be also something about row count of table and/or result set). Based on these kind of answers I may try benchmarking. And of course I am asking why? I am not trying to solve Order - OrderDetail problem or solve anything at all. I am trying to learn and understand when to use single query but repeated data or only required data but multiple queries.

A single one-to-many query is pretty straightforward for ORMs. It's when you need to make several interrelated one-to-many queries that performance considerations start making themselves known.

always measure performance for your particular case. if order table has few-small sized columns, getting all data in one round trip may be better. if order tables has too many or blob columns, issuing 2 seperate queries may outperform.

Using the EntityFramework, you should either call Include on the context
var order = context.Orders.Include(x => x.Details).First(x => x.Id == orderId);
Loading Related Objects

Related

EF Core query is extremely slow with many relations between tables

I have an EF Core query like this:
var existingViolations = await _context.Parent
.Where(p => p.ProjectId == projectId)
.Include(p => p.Relation1)
.Include(p => p.Relation2)
.ThenInclude(r => r.Relation21)
.Include(p => p.Relation3)
.AsSplitQuery()
.ToListAsync();
This query usually takes between 55-65 seconds which can sometimes cause database timeouts. All the tables included in the query, including the parent table, contain anywhere from 30k-60k rows and 3-6 columns each. I have tried splitting it up into smaller queries using LoadAsync() like this:
_context.ChangeTracker.LazyLoadingEnabled = false;
_context.ChangeTracker.AutoDetectChangesEnabled = false;
await _context.Relation1.Where(r1 => r1.Parent.ProjectId == projectId).LoadAsync();
await _context.Relation2.Where(r2 => r2.Parent.ProjectId == projectId).Include(r2 => r2.Relation21).LoadAsync();
await _context.Relation3.Where(r3 => r3.Parent.ProjectId == projectId).LoadAsync();
var result = await _context.Parent.Where(p => p.ProjectId == projectId).ToListAsync();
That shaves about 5 seconds off the query time, so nothing to brag about. I've done some timings, and it's the last line (var result = await _context.Parent.Where(p => p.ProjectId == projectId).ToListAsync();) that takes by far the longest to complete, about 90% of the spent time.
How can I optimize this further?
EDIT: Here is the generated SQL query:
SELECT [v].[Id], [v].[Description], [v].[ProjectId], [v].[RuleId], [v].[StateStatus], [v0].[Id], [v0].[ElementId], [v0].[Role], [v0].[ParentId], [t].[Id], [t].[ActivatedDate], [t].[StateStatus], [t].[ParentId], [t].[Id0], [t].[RunId], [t].[SerializedState], [t].[StateId], [p].[Id], [p].[ActualValue], [p].[CurrentValue], [p].[ParameterId], [p].[ParentId]
FROM [Parent] AS [v]
LEFT JOIN [Relation1] AS [v0] ON [v].[Id] = [v0].[ParentId]
LEFT JOIN (
SELECT [s].[Id], [s].[ActivatedDate], [s].[StateStatus], [s].[ParentId], [s0].[Id] AS [Id0], [s0].[RunId], [s0].[SerializedState], [s0].[StateId]
FROM [Relation2] AS [s]
LEFT JOIN [Relation21] AS [s0] ON [s].[Id] = [s0].[StateId]
) AS [t] ON [v].[Id] = [t].[ParentId]
LEFT JOIN [Relation3] AS [p] ON [v].[Id] = [p].[ParentId]
WHERE [v].[ProjectId] = #__projectId_0
ORDER BY [v].[Id], [v0].[Id], [t].[Id], [t].[Id0], [p].[Id]
When running the SQL query directly in the database, it takes about 3-4 seconds to complete, so the problem seems to be with how EF processes the results.
Without seeing the real entities and how they might be configured, it's anyone's guess.
Generally speaking when looking at performance issues like this, the first thing I would look to tackle is "what is this data being loaded for?" Typically when I see queries using a lot of Includes, this is something like a read operation to be loaded for a view or computation based on that selected data. Projection down to a simpler model can help significantly here if you really only need a few columns from each table to satisfy your needs. The benefit of projection using a Select across the related data to fill either a DTO/ViewModel class for a view or an an anonymous type for a computation is that Include will want to pass all columns for all eager loaded tables in the one go, where projection will only pass back the columns referenced. This can be critically important where tables can contain things like large text/binary columns that you don't need at all or right away. This is also very important in cases where the database server might be some distance from the consuming client or web server. Less data over the wire = faster performance, though the issue right now sounds like the DB query itself.
The next thing to check would be the relationships between all of the tables and any relevant configuration in EF vs. the table design. Waiting a minute to pull a few records from 30-60k rows is ridiculously long and I would be highly suspect of some flawed relationship mapping that isn't using FKs/indexes. Another place to look would be to run a profiler against the database to capture the exact SQL statement(s) being run, then execute those manually to investigate their execution plan which might reveal schema problems or some weirdness with the entity relationship mapping producing very inefficient queries.
The next thing to check would be to use a process of elimination to see if there is a bad relationship. Eliminate each of the eager load Include statements one by one and see how long each the query scenario takes. If there is a particular Include that is responsible for a drastic slow-down, drill down into that relationship to see why that might be.
That should give you a couple avenues to check. Consider revising your question with the actual entities and any further troubleshooting results.

LINQ Query with subqueries VS Query and foreach

I have a very large amount of data that I need to gather for a report I am generating. All of this data comes from a Database that I am connected to via entity framework. For this query I have tried doing this a few different ways but no matter what I do it seems to be slow.
Overall I am curious if it is more efficient to have a LINQ query that has sub queries or is it better to do a foreach and then query for those values.
additional information for the DB a lot of the sub queries/loop iterations would be querying most of the largest tables in the DB.
Example code:
var b = (from brk in entities.Brokers
join pcy in Policies on brk.BrkId equals pcy.pcyBrkId
where pcy.DateStamp > twoYearsAgo
select new returnData
{
BroId = brk.brkId,
currentPrem = (from pcy in Policies
where pcy.PcyBrkID = brk.Brk.Id && pcy.InvDate > startDate && pcy.InvDate < endDate
select pcy.Premium).Sum(),
// 5 more similar subqueries
}).GroupBy(x=> x.BrkId).Select(x=> x.FirstOrDefault()).ToList();
OR
var b = (from brk in entities.Brokers
join pcy in Policies on brk.BrkId equals pcy.pcyBrkId
where pcy.DateStamp > twoYearsAgo
select new returnData
{
BroId = brk.brkId
}).GroupBy(x=> x.BrkId).Select(x=> x.FirstOrDefault()).ToList();
foreach( brk in b){
// grab data from subqueries here
}
One additional detail may be that I may be able to filter out some additional information if I grab the primary information reducing the results to go through in the foreach.
First of all, matters of performance always warrant profiling, no matter how reasonable or logical one or another solution might seem.
Saying that, usually, while working with database, less trips you do to database is better. Hence in your case it might be more efficient to have one single SQL query that retrieves big chunk of data over network, and after you process it locally with loops and whatnot. This guideline has to be an optimal solution for most cases.
All, obviously, depends on how big that data is, how big your network bandwidth is, and how fast and tuned your database is.
Side note: in general, if you work with big, or complex (intertwined) data, better to avoid using Entity Framework at all, especially when you're concerned about performance. Not sure if that might work for you.

Include() vs Select() performance

I have a parent entity with a navigation property to a child entity. The parent entity may not be removed as long as there are associated records in the child entity. The child entity can contain hundreds of thousands of records.
I'm wondering what will be the most efficient to do in Entity Framework to do this:
var parentRecord = _context.Where(x => x.Id == request.Id)
.Include(x => x.ChildTable)
.FirstOrDefault();
// check if parentRecord exists
if (parentRecord.ChildTable.Any()) {
// cannot remove
}
or
var parentRecord = _context.Where(x => x.Id == request.Id)
.Select(x => new {
ParentRecord = x,
HasChildRecords = x.ChildTable.Any()
})
.FirstOrDefault();
// check if parentRecord exists
if (parentRecord.HasChildRecords) {
// cannot remove
}
The first query may include thousands of records while the second query will not, however, the second one is more complex.
Which is the best way to do this?
I would say it depens. It depends on which DBMS you're using. it depends on how good the optimizer works etc.
So one single statement with a JOIN could be far faster than a lot of SELECT statements.
In general I would say when you need the rows from your Child table use .Include(). Otherwise don't include them.
Or in simple words, just read the data you need.
The answer depends on your database design. Which columns are indexed? How much data is in table?
Include() offloads work to your C# layer, but means a more simple query. It's probably the better choice here but you should consider extracting the SQL that is generated by entity framework and running each through an optimisation check.
You can output the sql generated by entity framework to your visual studio console as note here.
This example might create a better sql query that suites your needs.

How do I apply the LINQ to SQL Distinct() operator to a List<T>?

I have a serious(it's getting me crazy) problem with LINQ to SQL. I am developing an ASP.NET MVC3 application using c# and Razor in Visual Studio 2010.
I have two database tables, Product and Categories:
Product(Prod_Id[primary key], other attributes)
Categories((Dept_Id, Prod_Id) [primary keys], other attributes)
Obviously Prod_Id in Categories is a foreign key. Both classes are mapped using the Entity Framework (EF). I do not mention the context of the application for simplicity.
In Categories there are multiple rows containing the Prod_Id. I want to make a projection of all Distinct Prod_Id in Categories. I did it using plain (T)SQL in SQL Server MGMT Studio according to this (really simple) query:
SELECT DISTINCT Prod_Id
FROM Categories
and the result is correct. Now I need to make this query in my application so I used:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
I go to check the result of my query by using:
query.Select(m => m.Prod_Id);
or
foreach(var item in query)
{
item.Prod_Id;
//other instructions
}
and it does not work. First of all the Intellisense when I attempt to write query.Select(m => m. or item.shows just suggestions about methods (such as Equals, etc...) and not properties. I thought that maybe there was something wrong with Intellisense (I guess most of you many times hoped that Intellisense was wrong :-D) but when I launch the application I receive an error at runtime.
Before giving your answer keep in mind that;
I checked many forums, I tried the normal LINQ to SQL (without using lambdas) but it does not work. The fact that it works in (T)SQL means that there is something wrong with the LINQ to SQL instruction (other queries in my application work perfectly).
For application related reasons, I used a List<T> variable instead of _StoreDB.Categories and I thought that was the problem. If you can offer me a solution without using a List<T> is appreciated as well.
This line:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
Your LINQ query most likely returns IEnumerable... of ints (judging by Select(m => m.Prod_Id)). You have list of integers, not list of entity objects. Try to print them and see what you got.
Calling _StoreDB.Categories.Select(m => m.Prod_Id) means that query will contain Prod_Id values only, not the entire entity. It would be roughly equivalent to this SQL, which selects only one column (instead of the entire row):
SELECT Prod_Id FROM Categories;
So when you iterate through query using foreach (var item in query), the type of item is probably int (or whatever your Prod_Id column is), not your entity. That's why Intellisense doesn't show the entity properties that you expect when you type "item."...
If you want all of the columns in Categories to be included in query, you don't even need to use .Select(m => m). You can just do this:
var query = _StoreDB.Categories.Distinct();
Note that if you don't explicitly pass an IEqualityComparer<T> to Distinct(), EqualityComparer<T>.Default will be used (which may or may not behave the way you want it to, depending on the type of T, whether or not it implements System.IEquatable<T>, etc.).
For more info on getting Distinct to work in situations similar to yours, take a look at this question or this question and the related discussions.
As has been explained by the other answers, the error that the OP ran into was because the result of his code was a collection of ints, not a collection of Categories.
What hasn't been answered was his question about how to use the collection of ints in a join or something in order to get at some useful data. I will attempt to do that here.
Now, I'm not really sure why the OP wanted to get a distinct list of Prod_Ids from Categories, rather than just getting the Prod_Ids from Projects. Perhaps he wanted to find out what Products are related to one or more Categories, thus any uncategorized Products would be excluded from the results. I'll assume this is the case and that the desired result is a collection of distinct Products that have associated Categories. I'll first answer the question about what to do with the Prod_Ids first, and then offer some alternatives.
We can take the collection of Prod_Ids exactly as they were created in the question as a query:
var query = _StoreDB.Categories.Select(m => m.Prod_Id).Distinct();
Then we would use join, like so:
var products = query.Join(_StoreDB.Products, id => id, p => p.Prod_Id,
(id,p) => p);
This takes the query, joins it with the Products table, specifies the keys to use, and finally says to return the Product entity from each matching set. Because we know that the Prod_Ids in query are unique (because of Distinct()) and the Prod_Ids in Products are unique (by definition because it is the primary key), we know that the results will be unique without having to call Distinct().
Now, the above will get the desired results, but it's definitely not the cleanest or simplest way to do it. If the Category entities are defined with a relational property that returns the related record from Products (which would likely be called Product), the simplest way to do what we're trying to do would be the following:
var products = _StoreDB.Categories.Select(c => c.Product).Distinct();
This gets the Product from each Category and returns a distinct collection of them.
If the Category entity doesn't have the Product relational property, then we can go back to using the Join function to get our Products.
var products = _StoreDB.Categories.Join(_StoreDB.Products, c => c.Prod_Id,
p => p.Prod_Id, (c,p) => p).Distinct();
Finally, if we aren't just wanting a simple collection of Products, then some more though would have to go into this and perhaps the simplest thing would be to handle that when iterating through the Products. Another example would be for getting a count for the number of Categories each Product belongs to. If that's the case, I would reverse the logic and start with Products, like so:
var productsWithCount = _StoreDB.Products.Select(p => new { Product = p,
NumberOfCategories = _StoreDB.Categories.Count(c => c.Prod_Id == p.Prod_Id)});
This would result in a collection of anonymous typed objects that reference the Product and the NumberOfCategories related to that Product. If we still needed to exclude any uncatorized Products, we could append .Where(r => r.NumberOfCategories > 0) before the semicolon. Of course, if the Product entity is defined with a relational property for the related Categories, you wouldn't need this because you could just take any Product and do the following:
int NumberOfCategories = product.Categories.Count();
Anyway, sorry for rambling on. I hope this proves helpful to anyone else that runs into a similar issue. ;)

Linq is returning too many results when joined

In my schema I have two database tables. relationships and relationship_memberships. I am attempting to retrieve all the entries from the relationship table that have a specific member in it, thus having to join it with the relationship_memberships table. I have the following method in my business object:
public IList<DBMappings.relationships> GetRelationshipsByObjectId(int objId)
{
var results = from r in _context.Repository<DBMappings.relationships>()
join m in _context.Repository<DBMappings.relationship_memberships>()
on r.rel_id equals m.rel_id
where m.obj_id == objId
select r;
return results.ToList<DBMappings.relationships>();
}
_Context is my generic repository using code based on the code outlined here.
The problem is I have 3 records in the relationships table, and 3 records in the memberships table, each membership tied to a different relationship. 2 membership records have an obj_id value of 2 and the other is 3. I am trying to retrieve a list of all relationships related to object #2.
When this linq runs, _context.Repository<DBMappings.relationships>() returns the correct 3 records and _context.Repository<DBMappings.relationship_memberships>() returns 3 records. However, when the results.ToList() executes, the resulting list has 2 issues:
1) The resulting list contains 6 records, all of type DBMappings.relationships(). Upon further inspection there are 2 for each real relationship record, both are an exact copy of each other.
2) All relationships are returned, even if m.obj_id == 3, even though objId variable is correctly passed in as 2.
Can anyone see what's going on because I've spent 2 days looking at this code and I am unable to understand what is wrong. I have joins in other linq queries that seem to be working great, and my unit tests show that they are still working, so I must be doing something wrong with this. It seems like I need an extra pair of eyes on this one :)
Edit: Ok so it seems like the whole issue was the way I designed my unit test, since the unit test didn't actually assign ID values to the records since it wasn't hitting sql (for unit testing).
Marking the answer below as the answer though as I like the way he joins it all together better.
Just try like this
public IList<DBMappings.relationships> GetRelationshipsByObjectId(int objId)
{
var results = (from m in _context.Repository<DBMappings.relationship_memberships>()
where m.rel_id==objID
select m.relationships).ToList();
return results.ToList<DBMappings.relationships>();
}
How about to set _context.Log = Console.Out just to see the generated SQL query? Share the output with us (maybe use some streamwriter instead of console.out so that you can copy that easily and without mistakes).
Pz, the TaskConnect developer
I might have this backwards, but I don't think you need a join here. If you've setup your foreign keys correctly, this should work, right?
public IList<DBMappings.relationships> GetRelationshipsByObjectId(int objId)
{
var mems = _context.Repository<DBMappings.relationship_memberships>();
var results = mems.Where(m => m.obj_id == objId).Select(m => m.relationships);
return results.ToList<DBMappings.relationships>();
}
Here's the alternative (if I've reversed the mapping in my brain):
public IList<DBMappings.relationships> GetRelationshipsByObjectId(int objId)
{
var mems = _context.Repository<DBMappings.relationship_memberships>();
var results = mems.Where(m => m.obj_id == objId).SelectMany(m => m.relationships);
return results.ToList<DBMappings.relationships>();
}
Let me know if I'm way off with this, and I can take another stab at it.

Categories