EF recursive hierarchy - c#

I've got some kind of layers in my application. These layers are following a structure like: a company has settlements, settlements have sections, sections have machines, machines are producing items, to produce items the machine needs tools,...
At the very end of this hierarchy there are entries how many items could be produced with a specific part of a tool(called cuttingtool). Based on that, a statistic can be calculated. On each layer the statistic results of the next upper layer are getting added.
Take a look at this diagram:
On each layer, a statistic is displayed. For example: The user navigates to the second layer(Items). There are 10 items. The user can see a pie chart which displays the costs of each item. These costs are calculated by adding all costs of the items tools(the next upper layer). The costs of the tools are calculated by adding all costs of the "parts of the tools"...
I know that is a bit complicated so if there any questions, just ask me for a more detailed explaination.
Now my problem: To calculate the cost of an item(the same statistic is provided for machines, tools,... => for each layer on the diagram), I need to get all Lifetimes of the Item. So I am using a recursive call to skip all layers between the Item and the Lifetime.
That workes quite well BUT I am using far to many SelectMany-linq commands. As a result, the performance is extremely bad.
I've thought about a joins or procedures(stored in the database) to speed that up, but I am by far not experied which techniques like databases. So I want to ask you, what you would do?
Currently I am using something like that:
public IEnumerable<IHierachyEntity> GetLifetimes(IEnumerable<IHierachyEntity> entities)
{
if(entities is IEnumerable<Lifetime>)
{
return entities;
}
else
{
return GetLifetimes(entities.SelectMany(x => x.Childs))
}
}

Since this probably is a pretty fixed hierarchy in the heart of your application I wouldn't mind writing a dedicated piece of code for it. Moreover, writing an efficient generic routine for hierarchical queries is impossible with LINQ to a database backend. The n+1 problem just can't be avoided.
So just do something like this:
public IQueryable<Lifetime> GetLifetimes<T>(IQueryable<T> entities)
{
var machines = entities as IQueryable<Machine>;
if (machines != null)
return machines.SelectMany (m => m.Items)
.SelectMany (i => i.Tools)
.SelectMany (i => i.Parts)
.SelectMany (i => i.Lifetimes);
var items = entities as IQueryable<Item>;
if (items != null)
return items.SelectMany (i => i.Tools)
.SelectMany (i => i.Parts)
.SelectMany (i => i.Lifetimes);
var tools = entities as IQueryable<Tool>;
if (tools != null)
return tools.SelectMany (i => i.Parts)
.SelectMany (i => i.Lifetimes);
var parts = entities as IQueryable<Part>;
if (parts != null)
return parts.SelectMany (i => i.Lifetimes);
return Enumerable.Empty<Lifetime>().AsQueryable();
}
Repetitive code, yes, but its is crystal clear what happens and it's probably among the most stable parts of the code. Repetitive code is a potential problem when continuous maintenance is to be expected.

As much as I understood you trying to pull very long history of your actions. I need create a routine which will update your statistics as changes happened. This has no "ultimate" solution your should figure it out. E.g. I have "in" and "out" stock transactions and to find out current stock level for all items I should go through 20 years history. To come around I can do monthly summaries and only calculate changes from month start. Or I can use a database trigger to update my summaries as soon as changes happened (could be performance costly one). Or I can have a service that will update it time to time ( would not be 100% up to date possibly). In another words you need table/class which will keep your aggregated results ready to use.

Related

I want a way to update a range of records based on a where clause in entity framework without using ToList() and foreach

I have a function in my asp.net core app which updates a bunch of records based on a certain criteria I write in a where clause ... I read that ToList() has bad performance , so is there a better and faster way than using tolist and foreach ???
This is my current way doing it , I would appreciate it if someone provides a more efficient way
public async Task UpdateCatalogOnTenantApproval(int tenantID)
{
var catalogQuery = GetQueryable();
var catalog = await catalogQuery.Where(x => x.IdTenant == tenantID).ToListAsync();
catalog.ForEach(c => { c.IsApprovedByAdmin = true; c.IsActive = true; });
Context.UpdateRange(catalog);
await Context.SaveChangesAsync(); ;
}
read that ToList() has bad performance ,
That is wrong. ToList has as good a performance as you will get - submit a bad query which is overly complex and which results in bad SQL that SQL Server will take ages to execute and it is slow.
Also, many people think "ToList" is slow (as in: in the profiler). You see, yo ustart with a db context, take a set of entities there, add some where clauses - all fast. Then ToList and it takes "long" (compared to the rest). Well, THAT is where the query is sent to the sql server ;) WHere (x=>whatever) takes "no time" because all it does is add some nodes to the expression tree, not executing the query. THAT is mostly what people mix up - delayed execution which exeutes only when asked for the results.
And third, some people like "ToList().Where() and complain about performance. Filter as much as possible no the DB.
All three reasons are why people think ToList is slow - but all it shows is a lack of understanding of how LINQ and SQL operate.
Entity Framework does not handle bulk update operations by default -- hence your existing code. If you really want to do these bulk operations, then you have two options:
Write the SQL yourself and use the ExecuteSqlCommand() method
to execute it; or
Look at 3rd party extensions, such as https://entityframework-extensions.net/
We can reduce query cost by selecting a subset of data before attaching for EF to track, and then updating.
However, it may be just pointless micro-optimization that does not perform significantly better unless you are processing massive amount of records.
// select pk for EF to track, and the 2 fields to be modified
var catalog = await catalogQuery.Where(x => x.IdTenant == tenantID)
.Select(x => new Catelog{x.CatelogId, x.IsApprovedByAdmin, x.IsActive }).ToListAsync();
//next we attach range here to let EF track the list
Context.AttachRange(catalog);
//perform your update as usual, this will be flagged as modified if changed
catalog.ForEach(c => { c.IsApprovedByAdmin = true; c.IsActive = true; });
//save and let EF update based on modified fields.
await Context.SaveChangesAsync();
Let me explain to you what you have done and what you are trying to do.
You are partially right about the performance issues related to ToList and ToListAsync as they are mainly responsible to upload entities to the memory and track them.
Based on that if your request is expected to deal intensively with light data you are not required to enhance your code. if it is not, however, there are many open approaches each one has its pros and cons and you have to treat and balance between them for each case you do not want to use the dual app-SQL requests.
let's be more realistic by talking about your case:
1- we assume that your method is a resource-consuming by (loading high volume of data, intensively called, or both)
2- I see the modification is too static by updating all of the rows by c.IsApprovedByAdmin = true; c.IsActive = true;
form (1) and (2) I suggest to write a stored procedure or ExexcuteSqlCammand (as Bryan Lewis suggested) that does this for you
because (3) the stored procedures, triggers, and all the SQL based operation are hard-maintainable and are highly potential for hidden exceptions. In your case, however, you less likely to fell into that as your code is too basic and you could reduce more the risk by construct your query from dynamic elements such as nameof(yourClassName that is the table name).YouProperty and the like ...
Anyway, this is an example to show that there is no ideal approach and you have study each case alone.
Finally, I do not agree with the 3d parties extensions as most of freely provided developed by unprofessionals and tracking exceptions caused by them are nightmares, and the paid versions are too expensive and not 0-exception extensions. The 3d party extension are more oriented to the complex bulk update/delete and/or huge data.
e.g.
await Context.UpdateAsync(e=> new Catalog
{ Archived = e.LastUpdate >
DateTime.UtcNow.AddYears(-99)? false : true
});

Optimize/faster choice than "not all" in LINQ

I'm using EF 6 and .NET Framework 4.6.1. I have a scenario where I need to exclude a parent record if all of its child records meet a certain condition.
This is a generic version of what I've done so far:
public ParentRecords GetParentRecordsExceptWhereSpecificStringOnAllChildren(string aSpecificString){
return ParentRecords
.Where(parent => !parent.ChildRecords
.Select(child => child.SomeStringProperty)
.All(c => c.Equals(aSpecificString))
);
}
This takes a little too much time to run (on the scale of one second per child record), and the generated SQL from EF contains n-1 UNION ALL statements, where n is the number of child records.
I suspect I'm missing an obvious way to write this that would improve performance dramatically, but I'm not seeing it (but I'm not a LINQ/EF master by any means).
I wrote a stored procedure that returns the same data, but much faster, and not exactly in the same layout (one flat row versus a row for each child record). We're trying to avoid stored procedures, though, so I'm back to the grindstone on figuring out how to make this LINQ faster.
Any suggestions would be greatly appreciated. If I haven't explained this clearly, please let me know. I tried to make it generic for the sake of re-use, in case anyone else is in this situation.
You can remove the select of SomeStringProperty from your code.
Using Any
ParentRecords.Where(parent =>
parent.ChildRecords.Any(child => !child.SomeStringProperty.Equals(aSpecificString)));
Using All
ParentRecords.Where(parent =>
!parent.ChildRecords.All(child => child.SomeStringProperty.Equals(aSpecificString)));

Improving performance of big EF multi-level Include

I'm an EF noob (as in I just started today, I've only used other ORMs), and I'm experiencing a baptism of fire.
I've been asked to improve the performance of this query created by another dev:
var questionnaires = await _myContext.Questionnaires
.Include("Sections")
.Include(q => q.QuestionnaireCommonFields)
.Include("Sections.Questions")
.Include("Sections.Questions.Answers")
.Include("Sections.Questions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers")
.Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
.Where(q => questionnaireIds.Contains(q.Id))
.ToListAsync().ConfigureAwait(false);
A quick web-surf tells me that Include() results in a cols * rows product and poor performance if you run multiple levels deep.
I've seen some helpful answers on SO, but they have limited less complex examples, and I can't figure out the best approach for a rewrite of the above.
The multiple repeat of the part -"Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers..." looks suspicious to me like it could be done separately and then another query issued, but I don't know how to build this up or whether such an approach would even improve performance.
Questions:
How do I rewrite this query to something more sensible to improve performance, while ensuring that the eventual result set is the same?
Given the last line: .Include("Sections.Questions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.SubQuestions.Answers.AnswerMetadatas")
Why do I need all the intermediate lines? (I guess it's because some of the joins may not be left joins?)
EF Version info: package id="EntityFramework" version="6.2.0" targetFramework="net452"
I realise this question is a bit rubbish, but I'm trying to resolve as fast as I can from a point of no knowledge.
Edit
After mulling over this for half a day and thanks to StuartLC's suggestions I came up with some options:
Poor - split the query so that it performs multiple round-trips to fetch the data. This is likely to provide a slightly slower experience for the user, but will stop the SQL timing out. (This is not much better than just increasing the EF command timeout).
Good - change the clustered indexing on child tables to be clustered by their parent's foreign key (assuming you don't have a lot of insert operations).
Good - change the code to only query the first few levels and lazy-load (separate db hit) anything below this, i.e. remove all but the top few Includes, then change the ICollections - Answers.SubQuestions, Answers.AnswerMetadatas, and Question.Answers to all be virtual. Presumably the downside to making these virtual is that if any (other) existing code in the app expects those ICollection properties to be eager-loaded, you may have to update that code (i.e. if you want/need them to load immediately within that code). I will be investigating this option further. Further edit - unfortunately this won't work if you need to serialize the response due to self-referencing loop.
Non-trivial - Write a sql stored proc/view manually and build a new EF object pointed at it.
Longer term
The obvious, best, but most time-consuming option - rewrite the app design, so it doesn't need the whole data tree in a single api call, or go with the option below:
Rewrite the app to store the data in a NoSQL fashion (e.g. store the object tree as json so there are no joins). As Stuart mentioned this is not a good option if you need to filter the data in other ways (via something other than the questionnaireId), which you might need to do. Another alternative is to partially store NoSQL-style and partially relational as required.
First up, it must be said that this isn't a trivial query. Seemingly we have:
6 levels of recursion through a nested question-answer tree
A total of 20 tables are joined in this way via eager loaded .Include
I would first take the time to determine where this query is used in your app, and how often it is needed, with particular attention to where it is used most frequently.
YAGNI optimizations
The obvious place to start is to see where the query is used in your app, and if you don't need the whole tree all the time, then suggest you don't join in the nested question and answer tables if they are not needed in all usages of the query.
Also, it is possible to compose on IQueryable dynamically, so if there are multiple use cases for your query (e.g. from a "Summary" screen which doesn't need the question + answers, and a details tree which does need them), then you can do something like:
var questionnaireQuery = _myContext.Questionnaires
.Include(q => q.Sections)
.Include(q => q.QuestionnaireCommonFields);
// Conditionally extend the joins
if (mustIncludeQandA)
{
questionnaireQuery = questionnaireQuery
.Include(q => q.Sections.Select(s => s.Questions.Select(q => q.Answers..... etc);
}
// Execute + materialize the query
var questionnaires = await questionnaireQuery
.Where(q => questionnaireIds.Contains(q.Id))
.ToListAsync()
.ConfigureAwait(false);
SQL Optimizations
If you really have to fetch the whole tree all the time, then look at your SQL table design and indexing.
1) Filters
.Where(q => questionnaireIds.Contains(q.Id))
(I'm assuming SQL Server terminology here, but the concepts are applicable in most other RDBMs as well.)
I'm guessing Questionnaires.Id is a clustered primary key, so will be indexed, but just check for sanity (it will look something PK_Questionnaires CLUSTERED UNIQUE PRIMARY KEY in SSMS)
2) Ensure all child tables have indexes on their foreign keys back to the parent.
e.g. q => q.Sections means that table Sections has a foreign key back to Questionnaires.Id - make sure this has at least a non-clustered index on it - EF Code First should do this automagically, but again, check to be sure.
This would look like IX_QuestionairreId NONCLUSTERED on column Sections(QuestionairreId)
3) Consider changing the clustered indexing on child tables to be clustered by their parent's foreign key, e.g. Cluster Section by Questions.SectionId. This will keep all child rows related to the same parent together, and reduce the number of pages of data that SQL needs to fetch. It isn't trivial to achieve in EF code first, but your DBA can assist you in doing this, perhaps as a custom step.
Other comments
If this query is only used to query data, not to update or delete, then adding .AsNoTracking() will marginally reduce the memory consumption and in-memory performance of EF.
Unrelated to performance, but you've mixed the weakly typed ("Sections") and strongly typed .Include statements (q => q.QuestionnaireCommonFields). I would suggest moving to the strongly typed includes for the additional compile time safety.
Note that you only need to specify the include path for the longest chain(s) which are eager loaded - this will obviously force EF to include all higher levels too. i.e. You can reduce the 20 .Include statements to just 2. This will do the same job more efficiently:
.Include(q => q.QuestionnaireCommonFields)
.Include(q => q.Sections.Select(s => s.Questions.Select(q => q.Answers .... etc))
You'll need .Select any time there is a 1:Many relationship, but if the navigation is 1:1 (or N:1) then you don't need the .Select, e.g. City c => c.Country
Redesign
Last but not least, if data is only ever filtered from the top level (i.e. Questionnaires), and if the whole questionairre 'tree' (Aggregate Root) is typically always added or updated all at once, then you might try and approach the data modelling of the question and answer tree in a NoSQL way, e.g. by simply modelling the whole tree as XML or JSON, and then treat the whole tree as a long string. This will avoid all the nasty joins altogether. You would need a custom deserialization step in your data tier. This latter approach won't be very useful if you need to filter from nodes in the tree (i.e. a Query like find me all questionairre's where the SubAnswer to Question 5 is "Foo" won't be a good fit)

Efficiency of C# Find on 1000+ records

I am trying to essentially see if entities exist in a local context and sort them accordingly. This function seems to be faster than others we have tried runs in about 50 seconds for 1000 items but I am wondering if there is something I can do to improve the efficiency. I believe the find here is slowing it down significantly as a simple foreach iteration over 1000 takes milliseconds and benchmarking shows bottle necking there. Any ideas would be helpful. Thank you.
Sample code:
foreach(var entity in entities) {
var localItem = db.Set<T>().Find(Key);
if(localItem != null)
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
If this is a database (which from the comments I've gathered that it is...)
You would be better off doing fewer queries.
list1.AddRange(db.Set<T>().Where(x => x.Key == Key));
list2.AddRange(db.Set<T>().Where(x => x.Key != Key));
This would be 2 queries instead of 1000+.
Also be aware of the fact that by adding each one to a List<T>, you're keeping 2 large arrays. So if 1000+ turns into 10000000, you're going to have interesting memory issues.
See this post on my blog for more information: http://www.artisansoftware.blogspot.com/2014/01/synopsis-creating-large-collection-by.html
If I understand correctly the database seems to be the bottleneck? If you want to (effectivly) select data from a database relation, whose attribute x should match a ==-criteria, you should consider creating a secondary access path for that attribute (an index structure). Depending on your database system and the distribution in your table this might be a hash index (especially good for checks on ==) or a B+-tree (allrounder) or whatever your system offers you.
However this only works if...
you not only get the full data set once and have to live with that in your application.
adding (another) index to the relation is not out of question (or e.g. its not worth to have it for a single need).
adding an index wouldn't be effective - e.g if the attribute you are querying on has very few unique values.
I found your answers very helpful but here is ultimately how I fold the problem. It seemed .Find was the bottleneck.
var tableDictionary = db.Set<T>().ToDictionary(x => x.KeyValue, x => x);
foreach(var entity in entities) {
if (tableDictionary.ContainsKey(entity.yKeyValue))
{
list1.Add(entity);
}
else
{
list2.Add(entity);
}
}
This ran in with 900+ rows in about a 10th of a second which for our purposes was efficient enough.
Rather than querying the DB for each item, you can just do one query, get all of the data (since you want all of the data from the DB eventually) and you can then group it in memory, which can be done (in this case) about as efficiently as in the database. By creating a lookup of whether or not the key is equal, we can easily get the two groups:
var lookup = db.Set<T>().ToLookup(item => item.Key == Key);
var list1 = lookup[true].ToList();
var list2 = lookup[false].ToList();
(You can use AddRange instead if the lists have previous values that should also be in them.)

Code efficiency/optimization

Not really sure if this is a proper question or not, but I figure that I'll give it a go and see what kind of answers pop up.
We're at the point in our development that we are going on to User Acceptance Testing, and one of the things the users have found to be a little lacking was the speed in which tabs are loading after a search result is selected. I've implemented logging methods and have come up with a few culprits as to the methods and data retrieval/manipulation that are causing the perceived slowness. The below is the biggest issue. The purpose of the method is to select all payments received towards a policy or any sub-policies, group them together by both due date and paid date, and then return a GroupedClass that will sum the amounts paid towards the whole policy. I'm wondering if there's any way this can be made more efficient. I've noticed that working with this old UniVerse data that things tend to break if they aren't cast .AsEnumerable() before being utilized:
var mc = new ModelContext();
var policy = mc.Polmasts.Find("N345348");
var payments =
mc.Paymnts.Where(p => p.POLICY.Contains(policy.ID)).GroupBy(p => new { p.PAYDUE_, p.PAYPD_ }).Select(
grp =>
new GroupedPayments
{
PAYPD_ = grp.Key.PAYPD_,
PAYDUE_ = grp.Key.PAYDUE_,
AMOUNT = grp.Sum(a => a.AMOUNT),
SUSP = grp.Sum(a => a.SUSP)
}).AsEnumerable().OrderByDescending(g => g.PAYDUE_).Take(3);
I've noticed that working with this old UniVerse data that things tend to break if they aren't cast .AsEnumerable() before being utilized
This goes to the root of your problems. By saying AsEnumerable, you are forcing all records in the sequence at that point to be brought down, before you sort and take the first three. Obviously, this will get slower and slower for more data.
Fixing this could be difficult, given what you say. In general, LINQ providers provide varying amounts of functionality in terms of what can be evaluated on the server and what can't. From your above comment, it sounds like LINQ-to-UniVerse doesn't do particularly well at doing things on the server.
For example, I would expect any good database LINQ provider to be able to do (using made-up definitions)
context.Products.Where(p => p.Type == 4).OrderBy(p => p.Name)
on the server; however, your code above is more taxing. Try splitting it into smaller pieces and establishing if it's possible to get the server to do the sort and Take(3). It might be that the best thing to do is to one query (which can be done on the server) to get the bottom three PAYDUE_ values, then another to actually get the amounts for those dates, pulling all relevant records down to the client.
Assuming you're running against SQL Server I would enable profiling, Linq has a habit of not producing the SQL you'd like it to. It's much more likely that the slowdown is from bad SQL than from in memory operations.

Categories