Refactor GroupBy to avoid slowing down operation on big dataset - c#

I have a big collection where i need to get the newest item based on two properties.
The first step is ordering the list based on the date prop. This is all fine and pretty quick.
Then I group the newlist by two properties, and take the first item from each.
var one = Fisks.Where(s=>s.Havn.Id == 1).OrderByDescending(s=>s.Date);
var two = one.GroupBy(s=>new {s.Arter.Name, s.Sort});
var three = two.Select(s=>s.FirstOrDefault());
This works, but it is really slow when using it on the large collection. How can I avoid using the groupBy but still get the same result?
Thanks!

Using LINQ only for the first step and then taking the first ones in a loop gives you more control over the process and avoids grouping altogether:
var query = Fisks
.Where(f => f.Havn.Id == 1)
.OrderByDescending(f => f.Date)
.ThenBy(f => f.Arter.Name)
.ThenBy(f => f.Sort);
var list = new List<Fisk>();
foreach (Fisk fisk in query) {
if (list.Count == 0) {
list.Add(fisk);
} else {
Fisk last = list[list.Count - 1];
if (fisk.Sort != last.Sort || fisk.Arter.Name != last.Arter.Name) {
list.Add(fisk);
}
}
}

Generally I advise against ordering before doing something that possibly destroys that order (such as GroupBy can do in SQL as generated by LINQ2SQL). Also try ordering only the stuff you are going to use. You can improve your query performance, if you limit the selection only the required fields/properties. You can fiddle around with this sample and use your real backend instead:
var Fisks=new[]{
new {Havn=new{Id=1},Date=DateTime.MinValue,Arter=new{Name="A"},Sort=1,Title="A1"},
new {Havn=new{Id=1},Date=DateTime.MinValue.AddDays(1),Arter=new{Name="A"},Sort=1,Title="A2"},
new {Havn=new{Id=1},Date=DateTime.MinValue,Arter=new{Name="B"},Sort=1,Title="B1",},
new {Havn=new{Id=1},Date=DateTime.MinValue.AddDays(2),Arter=new{Name="B"},Sort=1,Title="B2",},
new {Havn=new{Id=1},Date=DateTime.MinValue.AddDays(2),Arter=new{Name="B"},Sort=1,Title="B3",},
};
var stopwatch=Stopwatch.StartNew();
var one = Fisks.Where(s=>s.Havn.Id == 1).OrderByDescending(s=>s.Date);
var two = one.GroupBy(s=>new {s.Arter.Name, s.Sort});
var three = two.Select(s=>s.FirstOrDefault());
var answer=three.ToArray();
stopwatch.Stop();
stopwatch.ElapsedTicks.Dump("elapsed Ticks");
answer.Dump();
stopwatch.Restart();
answer=Fisks
.Where(f=>f.Havn.Id.Equals(1))
.GroupBy(s=>new {s.Arter.Name, s.Sort},(k,g)=>new{
s=g.OrderByDescending(s=>s.Date).First()//TOP 1 -> quite fast
})
.Select(g=>g.s)
.OrderByDescending(s=>s.Date) // only fully order results
.ToArray();
stopwatch.Stop();
stopwatch.ElapsedTicks.Dump("elapsed Ticks");
answer.Dump();
If you're working against any SQL Server you should check the generated SQL in LINQPad. You don't want to end up with a n+1 Query. Having an index on Havn.Id and Fisks.Date might also help.

Related

How to make a linq-query with multiple Contains()/Any() on possibly empty lists?

I am trying to make a query to a database view based on earlier user-choices. The choices are stored in lists of objects.
What I want to achieve is for a record to be added to the reportViewList if the stated value exists in one of the lists, but if for example the clientList is empty the query should overlook this statement and add all clients in the selected date-range. The user-choices are stored in temporary lists of objects.
The first condition is a time-range, this works fine. I understand why my current solution does not work, but I can not seem to wrap my head around how to fix it. This example works when both a client and a product is chosen. When the lists are empty the reportViewList is obviously also empty.
I have played with the idea of adding all the records in the date-range and then removing the ones that does not fit, but this would be a bad solution and not efficient.
Any help or feedback is much appreciated.
List<ReportView> reportViews = new List<ReportView>();
using(var dbc = new localContext())
{
reportViewList = dbc.ReportViews.AsEnumerable()
.Where(x => x.OrderDateTime >= from && x.OrderDateTime <= to)
.Where(y => clientList.Any(x2 => x2.Id == y.ClientId)
.Where(z => productList.Any(x3 => x3.Id == z.ProductId)).ToList();
}
You should not call AsEnumerable() before you have added eeverything to your query. Calling AsEnumerable() here will cause your complete data to be loaded in memory and then be filtered in your application.
Without AsEnumerable() and before calling calling ToList() (Better call ToListAsync()), you are working with an IQueryable<ReportView>. You can easily compose it and just call ToList() on your final query.
Entity Framework will then examinate your IQueryable<ReportView> and generate an SQL expression out of it.
For your problem, you just need to check if the user has selected any filters and only add them to the query if they are present.
using var dbc = new localContext();
var reportViewQuery = dbc.ReportViews.AsQueryable(); // You could also write IQuryable<ReportView> reportViewQuery = dbc.ReportViews; but I prefer it this way as it is a little more save when you are refactoring.
// Assuming from and to are nullable and are null if the user has not selected them.
if (from.HasValue)
reportViewQuery = reportViewQuery.Where(r => r.OrderDateTime >= from);
if (to.HasValue)
reportViewQuery = reportViewQuery.Where(r => r.OrderDateTime <= to);
if(clientList is not null && clientList.Any())
{
var clientIds = clientList.Select(c => c.Id).ToHashSet();
reportViewQuery = reportViewQuery.Where(r => clientIds.Contains(y.ClientId));
}
if(productList is not null && productList.Any())
{
var productIds = productList.Select(p => p.Id).ToHashSet();
reportViewQuery = reportViewQuery.Where(r => productIds .Contains(r.ProductId));
}
var reportViews = await reportViewQuery.ToListAsync(); // You can also use ToList(), if you absolutely must, but I would not recommend it as it will block your current thread.

How to Performance Test This and Suggestions to Make Faster?

I seem to have written some very slow piece of code which gets slower when I have to deal with EF Core.
Basically I have a list of items that store attributes in a Json string in the database as I am storing many different items with different attributes.
I then have another table that contains the display order for each attribute, so when I send the items to the client I am order them based on that order.
It is kinda slow at doing 700 records in about 18-30 seconds (from where I start my timer, not the whole block of code).
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId);
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
Stopwatch a = new Stopwatch();
a.Start();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
var specDtos = new List<SpecDto>();
foreach (var inventorySpecification in inventorySpecifications.OrderBy(x => x.DisplayOrder))
{
if (specs.ContainsKey(inventorySpecification.JsonKey))
{
var value = specs.GetValue(inventorySpecification.JsonKey);
var newSpecDto = new SpecDto()
{
Key = inventorySpecification.JsonKey,
Value = displaySpec.ToString()
};
specDtos.Add(newSpecDto);
}
}
var dto = new InventoryItemDto()
{
// create dto
};
inventoryItemDtos.Add(dto);
}
Now it goes crazy slow when I add EF some more columns that I need info from.
In the //create dto area I access some information from other tables
var dto = new InventoryItemDto()
{
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
};
By trying to access these columns in the loop takes 6mins to process 700 rows.
I don't understand why it is so slow, it's the only change I really made and I made sure to eager load everything in.
To me it almost makes me think eager loading is not working, but I don't know how to verify if it is or not.
var inventoryItems = dbContext.InventoryItems.Include(x => x.Branch).ThenInclude(x => x.Company)
.Include(x => x.Branch).ThenInclude(x => x.Country)
.Include(x => x.Branch).ThenInclude(x => x.State)
.Include(x => x.Brand)
.Where(x => x.InventoryCategoryId == categoryId).ToList();
so I thought because of doing this the speed would not be that much different then the original 18-30 seconds.
I would like to speed up the original code too but I am not really sure how to get rid of the dual foreach loops that is probably slowing it down.
First, loops inside loops is a very bad thing, you should refactor that out and make it a single loop. This should not be a problem because inventorySpecifications is declared outside the loop
Second, the line
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
should end with ToList(), because it's enumerations is happening within the inner foreach, which means that the query is running for each of "inventoryItems"
that should save you a good amount of time
I'm no expert but this part of your second foreach raises a red flag: inventorySpecifications.OrderBy(x => x.DisplayOrder). Because this is getting called inside another foreach it's doing the .OrderBy call every time you iterate over inventoryItems.
Before your first foreach loop, try this: var orderedInventorySpecs = inventorySpecifications.OrderBy(x => x.DisplayOrder); and then use foreach (var inventorySpec in orderedInventorySpecs) and see if it makes a difference.
To help you better understand what EF is running behind the scenes add some logging in to expose the SQL being run which might help you see how/where your queries are going wrong. This can be extremely helpful to help determine if your queries are hitting the DB too often. As a very general rule you want to hit the DB as few times as possible and retrieve only the information you need via the use of .Select() to reduce what is being returned. The docs for the logging are: http://learn.microsoft.com/en-us/ef/core/miscellaneous/logging
I obviously cannot test this and I am a little unsure where your specDto's go once you have them but I assume they become part of the InventoryItemDto?
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId).Select(x => new InventoryItemDto() {
Attributes = x.Attributes,
//.....
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
}).ToList();
var inventorySpecifications = dbContext.InventoryCategorySpecifications
.Where(x => x.InventoryCategoryId == categoryId)
.OrderBy(x => x.DisplayOrder)
.Select(x => x.InventorySpecification).ToList();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
// Assuming the specs become part of an inventory item?
item.specs = inventorySpecification.Where(x => specs.ContainsKey(x.JsonKey)).Select(x => new SpecDto() { Key = x.JsonKey, Value = specs.GetValue(x.JsonKey)});
}
The first call to the DB for inventoryItems should produce one SQL query that will pull all the information you need at once to construct your InventoryItemDto and thus only hits the DB once. Then it pulls the specs out and uses OrderBy() before materialising which means the OrderBy will be run as part of the SQL query rather than in memory. Both those results are materialised via .ToList() which will cause EF to pull the results into memory in one go.
Finally the loop goes over your constructed inventoryItems, parses the Json and then filters the specs based on that. I am unsure of where you were using the specDtos so I made an assumption that it was part of the model. I would recomend checking the performance of the Json work you are doing as that could be contributing to your slow down.
A more integrated approach to using Json as part of your EF models can be seen at this answer: https://stackoverflow.com/a/51613611/621524 however you will still be unable to use those properties to offload execution to SQL as accessing properties that are defined within code will cause queries to fragment and run in several parts.

Looping through fields to get sum

I know there are probably 100 much easier ways to do this but until I see it I can't comprehend how to go about it. I'm using linqpad for this. Before I go to phase 2 I need to get this part to work!
I've connected to an SQL database.
I'm running a query to retrieve some desired records.
var DesiredSym = (from r in Symptoms
where r.Status.Equals(1) && r.Create_Date < TimespanSecs
select r).Take(5);
So, in this example, I retrieve 5 'records' essentially in my DesiredSym variable as iQueryable (linqpad tells me this)
The DesiredSym contains a large number of fields including a number feilds that hold a int of Month1_Used, Month2_Used, Month3_Used .... Month12_Use.
So I want to loop through the DesiredSym and basically get the sum of all the Monthx_Used fields.
foreach (var MonthUse in DesiredSym)
{
// get sum of all fields where they start with MonthX_Used;
}
This is where I'm not clear on how to proceed or even if I'm on the right track. Thanks for getting me on the right track.
Since you've got a static number of fields, I'd recommend this:
var DesiredSym =
(from r in Symptoms
where r.Status.Equals(1) && r.Create_Date < TimespanSecs
select retireMe)
.Take(5);
var sum = DesiredSym.Sum(s => s.Month1_Use + s.Month2_Use + ... + s.Month12_Use);
You could use reflection, but that would be significantly slower and require more resources, since you'd need to pull the whole result set into memory first. But just for the sake of argument, it would look something like this:
var t = DesiredSym.GetType().GenericTypeArguments[0];
var props = t.GetProperties().Where(p => p.Name.StartsWith("Month"));
var sum = DesiredSym.AsEnumerable()
.Sum(s => props.Sum(p => (int)p.GetValue(s, null)));
Or this, which is a more complicated use of reflection, but it has the benefit of still being executed on the database:
var t = DesiredSym.GetType().GenericTypeArguments[0];
var param = Expression.Parameter(t);
var exp = t.GetProperties()
.Where(p => p.Name.StartsWith("Month"))
.Select(p => (Expression)Expression.Property(param, p))
.Aggregate((x, y) => Expression.Add(x, y));
var lambda = Expression.Lambda(exp, param);
var sum = DesiredSym.Sum(lambda);
Now, to these methods (except the third) to calculate the sum in batches of 5, you can use MoreLINQ's Batch method (also available on NuGet):
var DesiredSym =
from r in Symptoms
where r.Status.Equals(1) && r.Create_Date < TimespanSecs
select retireMe;
// first method
var batchSums = DesiredSym.Batch(5, b => b.Sum(s => s.Month1_Use ...));
// second method
var t = DesiredSym.GetType().GenericTypeArguments[0];
var props = t.GetProperties().Where(p => p.Name.StartsWith("Month"));
var batchSums = DesiredSym.Batch(5, b => b.Sum(s => props.Sum(p => (int)p.GetValue(s, null))));
Both these methods will be a bit slower and use more resources since all the processing has to be don in memory. For the same reason the third method will not work, since MoreLinq does not support the IQueryable interface.

Identify items in one list not in another of a different type

I need to identify items from one list that are not present in another list. The two lists are of different entities (ToDo and WorkshopItem). I consider a workshop item to be in the todo list if the Name is matched in any of the todo list items.
The following does what I'm after but find it awkward and hard to understand each time I revisit it. I use NHibernate QueryOver syntax to get the two lists and then a LINQ statement to filter down to just the Workshop items that meet the requirement (DateDue is in the next two weeks and the Name is not present in the list of ToDo items.
var allTodos = Session.QueryOver<ToDo>().List();
var twoWeeksTime = DateTime.Now.AddDays(14);
var workshopItemsDueSoon = Session.QueryOver<WorkshopItem>()
.Where(w => w.DateDue <= twoWeeksTime).List();
var matches = from wsi in workshopItemsDueSoon
where !(from todo in allTodos
select todo.TaskName)
.Contains(wsi.Name)
select wsi;
Ideally I'd like to have just one NHibernate query that returns a list of WorkshopItems that match my requirement.
I think I've managed to put together a Linq version of the answer put forward by #CSL and will mark that as the accepted answer as it put me in the direction of the following.
var twoWeeksTime = DateTime.Now.AddDays(14);
var subquery = NHibernate.Criterion.QueryOver.Of<ToDo>().Select(t => t.TaskName);
var matchingItems = Session.QueryOver<WorkshopItem>()
.Where(w => w.DateDue <= twoWeeksTime &&
w.IsWorkshopItemInProgress == true)
.WithSubquery.WhereProperty(x => x.Name).NotIn(subquery)
.Future<WorkshopItem>();
It returns the results I'm expecting and doesn't rely on magic strings. I'm hesitant because I don't fully understand the WithSubquery (and whether inlining it would be a good thing). It seems to equate to
WHERE WorkshopItem.Name IS NOT IN (subquery)
Also I don't understand the Future instead of List. If anyone would shed some light on those that would help.
I am not 100% sure how to achieve what you need using LINQ so to give you an option I am just putting up an alternative solution using nHibernate Criteria (this will execute in one database hit):
// Create a query
ICriteria query = Session.CreateCriteria<WorkShopItem>("wsi");
// Restrict to items due within the next 14 days
query.Add(Restrictions.Le("DateDue", DateTime.Now.AddDays(14));
// Return all TaskNames from Todo's
DetachedCriteria allTodos = DetachedCriteria.For(typeof(Todo)).SetProjection(Projections.Property("TaskName"));
// Filter Work Shop Items for any that do not have a To-do item
query.Add(SubQueries.PropertyNotIn("Name", allTodos);
// Return results
var matchingItems = query.Future<WorkShopItem>().ToList()
I'd recommend
var workshopItemsDueSoon = Session.QueryOver<WorkshopItem>()
.Where(w => w.DateDue <= twoWeeksTime)
var allTodos = Session.QueryOver<ToDo>();
Instead of
var allTodos = Session.QueryOver<ToDo>().List();
var workshopItemsDueSoon = Session.QueryOver<WorkshopItem>()
.Where(w => w.DateDue <= twoWeeksTime).List();
So that the collection isn't iterated until you need it to be.
I've found that it's helpfull to use linq extension methods to make subqueries more readable and less awkward.
For example:
var matches = from wsi in workshopItemsDueSoon
where !allTodos.Select(it=>it.TaskName).Contains(wsi.Name)
select wsi
Personally, since the query is fairly simple, I'd prefer to do it like so:
var matches = workshopItemsDueSoon.Where(wsi => !allTodos.Select(it => it.TaskName).Contains(wsi.Name))
The latter seems less verbose to me.

Linq select from list where property matches a condition

I have an collection of Videos who have a field typeidentifier that tells me if a video is a trailer, clip or interview.
I need to put them in 3 seperate collections.
var trailers = myMediaObject.Videos.Where(type => type.TypeIdentifier == 1);
var clips = myMediaObject.Videos.Where(type => type.TypeIdentifier == 2);
var interviews = myMediaObject.Videos.Where(type => type.TypeIdentifier == 3);
Is there a more efficient way of doing this? I love using Linq here though.
How about:
var lookup = myMediaObject.Videos.ToLookup(type => type.TypeIdentifier);
var trailers = lookup[1];
var clips = lookup[2];
var interviews = lookup[3];
Note that this will materialize the results immediately, whereas your first version didn't. If you still want deferred execution, you might want to use GroupBy instead - although that will be slightly trickier later on. It really depends what you need to do with the results.

Categories