Compare two List<POCO> to find differences being case insensitive

Compare two List<POCO> to find differences being case insensitive - c#

I have two collections:
private void ProcessCollectionsData(List<OrganizationUser> databaseUsers, List<OrganizationUser> importedUsers) { ... }
There is a property called UserIdentifier (String) and UserId (Int32). This is how I'm comparing them which is producing wrong results and heavy performance bottlenecks:
LogMessage(new LogEntry(" - Generating delta for new users...", true));
Task.WaitAll(Task.Run(() =>
{
newUsers = databaseUsers.Any() ? importedUsers.Where(x => !databaseUsers.Select(y => y.UserIdentifier.ToLower())
.ToList()
.Contains(x.UserIdentifier.ToLower()))
.ToList()
: importedUsers;
duplicates = newUsers.OrderByDescending(o1 => o1.UserId)
.GroupBy(s => s.UserIdentifier, StringComparer.InvariantCultureIgnoreCase)
.Where(y => y.Count() > 1);
foreach (var item in duplicates)
{
newUsers.RemoveAll(s => string.Equals(s.UserIdentifier, item.Key, StringComparison.OrdinalIgnoreCase));
newUsers.Add(item.First());
}
}));
LogMessage(new LogEntry(String.Format(" - Done. New users to be imported: {0}", newUsers.Count)));
The data in the importedUsers comes from a CSV and can be duplicated and also duplicated with mix case for UserIdentifier field. The data in databaseUsers is empty first time around. Then, after first run, the import file dumps around a 100,000 users to database and at second and consecutive runs, the databaseUsers is loaded with 100,000 existing users and importedUsersalso brings data in the range of 99,990 to 100,100 (example) which requires me to generate delta collections so that I know which users to mark delete, which to add (new) and remaining (common) needs to be updated.
Can anyone suggest a faster way to do this?
I can see I'm making a mistake where I'm assigning to the newUser collection by using ToLower()
Correction to above statement, the resulting newUsers collection retains case information as desired. So the performance is the real issue here now.

I think the crux of your performance problem is here
!databaseUsers.Select(y => y.UserIdentifier.ToLower()).ToList()
Given databaseUsers can contain 100,000 users you certainly don't want to be pulling that entire list into memory. Getting rid of the Select / ToList calls should mean you only query the DB which should make a difference
importedUsers.Where(x => !databaseUsers.Any(y =>
y.UserIdentifier.ToLower() == x.UserIdentifier.ToLower()).ToList()

Related

Adding items to the list inside foreach loop

epublic ActionResult ExistingPolicies()
{
if (Session["UserId"]==null)
{
return RedirectToAction("Login");
}
using(PMSDBContext dbo=new PMSDBContext())
{
List<Policy> viewpolicy = new List<Policy>();
var userid = Session["UserId"];
List<AddPolicy> policy= dbo.AddPolicies.Where(c => c.MobileNumber ==
(string)userid).ToList();
foreach(AddPolicy p in policy)
{
viewpolicy=dbo.Policies.Where(c => c.PolicyId ==p.PolicyId).ToList();
}
Session["Count"] = policy.Count;
return View(viewpolicy);
}
}
Here the policy list clearly has 2 items.But when I iterate through foreach,the viewpolicy list only takes the last item as its value.If break is used,it takes only the first item.How to store both items in viewpolicy list??
Regards
Surya.

You can iterate through policies and add them by one to list with Add, but I would say that often (not always, though) better option would be to just retrieve the whole list from DB in one query. Without knowing your entities you can do at least something like that:
List<AddPolicy> policy = ...
viewpolicy = dbo.Policies
.Where(c => policy.Select(p => p.PolicyId).Contains(c.PolicyId))
.ToList();
But if you have correctly set up entities relations, you should be able to do something like this:
var viewpolicy = dbo.AddPolicies
.Where(c => c.MobileNumber == (string)userid)
.Select(p => p.Policy) //guessing name here, also can be .SelectMany(p => p.Policy)
.ToList();

Of course; instead of adding to the list, you replace it with a whole new one on each pass of the loop:
viewpolicy=dbo.Policies.Where(c => c.PolicyId ==p.PolicyId).ToList()
This code above will search all the policies for the policy with that ID, turn it into a new List and assign to the viewpolicy variable. You never actually add anything to a list with this way, you just make new lists all the time and overwrite the old one with the latest list
Perhaps you need something like this:
viewpolicy.Add(dbo.Policies.Single(c => c.PolicyId ==p.PolicyId));
This has a list, finds one policy by its ID number (for which there should be only one policy, right? It's an ID so I figured it's unique..) and adds it to the list
You could use a Where and skip the loop entirely if you wanted:
viewpolicy=dbo.Policies.Where(c => policy.Any(p => c.PolicyId == p.PolicyId)).ToList();
Do not do this in a loop, it doesn't need it. It works by asking LINQ to do the looping for you. It should be converted to an IN query and run by the DB, so generally more performant than dragging the policies out one by one (via id). If the ORM didn't understand how to make it into SQL you can simplify things for it by extracting the ids to an int collection:
viewpolicy=dbo.Policies.Where(c => policy.Select(p => p.PolicyId).Any(id => c.PolicyId == id)).ToList();
Final point, I recommend you name your "collections of things" with a plural. You have a List<Policy> viewpolicy - this is a list that contains multiple policies so really we should call it viewPolicies. Same for the list of AddPolicy. It makes code read more nicely if things that are collections/lists/arrays are named in the plural

Something like:
viewpolicy.AddRange(dbo.Policies.Where(c => c.PolicyId ==p.PolicyId));

Why is Entity Framework having performance issues when calculating a sum

I am using Entity Framework in a C# application and I am using lazy loading. I am experiencing performance issues when calculating the sum of a property in a collection of elements. Let me illustrate it with a simplified version of my code:
public decimal GetPortfolioValue(Guid portfolioId) {
var portfolio = DbContext.Portfolios.FirstOrDefault( x => x.Id.Equals( portfolioId ) );
if (portfolio == null) return 0m;
return portfolio.Items
.Where( i =>
i.Status == ItemStatus.Listed
&&
_activateStatuses.Contains( i.Category.Status )
)
.Sum( i => i.Amount );
}
So I want to fetch the value for all my items that have a certain status of which their parent has a specific status as well.
When logging the queries generated by EF I see it is first fetching my Portfolio (which is fine). Then it does a query to load all Item entities that are part of this portfolio. And then it starts fetching ALL Category entities for each Item one by one. So if I have a portfolio that contains 100 items (each with a category), it literally does 100 SELECT ... FROM categories WHERE id = ... queries.
So it seems like it's just fetching all info, storing it in its memory and then calculating the sum. Why does it not do a simple join between my tables and calculate it like that?
Instead of doing 102 queries to calculate the sum of 100 items I would expect something along the lines of:
SELECT
i.id, i.amount
FROM
items i
INNER JOIN categories c ON c.id = i.category_id
WHERE
i.portfolio_id = #portfolioId
AND
i.status = 'listed'
AND
c.status IN ('active', 'pending', ...);
on which it could then calculate the sum (if it is not able to use the SUM directly in the query).
What is the problem and how can I improve the performance other than writing a pure ADO query instead of using Entity Framework?
To be complete, here are my EF entities:
public class ItemConfiguration : EntityTypeConfiguration<Item> {
ToTable("items");
...
HasRequired(p => p.Portfolio);
}
public class CategoryConfiguration : EntityTypeConfiguration<Category> {
ToTable("categories");
...
HasMany(c => c.Products).WithRequired(p => p.Category);
}
EDIT based on comments:
I didn't think it was important but the _activeStatuses is a list of enums.
private CategoryStatus[] _activeStatuses = new[] { CategoryStatus.Active, ... };
But probably more important is that I left out that the status in the database is a string ("active", "pending", ...) but I map them to an enum used in the application. And that is probably why EF cannot evaluate it? The actual code is:
... && _activateStatuses.Contains(CategoryStatusMapper.MapToEnum(i.Category.Status)) ...
EDIT2
Indeed the mapping is a big part of the problem but the query itself seems to be the biggest issue. Why is the performance difference so big between these two queries?
// Slow query
var portfolio = DbContext.Portfolios.FirstOrDefault(p => p.Id.Equals(portfolioId));
var value = portfolio.Items.Where(i => i.Status == ItemStatusConstants.Listed &&
_activeStatuses.Contains(i.Category.Status))
.Select(i => i.Amount).Sum();
// Fast query
var value = DbContext.Portfolios.Where(p => p.Id.Equals(portfolioId))
.SelectMany(p => p.Items.Where(i =>
i.Status == ItemStatusConstants.Listed &&
_activeStatuses.Contains(i.Category.Status)))
.Select(i => i.Amount).Sum();
The first query does a LOT of small SQL queries whereas the second one just combines everything into one bigger query. I'd expect even the first query to run one query to get the portfolio value.

Calling portfolio.Items this will lazy load the collection in Items and then execute the subsequent calls including the Where and Sum expressions. See also Loading Related Entities article.
You need to execute the call directly on the DbContext the Sum expression can be evaluated database server side.
var portfolio = DbContext.Portfolios
.Where(x => x.Id.Equals(portfolioId))
.SelectMany(x => x.Items.Where(i => i.Status == ItemStatus.Listed && _activateStatuses.Contains( i.Category.Status )).Select(i => i.Amount))
.Sum();
You also have to use the appropriate type for _activateStatuses instance as the contained values must match the type persisted in the database. If the database persists string values then you need to pass a list of string values.
var _activateStatuses = new string[] {"Active", "etc"};
You could use a Linq expression to convert enums to their string representative.
Notes
I would recommend you turn off lazy loading on your DbContext type. As soon as you do that you will start to catch issues like this at run time via Exceptions and can then write more performant code.
I did not include error checking for if no portfolio was found but you could extend this code accordingly.

Yep CategoryStatusMapper.MapToEnum cannot be converted to SQL, forcing it to run the Where in .Net. Rather than mapping the status to the enum, _activeStatuses should contain the list of integer values from the enum so the mapping is not required.
private int[] _activeStatuses = new[] { (int)CategoryStatus.Active, ... };
So that the contains becomes
... && _activateStatuses.Contains(i.Category.Status) ...
and can all be converted to SQL
UPDATE
Given that i.Category.Status is a string in the database, then
private string[] _activeStatuses = new[] { CategoryStatus.Active.ToString(), ... };

Populate a large list of objects with details from another list

I have a large database query that returns around 100k records into an in-memory list. I need to link a list of related employees to each record (also around 100k records), but I'm struggling to get useable performance.
foreach (var detail in reportData.Details)
{
detail.Employees = employees
.Where(x => x.AccountingDocumentItemId == detail.AccountingDocumentItemId)
.Select(x => x.Employee)
.ToList();
detail.Employee = String.Join(", ", detail.Employees);
}
The above code takes over 8 minutes to complete. I've narrowed down the speed issue to the first line in the for loop where it finds the related employees. If I leave out the ToList() it's super fast, but then the next line immediately causes the issues where the String.Join causes the Where to execute.
I'm obviously approaching this from the wrong angle, but I've exhausted the options I think would work.

You current code has O(n ** 2) time complexity (nested loops) and thus you have 1e5 * 1e5 ~ 1e10 (10 billions) operations to perform which takes 8 minutes to complete.
Let's extract a dictionary in order to have O(n) time complexity (~1e5 operations only):
var dict = reportData
.Details
.GroupBy(item => item.AccountingDocumentItemId,
item => item.Employee)
.ToDictionary(chunk => chunk.Key,
chunk => chunk.ToList());
foreach (var detail in reportData.Details) {
detail.Employees = dict.TryGetValue(detail.AccountingDocumentItemId, out var list)
? list.ToList() // copy of the list
: new List<MyClass>(); // put the right type instead of MyType
detail.Employee = String.Join(", ", detail.Employees);
}

Replacing Include() calls to Select()

Im trying to eliminate the use of the Include() calls in this IQueryable definition:
return ctx.timeDomainDataPoints.AsNoTracking()
.Include(dp => dp.timeData)
.Include(dp => dp.RecordValues.Select(rv => rv.RecordKind).Select(rk => rk.RecordAlias).Select(fma => fma.RecordAliasGroup))
.Include(dp => dp.RecordValues.Select(rv => rv.RecordKind).Select(rk => rk.RecordAlias).Select(fma => fma.RecordAliasUnit))
.Where(dp => dp.RecordValues.Any(rv => rv.RecordKind.RecordAlias != null))
.Where(dp => dp.Source == 235235)
.Where(dp => dp.timeData.time >= start && cd.timeData.time <= end)
.OrderByDescending(cd => cd.timeData.time);
I have been having issues with the database where the run times are far too long and the primary cause of this is the Include() calls are pulling everything.
This is evident in viewing the table that is returned from the resultant SQL query generated from this showing lots of unnecessary information being returned.
One of the things that you learn I guess.
The Database has a large collection of data points which there are many Recorded values.
Each Recorded value is mapped to a Record Kind which may have a Record Alias.
I have tried creating a Select() as an alternative but I just cant figure out how to construct the right Select and also keep the entity hierarchy correctly loaded. I.e. the related entities are loaded with unnecessary calls to the DB.
Does anyone has alternate solutions that may jump start me to solve this problem.
Ill add more detail if needed.

You are right. One of the slower parts of a database query is the transport of the selected data from the DBMS to your local process. Hence it is wise to limit this.
Every TimeDomainDataPoint has a primary key. All RecordValues of this TimeDomainDataPoint have a foreign key TimeDomainDataPointId with a value equal to this primary key.
So If TimeDomainDataPoint with Id 4 has a thousand RecordValues, then every RecordValue will have a foreign key with a value 4. It would be a waste to transfer this value 4 a 1001 times, while you only need it once.
When querying data, always use Select and select only the properties you actually plan to use. Only use Include if you plan to update the fetched included items.
The following will be much faster:
var result = dbContext.timeDomainDataPoints
// first limit the datapoints you want to select
.Where(datapoint => d.RecordValues.Any(rv => rv.RecordKind.RecordAlias != null))
.Where(datapoint => datapoint.Source == 235235)
.Where(datapoint => datapoint.timeData.time >= start
&& datapoint.timeData.time <= end)
.OrderByDescending(datapoint => datapoint.timeData.time)
// then select only the properties you actually plan to use
Select(dataPoint => new
{
Id = dataPoint.Id,
RecordValues = dataPoint.RecordValues
.Where(recordValues => ...) // if you don't want all RecordValues
.Select(recordValue => new
{
// again: select only the properties you actually plan to use:
Id = recordValue.Id,
// not needed, you know the value: DataPointId = recordValue.DataPointId,
RecordKinds = recordValues.RecordKinds
.Where(recordKind => ...) // if you don't want all recordKinds
.Select(recordKind => new
{
... // only the properties you really need!
})
.ToList(),
...
})
.ToList(),
TimeData = dataPoint.TimeData.Select(...),
...
});
Possible imporvement
The part:
.Where(datapoint => d.RecordValues.Any(rv => rv.RecordKind.RecordAlias != null))
is used to fetch only datapoints that have recordValues with a non-null RecordAlias. If you are selecting the RecordAlias anyway, consider doing this Where after your select:
.Select(...)
.Where(dataPoint => dataPoint
.Where(dataPoint.RecordValues.RecordKind.RecordAlias != null)
.Any());
I'm not really sure whether this is faster. If your database management system internally first creates a complete table with all columns of all joined tables and then throws away the columns that are not selected, then it won't make a difference. However, if it only creates a table with the columns it actually uses, then the internal table will be smaller. This could be faster.

your problem is hierarchy joins in your query.In order to decrease this problem create other query for get result from relation table as follows:
var items= ctx.timeDomainDataPoints.AsNoTracking().Include(dp =>dp.timeData).Include(dp => dp.RecordValues);
var ids=items.selectMany(item=>item.RecordValues).Select(i=>i.Id);
and on other request to db:
var otherItems= ctx.RecordAlias.AsNoTracking().select(dp =>dp.RecordAlias).where(s=>ids.Contains(s.RecordKindId)).selectMany(s=>s.RecordAliasGroup)
to this approach your query do not have internal joins.

How to Performance Test This and Suggestions to Make Faster?

I seem to have written some very slow piece of code which gets slower when I have to deal with EF Core.
Basically I have a list of items that store attributes in a Json string in the database as I am storing many different items with different attributes.
I then have another table that contains the display order for each attribute, so when I send the items to the client I am order them based on that order.
It is kinda slow at doing 700 records in about 18-30 seconds (from where I start my timer, not the whole block of code).
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId);
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
Stopwatch a = new Stopwatch();
a.Start();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
var specDtos = new List<SpecDto>();
foreach (var inventorySpecification in inventorySpecifications.OrderBy(x => x.DisplayOrder))
{
if (specs.ContainsKey(inventorySpecification.JsonKey))
{
var value = specs.GetValue(inventorySpecification.JsonKey);
var newSpecDto = new SpecDto()
{
Key = inventorySpecification.JsonKey,
Value = displaySpec.ToString()
};
specDtos.Add(newSpecDto);
}
}
var dto = new InventoryItemDto()
{
// create dto
};
inventoryItemDtos.Add(dto);
}
Now it goes crazy slow when I add EF some more columns that I need info from.
In the //create dto area I access some information from other tables
var dto = new InventoryItemDto()
{
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
};
By trying to access these columns in the loop takes 6mins to process 700 rows.
I don't understand why it is so slow, it's the only change I really made and I made sure to eager load everything in.
To me it almost makes me think eager loading is not working, but I don't know how to verify if it is or not.
var inventoryItems = dbContext.InventoryItems.Include(x => x.Branch).ThenInclude(x => x.Company)
.Include(x => x.Branch).ThenInclude(x => x.Country)
.Include(x => x.Branch).ThenInclude(x => x.State)
.Include(x => x.Brand)
.Where(x => x.InventoryCategoryId == categoryId).ToList();
so I thought because of doing this the speed would not be that much different then the original 18-30 seconds.
I would like to speed up the original code too but I am not really sure how to get rid of the dual foreach loops that is probably slowing it down.

First, loops inside loops is a very bad thing, you should refactor that out and make it a single loop. This should not be a problem because inventorySpecifications is declared outside the loop
Second, the line
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
should end with ToList(), because it's enumerations is happening within the inner foreach, which means that the query is running for each of "inventoryItems"
that should save you a good amount of time

I'm no expert but this part of your second foreach raises a red flag: inventorySpecifications.OrderBy(x => x.DisplayOrder). Because this is getting called inside another foreach it's doing the .OrderBy call every time you iterate over inventoryItems.
Before your first foreach loop, try this: var orderedInventorySpecs = inventorySpecifications.OrderBy(x => x.DisplayOrder); and then use foreach (var inventorySpec in orderedInventorySpecs) and see if it makes a difference.

To help you better understand what EF is running behind the scenes add some logging in to expose the SQL being run which might help you see how/where your queries are going wrong. This can be extremely helpful to help determine if your queries are hitting the DB too often. As a very general rule you want to hit the DB as few times as possible and retrieve only the information you need via the use of .Select() to reduce what is being returned. The docs for the logging are: http://learn.microsoft.com/en-us/ef/core/miscellaneous/logging
I obviously cannot test this and I am a little unsure where your specDto's go once you have them but I assume they become part of the InventoryItemDto?
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId).Select(x => new InventoryItemDto() {
Attributes = x.Attributes,
//.....
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
}).ToList();
var inventorySpecifications = dbContext.InventoryCategorySpecifications
.Where(x => x.InventoryCategoryId == categoryId)
.OrderBy(x => x.DisplayOrder)
.Select(x => x.InventorySpecification).ToList();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
// Assuming the specs become part of an inventory item?
item.specs = inventorySpecification.Where(x => specs.ContainsKey(x.JsonKey)).Select(x => new SpecDto() { Key = x.JsonKey, Value = specs.GetValue(x.JsonKey)});
}
The first call to the DB for inventoryItems should produce one SQL query that will pull all the information you need at once to construct your InventoryItemDto and thus only hits the DB once. Then it pulls the specs out and uses OrderBy() before materialising which means the OrderBy will be run as part of the SQL query rather than in memory. Both those results are materialised via .ToList() which will cause EF to pull the results into memory in one go.
Finally the loop goes over your constructed inventoryItems, parses the Json and then filters the specs based on that. I am unsure of where you were using the specDtos so I made an assumption that it was part of the model. I would recomend checking the performance of the Json work you are doing as that could be contributing to your slow down.
A more integrated approach to using Json as part of your EF models can be seen at this answer: https://stackoverflow.com/a/51613611/621524 however you will still be unable to use those properties to offload execution to SQL as accessing properties that are defined within code will cause queries to fragment and run in several parts.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Compare two List<POCO> to find differences being case insensitive - c#

Related

Adding items to the list inside foreach loop

Why is Entity Framework having performance issues when calculating a sum

Populate a large list of objects with details from another list

Replacing Include() calls to Select()

How to Performance Test This and Suggestions to Make Faster?

Categories

Resources