LINQ Join/Update List of Objects from Database - c#

This issue is a new one to me in LINQ. And maybe I'm going about this wrong.
What I have is a list of objects in memory, which could number up to 100k, and I need to find in my database which objects represent an existing customer.
This search needs to be done across multiple object properties and all I have to go on are the name and address of the person - no unique identifier since this data comes from an outside source.
Is it possible to join my generic of objects against my database context and then update the generic objects, with data from the context, based on whether they are found in the join?
I thought I was getting close to the join working with the below code. And I think the join works .. maybe. But I can't even seem to loop through the records.
public void FindCustomerMatches(List<DocumentLine> lines)
{
IQueryable<DocumentLine> results = null;
var linesQuery = lines.AsQueryable();
using (var customerContext = new Entities())
{
customerContext.Configuration.LazyLoadingEnabled = false;
var dbCustomerQuery = customerContext.customers.Where(c => !c.customernumber.StartsWith("D"));
results = from c in dbCustomerQuery
from l in linesQuery
where c.firstname1 == l.CustomerFirstName
&& c.lastname1 == l.CustomerLastName
&& c.street_address1.Contains(l.CustomerAddress)
&& c.city == l.CustomerCity
&& c.state == l.CustomerState
&& c.zip == l.CustomerZip
select l;
foreach (var result in results)
{
// Do something with each record here, like update it.
}
}
}

It seems to me that you have two collections: a local collection of DocumentLines in variable lines, and a collection of Customers in a customerContext.Customers, probably in a database management system.
Every DocumentLine contains several properties that can also be found in a Customer. Alas you didn't say whether all DocumentLine properties can be found in a Customer.
From lines (the local collection of DocumentLines) you only want to keep only those DocumentLines of which there is at least one Customer in your queryable collection of Customers that match all these properties.
So the result is a sequence of DocumentLines, a sub-collection of lines.
The problem is that you don't want to query a sub-collection of the database table Customers, but you want a sub-collection of your local lines.
Using AsQueryable doesn't transport your lines to your DBMS. I doubt whether the query you defined will be performed by the DBMS. I suspect that all Customers will be transported to your local process to perform the query.
If all properties of a DocumentLine are in a Customer then it is possible to extract the DocumentLines properties from every Customer and use Queryable.Contains to keep only those extracted DocumentLines that are in your lines:
IQueryable<DocumentLine> customerDocumentLines = dbContext.Customers
.Select(customer => new DocumentLine()
{
FirstName = customer.FirstName,
LastName = customer.LastName,
...
// etc, fill all DocumentLine properties
});
Note: the query is not executed yet! No communication with the DBMS is performed
Your requested result are all customerDocumentLines that are contained in lines, removing the duplicates.
var result = customerDocumentLines // extract the document lines from all Customers
.Distinct // remove duplicates
.Where(line => lines.Contains(line)); // keep only those lines that are in lines
This won't work if you can't extract a complete DocumentLine from a Customer. If lines contains duplicates, the result won't show these duplicates.
If you can't extract all properties from a DocumentLine you'll have to move the values to check to local memory:
var valuesToCompare = dbContext.Customers
.Select(customer => new
{
FirstName = customer.FirstName,
LastName = customer.LastName,
...
// etc, fill all values you need to check
})
.Distinct() // remove duplicates
.AsEnumerable(); // make it IEnumerable,
// = efficiently move to local memory
Now you can use Enumerable.Contains to get the subset of lines. You'll need to compare by value, not by reference. Luckily anonymous types compare for equality by value
var result = lines
// extract the values to compare
.Select(line => new
{
Line = line,
ValuesToCompare = new
{
FirstName = customer.FirstName,
LastName = customer.LastName,
...
})
})
// keep only those lines that match valuesToCheck
.Where(line => valuesToCheck.Contains(line.ValuesToCompare));

Related

How to Performance Test This and Suggestions to Make Faster?

I seem to have written some very slow piece of code which gets slower when I have to deal with EF Core.
Basically I have a list of items that store attributes in a Json string in the database as I am storing many different items with different attributes.
I then have another table that contains the display order for each attribute, so when I send the items to the client I am order them based on that order.
It is kinda slow at doing 700 records in about 18-30 seconds (from where I start my timer, not the whole block of code).
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId);
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
Stopwatch a = new Stopwatch();
a.Start();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
var specDtos = new List<SpecDto>();
foreach (var inventorySpecification in inventorySpecifications.OrderBy(x => x.DisplayOrder))
{
if (specs.ContainsKey(inventorySpecification.JsonKey))
{
var value = specs.GetValue(inventorySpecification.JsonKey);
var newSpecDto = new SpecDto()
{
Key = inventorySpecification.JsonKey,
Value = displaySpec.ToString()
};
specDtos.Add(newSpecDto);
}
}
var dto = new InventoryItemDto()
{
// create dto
};
inventoryItemDtos.Add(dto);
}
Now it goes crazy slow when I add EF some more columns that I need info from.
In the //create dto area I access some information from other tables
var dto = new InventoryItemDto()
{
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
};
By trying to access these columns in the loop takes 6mins to process 700 rows.
I don't understand why it is so slow, it's the only change I really made and I made sure to eager load everything in.
To me it almost makes me think eager loading is not working, but I don't know how to verify if it is or not.
var inventoryItems = dbContext.InventoryItems.Include(x => x.Branch).ThenInclude(x => x.Company)
.Include(x => x.Branch).ThenInclude(x => x.Country)
.Include(x => x.Branch).ThenInclude(x => x.State)
.Include(x => x.Brand)
.Where(x => x.InventoryCategoryId == categoryId).ToList();
so I thought because of doing this the speed would not be that much different then the original 18-30 seconds.
I would like to speed up the original code too but I am not really sure how to get rid of the dual foreach loops that is probably slowing it down.
First, loops inside loops is a very bad thing, you should refactor that out and make it a single loop. This should not be a problem because inventorySpecifications is declared outside the loop
Second, the line
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
should end with ToList(), because it's enumerations is happening within the inner foreach, which means that the query is running for each of "inventoryItems"
that should save you a good amount of time
I'm no expert but this part of your second foreach raises a red flag: inventorySpecifications.OrderBy(x => x.DisplayOrder). Because this is getting called inside another foreach it's doing the .OrderBy call every time you iterate over inventoryItems.
Before your first foreach loop, try this: var orderedInventorySpecs = inventorySpecifications.OrderBy(x => x.DisplayOrder); and then use foreach (var inventorySpec in orderedInventorySpecs) and see if it makes a difference.
To help you better understand what EF is running behind the scenes add some logging in to expose the SQL being run which might help you see how/where your queries are going wrong. This can be extremely helpful to help determine if your queries are hitting the DB too often. As a very general rule you want to hit the DB as few times as possible and retrieve only the information you need via the use of .Select() to reduce what is being returned. The docs for the logging are: http://learn.microsoft.com/en-us/ef/core/miscellaneous/logging
I obviously cannot test this and I am a little unsure where your specDto's go once you have them but I assume they become part of the InventoryItemDto?
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId).Select(x => new InventoryItemDto() {
Attributes = x.Attributes,
//.....
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
}).ToList();
var inventorySpecifications = dbContext.InventoryCategorySpecifications
.Where(x => x.InventoryCategoryId == categoryId)
.OrderBy(x => x.DisplayOrder)
.Select(x => x.InventorySpecification).ToList();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
// Assuming the specs become part of an inventory item?
item.specs = inventorySpecification.Where(x => specs.ContainsKey(x.JsonKey)).Select(x => new SpecDto() { Key = x.JsonKey, Value = specs.GetValue(x.JsonKey)});
}
The first call to the DB for inventoryItems should produce one SQL query that will pull all the information you need at once to construct your InventoryItemDto and thus only hits the DB once. Then it pulls the specs out and uses OrderBy() before materialising which means the OrderBy will be run as part of the SQL query rather than in memory. Both those results are materialised via .ToList() which will cause EF to pull the results into memory in one go.
Finally the loop goes over your constructed inventoryItems, parses the Json and then filters the specs based on that. I am unsure of where you were using the specDtos so I made an assumption that it was part of the model. I would recomend checking the performance of the Json work you are doing as that could be contributing to your slow down.
A more integrated approach to using Json as part of your EF models can be seen at this answer: https://stackoverflow.com/a/51613611/621524 however you will still be unable to use those properties to offload execution to SQL as accessing properties that are defined within code will cause queries to fragment and run in several parts.

ASP.NET MVC C# Select and Where Statements

I'm having trouble understanding .Select and .Where statements. What I want to do is select a specific column with "where" criteria based on another column.
For example, what I have is this:
var engineers = db.engineers;
var managers = db.ManagersToEngineers;
List<ManagerToEngineer> matchedManager = null;
Engineer matchedEngineer = null;
if (this.User.Identity.IsAuthenticated)
{
var userEmail = this.User.Identity.Name;
matchedEngineer = engineers.Where(x => x.email == userEmail).FirstOrDefault();
matchedManager = managers.Select(x => x.ManagerId).Where(x => x.EngineerId == matchedEngineer.PersonId).ToList();
}
if (matchedEngineer != null)
{
ViewBag.EngineerId = new SelectList(new List<Engineer> { matchedEngineer }, "PersonId", "FullName");
ViewBag.ManagerId = new SelectList(matchedManager, "PersonId", "FullName");
}
What I'm trying to do above is select from a table that matches Managers to Engineers and select a list of managers based on the engineer's id. This isn't working and when I go like:
matchedManager = managers.Where(x => x.EngineerId == matchedEngineer.PersonId).ToList();
I don't get any errors but I'm not selecting the right column. In fact the moment I'm not sure what I'm selecting. Plus I get the error:
Non-static method requires a target.
if you want to to select the manager, then you need to use FirstOrDefault() as you used one line above, but if it is expected to have multiple managers returned, then you will need List<Manager>, try like:
Update:
so matchedManager is already List<T>, in the case it should be like:
matchedManager = managers.Where(x => x.EngineerId == matchedEngineer.PersonId).ToList();
when you put Select(x=>x.ManagerId) after the Where() now it will return Collection of int not Collection of that type, and as Where() is self descriptive, it filters the collection as in sql, and Select() projects the collection on the column you specify:
List<int> managerIds = managers.Where(x => x.EngineerId == matchedEngineer.PersonId)
.Select(x=>x.ManagerId).ToList();
The easiest way to remember what the methods do is to remember that this is being translated to SQL.
A .Where() method will filter the rows returned.
A .Select() method will filter the columns returned.
However, there are a few ways to do that with the way you should have your objects set up.
First, you could get the Engineer, and access its Managers:
var engineer = context.Engineers.Find(engineerId);
return engineer.Managers;
However, that will first pull the Engineer out of the database, and then go back for all of the Managers. The other way would be to go directly through the Managers.
return context.Managers.Where(manager => manager.EngineerId == engineerId).ToList();
Although, by the look of the code in your question, you may have a cross-reference table (many to many relationship) between Managers and Engineers. In that case, my second example probably wouldn't work. In that case, I would use the first example.
You want to filter data by matching person Id and then selecting manager Id, you need to do following:
matchedManager = managers.Where(x => x.EngineerId == matchedEngineer.PersonId).Select(x => x.ManagerId).ToList();
In your case, you are selecting the ManagerId first and so you have list of ints, instead of managers from which you can filter data
Update:
You also need to check matchedEngineer is not null before retrieving the associated manager. This might be cause of your error
You use "Select" lambda expression to get the field you want, you use "where" to filter results

Need to Include() related entities but no option to do so

I'm not sure how else to word the title of this question so let me explain.
I have a need to select most of one entity type from my database, using .Include to select it's related entities, but at the same time to only select the entities where the entity identifier is equal to one of the IDs in a string array.
My code as follows:
List<TSRCategory> electives = new List<TSRCategory>();
foreach (var i in client.Electives.Split('&'))
{
int id = Int32.Parse(i);
electives.Add(db.TSRCategories.Find(id));
}
This correctly selects the TSRCategories that are part of the Electives list of IDs, but does not include the related entities. I was using this code:
TSRCategories = db.TSRCategories.Include("Competencies.CompetencySkills").ToList();
but this does not select only the chosen Electives. What I am ideally looking for is something like this:
List<TSRCategory> electives = new List<TSRCategory>();
foreach (var i in client.Electives.Split('&'))
{
int id = Int32.Parse(i);
electives.Add(db.TSRCategories.Find(id));
}
TSRCategories = electives.Include("Competencies.CompetencySkills").ToList();
But of course this can't be done for whatever reason (I don't actually know what to search for online in terms of why this can't be done!). Electives is a string with the & as a delimiter to separate the IDs into an array. TSRCategories contains Competencies which contains CompetencySkills. Is there a way to actually do this efficiently and in few lines?
You will find that fetching the associated ids one by one will result in poor query performance. You can fetch them all in one go by first projecting a list of all the needed ids (I've assumed the key name ElectiveId here):
var electiveIds = client.Electives.Split('&')
.Select(i => Int32.Parse(i))
.ToArray();
var electives = db.TSRCategories
.Include(t => t.Competencies.Select(c => c.CompetencySkills))
.Where(tsr => electiveIds.Contains(tsr.ElectiveId))
.ToList();
But one thing to mention is that the storage of your ids in a single string field joined by a delimiter violates database normalization. Instead, you should create a new junction table, e.g. ClientElectives which link the Electives associated with a Client in normalized fashion (ClientId, ElectiveId). This will also simplify your EF retrieval code.
Edit
According to the examples in the documentation, I should be using .Select for depth specification of the eager loading (not .SelectMany or other extension methods).
Try to use this extensions method:
using System.Data.Entity;
from x in db.Z.Include(x => x.Competencies)
.Include(x => x.Competencies.CompetencySkills)
select a.b.c;
To search by the given list of ids:
int[] ids = new int[0]; // or List<int>
from x in db.Z
where ids.Contains(x.Id)
select a.b.c;

remove nested values from object in linq

I have an object called copyAgencies which contains a class called Programs, which contains various information pertaining to a program (name, id, ect...).
I am trying to write a foreach loop to remove all programs that do not a match a certain id parameter.
For example, copyAgencies could contain 11 different programs; passing in 3 ids means that the other 8 programs should be removed from the copyAgencies object.
I tried the following code, which fails. Could you help me making it work?
foreach (int id in chkIds)
{
//copyAgencies.Select(x => x.Programs.Select(b => b.ProgramId == id));
copyAgencies.RemoveAll(x => x.Programs.Any(b => b.ProgramId != id)); //removes all agencies
}
If you only have one agency like you said in your comment, and that's all you care about, try this:
copyAgencies[0].Programs.RemoveAll(x => !chkIds.Contains(x.ProgramId));
An easy way to filter out values is to avoid removing the values you're not interesting but filtering the ones you're interested in:
var interestingPrograms = Programs.Where(p => chkIds.Contains(p.Id));
In order to apply this to your agencies you can simply enumerate agencies and filter out the Programs property
var chckIds = new List<int>() {1,2,3};
foreach (var a in agencies)
{
a.Programs = a.Programs.Where(p => chkIds.Contains(p.Id));
}

Detect entities which have the same children

I have two entities, Class and Student, linked in a many-to-many relationship.
When data is imported from an external application, unfortunately some classes are created in duplicate. The 'duplicate' classes have different names, but the same subject and the same students.
For example:
{ Id = 341, Title = '10rs/PE1a', SubjectId = 60, Students = { Jack, Bill, Sarah } }
{ Id = 429, Title = '10rs/PE1b', SubjectId = 60, Students = { Jack, Bill, Sarah } }
There is no general rule for matching the names of these duplicate classes, so the only way to identify that two classes are duplicates is that they have the same SubjectId and Students.
I'd like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:
var sb = new StringBuilder();
using (var ctx = new Ctx()) {
ctx.CommandTimeout = 10000; // Because the next line takes so long!
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
foreach (var c in allClasses) {
var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
foreach (var d in duplicates)
sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
}
}
lblResult.Text = sb.ToString();
This is no good because I get the error:
NotSupportedException: Unable to create a constant value of type 'TeachEDM.Student'. Only primitive types ('such as Int32, String, and Guid') are supported in this context.
Evidently it doesn't like me trying to match o.SubjectId == c.SubjectId in LINQ.
Also, this seems a horrible method in general and is very slow. The call to the database takes more than 5 minutes.
I'd really appreciate some advice.
The comparison of the SubjectId is not the problem because c.SubjectId is a value of a primitive type (int, I guess). The exception complains about Equals(c.Students). c.Students is a constant (with respect to the query duplicates) but not a primitive type.
I would also try to do the comparison in memory and not in the database. You are loading the whole data into memory anyway when you start your first foreach loop: It executes the query allClasses. Then inside of the loop you extend the IQueryable allClasses to the IQueryable duplicates which gets executed then in the inner foreach loop. This is one database query per element of your outer loop! This could explain the poor performance of the code.
So I would try to perform the content of the first foreach in memory. For the comparison of the Students list it is necessary to compare element by element, not the references to the Students collections because they are for sure different.
var sb = new StringBuilder();
using (var ctx = new Ctx())
{
ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id)
.ToList(); // executes query, allClasses is now a List, not an IQueryable
// everything from here runs in memory
foreach (var c in allClasses)
{
var duplicates = allClasses.Where(
o => o.SubjectId == c.SubjectId &&
o.Id != c.Id &&
o.Students.OrderBy(s => s.Name).Select(s => s.Name)
.SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name)));
// duplicates is an IEnumerable, not an IQueryable
foreach (var d in duplicates)
sb.Append(d.LongName)
.Append(" is a duplicate of ")
.Append(c.LongName)
.Append("<br />");
}
}
lblResult.Text = sb.ToString();
Ordering the sequences by name is necessary because, I believe, SequenceEqual compares length of the sequence and then element 0 with element 0, then element 1 with element 1 and so on.
Edit To your comment that the first query is still slow.
If you have 1300 classes with 30 students each the performance of eager loading (Include) could suffer from the multiplication of data which are transfered between database and client. This is explained here: How many Include I can use on ObjectSet in EntityFramework to retain performance? . The query is complex because it needs a JOIN between classes and students and object materialization is complex as well because EF must filter out the duplicated data when the objects are created.
An alternative approach is to load only the classes without the students in the first query and then load the students one by one inside of a loop explicitely. It would look like this:
var sb = new StringBuilder();
using (var ctx = new Ctx())
{
ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include!
foreach (var c in allClasses)
{
// "Explicite loading": This is a new roundtrip to the DB
ctx.LoadProperty(c, "Students");
}
foreach (var c in allClasses)
{
// ... same code as above
}
}
lblResult.Text = sb.ToString();
You would have 1 + 1300 database queries in this example instead of only one, but you won't have the data multiplication which occurs with eager loading and the queries are simpler (no JOIN between classes and students).
Explicite loading is explained here:
http://msdn.microsoft.com/en-us/library/bb896272.aspx
For POCOs (works also for EntityObject derived entities): http://msdn.microsoft.com/en-us/library/dd456855.aspx
For EntityObject derived entities you can also use the Load method of EntityCollection: http://msdn.microsoft.com/en-us/library/bb896370.aspx
If you work with Lazy Loading the first foreach with LoadProperty would not be necessary as the Students collections will be loaded the first time you access it. It should result in the same 1300 additional queries like explicite loading.

Categories