Detect entities which have the same children - c#

I have two entities, Class and Student, linked in a many-to-many relationship.
When data is imported from an external application, unfortunately some classes are created in duplicate. The 'duplicate' classes have different names, but the same subject and the same students.
For example:
{ Id = 341, Title = '10rs/PE1a', SubjectId = 60, Students = { Jack, Bill, Sarah } }
{ Id = 429, Title = '10rs/PE1b', SubjectId = 60, Students = { Jack, Bill, Sarah } }
There is no general rule for matching the names of these duplicate classes, so the only way to identify that two classes are duplicates is that they have the same SubjectId and Students.
I'd like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:
var sb = new StringBuilder();
using (var ctx = new Ctx()) {
ctx.CommandTimeout = 10000; // Because the next line takes so long!
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
foreach (var c in allClasses) {
var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
foreach (var d in duplicates)
sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
}
}
lblResult.Text = sb.ToString();
This is no good because I get the error:
NotSupportedException: Unable to create a constant value of type 'TeachEDM.Student'. Only primitive types ('such as Int32, String, and Guid') are supported in this context.
Evidently it doesn't like me trying to match o.SubjectId == c.SubjectId in LINQ.
Also, this seems a horrible method in general and is very slow. The call to the database takes more than 5 minutes.
I'd really appreciate some advice.

The comparison of the SubjectId is not the problem because c.SubjectId is a value of a primitive type (int, I guess). The exception complains about Equals(c.Students). c.Students is a constant (with respect to the query duplicates) but not a primitive type.
I would also try to do the comparison in memory and not in the database. You are loading the whole data into memory anyway when you start your first foreach loop: It executes the query allClasses. Then inside of the loop you extend the IQueryable allClasses to the IQueryable duplicates which gets executed then in the inner foreach loop. This is one database query per element of your outer loop! This could explain the poor performance of the code.
So I would try to perform the content of the first foreach in memory. For the comparison of the Students list it is necessary to compare element by element, not the references to the Students collections because they are for sure different.
var sb = new StringBuilder();
using (var ctx = new Ctx())
{
ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id)
.ToList(); // executes query, allClasses is now a List, not an IQueryable
// everything from here runs in memory
foreach (var c in allClasses)
{
var duplicates = allClasses.Where(
o => o.SubjectId == c.SubjectId &&
o.Id != c.Id &&
o.Students.OrderBy(s => s.Name).Select(s => s.Name)
.SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name)));
// duplicates is an IEnumerable, not an IQueryable
foreach (var d in duplicates)
sb.Append(d.LongName)
.Append(" is a duplicate of ")
.Append(c.LongName)
.Append("<br />");
}
}
lblResult.Text = sb.ToString();
Ordering the sequences by name is necessary because, I believe, SequenceEqual compares length of the sequence and then element 0 with element 0, then element 1 with element 1 and so on.
Edit To your comment that the first query is still slow.
If you have 1300 classes with 30 students each the performance of eager loading (Include) could suffer from the multiplication of data which are transfered between database and client. This is explained here: How many Include I can use on ObjectSet in EntityFramework to retain performance? . The query is complex because it needs a JOIN between classes and students and object materialization is complex as well because EF must filter out the duplicated data when the objects are created.
An alternative approach is to load only the classes without the students in the first query and then load the students one by one inside of a loop explicitely. It would look like this:
var sb = new StringBuilder();
using (var ctx = new Ctx())
{
ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include!
foreach (var c in allClasses)
{
// "Explicite loading": This is a new roundtrip to the DB
ctx.LoadProperty(c, "Students");
}
foreach (var c in allClasses)
{
// ... same code as above
}
}
lblResult.Text = sb.ToString();
You would have 1 + 1300 database queries in this example instead of only one, but you won't have the data multiplication which occurs with eager loading and the queries are simpler (no JOIN between classes and students).
Explicite loading is explained here:
http://msdn.microsoft.com/en-us/library/bb896272.aspx
For POCOs (works also for EntityObject derived entities): http://msdn.microsoft.com/en-us/library/dd456855.aspx
For EntityObject derived entities you can also use the Load method of EntityCollection: http://msdn.microsoft.com/en-us/library/bb896370.aspx
If you work with Lazy Loading the first foreach with LoadProperty would not be necessary as the Students collections will be loaded the first time you access it. It should result in the same 1300 additional queries like explicite loading.

Related

How to Performance Test This and Suggestions to Make Faster?

I seem to have written some very slow piece of code which gets slower when I have to deal with EF Core.
Basically I have a list of items that store attributes in a Json string in the database as I am storing many different items with different attributes.
I then have another table that contains the display order for each attribute, so when I send the items to the client I am order them based on that order.
It is kinda slow at doing 700 records in about 18-30 seconds (from where I start my timer, not the whole block of code).
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId);
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
Stopwatch a = new Stopwatch();
a.Start();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
var specDtos = new List<SpecDto>();
foreach (var inventorySpecification in inventorySpecifications.OrderBy(x => x.DisplayOrder))
{
if (specs.ContainsKey(inventorySpecification.JsonKey))
{
var value = specs.GetValue(inventorySpecification.JsonKey);
var newSpecDto = new SpecDto()
{
Key = inventorySpecification.JsonKey,
Value = displaySpec.ToString()
};
specDtos.Add(newSpecDto);
}
}
var dto = new InventoryItemDto()
{
// create dto
};
inventoryItemDtos.Add(dto);
}
Now it goes crazy slow when I add EF some more columns that I need info from.
In the //create dto area I access some information from other tables
var dto = new InventoryItemDto()
{
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
};
By trying to access these columns in the loop takes 6mins to process 700 rows.
I don't understand why it is so slow, it's the only change I really made and I made sure to eager load everything in.
To me it almost makes me think eager loading is not working, but I don't know how to verify if it is or not.
var inventoryItems = dbContext.InventoryItems.Include(x => x.Branch).ThenInclude(x => x.Company)
.Include(x => x.Branch).ThenInclude(x => x.Country)
.Include(x => x.Branch).ThenInclude(x => x.State)
.Include(x => x.Brand)
.Where(x => x.InventoryCategoryId == categoryId).ToList();
so I thought because of doing this the speed would not be that much different then the original 18-30 seconds.
I would like to speed up the original code too but I am not really sure how to get rid of the dual foreach loops that is probably slowing it down.
First, loops inside loops is a very bad thing, you should refactor that out and make it a single loop. This should not be a problem because inventorySpecifications is declared outside the loop
Second, the line
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
should end with ToList(), because it's enumerations is happening within the inner foreach, which means that the query is running for each of "inventoryItems"
that should save you a good amount of time
I'm no expert but this part of your second foreach raises a red flag: inventorySpecifications.OrderBy(x => x.DisplayOrder). Because this is getting called inside another foreach it's doing the .OrderBy call every time you iterate over inventoryItems.
Before your first foreach loop, try this: var orderedInventorySpecs = inventorySpecifications.OrderBy(x => x.DisplayOrder); and then use foreach (var inventorySpec in orderedInventorySpecs) and see if it makes a difference.
To help you better understand what EF is running behind the scenes add some logging in to expose the SQL being run which might help you see how/where your queries are going wrong. This can be extremely helpful to help determine if your queries are hitting the DB too often. As a very general rule you want to hit the DB as few times as possible and retrieve only the information you need via the use of .Select() to reduce what is being returned. The docs for the logging are: http://learn.microsoft.com/en-us/ef/core/miscellaneous/logging
I obviously cannot test this and I am a little unsure where your specDto's go once you have them but I assume they become part of the InventoryItemDto?
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId).Select(x => new InventoryItemDto() {
Attributes = x.Attributes,
//.....
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
}).ToList();
var inventorySpecifications = dbContext.InventoryCategorySpecifications
.Where(x => x.InventoryCategoryId == categoryId)
.OrderBy(x => x.DisplayOrder)
.Select(x => x.InventorySpecification).ToList();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
// Assuming the specs become part of an inventory item?
item.specs = inventorySpecification.Where(x => specs.ContainsKey(x.JsonKey)).Select(x => new SpecDto() { Key = x.JsonKey, Value = specs.GetValue(x.JsonKey)});
}
The first call to the DB for inventoryItems should produce one SQL query that will pull all the information you need at once to construct your InventoryItemDto and thus only hits the DB once. Then it pulls the specs out and uses OrderBy() before materialising which means the OrderBy will be run as part of the SQL query rather than in memory. Both those results are materialised via .ToList() which will cause EF to pull the results into memory in one go.
Finally the loop goes over your constructed inventoryItems, parses the Json and then filters the specs based on that. I am unsure of where you were using the specDtos so I made an assumption that it was part of the model. I would recomend checking the performance of the Json work you are doing as that could be contributing to your slow down.
A more integrated approach to using Json as part of your EF models can be seen at this answer: https://stackoverflow.com/a/51613611/621524 however you will still be unable to use those properties to offload execution to SQL as accessing properties that are defined within code will cause queries to fragment and run in several parts.

LINQ Join/Update List of Objects from Database

This issue is a new one to me in LINQ. And maybe I'm going about this wrong.
What I have is a list of objects in memory, which could number up to 100k, and I need to find in my database which objects represent an existing customer.
This search needs to be done across multiple object properties and all I have to go on are the name and address of the person - no unique identifier since this data comes from an outside source.
Is it possible to join my generic of objects against my database context and then update the generic objects, with data from the context, based on whether they are found in the join?
I thought I was getting close to the join working with the below code. And I think the join works .. maybe. But I can't even seem to loop through the records.
public void FindCustomerMatches(List<DocumentLine> lines)
{
IQueryable<DocumentLine> results = null;
var linesQuery = lines.AsQueryable();
using (var customerContext = new Entities())
{
customerContext.Configuration.LazyLoadingEnabled = false;
var dbCustomerQuery = customerContext.customers.Where(c => !c.customernumber.StartsWith("D"));
results = from c in dbCustomerQuery
from l in linesQuery
where c.firstname1 == l.CustomerFirstName
&& c.lastname1 == l.CustomerLastName
&& c.street_address1.Contains(l.CustomerAddress)
&& c.city == l.CustomerCity
&& c.state == l.CustomerState
&& c.zip == l.CustomerZip
select l;
foreach (var result in results)
{
// Do something with each record here, like update it.
}
}
}
It seems to me that you have two collections: a local collection of DocumentLines in variable lines, and a collection of Customers in a customerContext.Customers, probably in a database management system.
Every DocumentLine contains several properties that can also be found in a Customer. Alas you didn't say whether all DocumentLine properties can be found in a Customer.
From lines (the local collection of DocumentLines) you only want to keep only those DocumentLines of which there is at least one Customer in your queryable collection of Customers that match all these properties.
So the result is a sequence of DocumentLines, a sub-collection of lines.
The problem is that you don't want to query a sub-collection of the database table Customers, but you want a sub-collection of your local lines.
Using AsQueryable doesn't transport your lines to your DBMS. I doubt whether the query you defined will be performed by the DBMS. I suspect that all Customers will be transported to your local process to perform the query.
If all properties of a DocumentLine are in a Customer then it is possible to extract the DocumentLines properties from every Customer and use Queryable.Contains to keep only those extracted DocumentLines that are in your lines:
IQueryable<DocumentLine> customerDocumentLines = dbContext.Customers
.Select(customer => new DocumentLine()
{
FirstName = customer.FirstName,
LastName = customer.LastName,
...
// etc, fill all DocumentLine properties
});
Note: the query is not executed yet! No communication with the DBMS is performed
Your requested result are all customerDocumentLines that are contained in lines, removing the duplicates.
var result = customerDocumentLines // extract the document lines from all Customers
.Distinct // remove duplicates
.Where(line => lines.Contains(line)); // keep only those lines that are in lines
This won't work if you can't extract a complete DocumentLine from a Customer. If lines contains duplicates, the result won't show these duplicates.
If you can't extract all properties from a DocumentLine you'll have to move the values to check to local memory:
var valuesToCompare = dbContext.Customers
.Select(customer => new
{
FirstName = customer.FirstName,
LastName = customer.LastName,
...
// etc, fill all values you need to check
})
.Distinct() // remove duplicates
.AsEnumerable(); // make it IEnumerable,
// = efficiently move to local memory
Now you can use Enumerable.Contains to get the subset of lines. You'll need to compare by value, not by reference. Luckily anonymous types compare for equality by value
var result = lines
// extract the values to compare
.Select(line => new
{
Line = line,
ValuesToCompare = new
{
FirstName = customer.FirstName,
LastName = customer.LastName,
...
})
})
// keep only those lines that match valuesToCheck
.Where(line => valuesToCheck.Contains(line.ValuesToCompare));

Why futures are not cached in session 1st level cache?

I have "static" readonly entities which I simply load with QueryOver<T>().List<T>(). All their properties are not "lazy". So some of them have N+1 problem.
I tried to use Future to avoid N+1. But it looks like then NH considers entity properties as "lazy". And when I access them it even reloads entities from db one by one (leading to the same N+1 situation) despite that all entities were preloaded preliminary and should be cached in the session 1st level cache. Here is the code how I'm doing it:
var futures = new List<IEnumerable>();
futures.Add(s.QueryOver<DailyBonus>().Future<DailyBonus>());
futures.Add(s.QueryOver<DailyBonusChestContent>().Future<DailyBonusChestContent>());
// ... other entities ...
// all queries should be sent with first enumeration
// but I want to ensure everything is loaded
// before using lazy properties
foreach (IEnumerable future in futures)
{
if (future.Cast<object>().Any(x => false)) break;
}
// now everything should be in cache, right?
// so I can travel the whole graph without accessing db?
Serializer.Serialize(ms, futures); // wow, N+1 here!
I checked this behavior using hibernatingrhinos profiler.
So what is going on wrong here?
The only correct way of using futures for loading entity with collections is with Fetch which means joins for each query:
var q = session.Query<User>().Where(x => x.Id == id);
var lst = new List<IEnumerable>
{
q.FetchMany(x => x.Characters).ToFuture(),
q.Fetch(x=>x.UpdateableData).ToFuture(),
session.QueryOver<User>().Where(x => x.Id == id)
.Fetch(x=>x.Characters).Eager
.Fetch(x => x.Characters.First().SmartChallengeTrackers).Eager
.Future()
};
var r = session.QueryOver<User>().Where(x => x.Id == id)
.TransformUsing(Transformers.DistinctRootEntity)
.Future();
foreach (IEnumerable el in lst)
{
foreach (object o in el)
{
}
}
return r.ToArray();
It's still better than join everything in a one query - NHibernate won't have to parse thousands of rows introduced by join x join x join x join...
You can add normal select queries (without Fetch) to to the same batch but they won't be used for retrieving collections on another entity.

Something like a VLOOKUP

I'm attempting to merge two lists of different objects where a specific field (employeeID) is equal to a specific field[0,0] in another list. My code looks like this:
int i = Users.Count() - 1;
int i2 = oracleQuery.Count() - 1;
for (int c = 0; c <= i; c++)
{
for (int d = 0; d <= i2; d++)
{
if (Users[c].getEmployeeID().ToString() == oracleQuery[d][0,0].ToString())
{
Users[c].setIDMStatus(oracleQuery[d][0,1].ToString());
}
}
}
This works... but it doesn't seem efficient. Any suggestions for more efficient code that will ultimately lead to the Users list containing the new information from the oracleQuery list?
You could use a join with Enumerable.Join:
var matches = Users.Join(oracleQuery,
u => u.getEmployeeId().ToString(),
oq => oq[0,0].ToString(),
(u,oc) => new { User = u, Status = oc[0,1].ToString() });
foreach(var match in matches)
match.User.setIDMStatus(match.Status);
Note that you could eliminate the ToString() calls if getEmployeeId() and the oracleQuery's [0,0] element are of the same type.
The only thing I notice as far as efficiency is that you use the Enumerable.Count() method, which enumerates the results before you loop through again explicitly in your for loops. I think the LINQ implementation will get rid of the pass through the results to count the elements.
I don't know how you feel about using LINQ QUERY EXPRESSIONS, but this is what I like best:
var matched = from user in Users
join item in oracleQuery on user.getEmployeeID().ToString() equals item[0,0].ToString()
select new {user = user, IDMStatus = item[0,1] };
foreach (var pair in matched)
{
pair.user.setIDMStatus(pair.IDMStatus);
}
You could also use nested foreach loops (if there are multiple matches and set is called multiple times):
foreach (var user in Users)
{
foreach (var match in oracleQuery.Where(item => user.getEmployeeID().ToString() == item[0,0].ToString()) {
user.setIDMStatus(match[0,1]);
}
}
Or if there will only be one match for sure:
foreach (var user in Users)
{
var match = oracleQuery.SingleOrDefault(item => user.getEmployeeID().ToString() == item[0,0].ToString());
if (match != null) {
user.setIDMStatus(match[0,1]);
}
}
I don't think there is any real efficiency problem in what you've written, but you can benchmark it against the implementation in LINQ. I think that using foreach or a Linq query expression might make the code easier to read, but I think there is not a problem with efficiency. You can also write the LINQ query expression using LINQ method syntax, as was done in another answer.
If the data comes from a databases you could do a join there. Otherwise, you could sort the two lists and do a merge join that would be faster than what you have now.
However, since C# introduced LINQ there are a lot of ways to do this in code. Just look up using linq to join/merge lists.

Using Linq to SQL, how do I Eager Load all child and any nested children results

I have 5 tables in a L2S Classes dbml : Global >> Categories >> ItemType >> Item >> ItemData. For the below example I have only gone as far as itemtype.
//cdc is my datacontext
DataLoadOptions options = new DataLoadOptions();
options.LoadWith<Global>(p => p.Category);
options.AssociateWith<Global>(p => p.Category.OrderBy(o => o.SortOrder));
options.LoadWith<Category>(p => p.ItemTypes);
options.AssociateWith<Category>(p => p.ItemTypes.OrderBy(o => o.SortOrder));
cdc.LoadOptions = options;
TraceTextWriter traceWriter = new TraceTextWriter();
cdc.Log = traceWriter;
var query =
from g in cdc.Global
where g.active == true && g.globalid == 41
select g;
var globalList = query.ToList();
// In this case I have hardcoded an id while I figure this out
// but intend on trying to figure out a way to include something like globalid in (#,#,#)
foreach (var g in globalList)
{
// I only have one result set, but if I had multiple globals this would run however many times and execute multiple queries like it does farther down in the hierarchy
List<Category> categoryList = g.category.ToList<Category>();
// Doing some processing that sticks parent record into a hierarchical collection
var categories = (from comp in categoryList
where comp.Type == i
select comp).ToList<Category>();
foreach (var c in categories)
{
// Doing some processing that stick child records into a hierarchical collection
// Here is where multiple queries are run for each type collection in the category
// I want to somehow run this above the loop once where I can get all the Items for the categories
// And just do a filter
List<ItemType> typeList = c.ItemTypes.ToList<ItemType>();
var itemTypes = (from cat in TypeList
where cat.itemLevel == 2
select cat).ToList<ItemType>();
foreach (var t in itemTypes)
{
// Doing some processing that stick child records into a hierarchical collection
}
}
}
"List typeList = c.ItemTypes.ToList();"
This line gets executed numerous times in the foreach, and a query is executed to fetch the results, and I understand why to an extent, but I thought it would eager load on Loadwith as an option, as in fetch everything with one query.
So basically I would have expected L2S behind the scenes to fetch the "global" records in one query, take any primary key values, get the "category" children using one one query. Take those results and stick them into collections linked to the global. Then take all the category keys and excute one query to fetch the itemtype children and link those into their associated collections. Something on the order of (Select * from ItemTypes Where CategoryID in ( select categoryID from Categories where GlobalID in ( #,#,# ))
I would like to know how to properly eager load associated children with minimal queries and possibly how to accomplish my routine generically not knowing how far down I need to build the hierarchy, but given a parent entity, grab all the associated child collections and then do what I need to do.
Linq to SQL has some limitations with respect to eager loading.
So Eager Load in Linq To SQL is only
eager loading for one level at a time.
As it is for lazy loading, with Load
Options we will still issue one query
per row (or object) at the root level
and this is something we really want
to avoid to spare the database. Which
is kind of the point with eager
loading, to spare the database. The
way LINQ to SQL issues queries for the
hierarchy will decrease the
performance by log(n) where n is the
number of root objects. Calling ToList
won't change the behavior but it will
control when in time all the queries
will be issued to the database.
For details see:
http://www.cnblogs.com/cw_volcano/archive/2012/07/31/2616729.html
I am sure this could be done better, but I got my code working with minimal queries. One per level. This is obviously not really eager loading using L2S, but if someone knows the right way I would like to know for future reference.
var query =
from g in cdc.Global
where g.active == true && g.globalId == 41
select g;
var globalList = query.ToList();
List<Category> categoryList = g.category.ToList<Category>();
var categoryIds = from c in cdc.Category
where c.globalId == g.globalId
select c.categoryId;
var types = from t in cdc.ItemTypes
where categoryIds.Any(i => i == t.categoryId)
select t;
List<ItemType> TypeList = types.ToList<ItemType>();
var items = from i in cdc.Items
from d in cdc.ItemData
where i.ItemId == d.ItemId && d.labelId == 1
where types.Any(i => i == r.ItemTypes)
select new
{
i.Id,
// A Bunch of more fields shortened for berevity
d.Data
};
var ItemList = items.ToList();
// Keep on going down the hierarchy if you need more child results
// Do your processing psuedocode
for each item in list
filter child list
for each item in child list
.....
//
Wouldn't mind knowing how to do this all using generics and a recursive method given the top level table

Categories