Creating a Hierarchy of Nested Entities in Entity Framework

Creating a Hierarchy of Nested Entities in Entity Framework - c#

I am trying to create a hierarchical representation in Entity Framework, and I can't seem to find much on the subject after searching around.
Premise: I am working on a backlink monitoring tool where I can paste in a bunch of URLs to see if they point to a specific domain. If so, I want to remove them from the list and store them as top-level (Tier 1) backlinks. After locating and removing all of the backlinks that link directly to the URL, I want to run through the remaining backlinks in the list to see if they point to any of the URLs in the newly-created top-level backlink list, and for the ones that point to the top-level backlinks, store them as Tier 2 backlinks. Then search for Tier 3 backlinks, and so on until the entire list has been checked.
I have a Website entity that contains the Url that is to be used for the first run through the list of imported backlinks. Those that are found are moved to a list, and their URLs are used when looping through the 2nd time, and so on.
I originally created a separate property in the Website entity for each "Tier" of links, but that doesn't seem to be very efficient because when trying to render the hierarchy, the code has to loop through each Tier and re-match the urls from the tiers below to recreate the actual linking structure.
End goal sample:
So I instead believe I should create a single "Backlink" model, and have each backlink entity store a list of the backlinks below it, then when trying to view the backlink hierarchy, just do a simple loop through, and loop through each sub-backlink entity.
A sample of the backlink entity is as follows:
public class Backlink
{
public int BacklinkID { get; set; }
public string Url { get; set; }
public string AnchorText { get; set; }
public string LinksTo { get; set; }
public int PageAuthority { get; set; }
public int PageRank { get; set; }
public virtual ICollection<Backlink> Backlinks { get; set; }
}
I have written the code that actually goes through and checks each backlink's HTML to find if the backlink points to each specific URL, so now I'm trying to figure out the best way to store the results.
Is creating an entity that stores a list of its same type of entity a smart approach, or am I going about this all wrong? Will doing something in this way hurt the performance when querying the database?
Ideally I would like to use lazy-loading and show only the top-tier backlinks at first, then when clicking on the specific backlink, have EF make another call to go a fetch the sub-backlinks and so on - so would this storage approach with lazy loading be smart, or should I scrap that idea and figure out a totally different schema for this?
I'm not great with EF yet so any insights on best approach would be greatly appreciated.

What you are trying to implement is called Adjacency List. It seems that just adding ICollection<Backlink>; Backlinks collection is ok (of course, a proper model configuration is required). However, Adjacency list itself is not a good friend of performance and particularly a typical implementation of it in EF (exactly like you suggested). There are two options:
Like you suggested, load links level-by-level on demand. In this case, selected model itself actually works fine (each level is very simple SELECT like #Danexxtone mentioned). However, you will have a lot of requests to app server / DB. Hence, probably not so good user experience.
You may want to load whole tree in order to show nodes to user without any delay. Doing this using EF means recursion over navigation collections and it's really the worst idea - too much requests to DB.
It seems that EF doesn't have more options. But, you can use plain SQL (through EF data context, by the way)... And there are much more interesting approaches:
CTE (like #Jon mentioned). It works over adjacency list without any additional changes to DB structure. Not bad option, but not the best.
Tree path column. Let's number root of hierarchy as "1", level 1 links as "2", "3", "4" and level 3 link as "5". Each node in tree, each link, may have unique string path like "1/2/5/". Just add one more column "Path" to DB - and you will be able to extract sub-tree using simple LIKE expression (or even .StartsWith in EF)
I assume that you're using MS SqlServer DB. Then you have even better option - hierarchyid data type. It's not supported by EF, however it provides all "tree path" functionality out of box.
I wrote that CTE is not the best option. That's because of performance - queries using string tree path is much more efficient (don't forget about indexes). Performance of hierarchyid is a little better than tree path, but it's advantage - the built-in API for tree manipulations.
One more interesting approach is Nested Sets. However, I wouldn't recommend it - too huge overhead on inserting new nodes and it's not so easy to code it.
Conclusion
If you are familiar with SQL itself and using plain SQL in EF - the best option could be hierarchyid.
If you want to code using only EF - adjacency list is the only option. Just do not retrieve deep sub-trees using recursive traversal of navigation collections - it may really hurt.

Related

How to implement LinkedList in Entity Framework Core

I want to persist a linked list of objects using in my ASP.Net Core application. For simplicity, I'll use blog and comments; although the real context is much more complex.
I scaffolded two tables, and changed ICollection<Comment> to LinkedList<Comment>. However, if I create an initial migration and apply to an empty database, I don't get anything "linked" in the database (no next or previous). Also, if I seed the data, and then do something like this:
var comments = _context.blogs.First().Comments
I get null. If I leave public virtual ICollection<Comment> Comments, I get the IEnumerable just fine.
I tried to use LinkedList<LinkedListNode<Comment>> instead, and it works nice unless I try to create a migration. Getting an error
No suitable constructor found for entity type 'LinkedListNode'. The following constructors had parameters that could not be bound to properties of the entity type: cannot bind 'value' in 'LinkedListNode(Comment value)'; cannot bind 'list', 'value' in 'LinkedListNode(LinkedList list, Comment value)'.
I couldn't find any guidance how to implement LinkedList in C#/.NET Core (obviously, I can do it manually, having next and prev fields - but I would very much prefer to use framework capabilities, if possible!)

I don't know much about MS SQL server, but the whole next and prev parts in mysql won't work because you can't map the keys it would require to track the fields properly because each link has to have an id in a database to link to. only thing I know of would be create the next/prev yourself and use some custom data persistence or data annotations. but the foreign key constraints I'm pretty sure on any relational database will prevent you from auto persisting those types of fields. reason being tracking deletes, inserts etc would be a nightmare because if you remove the middle of the chain, then the database has to try and guess where to link the ends to

MongoDB data modelling - Indexes and PK

I'm currently transitioning from an RDBMS to a NoSQL solution, more specifically MongoDB. Consider the following tables in my database (the original solution is much more complex, but I include this so you have an idea):
User (PK_ID_User, FirstName, LastName, ...);
UserProfile: (PK_ID_UserProfile, ProfileName, FK_ID_User, ...);
The keys in this table are GUIDs, however they are custom generated. For example:
UserGUIDs will be of the following structure: US022d717e507f40a6b9551f11ebf2fcb4 (so, US-prefix and random numbers),
while UserProfile GUIDS will be of this format: UP0025f5804a30483b9b769c5707b02af6 (so UP-prefix and random numbers)
Now, suppose I want to convert this RDBMS data model to NoSQL MongoDB. For my application (which uses the C# driver), it is very important that all of document properties in MongoDB have the same name. This also counts for the ID fields: the names PK_ID_User and PK_ID_UserProfile, including the GUIDs, have to be the same.
Now, MongoDB uses a standard unique indexed property _id for storing id's. The name of this _id fields can ofcourse not be changed, even though I really need for my application to preserve the column / property names.
So I came up with the following document structures for my Users and User Profiles. Bear in mind that, for this case, I chose to use referenced data modeling over embeds for various reasons I won't explain here:
User-document
{
_id: ObjectId, - indexed
PK_ID_User: custom GUID, - indexed, as it needs to be unique
FirstName: string,
...
}
UserProfile-document
{
_id: ObjectId - indexed
PK_ID_UserProfile: custom GUID, as explained above - indexed, as it needs to be unique,
...
}
And here's the C# class:
public class User
{
[BsonConstructor]
public User() { }
[BsonId] // the _id field
[BsonRepresentation(BsonType.ObjectId)]
public string Id { get; set; }
[BsonElement("PK_ID_User")]
public string PK_ID_User { get; set; }
//Other Mapper properties
}
The reason I chose this modelling strategy is the following: the current project consists of a whole web service, using ORM and RDBMS, and a client side that more or less maps the database objects to client side view objects. So it's really necessary to preserve the names of the Ids / PKs as much as possible. I decided that it'd be best to let MongoDB use the ObjectId's internally (for CRUD-operations), as they don't cause performance overhead, and use the custom GUIDs so they are compatible with the rest of my code. This way, minimal changes have to be made, MongoDB is happy and I am happy, as externally, I can keep querying results based on my GUID PKs that will always be unique. As in MongoDB, my PK GUIDs are stored as unique strings, I think I don't have to worry about GUID overhead on the server side: the GUIDs are created by my C# application.
However, I have my doubts about performance, I now always have a minimum of 2 indexes per document / collection, and have no idea how costly it is in terms of performance.
Is there a better approach for my problem, or should I stick to my current solution?
Kind regards.

I now always have a minimum of 2 indexes per document / collection, and have no idea how costly it is in terms of performance.
Indexes cost performance for inserts and updates, and you posted no info on the frequency of write operations or your setup. It'll be impossible to give a definite answer without a measurement.
Then again, provided you're using a web application, I'd say the sheer network delay to your clients will be several orders of magnitude higher than the difference between say 1, 2 or 3 indexes, since all these operations mostly hit the RAM.
What's costly is the writing to disk, not the restructuring of the BTree in memory. Of course, having more and more indexes will increase the likelihood of an insert leading to costly restructuring of an index tree that would have to hit the disk, but that also depends on the structure of the keys themselves.
If anything, I'd worry about bad cache coherence and time-locality of GUIDs: If your data were very time-local (like logs), then a GUID might hurt (high jitter at the beginning of the string), because updates would be more likely to rearrange entire sub-trees and a typical time-range query would grab items cluttered throughout the tree. But since this appears to be about users and user profiles, such a query would probably not make much sense.

Can I dynamically/on the fly create a class from an interface, and will nHibernate support this practice?

I’ve done some Googling but I have yet to find a solution, or even a definitive answer to my problem.
The problem is simple. I want to dynamically create a table per instance of a dynamically named/created object. Each table would then contain records that are specific to the object. I am aware that this is essentially an anti-pattern but these tables could theoretically become quite large so having all of the data in one table could lead to performance issues.
A more concrete example:
I have a base class/interface ACCOUNT which contains a collection of transactions. For each company that uses my software I create a new concrete version of the class, BOBS_SUB_SHOP_ACCOUNT or SAMS_GARAGE_ACCOUNT, etc. So the identifying value for the class is the class name, not a field within the class.
I am using C# and Fluent nHibernate.
So my questions are:
Does this make sense or do I need to clarify more? (or am I trying
to do something I REALLY shouldn’t?)
Does this pattern have a name?
Does nHibernate support this?
Do you know of any documentation on
the pattern I could read?
Edit: I thought about this a bit more and I realized that I don't REALLY need dynamic objects. All I need is a way to tie objects with some identifier to a table through NHibernate. For example:
//begin - just a brain dump
public class Account
{
public virtual string AccountName { get; set; }
public virtual IList Stuff { get; set; }
}
... somewhere else in code ...
//gets mapped to a table BobsGarageAccount (or something similar)
var BobsGarage = new Account{AccountName="BobsGarage"};
//gets mapped to a table StevesSubShop(or something similar)
var StevesSubShop = new Account{AccountName="StevesSubShop"};
//end
That should suffice for what i need, assuming NHibernate would allow it. I am trying to avoid a situation where one giant table would have the heck beat out of it if high volume occurred on the account tables. If all accounts were in one table... it could be ugly.
Thank you in advance.

Rather than creating a class on the fly, I would recommend a dynamic object. If you implement the right interfaces (one example is here, and in any case you can get there by inheriting from DynamicObject), you can write
dynamic bobsSubShopAccount = new DynamicAccount("BOBS_SUB_SHOP_ACCOUNT");
Console.WriteLine("Balance = {0}", bobsSubShopAccount.Balance);
in your client code. If you use the DLR to implement DynamicAccount, all these calls get intercepted at runtime and passed to your class at runtime. So, you could have the method
public override bool TryGetMember(GetMemberBinder binder, out object result)
{
if (DatabaseConnection.TryGetField(binder.Name, out result))
return true;
// Log the database failure here
result = null;
return false; // The attempt to get the member fails at runtime
}
to read the data from the database using the name of the member requested by client code.
I haven't used NHibernate, so I can't comment with any authority on how NHibernate will play with dynamic objects.

Those classes seem awfully smelly to me, and attempt to solve what amounts to be an actual storage layer issue, not a domain issue. Sharding is the term that you are looking for, essentially.
If you are truly worried about performance of the db, and your loads will be so large, perhaps you might look at partitioning the table instead? Your domain objects could easily handle creating the partition key, and you don't have to do crazy voodoo with NHibernate. This will also more easily permit you to not do nutty domain level things in case you change your persistence mechanisms later. You can create collection filters in your maps, or map readonly objects to a view. The latter option would be a bit smelly in the domain though.
If you absolutely insist on doing some voodoo you might want to look at NHibernate.Shards, it was intended for easy database sharding. I can't say what the current dev state and compatibility is, but it's an option.

Custom Explicit Loading in Entity Framework - any way to do it?

I've got a list of entity object Individual for an employee survey app - an Individual represents an employee or outside rater. The individual has the parent objects Team and Team.Organization, and the child objects Surveys, Surveys.Responses. Responses, in turn, are related to Questions.
So usually, when I want to check the complete information about an Individual, I need to fetch Individuals.Include(Team.Organization).Include(Surveys.Responses.Question).
That's obviously a lot of includes, and has a performance cost, so when I fetch a list of Individuals and don't need their related objects, I don't bother with the Includes... but then the user wants to manipulate an Individual. So here's the challenge. I seem to have 3 options, all bad:
1) Modify the query that downloads the big list of Individuals to .Include(Team.Organization).Include(Surveys.Responses.Question). This gives it bad performance.
2) Individuals.Load(), TeamReference.Load(), OrganizationReference.Load(), Surveys.Load(), (and iterate through the list of Surveys and load their Responses and the Responses' Questions).
3) When a user wishes to manipulate an Individual, I drop that reference and fetch a whole brand new Individual from the database by its primary key. This works, but is ugly because it means I have two different kinds of Individuals, and I can never use one in place of the other. It also creates ugly problems if I'm iterating across a list repeatedly, as it's tricky to avoid loading and dropping the fully-included Individuals repeatedly, which is wasteful.
Is there any way to say
myIndividual.Include("Team.Organization").Include("Surveys.Responses.Question");
with an existing Individual entity, instead of taking approach (3)?
That is, is there any middle-ground between "fetch everything from the database up-front" and "late-load one relationship at a time"?
Possible solution that I'm hoping I could get insight about:
So there's no way to do a manually-implemented explicit load on a navigational-property? No way to have the system interpret
Individual.Surveys = from survey in MyEntities.Surveys.Include("Responses.Question")
where survey.IndividualID = Individual.ID
select survey; //Individual.Surveys is the navigation collection property holding Surveys on the Individual.
Individual.Team = from team in MyEntities.Teams.Include("Organization")
where team.ID = Individual.TeamID
select team;
as just loading Individual's related objects from the database instead of being an assignment/update operation? If this means no actual change in X and Y, can I just do that?
I want a way to manually implement a lazy or explicit load that isn't doing it a dumb (one relation at a time) way. Really, the Teams and Organizationss aren't the problem, but the Survey.Responses.Questions are a massive buttload of database hits.
I'm using 3.5, but for the sake of others (and when my project finally migrates to 4) I'm sure responses relevant to 4 would be appreciated. In that context, similar customization of lazy loading would be good to hear about too.
edit: Switched the alphabet soup to my problem domain, edited for clarity.
Thanks

The Include statement is designed to do exactly what you're hoping to do. Having multiple includes does indeed eager load the related entities.
Here is a good blog post about it:
http://thedatafarm.com/blog/data-access/the-cost-of-eager-loading-in-entity-framework/
In addition, you can use strongly typed "Includes" using some nifty ObjectContext extension methods. Here is an example:
http://blogs.microsoft.co.il/blogs/shimmy/archive/2010/08/06/say-goodbye-to-the-hard-coded-objectquery-t-include-calls.aspx

DDD Value Object: How to persist entity objects without tons of SQL joins?

Obviously I am wrong in saying that DDD is similar to EAV/CR in usefulness, but the only difference I see so far is physical tables build for each entity with lots of joins rather than three tables and lots of joins.
This must be due to my lack of DDD understanding. How do you physically store these objects to the database without significant joins and complication when importing data? I know you can simply create objects that feed in to your repository, but it's difficult to train tools like Microsoft Sql Server Integration Server to use your custom C# objects and framework. Maybe that should be my question, how do you use your DDD ASP.NET C# framework with Microsoft SQL Server Integration Services and Report Services? LOL.
In an EAV/CR database, we can setup a single Person table with different classes depending on the type of person: Vendor, Customer, Buyer, Representative, Company, Janitor, etc. Three tables, a few joins, attributes are always string with validation before we insert, just like ModelValidation in MVC where the object accepts any value but won't persist until it's valid.
In a standard relational model, we used to create a table for each type of entity, mixing in redundant data types like City.
Using Domain Driven Design, we use objects to represent each type of entity, nested objects for each type of ValueObject, and more nested objects still as necessary. In my understanding, this results in a table for each kind of entity and a table for each kind of information set (value object). With all these tables, I see a lot of joins. We also end up creating a physical table for each new contact type. Obviously there is a better way, so I must be incorrect in how I persist objects to a database.
My Vendor looks like this:
public class Vendor {
public int vendorID {get; set;}
public Address vAddress {get; set;}
public Representative vRep {get;set;}
public Buyer vBuyer {get; set;}
}
My Buyer:
public class Buyer {
public int buyerID {get; set;}
public Address bAddress {get; set;}
public Email bEmail {get; set;}
public Phone bPhone {get; set;}
public Phone bFax (get; set;}
}
Do we really reference things like Vendor.vBuyer.bPhone.pAreaCode? I would think we would reference and store Vendor.BuyerPhoneNumber, and build the objects almost like aliases to these parts: Vendor.Address1, Vendor.Address2, Vendor.BuyerPhoneNumber ... etc.

The real answer is to match your SQL normalization strategy to your objects. If you have lots of duplicate addresses and you need to associate them together, then normalize the data to a separate table, thus creating the need for the value object.

You could serialize your objects to xml and save it to an xml column in you Sql Server. After all, you are trying to represent a hierarchical data structure, and that's where xml excels.

Domain-driven design proponents often recommend keeping the data model as close to the object model as possible, but it isn't an ironclad rule.
You can still use a EAV/CR database design if you create mappings in your object-relational mapping layer to transform (project) your data into your objects.
Deciding how to design your objects and provide access to child values is really a separate question that you have to address on a case-by-case basis. Vendor.BuyerPhoneNumber or Vendor.vBuyer.bPhone.pAreaCode? The answer always depends, because it's rooted in your specific requirements.

One of the best ways to store Domain Objects is actually a document database. It works beautifully because the transactional boundary of the document matches perfectly the consistency boundary of the Aggregate Root. You don't have to worry about JOINs, or eager/lazy loading issues. That's not strictly necessary, though, if you apply CQRS (which I've written about below).
The downside is often with querying. If you wish to query directly the persisted data behind your Domain Objects, you can get into knots. However that is a complexity that CQRS aims to solve for you, where you have different parts of your application doing the queries than the parts loading/validating/storing Domain Objects.
You might have a complex "Command" implementation that loads Domain Objects, invokes the behaviour on them (remembering that Domain Objects must have behaviour and encapsulate their data, or else risk becoming "anaemic"), before saving them and optionally even publishing events about what happened.
You might then use those events to update some other "read store", though you don't have to. The point is you have a totally different vertical slice implemented in your application that doesn't have to bother with that complex object model/ORM business, and instead goes straight to the data, loads exactly what it needs and returns it for displaying to the user.
CQRS is not hard, and is not complex. All it does is instruct you to have separate code for:
Handling commands (which mutate state and therefore need the business rules/invariants involved).
Executing queries (which do not mutate state and therefore don't need all the complex business rules/invariants involved, and so can "just" go get the data in an efficient way.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.