MongoDB data modelling - Indexes and PK

MongoDB data modelling - Indexes and PK - c#

I'm currently transitioning from an RDBMS to a NoSQL solution, more specifically MongoDB. Consider the following tables in my database (the original solution is much more complex, but I include this so you have an idea):
User (PK_ID_User, FirstName, LastName, ...);
UserProfile: (PK_ID_UserProfile, ProfileName, FK_ID_User, ...);
The keys in this table are GUIDs, however they are custom generated. For example:
UserGUIDs will be of the following structure: US022d717e507f40a6b9551f11ebf2fcb4 (so, US-prefix and random numbers),
while UserProfile GUIDS will be of this format: UP0025f5804a30483b9b769c5707b02af6 (so UP-prefix and random numbers)
Now, suppose I want to convert this RDBMS data model to NoSQL MongoDB. For my application (which uses the C# driver), it is very important that all of document properties in MongoDB have the same name. This also counts for the ID fields: the names PK_ID_User and PK_ID_UserProfile, including the GUIDs, have to be the same.
Now, MongoDB uses a standard unique indexed property _id for storing id's. The name of this _id fields can ofcourse not be changed, even though I really need for my application to preserve the column / property names.
So I came up with the following document structures for my Users and User Profiles. Bear in mind that, for this case, I chose to use referenced data modeling over embeds for various reasons I won't explain here:
User-document
{
_id: ObjectId, - indexed
PK_ID_User: custom GUID, - indexed, as it needs to be unique
FirstName: string,
...
}
UserProfile-document
{
_id: ObjectId - indexed
PK_ID_UserProfile: custom GUID, as explained above - indexed, as it needs to be unique,
...
}
And here's the C# class:
public class User
{
[BsonConstructor]
public User() { }
[BsonId] // the _id field
[BsonRepresentation(BsonType.ObjectId)]
public string Id { get; set; }
[BsonElement("PK_ID_User")]
public string PK_ID_User { get; set; }
//Other Mapper properties
}
The reason I chose this modelling strategy is the following: the current project consists of a whole web service, using ORM and RDBMS, and a client side that more or less maps the database objects to client side view objects. So it's really necessary to preserve the names of the Ids / PKs as much as possible. I decided that it'd be best to let MongoDB use the ObjectId's internally (for CRUD-operations), as they don't cause performance overhead, and use the custom GUIDs so they are compatible with the rest of my code. This way, minimal changes have to be made, MongoDB is happy and I am happy, as externally, I can keep querying results based on my GUID PKs that will always be unique. As in MongoDB, my PK GUIDs are stored as unique strings, I think I don't have to worry about GUID overhead on the server side: the GUIDs are created by my C# application.
However, I have my doubts about performance, I now always have a minimum of 2 indexes per document / collection, and have no idea how costly it is in terms of performance.
Is there a better approach for my problem, or should I stick to my current solution?
Kind regards.

I now always have a minimum of 2 indexes per document / collection, and have no idea how costly it is in terms of performance.
Indexes cost performance for inserts and updates, and you posted no info on the frequency of write operations or your setup. It'll be impossible to give a definite answer without a measurement.
Then again, provided you're using a web application, I'd say the sheer network delay to your clients will be several orders of magnitude higher than the difference between say 1, 2 or 3 indexes, since all these operations mostly hit the RAM.
What's costly is the writing to disk, not the restructuring of the BTree in memory. Of course, having more and more indexes will increase the likelihood of an insert leading to costly restructuring of an index tree that would have to hit the disk, but that also depends on the structure of the keys themselves.
If anything, I'd worry about bad cache coherence and time-locality of GUIDs: If your data were very time-local (like logs), then a GUID might hurt (high jitter at the beginning of the string), because updates would be more likely to rearrange entire sub-trees and a typical time-range query would grab items cluttered throughout the tree. But since this appears to be about users and user profiles, such a query would probably not make much sense.

Related

Does LINQ update whole object in db if only one column is changed?

If an object has more than one column and program updates only one column, does LINQ update all columns in database no matter if they are changed or not or it updates only changed column(s)?
Example class:
MyObject
{
int ID {get; set}
string Field1 {get; set}
string Field2 {get; set}
string Field3 {get; set}
string Field4 {get; set}
string Field5 {get; set}
}
Now, I grab a recod from db and change only one field
var myObject=
(
from x in db.TableName
where x.ID == 12345
select x
)
.Single();
myObject.Field1 = "something";
db.SubmitChanges();
Does SQL query perform an update statement on all columns or only Field1 column?

It's not that granular. (Nor should it be, as column-level tracking would introduce a whole host of complexity into concurrency tracking, which is already a difficult and compromise-ridden subject.)
When you're using an ORM (such as Linq to SQL, Entity Framework, etc.) the focus is on the object. The framework maps that object (the entire graph of related objects, actually) into the relational database schema. But what you're updating when you commit your changes to an ORM for persistence is the object graph.
The ORM will track which objects changed, and will make its concurrency checks at the object level in accordance with the mapping logic and concurrency rules set forth. But it's going to compile SQL update statements for each record in its entirety as it corresponds to the object.
The state of the object is what changed, and so it should be persisted in its entirely new state. While it's certainly possible to track changes at the column level, the return on investment for that just doesn't exist. The code would be vastly more complex, which means:
It would be far more difficult to support
It would be far more prone to errors
It would run much slower
It would be far more difficult to understand and predict its behavior
Not to mention, of course, lots of new confusion in concurrency tracking. (Suppose User A updates the phone number for Record X, and User B simultaneously updates the address for Record X. How would you propose those changes be merged automatically? I'm sure you can imagine much, much more complex examples from there.)
The trade-off just doesn't add up in this case. When using an ORM, you're updating an object. The persistence model is abstracted (and pretty well optimized as it stands anyway).
For transactional systems, this is ideal. When committing a unit of work for a transactional system, in the vast majority of cases you're starting with an aggregate root (or a small number of aggregate roots) and updating the graph of objects beneath them. The relational graph is the more important piece in this scenario, and that's what the ORM is meant to handle.
For making mass-updates to targeted columns, you're no longer talking about units of work in a transactional system. At this point you're talking about directly interacting with the table data for data manipulation, data migration perhaps, even some business intelligence tasks. This is a whole different toolset, outside the scope of what ORMs provide.

Creating a Hierarchy of Nested Entities in Entity Framework

I am trying to create a hierarchical representation in Entity Framework, and I can't seem to find much on the subject after searching around.
Premise: I am working on a backlink monitoring tool where I can paste in a bunch of URLs to see if they point to a specific domain. If so, I want to remove them from the list and store them as top-level (Tier 1) backlinks. After locating and removing all of the backlinks that link directly to the URL, I want to run through the remaining backlinks in the list to see if they point to any of the URLs in the newly-created top-level backlink list, and for the ones that point to the top-level backlinks, store them as Tier 2 backlinks. Then search for Tier 3 backlinks, and so on until the entire list has been checked.
I have a Website entity that contains the Url that is to be used for the first run through the list of imported backlinks. Those that are found are moved to a list, and their URLs are used when looping through the 2nd time, and so on.
I originally created a separate property in the Website entity for each "Tier" of links, but that doesn't seem to be very efficient because when trying to render the hierarchy, the code has to loop through each Tier and re-match the urls from the tiers below to recreate the actual linking structure.
End goal sample:
So I instead believe I should create a single "Backlink" model, and have each backlink entity store a list of the backlinks below it, then when trying to view the backlink hierarchy, just do a simple loop through, and loop through each sub-backlink entity.
A sample of the backlink entity is as follows:
public class Backlink
{
public int BacklinkID { get; set; }
public string Url { get; set; }
public string AnchorText { get; set; }
public string LinksTo { get; set; }
public int PageAuthority { get; set; }
public int PageRank { get; set; }
public virtual ICollection<Backlink> Backlinks { get; set; }
}
I have written the code that actually goes through and checks each backlink's HTML to find if the backlink points to each specific URL, so now I'm trying to figure out the best way to store the results.
Is creating an entity that stores a list of its same type of entity a smart approach, or am I going about this all wrong? Will doing something in this way hurt the performance when querying the database?
Ideally I would like to use lazy-loading and show only the top-tier backlinks at first, then when clicking on the specific backlink, have EF make another call to go a fetch the sub-backlinks and so on - so would this storage approach with lazy loading be smart, or should I scrap that idea and figure out a totally different schema for this?
I'm not great with EF yet so any insights on best approach would be greatly appreciated.

What you are trying to implement is called Adjacency List. It seems that just adding ICollection<Backlink>; Backlinks collection is ok (of course, a proper model configuration is required). However, Adjacency list itself is not a good friend of performance and particularly a typical implementation of it in EF (exactly like you suggested). There are two options:
Like you suggested, load links level-by-level on demand. In this case, selected model itself actually works fine (each level is very simple SELECT like #Danexxtone mentioned). However, you will have a lot of requests to app server / DB. Hence, probably not so good user experience.
You may want to load whole tree in order to show nodes to user without any delay. Doing this using EF means recursion over navigation collections and it's really the worst idea - too much requests to DB.
It seems that EF doesn't have more options. But, you can use plain SQL (through EF data context, by the way)... And there are much more interesting approaches:
CTE (like #Jon mentioned). It works over adjacency list without any additional changes to DB structure. Not bad option, but not the best.
Tree path column. Let's number root of hierarchy as "1", level 1 links as "2", "3", "4" and level 3 link as "5". Each node in tree, each link, may have unique string path like "1/2/5/". Just add one more column "Path" to DB - and you will be able to extract sub-tree using simple LIKE expression (or even .StartsWith in EF)
I assume that you're using MS SqlServer DB. Then you have even better option - hierarchyid data type. It's not supported by EF, however it provides all "tree path" functionality out of box.
I wrote that CTE is not the best option. That's because of performance - queries using string tree path is much more efficient (don't forget about indexes). Performance of hierarchyid is a little better than tree path, but it's advantage - the built-in API for tree manipulations.
One more interesting approach is Nested Sets. However, I wouldn't recommend it - too huge overhead on inserting new nodes and it's not so easy to code it.
Conclusion
If you are familiar with SQL itself and using plain SQL in EF - the best option could be hierarchyid.
If you want to code using only EF - adjacency list is the only option. Just do not retrieve deep sub-trees using recursive traversal of navigation collections - it may really hurt.

put just the key type (id) or the class type for FK in entities, pros and cons

I've seen 2 types of entities, like this:
public class Person
{
public int Id {get;set;}
public string Name {get;set;}
public Country Country {get;set;}
}
and like this:
public class Person
{
public int Id {get;set;}
public string Name {get;set;}
public int CountryId {get;set;}
}
I think that the 2nd approach is more lightweight, and you get related data only if you needed;
which one do you think is better?

It depends what you want. If you only want to get the Country's ID then go for the second option. If you actually want to make use of navigation properties and/or lazy loading, then go for the first option.
Personally, I use Entity Framework and combine options one and two:
public class Person
{
public int Id {get;set;}
public string Name {get;set;}
public int CountryId {get;set;}
public Country Country {get;set;}
}
So I have a choice when it comes to returning data from my repositories. This also means that when I come to save, I can just populate the actual value type properties, instead of having to load the country object and assign it to the person.

Taken at face value, the first is an example of a rich domain model, and the second is a data driven approach. Allowing rich domain models is one of the main benefits of ORM.
The only reason I would include the CountryId (either in place of the Country, or in addition to it) would be for optimization for some very specific performance problem. Even then I would think twice. And optimization is something you shouldn't be thinking about too much at the initial design stage. Whats wrong with Person.Country.Id? (Assuming you need the id at all, and it's not just infrastructure).
If you are looking at this from any other angle than performance optimisation, then you are probably taking the wrong approach by including 'foreign keys' in your domain model. I had the same problem when first using NHibernate, coming from an ADO type background. I would almost certainly go with the first example.

There are two considerations, Platforms and Traffic, outlined below...
All in Microsoft Platform
In multi tier solutions, where end client is Silverlight and you are going to share your generated code via RIA services, or you have WPF client with WCF RIA services, first solution gives you better design.
Non Microsoft End client
If your end client is non microsoft client like Flex/Flash, Java or any ajax based smart clients, then first model will not be useful as it needs track itself (self tracking objects). Second model is preferred here.
Low Traffic applications
If network traffic is not much of issue and your design of software is more important, or you have highly scalable middle tires for caching etc, like App Fabric etc, first solution good one which will give you better design.
High Traffic applications
First model will serialize more data then necessary, and that can be a real performance issue in high traffic applications. So in that case, second model will work better because only if user is requesting more data of reference, then only it will be loaded.
This is quite a tradeoff issue between "Better Design" vs "Better Performance", and it needs to be selected based on parameters mentioned above and there can be more parameters depending upon complexity of project, team size, documentation and more.

Good question! For me
public List<Person> GetPersonsLivingIn(int countryId) {
return ObjectContext.Persons.Where(x => x.CountryId == countryId).ToList();
}
just looks like it works that way without knowing about all the magic (leaky) abstractions that may be present in the ORM that would make x => x.Country == country work. I came from Linq2Sql where I had some problems with the first one when passing around objects created in different object contexts.
But I would do as GenericTypeTea said and include both the id and the navigation property. After all, you'll want a navigable object graph at some point. And that way you can still make
public List<Person> GetPersonsLivingIn(Country country) {
return ObjectContext.Persons.Where(x => x.CountryId == country.CountryId).ToList();
}
which has a more OO feeling interface, but still looks like it would work without magic.

Except in some weird edge cases, there are no good reasons for the second design.
They are both equally lightweight (references are lazily loaded by default), but the second one doesn't give you navigational capabilities, which restricts and complicates your queries.

STOP!
In NHibernate, there is NO need to specify the foreign key in your domain model, not even for performance reasons.
Assuming you have lazy loading enabled (it's enabled by default), calling:
int countryId = person.Country.Id;
...won't incur a database hit to retrieve the Country entity. NHibernate will return a dynamic proxy of your Customer, not an actual Customer. Because of the proxy, a database hit will only occur on first access to a Property on your Customer entity, but NHibernate is smart enough to realise that 'person.Country.Id' is the same as accessing the customer ID foreign key in your Person table, which gets loaded in anyway.
However, the following code:
string countryName = person.Country.Name;
...will hit the database, the call to the 'Name' property will load the entire Customer instance.
This behavior assumes you have set-up your mapping like so:
<many-to-one name="Country" class="Country" column="Country_ID" lazy="proxy" />
(note that lazy="proxy" is the default).
Simply put, there is no need to map foreign keys in your domain model with NHibernate.

DDD Value Object: How to persist entity objects without tons of SQL joins?

Obviously I am wrong in saying that DDD is similar to EAV/CR in usefulness, but the only difference I see so far is physical tables build for each entity with lots of joins rather than three tables and lots of joins.
This must be due to my lack of DDD understanding. How do you physically store these objects to the database without significant joins and complication when importing data? I know you can simply create objects that feed in to your repository, but it's difficult to train tools like Microsoft Sql Server Integration Server to use your custom C# objects and framework. Maybe that should be my question, how do you use your DDD ASP.NET C# framework with Microsoft SQL Server Integration Services and Report Services? LOL.
In an EAV/CR database, we can setup a single Person table with different classes depending on the type of person: Vendor, Customer, Buyer, Representative, Company, Janitor, etc. Three tables, a few joins, attributes are always string with validation before we insert, just like ModelValidation in MVC where the object accepts any value but won't persist until it's valid.
In a standard relational model, we used to create a table for each type of entity, mixing in redundant data types like City.
Using Domain Driven Design, we use objects to represent each type of entity, nested objects for each type of ValueObject, and more nested objects still as necessary. In my understanding, this results in a table for each kind of entity and a table for each kind of information set (value object). With all these tables, I see a lot of joins. We also end up creating a physical table for each new contact type. Obviously there is a better way, so I must be incorrect in how I persist objects to a database.
My Vendor looks like this:
public class Vendor {
public int vendorID {get; set;}
public Address vAddress {get; set;}
public Representative vRep {get;set;}
public Buyer vBuyer {get; set;}
}
My Buyer:
public class Buyer {
public int buyerID {get; set;}
public Address bAddress {get; set;}
public Email bEmail {get; set;}
public Phone bPhone {get; set;}
public Phone bFax (get; set;}
}
Do we really reference things like Vendor.vBuyer.bPhone.pAreaCode? I would think we would reference and store Vendor.BuyerPhoneNumber, and build the objects almost like aliases to these parts: Vendor.Address1, Vendor.Address2, Vendor.BuyerPhoneNumber ... etc.

The real answer is to match your SQL normalization strategy to your objects. If you have lots of duplicate addresses and you need to associate them together, then normalize the data to a separate table, thus creating the need for the value object.

You could serialize your objects to xml and save it to an xml column in you Sql Server. After all, you are trying to represent a hierarchical data structure, and that's where xml excels.

Domain-driven design proponents often recommend keeping the data model as close to the object model as possible, but it isn't an ironclad rule.
You can still use a EAV/CR database design if you create mappings in your object-relational mapping layer to transform (project) your data into your objects.
Deciding how to design your objects and provide access to child values is really a separate question that you have to address on a case-by-case basis. Vendor.BuyerPhoneNumber or Vendor.vBuyer.bPhone.pAreaCode? The answer always depends, because it's rooted in your specific requirements.

One of the best ways to store Domain Objects is actually a document database. It works beautifully because the transactional boundary of the document matches perfectly the consistency boundary of the Aggregate Root. You don't have to worry about JOINs, or eager/lazy loading issues. That's not strictly necessary, though, if you apply CQRS (which I've written about below).
The downside is often with querying. If you wish to query directly the persisted data behind your Domain Objects, you can get into knots. However that is a complexity that CQRS aims to solve for you, where you have different parts of your application doing the queries than the parts loading/validating/storing Domain Objects.
You might have a complex "Command" implementation that loads Domain Objects, invokes the behaviour on them (remembering that Domain Objects must have behaviour and encapsulate their data, or else risk becoming "anaemic"), before saving them and optionally even publishing events about what happened.
You might then use those events to update some other "read store", though you don't have to. The point is you have a totally different vertical slice implemented in your application that doesn't have to bother with that complex object model/ORM business, and instead goes straight to the data, loads exactly what it needs and returns it for displaying to the user.
CQRS is not hard, and is not complex. All it does is instruct you to have separate code for:
Handling commands (which mutate state and therefore need the business rules/invariants involved).
Executing queries (which do not mutate state and therefore don't need all the complex business rules/invariants involved, and so can "just" go get the data in an efficient way.

Could you use Lucene as an OODB?

Given that Lucene is a robust document based search engine could it be used as an Object Database for simple applications (E.G., CMS style applications) and if so what do you see the benefits and limitations?
I understand the role of the RDBMS (and use them on a daily basis) but watned to explore other technologies/ideas.
For example say my domain entities are like:
[Serializable]
public class Employee
{
public string FirstName {get;set;}
public string Surname {get;set;}
}
Could I use reflection and store the property values of the Employee object as fields in a Lucene document, plus store a binary serialized version of the Employee object into another field in the same Lucene document?

No. Trying to use Lucene as an effective OODB (Object Oriented Database) is going to be like trying to fit a square peg into a round hole. They're really two completely different beasts.
Lucene is good at building a text index of a set of documents...not storing objects (in a programming sense). Maybe you mis-understand what an Object Oriented Database is. You can check out the definition at Wikipedia:
Object Databases
Object Oriented Databases have their place. If you truly have an application that would benefit from an OODB, I would suggest checking out something like InterSystems Caché

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.