CQRS and primary key: guid or not?

CQRS and primary key: guid or not? - c#

For my project, which is a potentially big web site, I have chosen to separate the command interface from the query interface. As a result, submitting commands are one-way operations that don't return a result. This means that the client has to provide the key, for example:
service.SubmitCommand(new AddUserCommand() { UserId = key, ... });
Obviously I can't use an int for the primary key, so a Guid is a logical choice - except that I read everywhere about the performance impact it has, which scares me :)
But then I also read about COMB Guids, and how they provide the advantages of Guid's while still having a good performance. I also found an implementation here: Sequential GUID in Linq-to-Sql?.
So before I take this important decision: does someone have experience with this matter, of advice?
Thanks a lot!
Lud

First of all, I use sequential GUIDs as a primary key and I don't have any problems with performance.
Most of tests Sequential GUID vs INT as primary key operates with batch insert and selects data from idle database. But in a real life selects and updates happen in SAME time.
As you are applying CQRS, you will not have batch inserts and burden for opening and closing transactions will take much more time than 1 write query. As you have separated read storage, your select operations on a table with GUID PK will be much faster than they would be on a table with INT PK in a unified storage.
Besides, asynchrony, that gives you messaging, allows your applications scale much better than systems with blocking RPC calls can do.
In consideration of aforesaid, choosing GUIDs vs INTs seems to me as be penny-wise and pound-foolish.

You didn't specify which database engine you are using, but since you mentioned LINQ to SQL, I guess it's MS SQL Server.
If yes, then Kimberly Tripp has some advice about that:
Disk space is cheap...
GUIDs as PRIMARY KEYs and/or the clustering key
To summarize the two links in a few words:
sequential GUIDs perform better than random GUIDs, but still worse than numeric autoincrement keys
it's very important to choose the right clustered index for your table, especially when your primary key is a GUID

Instead of supplying a Guid to a command (which is probably meaningless to the domain), you probably already have a natural key like username which serves to uniquely identify the user. This natural key make a lot more sense for the user commands:
When you create a user, you know the username because you submitted it as part of the command.
When you're logging in, you know the username because the user submitted it as part of the login command.
If you index the username column properly, you may not need the GUID. The best way to verify this is to run a test - insert a million user records and see how CreateUser and Login perform. If you really to see a serious performance hit that you have verified adversely affects the business and can't be solved by caching, then add a Guid.
If you're doing DDD, you'll want to focus hard on keeping the domain clean so the code is easy to understand and reflects the actual business processes. Introducing an artificial key is contrary to that goal, but if you're sure that it provides actual value to the business, then go ahead.

Related

Should I expose primary keys in ASP.NET MVC views

In my web application I use primary keys to generate hyperlinks in order to navigate to different pages:
<td>
#Html.ActionLink("Edit", "Edit", new { id = item.Id }) |
#Html.ActionLink("Details", "Details", new { id = item.Id }) |
#Html.ActionLink("Delete", "Delete", new { id = item.Id })
</td>
I was wondering if this code is a security concern. Is it advisable to expose primary keys in ASP.NET MVC views? If this is the case what are the alternatives? Should I encrypt the IDs in my viewmodels or should I create a mapping table between public and private keys?
I appreciate your advice

Gone are the days when people seeing you primary or surrogate keys were able to hack down the database. Now sql injections and backdoor concept are subsided.
I disagree with the stance that exposing primary keys is a problem. It can be a problem if you make them visible to users because they are given meaning outside the system, which is usually what you're trying to avoid.
However to use IDs as the value for combo box list items? Go for it I say. What's the point in doing a translation to and from some intermediate value? You may not have a unique key to use. Such a translation introduces more potential for bugs.
Just don't neglect security.
If say you present the user with 6 items (ID 1 to 6), never assume you'll only get those values back from the user. Someone could try and breach security by sending back ID 7 so you still have to verify that what you get back is allowed.
But avoiding that entirely? No way. No need.
As a comment on another answer says, look at the URL here. That includes what no doubt is the primary key for the question in the SO database. It's entirely fine to expose keys for technical uses.
Also, if you do use some surrogate value instead, that's not necessarily more secure.

Generally, there is no point of encrypting item id, because this is not considered (in most business domains) confidential information. Unless your domain specifically requires to keep id private - don't do this. Keep it simple, stupid.
There is no security concern associated with this.

What exactly do you mean by "primary key" in this context? That term is a database term. In your browser, it's just an identifier. What difference does it make if that identifier is stored in a column with a primary key constraint on it, or a column with a unique index on it, with or without some reversible transformation on its value before storing?
There is obviously no direct risk in exposing an identifier.
But there are risks associated with identifiers, that may have to be mitigated.
For example, you must ensure that knowledge of the identifier does not imply full access to the identified resource. You do that by properly authenticating and authorizing all resource access. (Update: some other answers have suggested that you may do that by making identifiers hard to guess, e.g. through encryption or signing. That is nonsense of course. You protect a resource by protecting it, not by trying to hide it.)
In some cases, the value of an identifier may carry information that you do not want to expose. For example, if you number your "orders" sequentially, and a user sees they have order number 17, they know how many orders you have received in the past. That may be competitive information that you do not want to expose. Also, if identifiers are sequential, they contain information about when the identifier was created, relative to other identifiers. That may be confidential as well.
So the question is not really "can I expose identifiers", but rather "how should I generate identifiers in such a way that no confidential information is exposed through them".
Well, if the number of identified resources is not confidential, just use a sequence (e.g. as generated by an identity column). If you want the identifier to be meaningless, use a cryptographic random number generator to generate them.

Their is no issue to public the ID of item in Web-applications.
If your code CRUD ajax request take ID parameter and process on it then a user can call many ajax request within firebug very easily. If you didn't permit too much to a guest user then it would not be a big problem.
Security doesn't means anything in this context. You just remember that all your code are safe from XSS.
expose of primary ID make it easier for people to remember or hack the url and go to next one (item or page). The only thing you need to care that always check security (XSS for this question)

I believe that there is no risk to expose primary keys to the public, I think you should pay attention to where vulnerabilities start. As long as your generated urls are tamper-free and you are certain about deciding a given url is generated within your application and no-man-in-the-middle, all go smoothly. To do that, I invariably use a hash-styled mechanisms and provide an extra parameter to my urls made up from primary key and something else to check for tamper.

Is there any better option for GUID creation than System.Guid.NewGuid() in .net

In my application code i am generating GUID using System.Guid.NewGuid() and saving this to SQL server DB.
I have few questions regarding the GUID generation:
when I ran the program I did not find any problem with this in terms of performance, but I still wanted to know whether we have any other better way to generate GUID.
System.Guid.NewGuid() is this the only way to create GUID in .NET
code?

The GUIDs generated by Guid.NewGuid are not sequential according to SQL Servers sort order. This means you are inserting randomly into your indexes which is a disaster for performance. It might not matter, if the write volume is small enough.
You can use SQL Servers NEWSEQUENTIALGUID() function to create sequential ones, or just use an int.

One alternative way to generate guids (I presume as your PK) is to set the column in the table up like this:
create table MyTable(
MyTableID uniqueidentifier not null default (newid()),
...
Implementing like this means that you've the choice whether or not to set them in .Net or to let SQL do it.
I wouldn't say either is going to be "better" or "quicker" though.

To answer the question:
Is there any better option for GUID creation than
System.Guid.NewGuid() in .net
I would venture to say that System.Guid.NewGuid() is the preferred choice.
But for the follow up question:
...saving this to SQL server DB.
The answer is less clear. This has been discussed on the web for a long time. Just Google "guid as primary key" and you'll have hours of reading to do.
Usually when you use a Guid in Sql server it is for the reason of using as primary keys in tables. This has many nice advantages:
It's easy to generate new values without accessing the database
You can be reasonably sure that you locally generated Guid will NOT cause a primary key collision
But there are significant drawbacks as well:
If the primary key is also the clustered index, inserting large amounts of new rows will cause a lot of IO (disc operations) and index updates.
The Guid is quite large compared to the other popular alternative for a surrogate key, the int. Since all other indexes on the table contain the clustered index key, they will grow much faster if you have a Guid vs an int.
Which will also cause more IO since those indexes will require more memory
To mitigate the IO issue, Sql Server 2005 introduced a new NEWSEQUENTIALGUID() function which can be used to generate sequential Guids when inserting new rows. But if you are ging to use that, then you will have to be in contact with the database to generate one, so you lose the possibility to generate one when off line. In this situation you could still generate a normal Guid and use that.
There are also many articles on the web about how to roll your own sequential Guids. One sample:
http://www.codeproject.com/Articles/388157/GUIDs-as-fast-primary-keys-under-multiple-database
I have not tested any of them so I can't vouch for how good they are. I chose that specific sample because it contains some information that might be interesting. Specifically:
It gets even more complicated, because one eccentricity of Microsoft
SQL Server is that it orders GUID values according to the least
significant six bytes (i.e. the last six bytes of the Data4 block).
So, if we want to create a sequential GUID for use with SQL Server, we
have to put the sequential portion at the end. Most other database
systems will want it at the beginning.
EDIT: Since the issue seems to be about inserting large amounts of data using bulk copy, a sequential Guid will probably be needed. If it's not necessary to know the Guid value before inserting then the answer by Jon Egerton would be one good way to solve the issue. If you need to know the Guid value beforehand you will either have to generate sequential Guids to use when inserting or use a workaround.
One possible workaround could be to change the table to use a seeded INT as primary key (and clustered index), and have the Guid value as a separate column with a unique index. When inserting the Guid will be supplied by you while the seeded int will be the clustered index. The rows will then be inserted sequntially, and your generated Guid can still be used as an alternative key for fetching records later. I have no idea if this is a feasible solution for you but it's at least one possible workaround.

NewGuid would be the generally recommended way - unless you need sequential values, in which case you can P/Invoke to the rpcrt function UuidCreateSequential:
Private Declare Function UuidCreateSequential Lib "rpcrt4.dll" (ByRef id As Guid) As Integer
(Sorry, nicked from VB, sure you can convert to C# or other .NET languages as required).

ASP.NET Custom Membership Provider for very large application

I have to come up with a membership solution for a very large website. The site will be built using ASP.NET MVC 2 and a MS SQL2008 database.
The current Membership provider seems like a BIG overkill, there's way too much functionality.
All I want to store is email/password and basic profile information such as First/LastName, Phone number. I will only ever need 2 roles, administrators & users.
What are your recommendations on this type of scenario, considering there might be millions of users registered? What does StackOverflow use?
I've used the existing Membership API a lot in the past and have extended it to store additional information etc. But there's tables such as
aspnet_Applications
aspnet_Paths
aspnet_SchemaVersions
aspnet_WebEvent_Events
aspnet_PersonalizationAllUsers
aspnet_PersonalizationPerUser
which are extremely redundant and I've never found use for.
Edit
Just to clarify a few other redundancies after #drachenstern's answer, there are also extra columns which I have no use for in the Membership/Users table, but which would add to the payload of each select/insert statements.
MobilePIN
PasswordQuestion/PasswordAnswer (I'll do email based password recovery)
IsApproved (user will always be approved)
Comment
MobileAlias
Username/LoweredUsername (or Email/LoweredEmail) [email IS the username so only need 1 of these]
Furthermore, I've heard that GUID's aren't all that fast, and would prefer to have integers instead (like Facebook does) which would also be publicly exposed.
How do I go about creating my own Membership Provider, re-using some of the Membership APIs (validation, password encryption, login cookie, etc) but only with tables that meet my requirements?
Links to articles and existing implementations are most welcome, my Google searches have returned some very basic examples.
Thanks in advance
Marko

#Marko I can certainly understand that the standard membership system may contain more functionality than you need, but the truth is that it really isn't going to matter. There are parts of the membership system that you aren't going to use just like there are parts of .Net that you aren't going to use. There are plenty of things that .Net can do that you are never, ever going to use, but you aren't going to go through .Net and strip out that functionality are you? Of course not. You have to focus on the things that are important to what you are trying to accomplish and work from there. Don't get caught up in the paralysis of analysis. You will waste your time, spin your wheels and not end up with anything better than what has already been created for you. Now Microsoft does get it wrong sometimes, but they do get a lot of things right. You don't have to embrace everything they do to accomplish your goals - you just have to understand what is important for your needs.
As for the Guids and ints as primary keys, let me explain something. There is a crucial difference between a primary key and a clustered index. You can add a primary key AND a clustered index on columns that aren't a part of the primary key! That means that if it is more important to have your data arranged by a name (or whatever), you can customize your clustered index to reflect exactly what you need without it affecting your primary key. Let me say it another way - a primary key and a clustered index are NOT one in the same. I wrote a blog post about how to add a clustered index and then a primary key to your tables. The clustered index will physically order the table rows the way you need them to be and the primary key will enforce the integrity that you need. Have a look at my blog post to see exactly how you can do it.
Here is the link - http://iamdotnetcrazy.blogspot.com/2010/09/primary-keys-do-not-or-should-not-equal.html.
It is really simple, you just add the clustered index FIRST and then add the primary key SECOND. It must be done in that order or you won't be able to do it. This assumes, of course, that you are using Sql Server. Most people don't realize this because SQL Server will create a clustered index on your primary key by default, but all you have to do is add the clustered index first and then add the primary key and you will be good to go. Using ints as a primary key can become VERY problematic as your database and server system scales out. I would suggest using Guids and adding the clustered index to reflect the way you actually need your data stored.
Now, in summary, I just want to tell you to go create something great and don't get bogged down with superficial details that aren't going to give you enough of a performance gain to actually matter. Life is too short. Also, please remember that your system can only be as fast as its slowest piece of code. So make sure that you look at the things that ACTUALLY DO take up a lot of time and take care of those.
And one more additional thing. You can't take everything you see on the web at face value. Technology changes over time. Sometimes you may view an answer to a question that someone wrote a long time ago that is no longer relevant today. Also, people will answer questions and give you information without having actually tested what they are saying to see if it is true or not. The best thing you can do for your application is to stress test it really well. If you are using ASP.Net MVC you can do this in your tests. One thing you can do is to add a for loop that adds users to your app in your test and then test things out. That is one idea. There are other ways. You just have to give it a little effort to design your tests well or at least well enough for your purposes.
Good luck to you!

The current Membership provider seems like a BIG overkill, there's way too much functionality.
All I want to store is email/password and basic profile information such as First/LastName, Phone number. I will only ever need 2 roles, administrators & users.
Then just use that part. It's not going to use the parts that you don't use, and you may find that you have a need for those other parts down the road. The classes are already present in the .NET framework so you don't have to provide any licensing or anything.
The size of the database is quite small, and if you do like I do, and leave aspnetdb to itself, then you're not really taking anything from your other databases.
Do you have a compelling reason to use a third-party component OVER what's in the framework already?
EDIT:
there are also extra columns which I
have no use for in the
Membership/Users table, but which
would add to the payload of each
select/insert statements.
MobilePIN
PasswordQuestion/PasswordAnswer (I'll
do email based password recovery)
IsApproved (user will always be
approved)
Comment MobileAlias
Username/LoweredUsername (or
Email/LoweredEmail) [email IS the
username so only need 1 of these]
This sounds like you're trying to microoptimize. Passing empty strings is virtually without cost (ok, it's there, but you have to profile to know just how much it's costing you. It won't be THAT much per user). We already routinely don't use all these fields in our apps either, but we use the membership system with no measurable detrimental impact.
Furthermore, I've heard that Guid's aren't all that fast, and would prefer to have integers instead (like Facebook does) which would also be publicly exposed.
I've heard that the cookiemonster likes cookies. Again, without profiling, you don't know if that's detrimental. Usually people use GUIDs because they want it to be absolutely (well to a degree of absoluteness) unique, no matter when it's created. The cost of generating it ONCE per user isn't all that heavy. Not when you're already creating them a new account.
Since you are absolutely set on creating a MembershipProvider from scratch, here are some references:
http://msdn.microsoft.com/en-us/library/system.web.security.membershipprovider.aspx
https://web.archive.org/web/20211020202857/http://www.4guysfromrolla.com/articles/120705-1.aspx
http://msdn.microsoft.com/en-us/library/f1kyba5e.aspx
http://www.amazon.com/ASP-NET-3-5-Unleashed-Stephen-Walther/dp/0672330113
Stephen Walther goes into detail on that in his book and it's a good reference for you to have as is.

My recommendation would be for you to benchmark it. Add as many records as you think you will have in production and submit a similar number of requests as you would get in production and see how it performs for your environment.
My guess is that it would be OK, the overhead that you are talking about would be insignificant.

Application Development: Should I check for a primary-key on a table or assume it should be there?

When building an application and you are using a table that has a primary key, should you check to see if the table has a primary key or does not have duplicate IDs?
I ran into some code I'm maintaining that is checking to ensure no duplicate ids are in the result set. But the id that is being checked is a primary key. So to me this check is not needed since you cannot have a primary keys with the same value.
But... should this be checked in case a DBA disabled the primary key on the table for any reason or assume the primary key should always be there?

Make sure that the query is actually returning data only from the table with the primary key. If this table is joined to another table in the query, and it isn't a one-to-one relationship, it could cause multiple rows to be returned which have the same ID in the primary table. In this case, the code checking for duplicates may actually be doing something valuable.
As long as this isn't the case, remove the code that checks for duplicates. It's a waste of CPU cycles and memory to verify that the database is doing its job.

I always leave it to the DB to manage this rule as it's best at doing that. But I have been bitten when people have dropped the primary key for various reasons - but it's always best to tackle that separately as it is usually an indication of another issue (such as a lack of training or care)

I think it would be a bad idea to have to confirm that the schema is correct in application code. That would be an ugly mixing of concerns. In fact, the application shouldn't care about the schema at all- it should be dependent on an abstracted data model.
Validation is another issue. You should check proactively for duplicates on primary and unique-keyed inserts rather than relying a database exception to indicate a duplicate.

I believe it should be checked - but not by the application. You probably don't run a virus check, test if there's enough space left in the DB, get hard disk health status, ... from your application, either.
Even if you did check for PKs from your application - how do you know, that this doesn't change during runtime? The existence of PKs should be ensured by the database deployment process, and permissions be restrictive enough, that this can't be changed (too easily) outside of that process.

If it is a primary key the constraint is enforced by the sql server and you don't need to verify it. So normally you cannot insert records with duplicate primary keys. This being said you can temporary deactivate this constraint and perform the insertion but in normal circumstances this cannot happen.

I wouldn't check it, this is a pretty basic RDBMS principle. I don't think it's unreasonable for your code to give strange answers if someone violates it.

In my humble opinion, checking for duplicates on a primary key is redundant. Dumb, too, but that's impolite, so let that go.
No DBA should be disabling the primary key.

It seems to me like you've already answered your question. If you have full control of your database structure, checking that the PK works is pointless work.
If it's likely that someone is going to go around in your db breaking things, then it's relevant to check that your db structure is intact.

If the DBA has dropped the primary key on the table and the product team is not aware of it to make appropriate code chages, then this is not a shortcoming of the code. I am sure that no DBA would possibly do that. So, to check if the table has a PK or not is absolutely redundant.

It usually makes sense to check the schema version at install time rather than runtime. One way to do that is to store a schema version number in the database and at install time check it is the one the app is expecting. Schema upgrade is potentially part of the installation anyway.
For a production app it is also usual to implement a change management process so that any changes (whether app or database) are regression tested before being released.

Difference between creating Guid keys in C# vs. the DB

We use Guids as primary keys for entities in the database. Traditionally, we've followed a pattern of letting the database set the ID for an entity during the INSERT, I think mostly because this is typically how you'd handle things using an auto-increment field or whatever.
I'm finding more and more that it's a lot handier to do key assignment in code during object construction, for two main reasons:
you know that once an object's constructor has run, all of it's fields have been initialized. You never have "half-baked" objects kicking around.
if you need to do a batch of operations, some of which depend on knowing an object's key, you can do them all at once without round-tripping to the database.
Are there any compelling reasons not to do things this way? That is, when using Guids as keys, is there a good reason to leave key assignment up to the database?
Edit:
A lot of people have strong opinions on whether or not Guids should be used for PKs (which I knew), but that wasn't really the point of my question.
Aside from the clustering issue (which doesn't seem to be a problem if you set your indexes up properly), I haven't seen a compelling reason to avoid creating keys in the application layer.

I think you are doing just fine by creating them on the client side. As you mentioned, if you let the db do it, you have to find some way (can't think of any really) to get that key. If you were using an identity, there are calls you can use to get the latest one created for a table, but I'm not sure if such exists for a guid.

By doing it in C# you might run the risk of reassigning the GUID and saving it back to the database. By having the database be responsible for it, you're guaranteed that this PK will not change, that is, if you set up the proper constraints. Having said that, you could set similar constraints in your C# code that prevent changing a unique id once it has been assigned, but you'd have to do the same in all of your applications...In my opinion, having it in C# sounds like is more maintenance than the database, since databases already have built in methods to prevent changing primary keys.

Interesting question.
Traditionally I too used the DB assigned guid but recently I was working on a Windows Mobile application and the SQL CE database doesn't allow for newguid so I had to do it in code.
I use SQL replication to get the data from the mobile devices to the server. Over the last 6 months I have had 40 SQL CE clients synchronise back over 100000 records to a SQL 2005 server without one missed or duplicated guid.
The additional coding required was negligible and the benefit of knowing the guid before inserting has in fact cut down on some of the complexity.
I haven't done any performance checking so performance aside I cannot see any reason not to implement guid handling as you suggest.

GUIDs are horrible for performance
I would leave it in the database especially now that SQL Server has NEWSEQUENTIALID() which doesn't cause page splits on inserts anymore because the values are random, every NEWSEQUENTIALID created will be greater than the previous one...only caviat is that it can only be used as a default value

If you ever have to do an insert outside of the GUI (think import from another vendor or data from a company you bought and have to merge with your data), then the GUID would not automatically be assigned. It's not an insurmountable issue, but it is something to consider nonetheless.

I let an empty Guid be an indicator that this object, although constructed, has not yet been inserted into (or retrieved from) the database.

As SQLMenace noted, standard GUIDs negatively affects indexing & paging. In C# you can generate sequential GUIDs like NEWSEQUENTIALID() using a little P/Invoke fun.
[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);
This way you can at least keep using GUIDs, but get more flexibility with how and where they are generated.

Ok, time to chime in. I would say that generated GUIDs client-side for saving to the database is the best way to do things -- provided you happen to be using GUIDs as your PKs, which I only recommend in one scenario: disconnected environment.
When you are using a disconnected model for your data propagation (i.e. PDA/cellphone apps, laptop apps intended for limited connectivity scenarios, etc), GUIDs as PKs generated client-side are the best way to do it.
For every other scenario, you're probably better off with auto-increment identity PKs.
Why? Well, a couple reasons. First, you really do get a big performance boost by using a row-spanning clustered PK index. A GUID PK and a clustered index do not play well together -- even with NEWSEQUENTIALID, which, by the way, I think totally misses the point of GUIDs. Second, unless your situation forces you not to (i.e. you have to use a disconnected model) you really want to keep everything transactional and insert as much interrelated data together at the same time.

Aside from the clustering issue (which doesn't seem to be a problem if you set your indexes up properly),
GUID as indexes will always be terribly cluttered - there's no "proper" setup to avoid that (unless you use the NEWSEQUENTIALGUID function in the SQL Server engine).
The biggest drawback IMHO is size - a GUID is 16 byte, an INT is 4. The PK is not only stored in the tree of the primary key, but also ON EVERY non-clustered index entry.
With a few thousand entries, that might not make a big difference - but if you have a table with millions or billions of entries and several non-clustered indices, using a 16-byte GUID vs. a 4-byte INT as PK might make a HUGE difference in space needed - on disk and in RAM.
Marc

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.