Using a subset of GetHashCode() to increase AzureTable performance through partitioning - c#

Generally speaking, Azure Table IO performance improves as more partitions are used (with some tradeoffs in continuation tokens and batch updates I won't go into).
Since the partition key is always a string I am considering using a "natural" load balancing technique based on a subset of the GetHashCode() of the partition key, and appending this subset to the partition key itself. This will allow all direct PK/RK queries to be computed with little overhead and with ease. Batch updates may just need an intermediate to group similar PKs together prior to submission.
Question:
Should I use GetHashCode() to compute the partition key? Is a better function available?
If I use GetHashCode() does it matter which character I use for my PK?
Is there an abstraction for Azure Table and Blob storage that does this for me already?

No, don't use GetHashCode as its value is only guaranteed to be stable in the current AppDomain. Otherwise, it can change anytime.
Use a hash function which you control or which is standardized. Google has put out a set of hashes for this purpose including "murmur hash".
What should you partition (and hash) on? That depends on your query patterns. It absolutely cannot be answered without looking at your query patterns. In general, try to partition on something that is a predicate in almost all of your queries.

Related

Should cache keys be hashed?

I am working on an existing system that using NCache. it is a distributed system with large caching requirements, so there is no question that caching is the correct answer, but...
For some reason, in the existing code, all cache keys are hashed before storing in the cache.
My argument is that we should NOT hash the key, as the caching library may have some super optimized way of storing it's dictionary and hashing everything means we may actually be slowing down lookups if we do this.
The guy who originally wrote the code has left, and the knowledge of why the keys are cached has been lost.
Can anyone suggest if hashing is the correct thing to do, or should it be removed.
Okay so your question is
Should we hash the keys before storing?
If you yourself do hashing, will it slow down anything
Well, the cache API works on strings as keys. In the background NCache automatically generates hashes against these keys which help it to identify where the object should be stored. And by where I mean in which node.
When you say that your application Hashes keys before handing it over to NCahe, then it is simple an unnecessary step. NCache API was meant to take this headache from you.
BUT if those hashes were generated because of some internal Logic within your application then that's another case. Please check carefully.
Needless to say, if you're doing something again and again then it will definitely have a performance degradation. The Hash strings that you provide will be used again to generate another hash value (int).
Whether you should or shouldn't hash keys depends on your system requirements.
NCache identifies object by it's key, and considers objects with equal keys to be equal. Below is a definition of a hash function from Wikipedia:
A hash function is any function that can be used to map data of
arbitrary size to data of fixed size.
If you stop hash keys, then cache may behave differently. For example, some objects that NCache considered equal, now NCache may consider not equal. And instead of one cache entry you will get two.
NCache doesn't require you to hash keys. NCache key is just a string that is unique for each object. Relevant excerpt from NCache 4.6 Programmer’s Guide:
NCache uses a “key” and “value” structure for objects. Every object
must have a unique string key associated with it. Every key has an
atomic occurrence in the cache whether it is local or clustered.
Cached keys are case sensitive in nature, and if you try to add
another key with same value, an OperationFailedException is thrown by
the cache.

Dictionary Vs Database (Mysql)

This question has been asked before but not under the context of a multithreaded application.
So given a single threaded application, a dictionary will find the value given a key in O(1) time. Given a table (from a database) which has a primary key (cluster index) on column 'X', a search for the associated tuple with a key on column 'X' will be found the tuple in O(log n) time. Add the in-memory benefit of the dictionary, the dictionary wins hands down.
Given a highly parallel (e.g. async socket server) application that relies on a common datastructure (i.e., Dictionary vs Database) to maintain application wide state information (e.g. connected users) where say 50% of the accesses are reads and about 50% of the accesses are updates/deletes/inserts, it does not seem so obvious to me that Dictionary is better. For the following reasons:
1) to make the dictionary thread-safe for concurrent access, a locking mechanism must be used. lock() will lock the entire dictionary and effectively only allowing one thread to access the datastructure at a time. Even using readerWriterLockSlim will lock the entire dictionary when elevated to writeLock.
2) Databases give the benefit of row-level locking when updating/deleting/inserting on a primary key.
3) Dictionaries are in-memory (faster) while database connection use sockets (slower)
So the question is, does the inherit row-level locking features of a relational database out weigh the benefits of in-memory Dictionary accesses?
This is a good question, Paul. Row-level locking feature outweights benefits of in-memory dictionary access. Many additional capabilities of a relational database allows for its widespread use in comparison to an in-memory dictionary object. If you wanted to create an in-memory table for faster data access with MySQL, you could use MEMORY storage engine.
Searching for, say, just date-based information from an indexed date-based column can be rather easy and fast. Adding to that, relational databases - depending on which you use - allows for security by roles, users, business intelligence etc. On a dictionary object you would have build all the features readily available in many popular databases.
A well tuned database can serve thousands of concurrent requests that multi-threaded and disparate applications can benefit from. So before rolling out your own, I'd recommend using a relational database engine. As always, your mileage may vary based on the complexity of the problem you hope to solve.

Is there any better option for GUID creation than System.Guid.NewGuid() in .net

In my application code i am generating GUID using System.Guid.NewGuid() and saving this to SQL server DB.
I have few questions regarding the GUID generation:
when I ran the program I did not find any problem with this in terms of performance, but I still wanted to know whether we have any other better way to generate GUID.
System.Guid.NewGuid() is this the only way to create GUID in .NET
code?
The GUIDs generated by Guid.NewGuid are not sequential according to SQL Servers sort order. This means you are inserting randomly into your indexes which is a disaster for performance. It might not matter, if the write volume is small enough.
You can use SQL Servers NEWSEQUENTIALGUID() function to create sequential ones, or just use an int.
One alternative way to generate guids (I presume as your PK) is to set the column in the table up like this:
create table MyTable(
MyTableID uniqueidentifier not null default (newid()),
...
Implementing like this means that you've the choice whether or not to set them in .Net or to let SQL do it.
I wouldn't say either is going to be "better" or "quicker" though.
To answer the question:
Is there any better option for GUID creation than
System.Guid.NewGuid() in .net
I would venture to say that System.Guid.NewGuid() is the preferred choice.
But for the follow up question:
...saving this to SQL server DB.
The answer is less clear. This has been discussed on the web for a long time. Just Google "guid as primary key" and you'll have hours of reading to do.
Usually when you use a Guid in Sql server it is for the reason of using as primary keys in tables. This has many nice advantages:
It's easy to generate new values without accessing the database
You can be reasonably sure that you locally generated Guid will NOT cause a primary key collision
But there are significant drawbacks as well:
If the primary key is also the clustered index, inserting large amounts of new rows will cause a lot of IO (disc operations) and index updates.
The Guid is quite large compared to the other popular alternative for a surrogate key, the int. Since all other indexes on the table contain the clustered index key, they will grow much faster if you have a Guid vs an int.
Which will also cause more IO since those indexes will require more memory
To mitigate the IO issue, Sql Server 2005 introduced a new NEWSEQUENTIALGUID() function which can be used to generate sequential Guids when inserting new rows. But if you are ging to use that, then you will have to be in contact with the database to generate one, so you lose the possibility to generate one when off line. In this situation you could still generate a normal Guid and use that.
There are also many articles on the web about how to roll your own sequential Guids. One sample:
http://www.codeproject.com/Articles/388157/GUIDs-as-fast-primary-keys-under-multiple-database
I have not tested any of them so I can't vouch for how good they are. I chose that specific sample because it contains some information that might be interesting. Specifically:
It gets even more complicated, because one eccentricity of Microsoft
SQL Server is that it orders GUID values according to the least
significant six bytes (i.e. the last six bytes of the Data4 block).
So, if we want to create a sequential GUID for use with SQL Server, we
have to put the sequential portion at the end. Most other database
systems will want it at the beginning.
EDIT: Since the issue seems to be about inserting large amounts of data using bulk copy, a sequential Guid will probably be needed. If it's not necessary to know the Guid value before inserting then the answer by Jon Egerton would be one good way to solve the issue. If you need to know the Guid value beforehand you will either have to generate sequential Guids to use when inserting or use a workaround.
One possible workaround could be to change the table to use a seeded INT as primary key (and clustered index), and have the Guid value as a separate column with a unique index. When inserting the Guid will be supplied by you while the seeded int will be the clustered index. The rows will then be inserted sequntially, and your generated Guid can still be used as an alternative key for fetching records later. I have no idea if this is a feasible solution for you but it's at least one possible workaround.
NewGuid would be the generally recommended way - unless you need sequential values, in which case you can P/Invoke to the rpcrt function UuidCreateSequential:
Private Declare Function UuidCreateSequential Lib "rpcrt4.dll" (ByRef id As Guid) As Integer
(Sorry, nicked from VB, sure you can convert to C# or other .NET languages as required).

CQRS and primary key: guid or not?

For my project, which is a potentially big web site, I have chosen to separate the command interface from the query interface. As a result, submitting commands are one-way operations that don't return a result. This means that the client has to provide the key, for example:
service.SubmitCommand(new AddUserCommand() { UserId = key, ... });
Obviously I can't use an int for the primary key, so a Guid is a logical choice - except that I read everywhere about the performance impact it has, which scares me :)
But then I also read about COMB Guids, and how they provide the advantages of Guid's while still having a good performance. I also found an implementation here: Sequential GUID in Linq-to-Sql?.
So before I take this important decision: does someone have experience with this matter, of advice?
Thanks a lot!
Lud
First of all, I use sequential GUIDs as a primary key and I don't have any problems with performance.
Most of tests Sequential GUID vs INT as primary key operates with batch insert and selects data from idle database. But in a real life selects and updates happen in SAME time.
As you are applying CQRS, you will not have batch inserts and burden for opening and closing transactions will take much more time than 1 write query. As you have separated read storage, your select operations on a table with GUID PK will be much faster than they would be on a table with INT PK in a unified storage.
Besides, asynchrony, that gives you messaging, allows your applications scale much better than systems with blocking RPC calls can do.
In consideration of aforesaid, choosing GUIDs vs INTs seems to me as be penny-wise and pound-foolish.
You didn't specify which database engine you are using, but since you mentioned LINQ to SQL, I guess it's MS SQL Server.
If yes, then Kimberly Tripp has some advice about that:
Disk space is cheap...
GUIDs as PRIMARY KEYs and/or the clustering key
To summarize the two links in a few words:
sequential GUIDs perform better than random GUIDs, but still worse than numeric autoincrement keys
it's very important to choose the right clustered index for your table, especially when your primary key is a GUID
Instead of supplying a Guid to a command (which is probably meaningless to the domain), you probably already have a natural key like username which serves to uniquely identify the user. This natural key make a lot more sense for the user commands:
When you create a user, you know the username because you submitted it as part of the command.
When you're logging in, you know the username because the user submitted it as part of the login command.
If you index the username column properly, you may not need the GUID. The best way to verify this is to run a test - insert a million user records and see how CreateUser and Login perform. If you really to see a serious performance hit that you have verified adversely affects the business and can't be solved by caching, then add a Guid.
If you're doing DDD, you'll want to focus hard on keeping the domain clean so the code is easy to understand and reflects the actual business processes. Introducing an artificial key is contrary to that goal, but if you're sure that it provides actual value to the business, then go ahead.

best practice to create a generic user id

What s the best way to implement a method that creates and assings ID s to user on a asp.net application?
I was thinking about using DateTime ticks and thread id
I wanna make sure that there is no collision and user ids are unique.
ID can be a string or long.
should i use MD5 on some information that i collect from user? what would that be?
I have seen that md5 collision rate is very low.
I would use GUIDs based off the limited information you've given.
The simplest solution is an autoincremented number. This requires a central server.
Date/time plus a one-way hash are for pseudo-random IDs. Do they have to be pseudo random for security? This should not be relied upon for uniqueness because by definition one-way hashes collide. You'd still need a central server to check for duplicates before issuing the ID.
GUIDs are best if the IDs are created in a distributed system (no central server to generate the ID). GUIDs can be generated on separate machines, and they shouldn't collide. Depends on the implementation, but some GUID algorithms are simply pseudo-random, and yes, there is still a possibility of collision.
Guid is by far the best choice for generating unique ids for something like a userid. They are absolutely guaranteed to be unique globally (hence the name). In order to best work with a clustered index you should use NEWSEQUENTIALID(). This generates sequential ids that can be appended to the index, and prevents sql server having to reorganise and page the index every time a value is added. There is a small security concern associated with using this function in that the next value in the sequence can be determined.

Categories