How generate unique Integers based on GUIDs - c#

Is it possible to generate (highly probable) unique Integer from GUIDs?
int i = Guid.NewGuid().GetHashCode();
int j = BitConverter.ToInt32(Guid.NewGuid().ToByteArray(), 0);
Which one is better?

Eric Lippert did a very interesting (as always) post about the probability of hash collisions.
You should read it all but he concluded with this very illustrative graphic:
Related to your specific question, I would also go with GetHashCode since collisions will be unavoidable either way.

The GetHashCode function is specifically designed to create a well distributed range of integers with a low probability of collision, so for this use case is likely to be the best you can do.
But, as I'm sure you're aware, hashing 128 bits of information into 32 bits of information throws away a lot of data, so there will almost certainly be collisions if you have a sufficiently large number of GUIDs.

A GUID is a 128 bit integer (its just in hex rather than base 10). With .NET 4 use http://msdn.microsoft.com/en-us/library/dd268285%28v=VS.100%29.aspx like so:
// Turn a GUID into a string and strip out the '-' characters.
BigInteger huge = BigInteger.Parse(modifiedGuidString, NumberStyles.AllowHexSpecifier)
If you don't have .NET 4 you can look at IntX or Solver Foundation.

Here is the simplest way:
Guid guid = Guid.NewGuid();
Random random = new Random();
int i = random.Next();
You'll notice that guid is not actually used here, mainly because there would be no point in using it. Microsoft's GUID algorithm does not use the computer's MAC address any more - GUID's are actually generated using a pseudo-random generator (based on time values), so if you want a random integer it makes more sense to use the Random class for this.
Update: actually, using a GUID to generate an int would probably be worse than just using Random ("worse" in the sense that this would be more likely to generate collisions). This is because not all 128 bits in a GUID are random. Ideally, you would want to exclude the non-varying bits from a hashing function, although it would be a lot easier to just generate a random number, as I think I mentioned before. :)

If you are looking to break through the 2^32 barrier then try this method:
/// <summary>
/// Generate a BigInteger given a Guid. Returns a number from 0 to 2^128
/// 0 to 340,282,366,920,938,463,463,374,607,431,768,211,456
/// </summary>
public BigInteger GuidToBigInteger(Guid guid)
{
BigInteger l_retval = 0;
byte[] ba = guid.ToByteArray();
int i = ba.Count();
foreach (byte b in ba)
{
l_retval += b * BigInteger.Pow(256, --i);
}
return l_retval;
}
The universe will decay to a cold and dark expanse before you experience a collision.

I had a requirement where multiple instances of a console application needed to get an unique integer ID. It is used to identify the instance and assigned at startup. Because the .exe is started by hands, I settled on a solution using the ticks of the start time.
My reasoning was that it would be nearly impossible for the user to start two .exe in the same millisecond. This behavior is deterministic: if you have a collision, you know that the problem was that two instances were started at the same time. Methods depending on hashcode, GUID or random numbers might fail in unpredictable ways.
I set the date to 0001-01-01, add the current time and divide the ticks by 10000 (because I don't set the microseconds) to get a number that is small enough to fit into an integer.
var now = DateTime.Now;
var zeroDate = DateTime.MinValue.AddHours(now.Hour).AddMinutes(now.Minute).AddSeconds(now.Second).AddMilliseconds(now.Millisecond);
int uniqueId = (int)(zeroDate.Ticks / 10000);
EDIT: There are some caveats. To make collisions unlikely, make sure that:
The instances are started manually (more than one millisecond apart)
The ID is generated once per instance, at startup
The ID must only be unique in regard to other instances that are currently running
Only a small number of IDs will ever be needed

Because the GUID space is larger than the number of 32-bit integers, you're guaranteed to have collisions if you have enough GUIDs. Given that you understand that and are prepared to deal with collisions, however rare, GetHashCode() is designed for exactly this purpose and should be preferred.

Maybe not integers but small unique keys, anyway shorter then guids:
http://www.codeproject.com/Articles/14403/Generating-Unique-Keys-in-Net

In a static class, keep a static const integer, then add 1 to it before every single access (using a public get property). This will ensure you cycle the whole int range before you get a non-unique value.
/// <summary>
/// The command id to use. This is a thread-safe id, that is unique over the lifetime of the process. It changes
/// at each access.
/// </summary>
internal static int NextCommandId
{
get
{
return _nextCommandId++;
}
}
private static int _nextCommandId = 0;
This will produce a unique integer value within a running process. Since you do not explicitly define how unique your integer should be, this will probably fit.

Here is the simplest solution, just call GetHashCode() on the Guid. Note, that a guid is a 128 bit int while a int is 32. So its not guaranteed to be unique. But its probably statistically good enough for most implementations.
public override bool Equals(object obj)
{
if (obj is IBase)
return ((IBase)obj).Id == this.Id;
return base.Equals(obj);
}
public override int GetHashCode()
{
if (this.Id == Guid.Empty)
return base.GetHashCode();
return this.Id.GetHashCode();
}

Related

Comparing string hashes on different machines

I have a bunch of IDs, that are in the String form, like "enemy1", "enemy2".
I want to save a progress, depends on how many of each enemies I killed. For that goal I use a dictionary like { { "enemy1", 0 }, { "enemy2", 1 } }.
Then I want to share player's save between few machines he can play into (like PC and laptop) via network (serialize it in JSON file first). For size decreasing and perfomance inreasing, i use hashes instead full string, using that alg (becouse MDSN said, that default hash alg can be different on different machines):
int hash_ = 0;
public override int GetHashCode()
{
if(hash_ == 0)
{
hash_ = 5381;
foreach(var ch in id_)
hash_ = ((hash_ << 5) + hash_) ^ ch;
}
return hash_;
}
So, the question is: is that alg in C# will return the same results in any machine player will use.
UPD: in comments i note that the main part of question was unclear.
So. If i can guarantee that all files before deserialization will be in the same encoding, is char representation on every machine that player can use will be the same and operation ^ ch will give same result? I mean WinX64/WinX32/Mac/Linux/... machines
Yes, that code will give the same result on every platform, for the same input. A char is a UTF-16 code unit, regardless of platform, and any given char will convert to the same int value on every platform. As normal with hash codes computed like this, you shouldn't assume that equal hash codes implies equal original values. (It's unclear how you're intending to use the hash, to be honest.)
I would point out that your code isn't thread-safe though - if two threads call GetHashCode at basically the same time, one may see a value of 0 (and therefore start hashing) whereas the second may see an interim result (as computed by the first thread) and assume that's the final hash. If you really believe caching is important here (and I'd test that first) you should compute the complete hash using a local variable, then copy it to the field only when you're done.

Trying to understand the GetHashCode()

I found the following on Microsoft documentation:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash code
I made my own tests to understand the Method:
public static void HashMetod()
{
List<Cliente> listClientTest = new List<Cliente>
{
new Cliente { ID = 1, name = "Marcos", Phones = "2222"}
};
List<Empresa> CompanyList = new List<Empresa>
{
new Empresa { ID = 1, name = "NovaQuimica", Clients = listClientTest },
new Empresa { ID = 1, name = "NovaQuimica", Clients = listClientTest }
};
CompanyList.Add(CompanyList[0]);
foreach (var item in CompanyList)
{
Console.WriteLine("Hash code = {0}", item.GetHashCode());
}
Console.WriteLine("CompanyList[0].Equals(CompanyList[1]) = {0}", CompanyList[0].Equals(CompanyList[1]));
Console.WriteLine("CompanyList[0].Equals(CompanyList[2]) = {0}", CompanyList[0].Equals(CompanyList[2]));
}
My Question is: How can two Differents objects returns the same HashCode? I believe that if two objects return the same, they are Equals(Thats what my method shows). Execute my method and check this out.
A simple observation based on the pigeonhole principle:
GetHashCode returns an int - a 32 bit integer.
There are 4.294.967.296 32-bit integers;
Considering only uppercase English letters, there are 141.167.095.653.376 ten letter words. If we include upper- and lowercase, then we have 144.555.105.949.057.024 combinations.
Since there are more objects than available hash-codes, some (different) objects must have the same hash code.
Another, more real-world example, is that if you wanted to give each person on Earth a hashcode, you would have collisions, since we have more persons than 32-bit integers.
"Fun" fact: because of the birthday paradox, in a city of 100.000 people, you have more than 50% chance of a hash collision.
Here is an Example;
String s1 = new String("AMY");
String s2 = new String("MAY");
Two different Objects, but if the hashCode is calculated with say, the ASCII Code of the characters, it will be the same for MAY and AMY.
You should basically understand the concept of hashing for this.
hashing an object means "finding a value (number) that can be reproduced by the very same instance again and again".
Because hash codes from Object.hashCode() are of type int, you can only have 2^32 different values.
That's why you will have so-called "collisions" depending on the hashing algorithm, when two distinct Objects produce the same hashCode.
To understand them better, you can go through a series of good examples;
PigeonHole, Sock Picking, Hair Counting
SoftBall Team
Birthday Problem.
Hope this helps.
You can read about hashing on the wiki page. But the whole point of hashing is to convert a value into an index, which is done with a hashing function. Hashing functions can vary, but pretty much all end with a mod to constrain the index value within a maximum so it can be put in an array. For each mod n there are an infinite amount of numbers that will yield the same index (I.E. 5 mod 2, 7 mod 2, etc).
You probably just need to read up on Hash Functions in general to make sure you understand that. From Wikipedia:
Hash functions are primarily used to generate fixed-length output data
that acts as a shortened reference to the original data
So essentially you know that you are taking a large (potentially infinite) set of possibilities and trying to fit them into a smaller, more manageable set of possibilities. Because of the two different sizes of the sets, you're guaranteed to have collisions between two different source objects and their Hashes. That said, a good Hash function minimizes those collisions as much as possible.
Hash code is int, that has 2^32 diffent values. Now let's take String class - it can have infinitly many different values, so we can conclude that there must be the same hash codes for different String values.
To find out hash collisions you may exploit Birthday paradox. For instance, for Doubles it could be
random gen = new Random();
Dictionary<int, Double> dict = new Dictionary<int, Double>();
// In general it'll take about
// 2 * sqrt(2^32) = 2 * 65536 = 131072 = 1e5 itterations
// to find out a hash collision (two unequal values with the same hash)
while (true) {
Double d = gen.NextDouble();
int key = d.GetHashCode();
if (dict.ContainsKey(key)) {
Console.Write(d.ToString(Culture.InvariantCulture));
Console.Write(".GetHashCode() == ");
Console.Write(dict[key].ToString(Culture.InvariantCulture));
Console.Write(".GetHashCode() == ");
Console.Write(key.ToString(Culture.InvariantCulture));
break;
}
dict.Add(key, d);
}
In my case
0.540086061479564.GetHashCode() == 0.0337553788133689.GetHashCode() == -1350313817
The purpose of a hash code is to allow code which receives an object to quickly identify things that an object cannot possibly be equal to. If a collection class which has been asked to store many objects it knows nothing about other than how to test them for equality, were then given another object and were asked whether it matches any of the objects it has stored, the collection would have to call Equals on every object in the collection. On the other hand, if the collection can call GetHashCode on each item that's added to the collection, as well as the item it's looking for, and if 99% of the objects in the collection have reported a hashcode which doesn't match the hashcode of the item being sought, then only the 1% of objects whose hashcode does match need to be examined.
The fact that two items' hash codes match won't help compare the two items any faster than could have been done without checking their hash codes, but the fact that items' hash codes don't match will eliminate any need to examine them further. In scenarios were items are far more likely not to match than they are to match, hash codes make it possible to accelerate the non-match case, sometimes by many orders of magnitude.

Guid.NewGuid() VS a random string generator from Random.Next()

My colleague and I are debating which of these methods to use for auto generating user ID's and post ID's for identification in the database:
One option uses a single instance of Random, and takes some useful parameters so it can be reused for all sorts of string-gen cases (i.e. from 4 digit numeric pins to 20 digit alphanumeric ids). Here's the code:
// This is created once for the lifetime of the server instance
class RandomStringGenerator
{
public const string ALPHANUMERIC_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
public const string ALPHA_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public const string NUMERIC = "1234567890";
Random rand = new Random();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
s += chars[rand.Next() % chars.Length];
return s;
}
}
and the other option is simply to use:
Guid.NewGuid();
see Guid.NewGuid on MSDN
We're both aware that Guid.NewGuid() would work for our needs, but I would rather use the custom method. It does the same thing but with more control.
My colleague thinks that because the custom method has been cooked up ourselves, it's more likely to generate collisions. I'll admit I'm not fully aware of the implementation of Random, but I presume it is just as random as Guid.NewGuid(). A typical usage of the custom method might be:
RandomStringGenerator stringGen = new RandomStringGenerator();
string id = stringGen.GetRandomString(20, RandomStringGenerator.ALPHANUMERIC_CAPS.ToCharArray());
Edit 1:
We are using Azure Tables which doesn't have an auto increment (or similar) feature for generating keys.
Some answers here just tell me to use NewGuid() "because that's what it's made for". I'm looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
Edit 2:
We were also using the cooked up method to generate post ID's which, unlike session tokens, need to look pretty for display in the url of our website (like http://mywebsite.com/14983336), so guids are not an option here, however collisions are still to be avoided.
I am looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
First, as others have noted, Random is not thread-safe; using it from multiple threads can cause it to corrupt its internal data structures so that it always produces the same sequence.
Second, Random is seeded based on the current time. Two instances of Random created within the same millisecond (recall that a millisecond is several million processor cycles on modern hardware) will have the same seed, and therefore will produce the same sequence.
Third, I lied. Random is not seeded based on the current time; it is seeded based on the amount of time the machine has been active. The seed is a 32 bit number, and since the granularity is in milliseconds, that's only a few weeks until it wraps around. But that's not the problem; the problem is: the time period in which you create that instance of Random is highly likely to be within a few minutes of the machine booting up. Every time you power-cycle a machine, or bring a new machine online in a cluster, there is a small window in which instances of Random are created, and the more that happens, the greater the odds are that you'll get a seed that you had before.
(UPDATE: Newer versions of the .NET framework have mitigated some of these problems; in those versions you no longer have every Random created within the same millisecond have the same seed. However there are still many problems with Random; always remember that it is only pseudo-random, not crypto-strength random. Random is actually very predictable, so if you are relying on unpredictability, it is not suitable.)
As other have said: if you want a primary key for your database then have the database generate you a primary key; let the database do its job. If you want a globally unique identifier then use a guid; that's what they're for.
And finally, if you are interested in learning more about the uses and abuses of guids then you might want to read my "guid guide" series; part one is here:
https://ericlippert.com/2012/04/24/guid-guide-part-one/
As written in other answers, my implementation had a few severe problems:
Thread safety: Random is not thread safe.
Predictability: the method couldn't be used for security critical identifiers like session tokens due to the nature of the Random class.
Collisions: Even though the method created 20 'random' numbers, the probability of a collision is not (number of possible chars)^20 due to the seed value only being 31 bits, and coming from a bad source. Given the same seed, any length of sequence will be the same.
Guid.NewGuid() would be fine, except we don't want to use ugly GUIDs in urls and .NETs NewGuid() algorithm is not known to be cryptographically secure for use in session tokens - it might give predictable results if a little information is known.
Here is the code we're using now, it is secure, flexible and as far as I know it's very unlikely to create collisions if given enough length and character choice:
class RandomStringGenerator
{
RNGCryptoServiceProvider rand = new RNGCryptoServiceProvider();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
{
byte[] intBytes = new byte[4];
rand.GetBytes(intBytes);
uint randomInt = BitConverter.ToUInt32(intBytes, 0);
s += chars[randomInt % chars.Length];
}
return s;
}
}
"Auto generating user ids and post ids for identification in the database"...why not use a database sequence or identity to generate keys?
To me your question is really, "What is the best way to generate a primary key in my database?" If that is the case, you should use the conventional tool of the database which will either be a sequence or identity. These have benefits over generated strings.
Sequences/identity index better. There are numerous articles and blog posts that explain why GUIDs and so forth make poor indexes.
They are guaranteed to be unique within the table
They can be safely generated by concurrent inserts without collision
They are simple to implement
I guess my next question is, what reasons are you considering GUID's or generated strings? Will you be integrating across distributed databases? If not, you should ask yourself if you are solving a problem that doesn't exist.
Your custom method has two problems:
It uses a global instance of Random, but doesn't use locking. => Multi threaded access can corrupt its state. After which the output will suck even more than it already does.
It uses a predictable 31 bit seed. This has two consequences:
You can't use it for anything security related where unguessability is important
The small seed (31 bits) can reduce the quality of your numbers. For example if you create multiple instances of Random at the same time(since system startup) they'll probably create the same sequence of random numbers.
This means you cannot rely on the output of Random being unique, no matter how long it is.
I recommend using a CSPRNG (RNGCryptoServiceProvider) even if you don't need security. Its performance is still acceptable for most uses, and I'd trust the quality of its random numbers over Random. If you you want uniqueness, I recommend getting numbers with around 128 bits.
To generate random strings using RNGCryptoServiceProvider you can take a look at my answer to How can I generate random 8 character, alphanumeric strings in C#?.
Nowadays GUIDs returned by Guid.NewGuid() are version 4 GUIDs. They are generated from a PRNG, so they have pretty similar properties to generating a random 122 bit number (the remaining 6 bits are fixed). Its entropy source has much higher quality than what Random uses, but it's not guaranteed to be cryptographically secure.
But the generation algorithm can change at any time, so you can't rely on that. For example in the past the Windows GUID generation algorithm changed from v1 (based on MAC + timestamp) to v4 (random).
Use System.Guid as it:
...can be used across all computers and networks wherever a unique identifier is required.
Note that Random is a pseudo-random number generator. It is not truly random, nor unique. It has only 32-bits of value to work with, compared to the 128-bit GUID.
However, even GUIDs can have collisions (although the chances are really slim), so you should use the database's own features to give you a unique identifier (e.g. the autoincrement ID column). Also, you cannot easily turn a GUID into a 4 or 20 (alpha)numeric number.
Contrary to what some people have said in the comment, a GUID generated by Guid.NewGuid() is NOT dependent on any machine-specific identifier (only type 1 GUIDs are, Guid.NewGuid() returns a type 4 GUID, which is mostly random).
As long as you don't need cryptographic security, the Random class should be good enough, but if you want to be extra safe, use System.Security.Cryptography.RandomNumberGenerator. For the Guid approach, note that not all digits in a GUID are random. Quote from wikipedia:
In the canonical representation, xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx, the most significant bits of N indicates the variant (depending on the variant; one, two or three bits are used). The variant covered by the UUID specification is indicated by the two most significant bits of N being 1 0 (i.e. the hexadecimal N will always be 8, 9, A, or B).
In the variant covered by the UUID specification, there are five versions. For this variant, the four bits of M indicates the UUID version (i.e. the hexadecimal M will either be 1, 2, 3, 4, or 5).
Regarding your edit, here is one reason to prefer a GUID over a generated string:
The native storage for a GUID (uniqueidentifier) in SQL Server is 16 bytes. To store a equivalent-length varchar (string), where each "digit" in the id is stored as a character, would require somewhere between 32 and 38 bytes, depending on formatting.
Because of its storage, SQL Server is also able to index a uniqueidentifier column more efficiently than a varchar column as well.

How to seed GUID generation?

What would be the easiest way to code a function in .NET to generate a GUID based on a seed so that I can have greater confidence about its uniqueness?
string GenerateSeededGuid(int seed) { /* code here */ }
Ideally, the seed would come from CryptGenRandom which describes its random number generation as follows:
The data produced by this function is cryptographically random. It is
far more random than the data generated by the typical random number
generator such as the one shipped with your C compiler.
This function is often used to generate random initialization vectors
and salt values.
Software random number generators work in fundamentally the same way.
They start with a random number, known as the seed, and then use an
algorithm to generate a pseudo-random sequence of bits based on it.
The most difficult part of this process is to get a seed that is truly
random. This is usually based on user input latency, or the jitter
from one or more hardware components.
With Microsoft CSPs, CryptGenRandom uses the same random number
generator used by other security components. This allows numerous
processes to contribute to a system-wide seed. CryptoAPI stores an
intermediate random seed with every user. To form the seed for the
random number generator, a calling application supplies bits it might
have—for instance, mouse or keyboard timing input—that are then
combined with both the stored seed and various system data and user
data such as the process ID and thread ID, the system clock, the
system time, the system counter, memory status, free disk clusters,
the hashed user environment block. This result is used to seed the
pseudorandom number generator (PRNG). [...] If an application has access to a good random source, it can
fill the pbBuffer buffer with some random data before calling
CryptGenRandom. The CSP then uses this data to further randomize its
internal seed. It is acceptable to omit the step of initializing the
pbBuffer buffer before calling CryptGenRandom.
tldr; use Guid.NewGuid instead of trying to invent another "more random" approach. (The only reason I can think of to create a UUIDvX from a seed is when a predictable, resettable, sequence is desired. However, a GUID might also not be the best approach2.)
By very definition of being a finite range - 128bits minus 6 versioning bits, so 122 bits of uniqueness for v4 - there are only so many (albeit supremely huge number! astronomically big!) "unique" identifiers.
Due to the Pigeonhole Principle there are only so many Pigeonholes. If Pigeons keep reproducing eventually there will not be enough Holes for each Pigeon. Due to the Birthday Paradox, assuming complete randomness, two Pigeons will try to fight for the same Pigeonholes before they are all filled up. Because there is no Master Pigeonhole List1 this cannot be prevented. Also, not all animals are Pigeons3.
While there are no guarantees as to which GUID generator will be used, .NET uses the underlying OS call, which is a GUIDv4 (aka Random UUID) generator since Windows 2k. As far as I know - or care, really - this is as good a random as it gets for such a purpose. It has been well vetted for over a decade and has not been replaced.
From Wikipedia:
.. only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
1 While there are still a finite set of Pigeonholes, UUIDv1 (aka MAC UUID) - assuming unique time-space - is guaranteed to generate deterministically unique numbers (with some "relatively small" theoretical maximum number of UUIDs generated per second on a given machine). Different broods of Pigeons living in different parallel dimensions - awesome!
2 Twitter uses Snowflakes in parallel dimensions in its own distributed Unique-ID scheme.
3 Rabbits like to live in Burrows, not Pigeonholes. The use of a GUID also acts as an implicit parallel partition. It is only when a duplicate GUID is used for the same purpose that collision-related problems can arise. Just think of how many duplicate auto-increment database primary keys there are!
All you really need to do in your GenerateSeededGuid method is to create a 128-bit random number and the convert it to a Guid. Something like:
public Guid GenerateSeededGuid(int seed)
{
var r = new Random(seed);
var guid = new byte[16];
r.NextBytes(guid);
return new Guid(guid);
}
This is a bit old, but no need for a random generator. But yes this is usefull for testing purpose, but not for general uses
public static Guid GenerateSeededGuid<T>(T value)
{
byte[] bytes = new byte[16];
BitConverter.GetBytes(value.GetHashCode()).CopyTo(bytes, 0);
return new Guid(bytes);
}
public static Guid SeededGuid(int seed, Random random = null)
{
random ??= new Random(seed);
return Guid.Parse(string.Format("{0:X4}{1:X4}-{2:X4}-{3:X4}-{4:X4}-{5:X4}{6:X4}{7:X4}",
random.Next(0, 0xffff), random.Next(0, 0xffff),
random.Next(0, 0xffff),
random.Next(0, 0xffff) | 0x4000,
random.Next(0, 0x3fff) | 0x8000,
random.Next(0, 0xffff), random.Next(0, 0xffff), random.Next(0, 0xffff)));
}
//Example 1
SeededGuid("Test".GetHashCode());
SeededGuid("Test".GetHashCode());
//Example 2
var random = new Random("Test".GetHashCode());
SeededGuid("Test".GetHashCode(), random);
SeededGuid("Test".GetHashCode(), random);
This method is based on php v4 uui https://www.php.net/manual/en/function.uniqid.php#94959

Quickly creating 32 bit hash code uniquely identifying a struct composed of (mostly) primitive values

EDIT: 64 or 128 bit would also work. My brain just jumped to 32bit for some reason, thinking it would be sufficient.
I have a struct that is composed of mostly numeric values (int, decimal), and 3 strings that are never more than 12 alpha-characters each. I'm trying to create an integer value that will work as a hash code, and trying to create it quickly. Some of the numeric values are also nullable.
It seems like BitVector32 or BitArray would be useful entities for use in this endevor, but I'm just not sure how to bend them to my will in this task. My struct contains 3 strings, 12 decimals (7 of which are nullable), and 4 ints.
To simplify my use case, lets say you have the following struct:
public struct Foo
{
public decimal MyDecimal;
public int? MyInt;
public string Text;
}
I know I can get numeric identifiers for each value. MyDecimal and MyInt are of course unique, from a numerical standpoint. And the string has a GetHashCode() function which will return a usually-unique value.
So, with a numeric identifier for each, is it possible to generate a hash code that uniquely identifies this structure? e.g. I can compare 2 different Foo's containing the same values, and get the same Hash Code, every time (regardless of app domain, restarting the app, time of day, alignment of Jupiters moons, etc).
The hash would be sparse, so I don't anticipate collisions from my use cases.
Any ideas? My first run at it I converted everything to a string representation, concated it, and used the built-in GetHashCode() but that seems terribly ... inefficient.
EDIT: A bit more background information. The structure data is being delivered to a webclient, and the client does a lot of computation of included values, string construction, etc to re-render the page. The aforementioned 19 field structure represent a single unit of information, each page could have many of units. I'd like to do some client-side caching of the rendered result, so I can quickly re-render a unit without recomputing on the client side if I see the same hash identifier from the server. JavaScript numeric values are all 64 bit, so I suppose my 32bit constraint is artificial and limiting. 64 bit would work, or I suppose even 128 bit if I can break it into two 64 bit values on the server.
Well, even in a sparse table one should better be prepared for collisions, depending on what "sparse" means.
You would need to be able to make very specific assumptions about the data you will be hashing at the same time to beat this graph with 32 bits.
Go with SHA256. Your hashes will not depend on CLR version and you will have no collisions. Well, you will still have some, but less frequently than meteorite impacts, so you can afford not anticipating any.
Two things I suggest you take a look at here and here. I don't think you'll be able to GUARANTEE no collisions with just 32 bits.
Hash codes by definition of a hash function are not meant to be unique. They are only meant to be as evenly distributed across all result values as possible. Getting a hash code for an object is meant to be a quick way to check if two objects are different. If hash codes for two objects are different then those objects are different. But if hash codes are the same you have to deeply compare the objects to be be sure. Hash codes main usage is in all hash-based collections where they make it possible for nearly O(1) retrieval speed.
So in this light, your GetHashCode does not have to be complex and in fact it shouldn't. It must be balanced between being very quick and producing evenly distributed values. If it takes too long to get a hash code it makes it pointless because advantage over deep compare is gone. If on the other extreme end, hash code would always be 1 for example (lighting fast) it would lead to deep compare in every case which makes this hash code pointless too.
So get the balance right and don't try to come up with a perfect hash code. Call GetHashCode on all (or most) of your members and combine the results using Xor operator maybe with a bitwise shift operator << or >>. Framework types have GetHashCode quite optimized although they are not guaranteed to be the same in each application run. There is no guarantee but they also do not have to change and a lot of them don't. Use a reflector to make sure or create your own versions based on the reflected code.
In your particular case deciding if you have already processed a structure by just looking at its hash code is a bit risky. The better the hash the smaller the risk but still. The ultimate and only unique hash code is... the data itself. When working with hash codes you must also override Object.Equals for your code to be truly reliable.
I believe the usual method in .NET is to call GetHashCode on each member of the structure and xor the results.
However, I don't think GetHashCode claims to produce the same hash for the same value in different app domains.
Could you give a bit more information in your question about why you want this hash value and why it needs to be stable over time, different app domains etc.
What goal are you after? If it is performance then you should use a class since a struct will be copied by value whenever you pass it as a function parameter.
3 strings, 12 decimals (7 of which are nullable), and 4 ints.
On a 64 bit machine a pointer will be 8 bytes in size a decimal takes 16 bytes and an int 4 bytes. Ignoring padding your struct will use 232 bytes per instance. This is much bigger compared to the recommened maximum of 16 bytes which makes sense perf wise (classes take up at least 16 bytes due to its object header, ...)
If you need a fingerprint of the value you can use a cryptographically grade hash algo like SHA256 which will produce a 16 byte fingerprint. This is still not uniqe but at least unique enough. But this will cost quite some performance as well.
Edit1:
After you made clear that you need the hash code to identify the object in a Java Script web client cache I am confused. Why does the server send the same data again? Would it not be simpler to make the server smarter to send only data the client has not yet received?
A SHA hash algo could be ok in your case to create some object instance tag.
Why do you need a hash code at all? If your goal is to store the values in a memory efficient manner you can create a FooList which uses dictionaries to store identical values only once and uses and int as lookup key.
using System;
using System.Collections.Generic;
namespace MemoryEfficientFoo
{
class Foo // This is our data structure
{
public int A;
public string B;
public Decimal C;
}
/// <summary>
/// List which does store Foos with much less memory if many values are equal. You can cut memory consumption by factor 3 or if all values
/// are different you consume 5 times as much memory as if you would store them in a plain list! So beware that this trick
/// might not help in your case. Only if many values are repeated it will save memory.
/// </summary>
class FooList : IEnumerable<Foo>
{
Dictionary<int, string> Index2B = new Dictionary<int, string>();
Dictionary<string, int> B2Index = new Dictionary<string, int>();
Dictionary<int, Decimal> Index2C = new Dictionary<int, decimal>();
Dictionary<Decimal,int> C2Index = new Dictionary<decimal,int>();
struct FooIndex
{
public int A;
public int BIndex;
public int CIndex;
}
// List of foos which do contain only the index values to the dictionaries to lookup the data later.
List<FooIndex> FooValues = new List<FooIndex>();
public void Add(Foo foo)
{
int bIndex;
if(!B2Index.TryGetValue(foo.B, out bIndex))
{
bIndex = B2Index.Count;
B2Index[foo.B] = bIndex;
Index2B[bIndex] = foo.B;
}
int cIndex;
if (!C2Index.TryGetValue(foo.C, out cIndex))
{
cIndex = C2Index.Count;
C2Index[foo.C] = cIndex;
Index2C[cIndex] = cIndex;
}
FooIndex idx = new FooIndex
{
A = foo.A,
BIndex = bIndex,
CIndex = cIndex
};
FooValues.Add(idx);
}
public Foo GetAt(int pos)
{
var idx = FooValues[pos];
return new Foo
{
A = idx.A,
B = Index2B[idx.BIndex],
C = Index2C[idx.CIndex]
};
}
public IEnumerator<Foo> GetEnumerator()
{
for (int i = 0; i < FooValues.Count; i++)
{
yield return GetAt(i);
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
class Program
{
static void Main(string[] args)
{
FooList list = new FooList();
List<Foo> fooList = new List<Foo>();
long before = GC.GetTotalMemory(true);
for (int i = 0; i < 1000 * 1000; i++)
{
list
//fooList
.Add(new Foo
{
A = i,
B = "Hi",
C = i
});
}
long after = GC.GetTotalMemory(true);
Console.WriteLine("Did consume {0:N0}bytes", after - before);
}
}
}
A similar memory conserving list can be found here

Categories