Generating a good hash code (GetHashCode) for a BitArray - c#

I need to generate a fast hash code in GetHashCode for a BitArray. I have a Dictionary where the keys are BitArrays, and all the BitArrays are of the same length.
Does anyone know of a fast way to generate a good hash from a variable number of bits, as in this scenario?
UPDATE:
The approach I originally took was to access the internal array of ints directly through reflection (speed is more important than encapsulation in this case), then XOR those values. The XOR approach seems to work well i.e. my 'Equals' method isn't called excessively when searching in the Dictionary:
public int GetHashCode(BitArray array)
{
int hash = 0;
foreach (int value in array.GetInternalValues())
{
hash ^= value;
}
return hash;
}
However, the approach suggested by Mark Byers and seen elsewhere on StackOverflow was slightly better (16570 Equals calls vs 16608 for the XOR for my test data). Note that this approach fixes a bug in the previous one where bits beyond the end of the bit array could affect the hash value. This could happen if the bit array was reduced in length.
public int GetHashCode(BitArray array)
{
UInt32 hash = 17;
int bitsRemaining = array.Length;
foreach (int value in array.GetInternalValues())
{
UInt32 cleanValue = (UInt32)value;
if (bitsRemaining < 32)
{
//clear any bits that are beyond the end of the array
int bitsToWipe = 32 - bitsRemaining;
cleanValue <<= bitsToWipe;
cleanValue >>= bitsToWipe;
}
hash = hash * 23 + cleanValue;
bitsRemaining -= 32;
}
return (int)hash;
}
The GetInternalValues extension method is implemented like this:
public static class BitArrayExtensions
{
static FieldInfo _internalArrayGetter = GetInternalArrayGetter();
static FieldInfo GetInternalArrayGetter()
{
return typeof(BitArray).GetField("m_array", BindingFlags.NonPublic | BindingFlags.Instance);
}
static int[] GetInternalArray(BitArray array)
{
return (int[])_internalArrayGetter.GetValue(array);
}
public static IEnumerable<int> GetInternalValues(this BitArray array)
{
return GetInternalArray(array);
}
... more extension methods
}
Any suggestions for improvement are welcome!

It is a terrible class to act as a key in a Dictionary. The only reasonable way to implement GetHashCode() is by using its CopyTo() method to copy the bits into a byte[]. That's not great, it creates a ton of garbage.
Beg, steal or borrow to use a BitVector32 instead. It has a good implementation for GetHashCode(). If you've got more than 32 bits then consider spinning your own class so you can get to the underlying array without having to copy.

If the bit arrays are 32 bits or shorter then you just need to convert them to 32 bit integers (padding with zero bits if necessary).
If they can be longer then you can either convert them to a series of 32-bit integers and XOR them, or better: use the algorithm described in Effective Java.
public int GetHashCode()
{
int hash = 17;
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
Taken from here. The field1, field2 correcpond the the first 32 bits, second 32 bits, etc.

Related

Fastest way to make a hashkey of multiple strings

The history why is long, but the problem is simple.
Having 3 strings I need to cache the matching value.
To have a fast cache I use the following code:
public int keygen(string a, string b, string c)
{
var x = a + "##" + b + "##" + c;
var hash = x.GetHashCode();
return hash;
}
(Note that string a,b,c does not contain the code "##")
The cache it self is just a Dictionary<int, object>
I know there is a risk that the hash key might be non unique, but except this:
Does anyone know a faster way to make an int key? (in C#)
This operation takes ~15% of total CPU time and this is a long running app.
I have tried a couple of implementations but failed to find any faster.
You should use a Dictionary<Tuple<string,string,string>, object>. Then you don't have to worry about non-uniqueness, since the Dictionary will take care of it for you.
Instead of concatenating strings (which creates new strings) you could use XOR or even better simple maths (credits to J.Skeet):
public int keygen(string a, string b, string c)
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
hash = hash * 23 + a == null ? 0 : a.GetHashCode();
hash = hash * 23 + b == null ? 0 : b.GetHashCode();
hash = hash * 23 + c == null ? 0 : c.GetHashCode();
return hash;
}
}
In general it's not necessary to produce unique hashs. But you should minimize collisions.
Another(not as efficient) way is to use an anonymous type which has a builtin support for GetHashCode:
public int keygen(string a, string b, string c)
{
return new { a, b, c }.GetHashCode();
}
Note that the name, type and order matters for the calculation of the hashcode of an anonymous type.
A faster approach would be to compute the hash of each string separately, then combine them using a hash function. This will eliminate the string concatenation which could be taking time.
e.g.
public int KeyGen(string a, string b, string c)
{
var aHash = a.GetHashCode();
var bHash = b.GetHashCode();
var cHash = c.GetHashCode();
var hash = 36469;
unchecked
{
hash = hash * 17 + aHash;
hash = hash * 17 + bHash;
hash = hash * 17 + cHash;
}
return hash;
}
I know there is a risk that the hash key might be non unique
Hash key's don't have to be unique - they just work better if collisions are minimized.
That said, 15% of your time spent computing a string's hash code seems VERY high. Even switching to string.Concat() (which the compiler may do for you anyways) or StringBuilder shouldn't make that much difference. I'd suggest triple-checking your measurements.
I’d guess most of the time of this function is spent on building the concatenated string, only to call GetHashCode on it. I would try something like
public int keygen(string a, string b, string c)
{
return a.GetHashCode() ^ b.GetHashCode() ^ c.GetHashCode();
}
Or possibly use something more complicated than a simple XOR. However, be aware that GetHashCode is not a cryptographic hash function! It is a hash function used for hash tables, not for cryptography, and you should definitely not use it for anything security-related like keys (as your keygen name hints).

32 bit fast uniform hash function. Use MD5 / SHA1 and cut off 4 bytes? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What is the best 32bit hash function for short strings (tag names)?
I need to hash many strings to 32bit (uint).
Can I just use MD5 or SHA1 and take 4 bytes from it? Or are there better alternatives?
There is no need for security or to care if one is cracked and so on.
I just need to hash fast and uniform to 32 bit. MD5 and SHA1 should be uniform.
But are there better (faster) build in alternatives I could use? If not, which of both would you use?
Here someone asked which one is better, but not for alternatives and there was a security matter (I don't care for security):
How to Use SHA1 or MD5 in C#?(Which One is Better in Performance and Security for Authentication)
Do you need a cryptographic-strength hash? If all you need is 32 bits I bet not.
Try the Fowler-Noll-Vo hash. It's fast, has good distribution and avalanche effect, and is generally acceptable for hashtables, checksums etc:
public static uint To32BitFnv1aHash(this string toHash,
bool separateUpperByte = false)
{
IEnumerable<byte> bytesToHash;
if (separateUpperByte)
bytesToHash = toHash.ToCharArray()
.Select(c => new[] { (byte)((c - (byte)c) >> 8), (byte)c })
.SelectMany(c => c);
else
bytesToHash = toHash.ToCharArray()
.Select(Convert.ToByte);
//this is the actual hash function; very simple
uint hash = FnvConstants.FnvOffset32;
foreach (var chunk in bytesToHash)
{
hash ^= chunk;
hash *= FnvConstants.FnvPrime32;
}
return hash;
}
public static class FnvConstants
{
public static readonly uint FnvPrime32 = 16777619;
public static readonly ulong FnvPrime64 = 1099511628211;
public static readonly uint FnvOffset32 = 2166136261;
public static readonly ulong FnvOffset64 = 14695981039346656037;
}
This is really useful for creating semantically equatable hashes for GetHashCode, based on a string digest of each object (a custom ToString() or otherwise). You can overload this to take any IEnumerable<byte> making it suitable for checksumming stream data etc. If you ever need a 64-bit hash (ulong), just copy the function and replace the constants used with the 64-bit constants. Oh, one more thing; the hash (as most do) rely on unchecked integer overflow; never run this hash in a "checked" block, or it will be virtually guaranteed to throw out exceptions.
If security does not play a role, generating a hash with a cryptographic hash function (such as MD5 or SHA1) and taking 4 bytes from it works. But they are slower than various non-cryptographic hash functions, as these functions are primarily designed for security, not speed.
Have a look at non-cryptographic hash functions such as FNV or Murmur.
Non-Cryptographic Hash Function Zoo
Performance Graphs
MurMurHash3, an ultra fast hash algorithm for C# / .NET
Edit: The floodyberry.com domain is now registered by a domain parking service - removed dead links
The easiest and yet good algorithm for strings is as follow:
int Hash(string s)
{
int res = 0;
for(int i = 0; i < str.Length; i++)
{
res += (i * str[i]) % int.MaxValue;
}
return res;
}
Obviously, this is absolutely not a secured hash algorithm but it is fast (really fast) returns 32 bit and as far as I know, is uniform (I've tried it for many algorithmic challenges with good results).
Not for use to hash password or any sensible data.

Creating 'good' hash codes for .NET ala Boost.Functional/Hash

For C++ I've always been using Boost.Functional/Hash to create good hash values without having to deal with bit shifts, XORs and prime numbers. Is there any libraries that produces good (I'm not asking for optimal) hash values for C#/.NET? I would use this utility to implement GetHashCode(), not cryptographic hashes.
To clarify why I think this is useful, here's the implementation of boost::hash_combine which combines to hash values (ofcourse a very common operation when implementing GetHashCode()):
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
Clearly, this sort of code doesn't belong in the implementation of GetHashCode() and should therefor be implemented elsewhere.
I wouldn't used a separate library just for that. As mentioned before, for the GetHashCode method it is essential to be fast and stable. Usually I prefer to write inline implementation, but it might be actually a good idea to use a helper class:
internal static class HashHelper
{
private static int InitialHash = 17; // Prime number
private static int Multiplier = 23; // Different prime number
public static Int32 GetHashCode(params object[] values)
{
unchecked // overflow is fine
{
int hash = InitialHash;
if (values != null)
for (int i = 0; i < values.Length; i++)
{
object currentValue = values[i];
hash = hash * Multiplier
+ (currentValue != null ? currentValue.GetHashCode() : 0);
}
return hash;
}
}
}
This way common hash-calculation logic can be used:
public override int GetHashCode()
{
return HashHelper.GetHashCode(field1, field2);
}
The answers to this question contains some examples of helper-classes that resembles Boost.Functional/Hash. None looks quite as elegant, though.
I am not aware of any real .NET library that provides the equivalent.
Unless you have very specific requirements you don't need to calculate your type's hashcode from first principles. Rather combine the hash codes of the fields/properties you use for equality determination in one of the simple ways, something like:
int hash = field1.GetHashCode();
hash = (hash *37) + field2.GetHashCode();
(Combination function taken from §3.3.2 C# in Depth, 2nd Ed, Jon Skeet).
To avoid the boxing issue chain your calls using a generic extension method on Int32
public static class HashHelper
{
public static int InitialHash = 17; // Prime number
private static int Multiplier = 23; // Different prime number
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Int32 GetHashCode<T>( this Int32 source, T next )
{
// comparing null of value objects is ok. See
// http://stackoverflow.com/questions/1972262/c-sharp-okay-with-comparing-value-types-to-null
if ( next == null )
{
return source;
}
unchecked
{
return source + next.GetHashCode();
}
}
}
then you can do
HashHelper
.InitialHash
.GetHashCode(field0)
.GetHashCode(field1)
.GetHashCode(field2);
Have a look at this link, it describes MD5 hashing.
Otherwise use GetHashCode().

Get hash code for an int and a Nullable<int> pair

Currently I'm using the following class as my key for a Dictionary collection of objects that are unique by ColumnID and a nullable SubGroupID:
public class ColumnDataKey
{
public int ColumnID { get; private set; }
public int? SubGroupID { get; private set; }
// ...
public override int GetHashCode()
{
var hashKey = this.ColumnID + "_" +
(this.SubGroupID.HasValue ? this.SubGroupID.Value.ToString() : "NULL");
return hashKey.GetHashCode();
}
}
I was thinking of somehow combining this to a 64-bit integer but I'm not sure how to deal with null SubGroupIDs. This is as far as I got, but it isn't valid as SubGroupID can be zero:
var hashKey = (long)this.ColumnID << 32 +
(this.SubGroupID.HasValue ? this.SubGroupID.Value : 0);
return hashKey.GetHashCode();
Any ideas?
Strictly speaking you won't be able to combine these perfectly because logically int? has 33 bits of information (32 bits for the integer and a further bit indicating whether the value is present or not). Your non-nullable int has a further 32 bits of information making 65 bits in total but a long only has 64 bits.
If you can safely restrict the value range of either of the ints to only 31 bits then you'd be able to pack them roughly as you're already doing. However, you won't get any advantage over doing it that way - you might as well just calculate the hash code directly like this (with thanks to Resharper's boilerplate code generation):
public override int GetHashCode()
{
unchecked
{
return (ColumnID*397) ^ (SubGroupID.HasValue ? SubGroupID.Value : 0);
}
}
You seem to be thinking of GetHashCode as a unique key. It isn't. HashCodes are 32-bit integers, and are not meant to be unique, only well-distributed across the 32 bit space to minimise the probability of collision. Try this for your GetHashCode method of ColumnDataKey:
ColumnID * 397 ^ (SubGroupID.HasValue ?? SubGroupID.Value : -11111111)
The magic numbers here are 397, a prime number, which for reasons of voodoo magic is a good number to multiply by to mix up your bits (and is the number the ReSharper team picked), and -11111111, a SubGroup ID which I assume is unlikely to arise in practice.

Is it possible to combine hash codes for private members to generate a new hash code?

I have an object for which I want to generate a unique hash (override GetHashCode()) but I want to avoid overflows or something unpredictable.
The code should be the result of combining the hash codes of a small collection of strings.
The hash codes will be part of generating a cache key, so ideally they should be unique however the number of possible values that are being hashed is small so I THINK probability is in my favour here.
Would something like this be sufficient AND is there a better way of doing this?
int hash = 0;
foreach(string item in collection){
hash += (item.GetHashCode() / collection.Count)
}
return hash;
EDIT: Thanks for answers so far.
#Jon Skeet: No, order is not important
I guess this is almost a another question but since I am using the result to generate a cache key (string) would it make sense to use a crytographic hash function like MD5 or just use the string representation of this int?
The fundamentals pointed out by Marc and Jon are not bad but they are far from optimal in terms of their evenness of distribution of the results. Sadly the 'multiply by primes' approach copied by so many people from Knuth is not the best choice in many cases better distribution can be achieved by cheaper to calculate functions (though this is very slight on modern hardware). In fact throwing primes into many aspects of hashing is no panacea.
If this data is used for significantly sized hash tables I recommend reading of Bret Mulvey's excellent study and explanation of various modern (and not so modern) hashing techniques handily done with c#.
Note that the behaviour with strings of various hash functions is heavily biased towards wehther the strings are short (roughly speaking how many characters are hashed before the bits begin to over flow) or long.
One of the simplest and easiest to implement is also one of the best, the Jenkins One at a time hash.
private static unsafe void Hash(byte* d, int len, ref uint h)
{
for (int i = 0; i < len; i++)
{
h += d[i];
h += (h << 10);
h ^= (h >> 6);
}
}
public unsafe static void Hash(ref uint h, string s)
{
fixed (char* c = s)
{
byte* b = (byte*)(void*)c;
Hash(b, s.Length * 2, ref h);
}
}
public unsafe static int Avalanche(uint h)
{
h += (h<< 3);
h ^= (h>> 11);
h += (h<< 15);
return *((int*)(void*)&h);
}
you can then use this like so:
uint h = 0;
foreach(string item in collection)
{
Hash(ref h, item);
}
return Avalanche(h);
you can merge multiple different types like so:
public unsafe static void Hash(ref uint h, int data)
{
byte* d = (byte*)(void*)&data;
AddToHash(d, sizeof(int), ref h);
}
public unsafe static void Hash(ref uint h, long data)
{
byte* d= (byte*)(void*)&data;
Hash(d, sizeof(long), ref h);
}
If you only have access to the field as an object with no knowledge of the internals you can simply call GetHashCode() on each one and combine that value like so:
uint h = 0;
foreach(var item in collection)
{
Hash(ref h, item.GetHashCode());
}
return Avalanche(h);
Sadly you can't do sizeof(T) so you must do each struct individually.
If you wish to use reflection you can construct on a per type basis a function which does structural identity and hashing on all fields.
If you wish to avoid unsafe code then you can use bit masking techniques to pull out individual bits from ints (and chars if dealing with strings) with not too much extra hassle.
Hashes aren't meant to be unique - they're just meant to be well distributed in most situations. They're just meant to be consistent. Note that overflows shouldn't be a problem.
Just adding isn't generally a good idea, and dividing certainly isn't. Here's the approach I usually use:
int result = 17;
foreach (string item in collection)
{
result = result * 31 + item.GetHashCode();
}
return result;
If you're otherwise in a checked context, you might want to deliberately make it unchecked.
Note that this assumes that order is important, i.e. that { "a", "b" } should be different from { "b", "a" }. Please let us know if that's not the case.
There is nothing wrong with this approach as long as the members whose hashcodes you are combining follow the rules of hash codes. In short ...
The hash code of the private members should not change for the lifetime of the object
The container must not change the object the private members point to lest it in turn change the hash code of the container
If the order of the items is not important (i.e. {"a","b"} is the same as {"b","a"}) then you can use exclusive or to combine the hash codes:
hash ^= item.GetHashCode();
[Edit: As Mark pointed out in a comment to a different answer, this has the drawback of also give collections like {"a"} and {"a","b","b"} the same hash code.]
If the order is important, you can instead multiply by a prime number and add:
hash *= 11;
hash += item.GetHashCode();
(When you multiply you will sometimes get an overflow that is ignored, but by multiplying with a prime number you lose a minimum of information. If you instead multiplied with a number like 16, you would lose four bits of information each time, so after eight items the hash code from the first item would be completely gone.)

Categories