How do I produce a Kinesis Data Stream HashKey from a PartitionKey? - c#

I'm consuming data from a Kinesis Data Stream using a C# client. There are multiple instances of the C# client, one for each shard in the stream, each concurrently retrieves Kinesis Records using the GetRecords method.
The PutRecords call is being performed on an external system which is specifying an IMEI as the 'PartitionKey' value. The HashKey (and hence shard selection) is being performed by the KinesisClient using (from the docs) 'An MD5 hash function ... to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards'.
I have an external system update which is broadcasting IMEI related data to all of the running clients. My challenge is that I need to determine which client is processing the data for the IMEI, hence I need to apply the same 'MD5 hash function' as the KinesisClient is applying to the data. My intention is then to compare the hash to the HashKey range of the shard the client is processing, allowing the client to determine whether it is interested in the IMEI data, or not.
I've tried to work this out and have this code:
byte[] hash;
string imei = "123456789012345";
using (MD5 md5 = MD5.Create()) {
hash = md5.ComputeHash(Encoding.UTF8.GetBytes(imei));
}
Console.WriteLine(new BigInteger(hash));
However, this gives me a negative value and my shard HashKey ranges are positive.
I really need the C# code which the KinesisClient is using to turn a PartitionKey into a HashKey, but I can't find it. Can anyone help me to work this out please?
UPDATE: I'm making progress after finding the links I mentioned in the comments below. My code currently looks like this:
private static MD5 md5 = MD5.Create();
private static List<BigInteger> powersOfTen = null;
private static BigInteger MAXHASHKEY =BigInteger.Parse("340282366920938463463374607431768211455");
public String CreateExplicitHashKey(String partitionKey) {
byte[] pkDigest = md5.ComputeHash(Encoding.UTF8.GetBytes(partitionKey));
byte[] hashKeyBytes = new byte[16];
for (int i = 0; i < pkDigest.Length; i++) {
hashKeyBytes[16 - i - 1] = (byte)((int)pkDigest[i] & 0xFF);
}
BigInteger hashKey = new BigInteger(hashKeyBytes, true, false);
if (powersOfTen == null) {
powersOfTen = new List<BigInteger>();
powersOfTen.Add(1);
for (BigInteger i = 10; i < MAXHASHKEY; i *= i) {
powersOfTen.Add(i);
}
}
return BuildString(hashKey, powersOfTen.Count - 1).ToString().TrimStart('0');
}
private static string BuildString(BigInteger n, int m) {
if (m == 0) return n.ToString();
BigInteger remainder;
BigInteger quotient = BigInteger.DivRem(n, powersOfTen[m], out remainder);
return BuildString(quotient, m - 1) + BuildString(remainder, m - 1);
}
I've been able to verify the conversion using the examples offered by Will Haley over here. Now I'm seeking to verify the code to ensure it is doing the exact same conversion as is performed by the Kinesis PutRecord/PutRecords client methods, so I can be 100% confident.
Originally I'd expected to find a function for this in the Kinesis client library, but couldn't find one. I've made this suggestion over on GitHub

Related

Having trouble with AWS4 Signature tutorial, hash doesn't match example

I'm working through the tutorial on AWS, trying to calculate the Authorization header and I'm stuck. (Tutorial here: https://docs.aws.amazon.com/general/latest/gr/sigv4-create-canonical-request.html)
I've narrowed down my problem to a step at the end of task 3. I can create the signing key as they described and get the same result as they do,
c4afb1cc5771d871763a393e44b703571b55cc28424d1a5e86da6ed3c154a4b9
I can calculate the stringToSign as they describe and I get a matching result,
AWS4-HMAC-SHA256\n20150830T123600Z\n20150830/us-east-1/iam/aws4_request\nf536975d06c0309214f805bb90ccff089219ecd68b2577efef23edd43b7e1a59
But when I try to sign the string my result doesn't match their result.
var kha = KeyedHashAlgorithm.Create("HMACSHA256");
kha.Key = Encoding.UTF8.GetBytes("c4afb1cc5771d871763a393e44b703571b55cc28424d1a5e86da6ed3c154a4b9");
var sts = "AWS4-HMAC-SHA256\n20150830T123600Z\n20150830/us-east-1/iam/aws4_request\nf536975d06c0309214f805bb90ccff089219ecd68b2577efef23edd43b7e1a59";
var signature = HexEncode(kha.ComputeHash(Encoding.UTF8.GetBytes(sts)));
When I run this my signature comes out as
fe52b221b5173b501c9863cec59554224072ca34c1c827ec5fb8a257f97637b1
but they say it should be
5d672d79c15b13162d9279b0855cfba6789a8edb4c82c400e06b5924a6f2b5d7
In task 2 I run my HexEncode function as part of creating the HashedCanonicalRequest and that is coming out fine so I don't think it is that function but here it is just in case:
private static string HexEncode(byte[] data, bool lowercase = true)
{
var sb = new StringBuilder();
for (var i = 0; i < data.Length; i++)
{
sb.Append(data[i].ToString(lowercase ? "x2" : "X2"));
}
return sb.ToString();
}
I've tried various ways of writing the sts like using
#"AWS4-HMAC-SHA256
20150830T123600Z
20150830/us-east-1/iam/aws4_request
f536975d06c0309214f805bb90ccff089219ecd68b2577efef23edd43b7e1a59"
instead of using \n but nothing has worked. I also read through a few of the other postings here on SO but none of those seemed to help either.
Update:
I created this fiddle just to prove to myself that it isn't something environmental but it gets the same answer as my local code.
https://dotnetfiddle.net/A5mVp9
So it turns out that using
kha.Key = Encoding.UTF8.GetBytes("c4afb1cc5771d871763a393e44b703571b55cc28424d1a5e86da6ed3c154a4b9");
is incorrect. That string is hex encoded (because that's what it says to do in the tutorial) but you are supposed to use the byte-array version and not hex encode it. They showed the hex encoded just for display purposes but didn't do a good job of saying to use the regular byte array and DO NOT HEX ENCODE IT! Anyways, that's what solves this.
If you want to see it in action, write a hex decoder:
public static byte[] DecodeHex(string hex)
{
byte[] raw = new byte[hex.Length / 2];
for (int i = 0; i < raw.Length; i++)
{
raw[i] = Convert.ToByte(hex.Substring(i * 2, 2), 16);
}
return raw;
}
and hex decode the string I listed and use that byte array in the hashing.
kha.Key = DecodeHex("c4afb1cc5771d871763a393e44b703571b55cc28424d1a5e86da6ed3c154a4b9");

Generate Running Hash (or Checksum) in C#?

Preface:
I am doing a data-import that has a verify-commit phase. The idea is that: the first phase allows taking data from various sources and then running various insert/update/validate operations on a database. The commit is rolled back but a "verification hash/checksum" is generated. The commit phase is the same, but, if the "verification hash/checksum" is the same then the operations will be committed. (The database will be running under the appropriate isolation levels.)
Restrictions:
Input reading and operations are forward-read-once only
Do not want to pre-create a stream (e.g. writing to MemoryStream not desirable) as there may be a lot of data. (It would work on our servers/load, but pretend memory is limited.)
Do not want to "create my own". (I am aware of available code like CRC-32 by Damien which I could use/modify but would prefer something "standard".)
And what I (think I am) looking for:
A way to generate a Hash (e.g. SHA1 or MD5?) or a Checksum (e.g. CRC32 but hopefully more) based on input + operations. (The input/operations could themselves be hashed to values more fitting to the checksum generation but it would be nice just to be able to "write to steam".)
So, the question is:
How to generate a Running Hash (or Checksum) in C#?
Also, while there are CRC32 implementations that can be modified for a Running operation, what about running SHAx or MD5 hashes?
Am I missing some sort of handy Stream approach than could be used as an adapter?
(Critiques are welcome, but please also answer the above as applicable. Also, I would prefer not to deal with threads. ;-)
You can call HashAlgorithm.TransformBlock multiple times, and then calling TransformFinalBlock will give you the result of all blocks.
Chunk up your input (by reading x amount of bytes from a steam) and call TransformBlock with each chunk.
EDIT (from the msdn example):
public static void PrintHashMultiBlock(byte[] input, int size)
{
SHA256Managed sha = new SHA256Managed();
int offset = 0;
while (input.Length - offset >= size)
offset += sha.TransformBlock(input, offset, size, input, offset);
sha.TransformFinalBlock(input, offset, input.Length - offset);
Console.WriteLine("MultiBlock {0:00}: {1}", size, BytesToStr(sha.Hash));
}
Sorry I don't have any example readily available, though for you, you're basically replacing input with your own chunk, then the size would be the number of bytes in that chunk. You will have to keep track of the offset yourself.
Hashes have a build and a finalization phase. You can shove arbitrary amounts of data in during the build phase. The data can be split up as you like. Finally, you finish the hash operation and get your hash.
You can use a writable CryptoStream to write your data. This is the easiest way.
You can generate an MD5 hash using the MD5CryptoServiceProvider's ComputeHash method. It takes a stream as input.
Create a memory or file stream, write your hash inputs to that, and then call the ComputeHash method when you are done.
var myStream = new MemoryStream();
// Blah blah, write to the stream...
myStream.Position = 0;
using (var csp = new MD5CryptoServiceProvider()) {
var myHash = csp.ComputeHash(myStream);
}
EDIT: One possibility to avoid building up massive Streams is calling this over and over in a loop and XORing the results:
// Assuming we had this somewhere:
Byte[] myRunningHash = new Byte[16];
// Later on, from above:
for (var i = 0; i < 16; i++) // I believe MD5 are 16-byte arrays. Edit accordingly.
myRunningHash[i] = myRunningHash[i] ^ [myHash[i];
EDIT #2: Finally, building on #usr's answer below, you can probably use HashCore and HashFinal:
using (var csp = new MD5CryptoServiceProvider()) {
// My example here uses a foreach loop, but an
// event-driven stream-like approach is
// probably more what you are doing here.
foreach (byte[] someData in myDataThings)
csp.HashCore(someData, 0, someData.Length);
var myHash = csp.HashFinal();
}
this is the canonical way:
using System;
using System.Security.Cryptography;
using System.Text;
public void CreateHash(string sSourceData)
{
byte[] sourceBytes;
byte[] hashBytes;
//create Bytearray from source data
sourceBytes = ASCIIEncoding.ASCII.GetBytes(sSourceData);
// calculate 16 Byte Hashcode
hashBytes = new MD5CryptoServiceProvider().ComputeHash(sourceBytes);
string sOutput = ByteArrayToHexString(hashBytes);
}
static string ByteArrayToHexString(byte[] arrInput)
{
int i;
StringBuilder sOutput = new StringBuilder(arrInput.Length);
for (i = 0; i < arrInput.Length - 1; i++)
{
sOutput.Append(arrInput[i].ToString("X2"));
}
return sOutput.ToString();
}

Encrypt a number to another number of the same length

I need a way to take a 12 digit number and encrypt it to a different 12 digit number (no characters other than 0123456789). Then at a later point I need to be able to decrypt the encrypted number back to the original number.
It is important that it isn't obvious if 2 encrypted numbers are in order. So for instance if I encrypt 0000000000001 it should look totally different when encrypted than 000000000002. It doesn't have to be the most secure thing in the world, but the more secure the better.
I've been looking around a lot but haven't found anything that seems to be a perfect fit. From what I've seen some type of XOR might be the easiest way to go, but I'm not sure how to do this.
Thanks,
Jim
I ended up solving this thanks to you guys using "FPE from a prefix cipher" from the wikipedia page http://en.wikipedia.org/wiki/Format-preserving_encryption. I'll give the basic steps below to hopefully be helpful for someone in the future.
NOTE - I'm sure any expert will tell you this is a hack. The numbers seemed random and it was secure enough for what I needed, but if security is a big concern use something else. I'm sure experts can point to holes in what I did. My only goal for posting this is because I would have found it useful when doing my search for an answer to the problem. Also only use this in situations where it couldn't be decompiled.
I was going to post steps, but its too much to explain. I'll just post my code. This is my proof of concept code I still need to clean up, but you'll get the idea. Note my code is specific to a 12 digit number, but adjusting for others should be easy. Max is probably 16 with the way I did it.
public static string DoEncrypt(string unencryptedString)
{
string encryptedString = "";
unencryptedString = new string(unencryptedString.ToCharArray().Reverse().ToArray());
foreach (char character in unencryptedString.ToCharArray())
{
string randomizationSeed = (encryptedString.Length > 0) ? unencryptedString.Substring(0, encryptedString.Length) : "";
encryptedString += GetRandomSubstitutionArray(randomizationSeed)[int.Parse(character.ToString())];
}
return Shuffle(encryptedString);
}
public static string DoDecrypt(string encryptedString)
{
// Unshuffle the string first to make processing easier.
encryptedString = Unshuffle(encryptedString);
string unencryptedString = "";
foreach (char character in encryptedString.ToCharArray().ToArray())
unencryptedString += GetRandomSubstitutionArray(unencryptedString).IndexOf(int.Parse(character.ToString()));
// Reverse string since encrypted string was reversed while processing.
return new string(unencryptedString.ToCharArray().Reverse().ToArray());
}
private static string Shuffle(string unshuffled)
{
char[] unshuffledCharacters = unshuffled.ToCharArray();
char[] shuffledCharacters = new char[12];
shuffledCharacters[0] = unshuffledCharacters[2];
shuffledCharacters[1] = unshuffledCharacters[7];
shuffledCharacters[2] = unshuffledCharacters[10];
shuffledCharacters[3] = unshuffledCharacters[5];
shuffledCharacters[4] = unshuffledCharacters[3];
shuffledCharacters[5] = unshuffledCharacters[1];
shuffledCharacters[6] = unshuffledCharacters[0];
shuffledCharacters[7] = unshuffledCharacters[4];
shuffledCharacters[8] = unshuffledCharacters[8];
shuffledCharacters[9] = unshuffledCharacters[11];
shuffledCharacters[10] = unshuffledCharacters[6];
shuffledCharacters[11] = unshuffledCharacters[9];
return new string(shuffledCharacters);
}
private static string Unshuffle(string shuffled)
{
char[] shuffledCharacters = shuffled.ToCharArray();
char[] unshuffledCharacters = new char[12];
unshuffledCharacters[0] = shuffledCharacters[6];
unshuffledCharacters[1] = shuffledCharacters[5];
unshuffledCharacters[2] = shuffledCharacters[0];
unshuffledCharacters[3] = shuffledCharacters[4];
unshuffledCharacters[4] = shuffledCharacters[7];
unshuffledCharacters[5] = shuffledCharacters[3];
unshuffledCharacters[6] = shuffledCharacters[10];
unshuffledCharacters[7] = shuffledCharacters[1];
unshuffledCharacters[8] = shuffledCharacters[8];
unshuffledCharacters[9] = shuffledCharacters[11];
unshuffledCharacters[10] = shuffledCharacters[2];
unshuffledCharacters[11] = shuffledCharacters[9];
return new string(unshuffledCharacters);
}
public static string DoPrefixCipherEncrypt(string strIn, byte[] btKey)
{
if (strIn.Length < 1)
return strIn;
// Convert the input string to a byte array
byte[] btToEncrypt = System.Text.Encoding.Unicode.GetBytes(strIn);
RijndaelManaged cryptoRijndael = new RijndaelManaged();
cryptoRijndael.Mode =
CipherMode.ECB;//Doesn't require Initialization Vector
cryptoRijndael.Padding =
PaddingMode.PKCS7;
// Create a key (No IV needed because we are using ECB mode)
ASCIIEncoding textConverter = new ASCIIEncoding();
// Get an encryptor
ICryptoTransform ictEncryptor = cryptoRijndael.CreateEncryptor(btKey, null);
// Encrypt the data...
MemoryStream msEncrypt = new MemoryStream();
CryptoStream csEncrypt = new CryptoStream(msEncrypt, ictEncryptor, CryptoStreamMode.Write);
// Write all data to the crypto stream to encrypt it
csEncrypt.Write(btToEncrypt, 0, btToEncrypt.Length);
csEncrypt.Close();
//flush, close, dispose
// Get the encrypted array of bytes
byte[] btEncrypted = msEncrypt.ToArray();
// Convert the resulting encrypted byte array to string for return
return (Convert.ToBase64String(btEncrypted));
}
private static List<int> GetRandomSubstitutionArray(string number)
{
// Pad number as needed to achieve longer key length and seed more randomly.
// NOTE I didn't want to make the code here available and it would take too longer to clean, so I'll tell you what I did. I basically took every number seed that was passed in and prefixed it and postfixed it with some values to make it 16 characters long and to get a more unique result. For example:
// if (number.Length = 15)
// number = "Y" + number;
// if (number.Length = 14)
// number = "7" + number + "z";
// etc - hey I already said this is a hack ;)
// We pass in the current number as the password to an AES encryption of each of the
// digits 0 - 9. This returns us a set of values that we can then sort and get a
// random order for the digits based on the current state of the number.
Dictionary<string, int> prefixCipherResults = new Dictionary<string, int>();
for (int ndx = 0; ndx < 10; ndx++)
prefixCipherResults.Add(DoPrefixCipherEncrypt(ndx.ToString(), Encoding.UTF8.GetBytes(number)), ndx);
// Order the results and loop through to build your int array.
List<int> group = new List<int>();
foreach (string key in prefixCipherResults.Keys.OrderBy(k => k))
group.Add(prefixCipherResults[key]);
return group;
}
One more way for simple encryption, you can just substruct each number from 10.
For example
initial numbers: 123456
10-1 = 9
10-2 = 8
10-3 = 7
etc.
and you will get
987654
You can combine it with XOR for more secure encryption.
What you're talking about is kinda like a one-time pad. A key the same length as the plaintext and then doing some modulo math on each individual character.
A xor B = C
C xor B = A
or in other words
A xor B xor B = A
As long as you don't use the same key B on multiple different inputs (e.g. B has to be unique, every single time you encrypt), then in theory you can never recover the original A without knowing what B was. If you use the same B multiple times, then all bets are off.
comment followup:
You shouldn't end up with more bits aftewards than you started with. xor just flips bits, it doesn't have any carry functionality. Ending up with 6 digits is just odd... As for code:
$plaintext = array(digit1, digit2, digit3, digit4, digit5, digit6);
$key = array(key1, key2, key3, key4, key5, key6);
$ciphertext = array()
# encryption
foreach($plaintext as $idx => $char) {
$ciphertext[$idx] = $char xor $key[$idx];
}
# decryption
foreach($ciphertext as $idx => $char) {
$decrypted[$idx] = $char xor $key[$idx];
}
Just doing this as an array for simplicity. For actual data you'd work on a per-byte or per-word basis, and just xor each chunk in sequence. You can use a key string shorter than the input, but that makes it easier to reverse engineer the key. In theory, you could use a single byte to do the xor'ing, but then you've just basically achieved the bit-level equivalent of rot-13.
For example you can add digits of your number with digits some const (214354178963...whatever) and apply "~" operator (reverse all bits) this is not safely but ensure you can decrypt your number allways.
anyone with reflector or ildasm will be able to hack such an encryption algorithm.
I don't know what is your business requirement but you have to know that.
If there's enough wriggle-room in the requirements that you can accept 16 hexadecimal digits as the encrypted side, just interpret the 12 digit decimal number as a 64bit plaintext and use a 64 bit block cipher like Blowfish, Triple-DES or IDEA.

C# Create an Auth Token from Guid combined with 40Char Hex string (UUID)

I am writing an asp.net MVC app that drives an IPhone application.
I want the Iphone to send me its UUID looks like this:
2b6f0cc904d137be2e1730235f5664094b831186
On the server I want to generate a Guid:
466853EB-157D-4795-B4D4-32658D85A0E0
On both the Iphone and the Server I need a simple aglorithm to combine these 2 values into an Auth token that can be passed around. Both the IPhone and the ASP.NET MVC app need to be able to compute the value over and over based on the UUID and the GUID.
So this needs to be a simple algorithm with no libraries from the .net framework.
Full Solution Here
public void Test()
{
var DeviceId = Guid.NewGuid();
var newId = "2b6f0cc904d137be2e1730235f5664094b831186";
var guidBytes = DeviceId.ToByteArray();
var iphoneBytes = StringToByteArray(newId);
byte[] xor = new byte[guidBytes.Length];
for (int i=0;i<guidBytes.Length;i++)
{
xor[i] = (byte) (guidBytes[i] ^ iphoneBytes[i]);
}
var result = ByteArrayToString(xor);
}
public static byte[] StringToByteArray(String hex)
{
int NumberChars = hex.Length;
byte[] bytes = new byte[NumberChars / 2];
for (int i = 0; i < NumberChars; i += 2)
bytes[i / 2] = Convert.ToByte(hex.Substring(i, 2), 16);
return bytes;
}
public static string ByteArrayToString(byte[] ba)
{
StringBuilder hex = new StringBuilder(ba.Length * 2);
foreach (byte b in ba)
hex.AppendFormat("{0:x2}", b);
return hex.ToString();
}
Well, the iPhone ID looks like a hex string, so converting both to binary and XORing the bytes ought to do it. You could store the result as an array, hex string, or base-64 encoded string as appropriate.
The way you refer to this as an "auth token" is a little concerning, however. Session ids must be unpredictable. You might consider generating an array of cryptographically random data on the server instead of a GUID.
Edit
// Convert the server GUID to a byte array.
byte[] guidBytes = severGuid.ToByteArray();
// Convert the iPhone device ID to an array
byte[] idBytes = StringToByteArray(iPhoneId);
Sadly, it seems .NET doesn't have a built-in method to convert to/from hex strings, but this subject has been covered before: Convert byte array to hex string and vice versa
// Once you've XORed the bytes, conver the result to a string.
string outputString = ByteArrayToString(outputBytes);
Just as a side note, all the "auth token" mechanisms I've worked with (at least) concatenated a constant value (a "secret") with the current time, then hashed them together then sent the hash and the date. The server then reconstructed the hash from the received date and known "secret" and then compared to the received hash (signature).
My point here was "concatenated with date" - this allows the resulting signature to be different every time which in theory should be more secure.
Rather than XORing, which loses information, you could just concatenate these hex digits.
Why do you even need the GUID? The phone ID is unique, the GUID seems to add no value.
I thought you can use two way algorithm. It mean the algorithm can be encode and decode like Base64, SHA256, AES

How do I generate a hashcode from a byte array in C#?

Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. I've used the cryptographic hash functions for this in the past because they are easy to implement, but they are doing a lot more work than they should to be cryptographically oneway, and I don't care about that (I'm just using the hashcode as a key into a hashtable).
Here's what I have today:
struct SomeData : IEquatable<SomeData>
{
private readonly byte[] data;
public SomeData(byte[] data)
{
if (null == data || data.Length <= 0)
{
throw new ArgumentException("data");
}
this.data = new byte[data.Length];
Array.Copy(data, this.data, data.Length);
}
public override bool Equals(object obj)
{
return obj is SomeData && Equals((SomeData)obj);
}
public bool Equals(SomeData other)
{
if (other.data.Length != data.Length)
{
return false;
}
for (int i = 0; i < data.Length; ++i)
{
if (data[i] != other.data[i])
{
return false;
}
}
return true;
}
public override int GetHashCode()
{
return BitConverter.ToInt32(new MD5CryptoServiceProvider().ComputeHash(data), 0);
}
}
Any thoughts?
dp: You are right that I missed a check in Equals, I have updated it. Using the existing hashcode from the byte array will result in reference equality (or at least that same concept translated to hashcodes).
for example:
byte[] b1 = new byte[] { 1 };
byte[] b2 = new byte[] { 1 };
int h1 = b1.GetHashCode();
int h2 = b2.GetHashCode();
With that code, despite the two byte arrays having the same values within them, they are referring to different parts of memory and will result in (probably) different hash codes. I need the hash codes for two byte arrays with the same contents to be equal.
The hash code of an object does not need to be unique.
The checking rule is:
Are the hash codes equal? Then call the full (slow) Equals method.
Are the hash codes not equal? Then the two items are definitely not equal.
All you want is a GetHashCode algorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTable or Dictionary<> will need to use the hash to optimise retrieval.
How long do you expect the data to be? How random? If lengths vary greatly (say for files) then just return the length. If lengths are likely to be similar look at a subset of the bytes that varies.
GetHashCode should be a lot quicker than Equals, but doesn't need to be unique.
Two identical things must never have different hash codes. Two different objects should not have the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers).
Don't use cryptographic hashes for a hashtable, that's ridiculous/overkill.
Here ya go... Modified FNV Hash in C#
http://bretm.home.comcast.net/hash/6.html
public static int ComputeHash(params byte[] data)
{
unchecked
{
const int p = 16777619;
int hash = (int)2166136261;
for (int i = 0; i < data.Length; i++)
hash = (hash ^ data[i]) * p;
hash += hash << 13;
hash ^= hash >> 7;
hash += hash << 3;
hash ^= hash >> 17;
hash += hash << 5;
return hash;
}
}
Borrowing from the code generated by JetBrains software, I have settled on this function:
public override int GetHashCode()
{
unchecked
{
var result = 0;
foreach (byte b in _key)
result = (result*31) ^ b;
return result;
}
}
The problem with just XOring the bytes is that 3/4 (3 bytes) of the returned value has only 2 possible values (all on or all off). This spreads the bits around a little more.
Setting a breakpoint in Equals was a good suggestion. Adding about 200,000 entries of my data to a Dictionary, sees about 10 Equals calls (or 1/20,000).
Have you compared with the SHA1CryptoServiceProvider.ComputeHash method? It takes a byte array and returns a SHA1 hash, and I believe it's pretty well optimized. I used it in an Identicon Handler that performed pretty well under load.
I found interesting results:
I have the class:
public class MyHash : IEquatable<MyHash>
{
public byte[] Val { get; private set; }
public MyHash(byte[] val)
{
Val = val;
}
/// <summary>
/// Test if this Class is equal to another class
/// </summary>
/// <param name="other"></param>
/// <returns></returns>
public bool Equals(MyHash other)
{
if (other.Val.Length == this.Val.Length)
{
for (var i = 0; i < this.Val.Length; i++)
{
if (other.Val[i] != this.Val[i])
{
return false;
}
}
return true;
}
else
{
return false;
}
}
public override int GetHashCode()
{
var str = Convert.ToBase64String(Val);
return str.GetHashCode();
}
}
Then I created a dictionary with keys of type MyHash in order to test how fast I can insert and I can also know how many collisions there are. I did the following
// dictionary we use to check for collisions
Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();
// used to generate random arrays
Random rand = new Random();
var now = DateTime.Now;
for (var j = 0; j < 100; j++)
{
for (var i = 0; i < 5000; i++)
{
// create new array and populate it with random bytes
byte[] randBytes = new byte[byte.MaxValue];
rand.NextBytes(randBytes);
MyHash h = new MyHash(randBytes);
if (checkForDuplicatesDic.ContainsKey(h))
{
Console.WriteLine("Duplicate");
}
else
{
checkForDuplicatesDic[h] = true;
}
}
Console.WriteLine(j);
checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
}
var elapsed = DateTime.Now - now;
Console.Read();
Every time I insert a new item to the dictionary the dictionary will calculate the hash of that object. So you can tell what method is most efficient by placing several answers found in here in the method public override int GetHashCode() The method that was by far the fastest and had the least number of collisions was:
public override int GetHashCode()
{
var str = Convert.ToBase64String(Val);
return str.GetHashCode();
}
that took 2 seconds to execute. The method
public override int GetHashCode()
{
// 7.1 seconds
unchecked
{
const int p = 16777619;
int hash = (int)2166136261;
for (int i = 0; i < Val.Length; i++)
hash = (hash ^ Val[i]) * p;
hash += hash << 13;
hash ^= hash >> 7;
hash += hash << 3;
hash ^= hash >> 17;
hash += hash << 5;
return hash;
}
}
had no collisions also but it took 7 seconds to execute!
If you are looking for performance, I tested a few hash keys, and
I recommend Bob Jenkin's hash function. It is both crazy fast
to compute and will give as few collisions as the cryptographic
hash you used until now.
I don't know C# at all, and I don't know if it can link with C, but
here is its implementation in C.
Is using the existing hashcode from the byte array field not good enough? Also note that in the Equals method you should check that the arrays are the same size before doing the compare.
Generating a good hash is easier said than done. Remember, you're basically representing n bytes of data with m bits of information. The larger your data set and the smaller m is, the more likely you'll get a collision ... two pieces of data resolving to the same hash.
The simplest hash I ever learned was simply XORing all the bytes together. It's easy, faster than most complicated hash algorithms and a halfway decent general-purpose hash algorithm for small data sets. It's the Bubble Sort of hash algorithms really. Since the simple implementation would leave you with 8 bits, that's only 256 hashes ... not so hot. You could XOR chunks instead of individal bytes, but then the algorithm gets much more complicated.
So certainly, the cryptographic algorithms are maybe doing some stuff you don't need ... but they're also a huge step up in general-purpose hash quality. The MD5 hash you're using has 128 bits, with billions and billions of possible hashes. The only way you're likely to get something better is to take some representative samples of the data you expect to be going through your application and try various algorithms on it to see how many collisions you get.
So until I see some reason to not use a canned hash algorithm (performance, perhaps?), I'm going to have to recommend you stick with what you've got.
Whether you want a perfect hashfunction (different value for each object that evaluates to equal) or just a pretty good one is always a performance tradeoff, it takes normally time to compute a good hashfunction and if your dataset is smallish you're better of with a fast function. The most important (as your second post points out) is correctness, and to achieve that all you need is to return the Length of the array. Depending on your dataset that might even be ok. If it isn't (say all your arrays are equally long) you can go with something cheap like looking at the first and last value and XORing their values and then add more complexity as you see fit for your data.
A quick way to see how your hashfunction performs on your data is to add all the data to a hashtable and count the number of times the Equals function gets called, if it is too often you have more work to do on the function. If you do this just keep in mind that the hashtable's size needs to be set bigger than your dataset when you start, otherwise you are going to rehash the data which will trigger reinserts and more Equals evaluations (though possibly more realistic?)
For some objects (not this one) a quick HashCode can be generated by ToString().GetHashCode(), certainly not optimal, but useful as people tend to return something close to the identity of the object from ToString() and that is exactly what GetHashcode is looking for
Trivia: The worst performance I have ever seen was when someone by mistake returned a constant from GetHashCode, easy to spot with a debugger though, especially if you do lots of lookups in your hashtable
RuntimeHelpers.GetHashCode might help:
From Msdn:
Serves as a hash function for a
particular type, suitable for use in
hashing algorithms and data structures
such as a hash table.
private int? hashCode;
public override int GetHashCode()
{
if (!hashCode.HasValue)
{
var hash = 0;
for (var i = 0; i < bytes.Length; i++)
{
hash = (hash << 4) + bytes[i];
}
hashCode = hash;
}
return hashCode.Value;
}

Categories