I would like to generate a Guid from a list of other Guids. The generated Guid must have the property that for the same input list of guids the resulting Guid will be the same, no matter how many times I apply the transformation.
Also, it should have the lowest collision possible so different guids at the input generate a different guid at the output.
Can someone help me with this? What should be the best way to go here? It's basically a hash function but over Guids.
You could do some arithmetic on the individual bytes of a Guid - the code below basically adds them up (modulo 256 because of the overflow):
byte[] totalBytes = new byte[16];
foreach (var guid in guids) {
var bytes = guid.ToByteArray();
for (int i = 0; i < 16; i++) {
totalBytes[i] += bytes[i];
}
}
var totalGuid = new Guid(totalBytes);
Related
I'm consuming data from a Kinesis Data Stream using a C# client. There are multiple instances of the C# client, one for each shard in the stream, each concurrently retrieves Kinesis Records using the GetRecords method.
The PutRecords call is being performed on an external system which is specifying an IMEI as the 'PartitionKey' value. The HashKey (and hence shard selection) is being performed by the KinesisClient using (from the docs) 'An MD5 hash function ... to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards'.
I have an external system update which is broadcasting IMEI related data to all of the running clients. My challenge is that I need to determine which client is processing the data for the IMEI, hence I need to apply the same 'MD5 hash function' as the KinesisClient is applying to the data. My intention is then to compare the hash to the HashKey range of the shard the client is processing, allowing the client to determine whether it is interested in the IMEI data, or not.
I've tried to work this out and have this code:
byte[] hash;
string imei = "123456789012345";
using (MD5 md5 = MD5.Create()) {
hash = md5.ComputeHash(Encoding.UTF8.GetBytes(imei));
}
Console.WriteLine(new BigInteger(hash));
However, this gives me a negative value and my shard HashKey ranges are positive.
I really need the C# code which the KinesisClient is using to turn a PartitionKey into a HashKey, but I can't find it. Can anyone help me to work this out please?
UPDATE: I'm making progress after finding the links I mentioned in the comments below. My code currently looks like this:
private static MD5 md5 = MD5.Create();
private static List<BigInteger> powersOfTen = null;
private static BigInteger MAXHASHKEY =BigInteger.Parse("340282366920938463463374607431768211455");
public String CreateExplicitHashKey(String partitionKey) {
byte[] pkDigest = md5.ComputeHash(Encoding.UTF8.GetBytes(partitionKey));
byte[] hashKeyBytes = new byte[16];
for (int i = 0; i < pkDigest.Length; i++) {
hashKeyBytes[16 - i - 1] = (byte)((int)pkDigest[i] & 0xFF);
}
BigInteger hashKey = new BigInteger(hashKeyBytes, true, false);
if (powersOfTen == null) {
powersOfTen = new List<BigInteger>();
powersOfTen.Add(1);
for (BigInteger i = 10; i < MAXHASHKEY; i *= i) {
powersOfTen.Add(i);
}
}
return BuildString(hashKey, powersOfTen.Count - 1).ToString().TrimStart('0');
}
private static string BuildString(BigInteger n, int m) {
if (m == 0) return n.ToString();
BigInteger remainder;
BigInteger quotient = BigInteger.DivRem(n, powersOfTen[m], out remainder);
return BuildString(quotient, m - 1) + BuildString(remainder, m - 1);
}
I've been able to verify the conversion using the examples offered by Will Haley over here. Now I'm seeking to verify the code to ensure it is doing the exact same conversion as is performed by the Kinesis PutRecord/PutRecords client methods, so I can be 100% confident.
Originally I'd expected to find a function for this in the Kinesis client library, but couldn't find one. I've made this suggestion over on GitHub
I have a GUID which I created with GUID.NewGUID(). Now I want to replace the first 32 bit of it with a specific 32-bit Integer while keeping the rest as they are.
Is there a function to do this?
You can use ToByteArray() function and then the Guid constructor.
byte[] buffer = Guid.NewGuid().ToByteArray();
buffer[0] = 0;
buffer[1] = 0;
buffer[2] = 0;
buffer[3] = 0;
Guid guid = new Guid(buffer);
Since the Guid struct has a constructor that takes a byte array and can return its current bytes, it's actually quite easy:
//Create a random, new guid
Guid guid = Guid.NewGuid();
Console.WriteLine(guid);
//The original bytes
byte[] guidBytes = guid.ToByteArray();
//Your custom bytes
byte[] first4Bytes = BitConverter.GetBytes((UInt32) 0815);
//Overwrite the first 4 Bytes
Array.Copy(first4Bytes, guidBytes, 4);
//Create new guid based on current values
Guid guid2 = new Guid(guidBytes);
Console.WriteLine(guid2);
Fiddle
Keep in mind however, that the order of bytes returned from BitConverter depends on your processor architecture (BitConverter.IsLittleEndian) and that your Guid's entropy decreases by 232 if you use the same number every time (which, depending on your application might not be as bad as it sounds, since you have 2128 to begin with).
The question is about replacing bits, but if someone wants to replace first characters of guid directly, this can be done by converting it to string, replacing characters in string and converting back. Note that replaced characters should be valid in hex, i.e. numbers 0 - 9 or letters a - f.
var uniqueGuid = Guid.NewGuid();
var uniqueGuidStr = "1234" + uniqueGuid.ToString().Substring(4);
var modifiedUniqueGuid = Guid.Parse(uniqueGuidStr);
I have a table of orders and I want to give users a unique code for an order whilst hiding the incrementing identity integer primary key because I don't want to give away how many orders have been made.
One easy way of making sure the codes are unique is to use the primary key to determine the code.
So how can I transform an integer into a friendly, say, eight alpha numeric code such that every code is unique?
The easiest way (if you want an alpha numeric code) is to convert the integer primary key to HEX (like below). And, you can Use `PadLeft()' to make sure the string has 8 characters. But, when the number of orders grow, 8 characters will not be enough.
var uniqueCode = intPrimaryKey.ToString("X").PadLeft(8, '0');
Or, you can create an offset of your primary key, before converting it to HEX, like below:
var uniqueCode = (intPrimaryKey + 999).ToString("X").PadLeft(8, '0');
Assuming the total number of orders being created isn't going to get anywhere near the total number of identifiers in your pool, a reasonably effective technique is to simply generate a random identifier and see if it is used already; continue generating new identifiers until you find one not previously used.
A quick and easy way to do this is to have a guid column that has a default value of
left(newid(),8)
This solution will generally give you a unique value for each row. But if you have extremely large amounts of orders this will not be unique and you should use just the newid() value to generate the guid.
I would just use MD5 for this. MD5 offers enough "uniqueness" for a small subset of integers that represent your customer orders.
For an example see this answer. You will need to adjust input parameter from string to int (or alternatively just call ToString on your number and use the code as-is).
If you would like something that would be difficult to trace and you don;t mind it being 16 characters, you could use something like this that includes some random numbers and mixes the byte positions of the original input with them: (EDITED to make a bit more untraceable, by XOR-ing with the generated random numbers).
public static class OrderIdRandomizer
{
private static readonly Random _rnd = new Random();
public static string GenerateFor(int orderId)
{
var rndBytes = new byte[4];
_rnd.NextBytes(rndBytes);
var bytes = new byte[]
{
(byte)rndBytes[0],
(byte)(((byte)(orderId >> 8)) ^ rndBytes[0]),
(byte)(((byte)(orderId >> 24)) ^ rndBytes[1]),
(byte)rndBytes[1],
(byte)(((byte)(orderId >> 16)) ^ rndBytes[2]),
(byte)rndBytes[2],
(byte)(((byte)(orderId)) ^ rndBytes[3]),
(byte)rndBytes[3],
};
return string.Concat(bytes.Select(b => b.ToString("X2")));
}
public static int ReconstructFrom(string generatedId)
{
if (generatedId == null || generatedId.Length != 16)
throw new InvalidDataException("Invalid generated order id");
var bytes = new byte[8];
for (int i = 0; i < 8; i++)
bytes[i] = byte.Parse(generatedId.Substring(i * 2, 2), System.Globalization.NumberStyles.HexNumber);
return (int)(
((bytes[2] ^ bytes[3]) << 24) |
((bytes[4] ^ bytes[5]) << 16) |
((bytes[1] ^ bytes[0]) << 8) |
((bytes[6] ^ bytes[7])));
}
}
Usage:
var obfuscatedId = OrderIdRandomizer.GenerateFor(123456);
Console.WriteLine(obfuscatedId);
Console.WriteLine(OrderIdRandomizer.ReconstructFrom(obfuscatedId));
Disadvantage: If the algorithm is know, it is obviously easy to break.
Advantage: It is completely custom, i.e. not an established algorithm like MD5 that might be easier to guess/crack if you do not know what algorithm is being used.
I have to generate 16 character strings, about 1,00,000 a month. They should be such that they don't repeat across multiple runs (once a month, every month). What is the best method to achieve this? Is using hash functions a good idea?
The string can have A-Z and 0-9 only.
This is to be done using C#.
EDIT: The strings should be random. So, keeping a simple counter is not an option.
Since you're limited to 16 alphanumeric characters, a GUID is probably not an option - it requires the full 128 bits to be unique and whilst that will generate a 16 character string, it will not necessarily fit the alphanumeric constraint.
You could have a simple counter and return the last 64 bits of an MD5 hash and check for uniqueness each time.
//parse out hex digits in calling code
static long NextHash(HashSet<long> hashes, int count)
{
System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create();
long l = BitConverter.ToInt64(md5.ComputeHash(IntToArray(count)));
if(!hashes.Contains(l)){
hashes.Add(l);
return l;
} else return -1; //check this in calling code for failure
}
static byte[] IntToArray(int i)
{
byte[] bytes = new byte[4];
for(int j=0;j<4;j++){
bytes[j] = (byte)i;
i>>=8;
}
}
You could do something similar for GUIDS, but I don't know how likely collisions are when you're only looking at a substring.
MD5 hashes have the advantage of "appearing" more random, if that's at all relevant.
You have not specified the language.
PHP,
http://php.net/manual/en/function.uniqid.php
echo rand(0,999).uniqid();
rand(0,999) = 3 characters randomly
uniqid() = 13 randomly characters
I don't know if that satisfies you, but I came up with sth like that
static List<string> generate(int count)
{
List<string> strings = new List<string>();
while (strings.Count < count)
{
Guid g = Guid.NewGuid();
string GuidString = g.ToString();
GuidString = GuidString.Replace("-", "");
GuidString = GuidString.Remove(16);
if (!strings.Contains(GuidString))
strings.Add(GuidString);
}
return strings;
}
Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. I've used the cryptographic hash functions for this in the past because they are easy to implement, but they are doing a lot more work than they should to be cryptographically oneway, and I don't care about that (I'm just using the hashcode as a key into a hashtable).
Here's what I have today:
struct SomeData : IEquatable<SomeData>
{
private readonly byte[] data;
public SomeData(byte[] data)
{
if (null == data || data.Length <= 0)
{
throw new ArgumentException("data");
}
this.data = new byte[data.Length];
Array.Copy(data, this.data, data.Length);
}
public override bool Equals(object obj)
{
return obj is SomeData && Equals((SomeData)obj);
}
public bool Equals(SomeData other)
{
if (other.data.Length != data.Length)
{
return false;
}
for (int i = 0; i < data.Length; ++i)
{
if (data[i] != other.data[i])
{
return false;
}
}
return true;
}
public override int GetHashCode()
{
return BitConverter.ToInt32(new MD5CryptoServiceProvider().ComputeHash(data), 0);
}
}
Any thoughts?
dp: You are right that I missed a check in Equals, I have updated it. Using the existing hashcode from the byte array will result in reference equality (or at least that same concept translated to hashcodes).
for example:
byte[] b1 = new byte[] { 1 };
byte[] b2 = new byte[] { 1 };
int h1 = b1.GetHashCode();
int h2 = b2.GetHashCode();
With that code, despite the two byte arrays having the same values within them, they are referring to different parts of memory and will result in (probably) different hash codes. I need the hash codes for two byte arrays with the same contents to be equal.
The hash code of an object does not need to be unique.
The checking rule is:
Are the hash codes equal? Then call the full (slow) Equals method.
Are the hash codes not equal? Then the two items are definitely not equal.
All you want is a GetHashCode algorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTable or Dictionary<> will need to use the hash to optimise retrieval.
How long do you expect the data to be? How random? If lengths vary greatly (say for files) then just return the length. If lengths are likely to be similar look at a subset of the bytes that varies.
GetHashCode should be a lot quicker than Equals, but doesn't need to be unique.
Two identical things must never have different hash codes. Two different objects should not have the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers).
Don't use cryptographic hashes for a hashtable, that's ridiculous/overkill.
Here ya go... Modified FNV Hash in C#
http://bretm.home.comcast.net/hash/6.html
public static int ComputeHash(params byte[] data)
{
unchecked
{
const int p = 16777619;
int hash = (int)2166136261;
for (int i = 0; i < data.Length; i++)
hash = (hash ^ data[i]) * p;
hash += hash << 13;
hash ^= hash >> 7;
hash += hash << 3;
hash ^= hash >> 17;
hash += hash << 5;
return hash;
}
}
Borrowing from the code generated by JetBrains software, I have settled on this function:
public override int GetHashCode()
{
unchecked
{
var result = 0;
foreach (byte b in _key)
result = (result*31) ^ b;
return result;
}
}
The problem with just XOring the bytes is that 3/4 (3 bytes) of the returned value has only 2 possible values (all on or all off). This spreads the bits around a little more.
Setting a breakpoint in Equals was a good suggestion. Adding about 200,000 entries of my data to a Dictionary, sees about 10 Equals calls (or 1/20,000).
Have you compared with the SHA1CryptoServiceProvider.ComputeHash method? It takes a byte array and returns a SHA1 hash, and I believe it's pretty well optimized. I used it in an Identicon Handler that performed pretty well under load.
I found interesting results:
I have the class:
public class MyHash : IEquatable<MyHash>
{
public byte[] Val { get; private set; }
public MyHash(byte[] val)
{
Val = val;
}
/// <summary>
/// Test if this Class is equal to another class
/// </summary>
/// <param name="other"></param>
/// <returns></returns>
public bool Equals(MyHash other)
{
if (other.Val.Length == this.Val.Length)
{
for (var i = 0; i < this.Val.Length; i++)
{
if (other.Val[i] != this.Val[i])
{
return false;
}
}
return true;
}
else
{
return false;
}
}
public override int GetHashCode()
{
var str = Convert.ToBase64String(Val);
return str.GetHashCode();
}
}
Then I created a dictionary with keys of type MyHash in order to test how fast I can insert and I can also know how many collisions there are. I did the following
// dictionary we use to check for collisions
Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();
// used to generate random arrays
Random rand = new Random();
var now = DateTime.Now;
for (var j = 0; j < 100; j++)
{
for (var i = 0; i < 5000; i++)
{
// create new array and populate it with random bytes
byte[] randBytes = new byte[byte.MaxValue];
rand.NextBytes(randBytes);
MyHash h = new MyHash(randBytes);
if (checkForDuplicatesDic.ContainsKey(h))
{
Console.WriteLine("Duplicate");
}
else
{
checkForDuplicatesDic[h] = true;
}
}
Console.WriteLine(j);
checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
}
var elapsed = DateTime.Now - now;
Console.Read();
Every time I insert a new item to the dictionary the dictionary will calculate the hash of that object. So you can tell what method is most efficient by placing several answers found in here in the method public override int GetHashCode() The method that was by far the fastest and had the least number of collisions was:
public override int GetHashCode()
{
var str = Convert.ToBase64String(Val);
return str.GetHashCode();
}
that took 2 seconds to execute. The method
public override int GetHashCode()
{
// 7.1 seconds
unchecked
{
const int p = 16777619;
int hash = (int)2166136261;
for (int i = 0; i < Val.Length; i++)
hash = (hash ^ Val[i]) * p;
hash += hash << 13;
hash ^= hash >> 7;
hash += hash << 3;
hash ^= hash >> 17;
hash += hash << 5;
return hash;
}
}
had no collisions also but it took 7 seconds to execute!
If you are looking for performance, I tested a few hash keys, and
I recommend Bob Jenkin's hash function. It is both crazy fast
to compute and will give as few collisions as the cryptographic
hash you used until now.
I don't know C# at all, and I don't know if it can link with C, but
here is its implementation in C.
Is using the existing hashcode from the byte array field not good enough? Also note that in the Equals method you should check that the arrays are the same size before doing the compare.
Generating a good hash is easier said than done. Remember, you're basically representing n bytes of data with m bits of information. The larger your data set and the smaller m is, the more likely you'll get a collision ... two pieces of data resolving to the same hash.
The simplest hash I ever learned was simply XORing all the bytes together. It's easy, faster than most complicated hash algorithms and a halfway decent general-purpose hash algorithm for small data sets. It's the Bubble Sort of hash algorithms really. Since the simple implementation would leave you with 8 bits, that's only 256 hashes ... not so hot. You could XOR chunks instead of individal bytes, but then the algorithm gets much more complicated.
So certainly, the cryptographic algorithms are maybe doing some stuff you don't need ... but they're also a huge step up in general-purpose hash quality. The MD5 hash you're using has 128 bits, with billions and billions of possible hashes. The only way you're likely to get something better is to take some representative samples of the data you expect to be going through your application and try various algorithms on it to see how many collisions you get.
So until I see some reason to not use a canned hash algorithm (performance, perhaps?), I'm going to have to recommend you stick with what you've got.
Whether you want a perfect hashfunction (different value for each object that evaluates to equal) or just a pretty good one is always a performance tradeoff, it takes normally time to compute a good hashfunction and if your dataset is smallish you're better of with a fast function. The most important (as your second post points out) is correctness, and to achieve that all you need is to return the Length of the array. Depending on your dataset that might even be ok. If it isn't (say all your arrays are equally long) you can go with something cheap like looking at the first and last value and XORing their values and then add more complexity as you see fit for your data.
A quick way to see how your hashfunction performs on your data is to add all the data to a hashtable and count the number of times the Equals function gets called, if it is too often you have more work to do on the function. If you do this just keep in mind that the hashtable's size needs to be set bigger than your dataset when you start, otherwise you are going to rehash the data which will trigger reinserts and more Equals evaluations (though possibly more realistic?)
For some objects (not this one) a quick HashCode can be generated by ToString().GetHashCode(), certainly not optimal, but useful as people tend to return something close to the identity of the object from ToString() and that is exactly what GetHashcode is looking for
Trivia: The worst performance I have ever seen was when someone by mistake returned a constant from GetHashCode, easy to spot with a debugger though, especially if you do lots of lookups in your hashtable
RuntimeHelpers.GetHashCode might help:
From Msdn:
Serves as a hash function for a
particular type, suitable for use in
hashing algorithms and data structures
such as a hash table.
private int? hashCode;
public override int GetHashCode()
{
if (!hashCode.HasValue)
{
var hash = 0;
for (var i = 0; i < bytes.Length; i++)
{
hash = (hash << 4) + bytes[i];
}
hashCode = hash;
}
return hashCode.Value;
}