C# Dictionary large performance hit when key is struct [duplicate] - c#

I created two structures of TheKey type k1={17,1375984} and k2={17,1593144}.
Obviosly the pointers in the second fields are different. But both get same hash code=346948941.
Expected to see different hash codes. See the code below.
struct TheKey
{
public int id;
public string Name;
public TheKey(int id, string name)
{
this.id = id;
Name = name;
}
}
static void Main() {
// assign two different strings to avoid interning
var k1 = new TheKey(17, "abc");
var k2 = new TheKey(17, new string(new[] { 'a', 'b', 'c' }));
Dump(k1); // prints the layout of a structure
Dump(k2);
Console.WriteLine("hash1={0}", k1.GetHashCode());
Console.WriteLine("hash2={0}", k2.GetHashCode());
}
unsafe static void Dump<T>(T s) where T : struct
{
byte[] b = new byte[8];
fixed (byte* pb = &b[0])
{
IntPtr ptr = new IntPtr(pb);
Marshal.StructureToPtr(s, ptr, true);
int* p1 = (int*)(&pb[0]); // first 32 bits
int* p2 = (int*)(&pb[4]);
Console.WriteLine("{0}", *p1);
Console.WriteLine("{0}", *p2);
}
}
Output:
17
1375984
17
1593144
hash1=346948941
hash2=346948941

It is a lot more complicated than meets the eye. For starters, give the key2 value a completely different string. Notice how the hash code is still the same:
var k1 = new TheKey(17, "abc");
var k2 = new TheKey(17, "def");
System.Diagnostics.Debug.Assert(k1.GetHashCode() == k2.GetHashCode());
Which is quite valid, the only requirement for a hash code is that the same value produces the same hash code. Different values don't have to produce different hash codes. That's not physically possible since a .NET hash code can only represent 4 billion distinct values.
Calculating the hash code for a struct is tricky business. The first thing the CLR does is check if the structure contains any reference type references or has gaps between the fields. A reference requires special treatment because the reference value is random. It is a pointer whose value changes when the garbage collector compacts the heap. Gaps in the structure layout are created because of alignment. A struct with a byte and an int has a 3 byte gap between the two fields.
If neither is the case then all the bits in the structure value are significant. The CLR quickly calculates the hash by xor-ing the bits, 32 at a time. This is a 'good' hash, all the fields in the struct participate in the hash code.
If the struct has fields of a reference type or has gaps then another approach is needed. The CLR iterates the fields of the struct and goes looking for one that is usable to generate a hash. A usable one is a field of a value type or an object reference that isn't null. As soon as it finds one, it takes the hash of that field, xors it with the method table pointer and quits.
In other words, only one field in the structure participates in the hash code calculation. Which is your case, only the id field is used. Which is why the string member value doesn't matter.
This is an obscure factoid that's obviously important to be aware of if you ever leave it up to the CLR to generate hash codes for a struct. By far the best thing to do is to just never do this. If you have to, then be sure to order the fields in the struct so that the first field gives you the best hash code. In your case, just swap the id and Name fields.
Another interesting tidbit, the 'good' hash calculation code has a bug. It will use the fast algorithm when the structure contains a System.Decimal. Problem is, the bits of a Decimal are not representative for its numeric value. Try this:
struct Test { public decimal value; }
static void Main() {
var t1 = new Test() { value = 1.0m };
var t2 = new Test() { value = 1.00m };
if (t1.GetHashCode() != t2.GetHashCode())
Console.WriteLine("gack!");
}

k1 and k2 contain the same values. Why are you surprised that they have the same hash code? It is contracted to return the same value for two objects that compare as equal.

Hash codes are created from state (values inside) of the structure / object. Not from where it is saved. And according to this : Why is ValueType.GetHashCode() implemented like it is?, the default behaviour of GetHashCode for value types, which struct is, is to return hash based on the values. And I believe that is the correct behaviour especialy for structures, that are suposed to be imutable.

Related

Why does it appear that this string is stored inline by value in an explicit layout class or struct?

I have been doing some extremely unsafe and slightly useless messing with the System.Runtime.CompilerServices.Unsafe MSIL package that allows you to do a lot of things with pointers you can't in C#. I created an extension method that returns a ref byte, with that byte being the start of the Method Table pointer at the start of the object, which allows you to use any object in a fixed statement, taking a byte pointer to the start of the object:
public static unsafe ref byte GetPinnableReference(this object obj)
{
return ref *(byte*)*(void**)Unsafe.AsPointer(ref obj);
}
I then decided to test it, using this code:
[StructLayout(LayoutKind.Explicit, Pack = 0)]
public class Foo
{
[FieldOffset(0)]
public string Name = "THIS IS A STRING";
}
[StructLayout(LayoutKind.Explicit, Pack = 0)]
public struct Bar
{
[FieldOffset(0)]
public string Name;
}
And then in the method
var foo = new Foo();
//var foo = new Bar { Name = "THIS IS A STRING" };
fixed (byte* objPtr = foo)
{
char* stringPtr = (char*)(objPtr + (foo is Foo ? : 12));
for (var i = 0; i < foo.Name.Length; i++)
{
Console.Write(*(stringPtr + i /* Char offset */));
}
Console.WriteLine();
}
Console.ReadKey();
The really weird thing about this is that this successfully prints "THIS IS A STRING"? The code works like this:
Get a byte pointer, objPtr, to the very start of the object
Add 16 to get to the actual data
Add another 16 to get past the string header to the string's actual data
Add 4 to skip the first 4 bytes of the string, which are the int _stringLength (exposed to us as Length property)
Interpret the result as a char pointer
EDIT: Important point - when switching foo to type Bar, I only add 12 rather than 36 bytes on (36 = 16 + 16 + 4). Why does it only have 8 bytes of header in the struct rather than 32 in the class? It would make sense that the struct has a smaller header (no syncblk i believe), but then why doesn't the string still have a 16 byte head? I would expect the offset to be 8 + 16 + 4 (28) rather than just 8 + 4 (12)
However, this assumption makes a big flaw. It assumes the string is stored inline inside the class/struct. However, strings are reference types and only a reference to them is stored inside the object from my knowledge. Particularly, I thought reference types can only be put on the heap - and as this struct is a local variable I thought it was on the stack. If it wasn't, the code would surely look something more like this to get the stringPtr
byte** stringRefptr = objPtr + 16;
char* stringPtr = (char*)(*stringRefPtr + 20);
where you take the string reference as a byte** and then use it to get to the chars. And this still wouldn't make sense if the string internally was a char[] (I'm not sure if it is)
So why does this work, and print the string, even though it mistakenly assumes string is stored inline, when string is a reference type?
NOTE: Requires .NET Core 2.0+ with System.Runtime.CompilerServices.Unsafe nuGet package, and C# 7.3+.
Because strings are indeed stored inline. The problem with your assumption is that strings are not normal objects but handled as a special case by the CLR (probably for performance reasons).
And as for the objects, since the string is the only member this would naturally be the most efficient way to allocate the memory. Try adding more members after your string member and your code would break.
Here’s a few references in how strings are stored in the CLR
https://mattwarren.org/2016/05/31/Strings-and-the-CLR-a-Special-Relationship/
https://codeblog.jonskeet.uk/2011/04/05/of-memory-and-strings/
Edit: I didn’t check, but I believe your reasoning behind the offsets is off. 36 = 24 (size of object) + 8 (string header?) + 4 (size of int) while for the struct the24 bytes becomes 0 as it has no header.

Is equivalent the memory used by an array of ints vs an array of structs having just one int?

Considering the next struct...
struct Cell
{
int Value;
}
and the next matrix definitions
var MatrixOfInts = new int[1000,1000];
var MatrixOfCells = new Cell[1000,1000];
which one of the matrices will use less memory space? or are they equivalent (byte per byte)?
Both are the same size because structs are treated like any of the other value type and allocated in place in the heap.
long startMemorySize2 = GC.GetTotalMemory(true);
var MatrixOfCells = new Cell[1000, 1000];
long matrixOfCellSize = GC.GetTotalMemory(true);
long startMemorySize = GC.GetTotalMemory(true);
var MatrixOfInts = new int[1000, 1000];
long matrixOfIntSize = GC.GetTotalMemory(true);
Console.WriteLine("Int Matrix Size:{0}. Cell Matrix Size:{1}",
matrixOfIntSize - startMemorySize, matrixOfCellSize - startMemorySize2);
Here's some fun reading from Jeffery Richter on how arrays are allocated http://msdn.microsoft.com/en-us/magazine/cc301755.aspx
By using the sizeof operator in C# and executing the following code (under Mono 3.10.0) I get the following results:
struct Cell
{
int Value;
}
public static void Main(string[] args)
{
unsafe
{
// result is: 4
var intSize = sizeof(int);
// result is: 4
var structSize = sizeof(Cell);
}
}
So it looks like that an integer and a struct storing an integer consume the same amount of memory, I would therefore assume that arrays would also require an equal amount of memory.
In an array with value-type elements, all of the elements are required to be of the exact same type. The object holding the array needs to store information about the type of elements contained therein, but that information is only stored once per array, rather than once per element.
Note that because arrays receive special handling in the .NET Framework (compared to other collection types) arrays of a structure type will allow elements of the structures contained therein to be acted upon "in-place". As a consequence, if one can limit oneself to storing a structure within an array (rather than some other collection type) and can minimize unnecessary copying of struct instances, it is possible to operate efficiently with structures of almost any size. If one needs to hold a collection of things, each of which will have associated with it four Int64 values and four Int32 values (a total of 48 bytes), using an array of eight-element exposed-field structures may be more efficient and semantically cleaner than representing each thing using four elements from an Int64[] and four elements from an Int32[], or using an array of references to unshared mutable class objects.

How can I generate a unique hashcode for a string

Is there any function, that gives me the same hashcode for the same string?
I'm having trouble when creating 2 different strings (but with the same content), their hashcode is different and therefore is not correctly used in a Dictionary.
I would like to know what GetHashCode() function the Dictionary uses when the key is a string.
I'm building mine like this:
public override int GetHashCode()
{
String str = "Equip" + Equipment.ToString() + "Destiny" + Destiny.ToString();
return str.GetHashCode();
}
But it's producing different results for every instance that uses this code, despite the content of the string being the same.
Your title asks for one thing (unique hash codes) your body asks for something different (consistent hash codes).
You claim:
I'm having trouble when creating 2 different strings (but with the same content), their hashcode is different and therefore is not correctly used in a Dictionary.
If the strings genuinely have the same content, that simply won't occur. Your diagnostics are wrong somehow. Check for non-printable characters in your strings, e.g trailing Unicode "null" characters:
string text1 = "Hello";
string text2 = "Hello\0";
Here text1 and text2 may print the same way in some contexts, but I'd hope they'd have different hash codes.
Note that hash codes are not guaranteed to be unique and can't be... there are only 232 possible hash codes returned from GetHashCode, but more than 232 possible different strings.
Also note that the same content is not guaranteed to produce the same hash code on different runs, even of the same executable - you should not be persisting a hash code anywhere. For example, I believe the 32-bit .NET 4 and 64-bit .NET 4 CLRs produce different hash codes for strings. However, your claim that the values aren't being stored correctly in a Dictionary suggests that this is within a single process - where everything should be consistent.
As noted in comments, it's entirely possible that you're overriding Equals incorrectly. I'd also suggest that your approach to building a hash code isn't great. We don't know what the types of Equipment and Destiny are, but I'd suggest you should use something like:
public override int GetHashCode()
{
int hash = 23;
hash = hash * 31 + Equipment.GetHashCode();
hash = hash * 31 + Destiny.GetHashCode();
return hash;
}
That's the approach I usually use for hash codes. Equals would then look something like:
public override bool Equals(object other)
{
// Reference equality check
if (this == other)
{
return true;
}
if (other == null)
{
return false;
}
// Details of this might change depending on your situation; we'd
// need more information
if (other.GetType() != GetType())
{
return false;
}
// Adjust for your type...
Foo otherFoo = (Foo) other;
// You may want to change the equality used here based on the
// types of Equipment and Destiny
return this.Destiny == otherFoo.Destiny &&
this.Equipment == otherFoo.Equipment;
}

How to improve hashing for short strings to avoid collisions?

I am having a problem with hash collisions using short strings in .NET4.
EDIT: I am using the built-in string hashing function in .NET.
I'm implementing a cache using objects that store the direction of a conversion like this
public class MyClass
{
private string _from;
private string _to;
// More code here....
public MyClass(string from, string to)
{
this._from = from;
this._to = to;
}
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
public bool Equals(MyClass other)
{
return this.To == other.To && this.From == other.From;
}
public override bool Equals(object obj)
{
if (obj == null) return false;
if (this.GetType() != obj.GetType()) return false;
return Equals(obj as MyClass);
}
}
This is direction dependent and the from and to are represented by short strings like "AAB" and "ABA".
I am getting sparse hash collisions with these small strings, I have tried something simple like adding a salt (did not work).
The problem is that too many of my small strings like "AABABA" collides its hash with the reverse of "ABAAAB" (Note that these are not real examples, I have no idea if AAB and ABA actually cause collisions!)
and I have gone heavy duty like implementing MD5 (which works, but is MUCH slower)
I have also implemented the suggestion from Jon Skeet here:
Should I use a concatenation of my string fields as a hash code?
This works but I don't know how dependable it is with my various 3-character strings.
How can I improve and stabilize the hashing of small strings without adding too much overhead like MD5?
EDIT: In response to a few of the answers posted... the cache is implemented using concurrent dictionaries keyed from MyClass as stubbed out above. If I replace the GetHashCode in the code above with something simple like #JonSkeet 's code from the link I posted:
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
Everything functions as expected.
It's also worth noting that in this particular use-case the cache is not used in a multi-threaded environment so there is no race condition.
EDIT: I should also note that this misbehavior is platform dependant. It works as intended on my fully updated Win7x64 machine but does not behave properly on a non-updated Win7x64 machine. I don't know the extend of what updates are missing but I know it doesn't have Win7 SP1... so I would assume there may also be a framework SP or update it's missing as well.
EDIT: As susggested, my issue was not caused by a problem with the hashing function. I had an elusive race condition, which is why it worked on some computers but not others and also why a "slower" hashing method made things work properly. The answer I selected was the most useful in understanding why my problem was not hash collisions in the dictionary.
Are you sure that collisions are who causes problems? When you say
I finally discovered what was causing this bug
You mean some slowness of your code or something else? If not I'm curious what kind of problem is that? Because any hash function (except "perfect" hash functions on limited domains) would cause collisions.
I put a quick piece of code to check for collisions for 3-letter words. And this code doesn't report a single collision for them. You see what I mean? Looks like buid-in hash algorithm is not so bad.
Dictionary<int, bool> set = new Dictionary<int, bool>();
char[] buffer = new char[3];
int count = 0;
for (int c1 = (int)'A'; c1 <= (int)'z'; c1++)
{
buffer[0] = (char)c1;
for (int c2 = (int)'A'; c2 <= (int)'z'; c2++)
{
buffer[1] = (char)c2;
for (int c3 = (int)'A'; c3 <= (int)'z'; c3++)
{
buffer[2] = (char)c3;
string str = new string(buffer);
count++;
int hash = str.GetHashCode();
if (set.ContainsKey(hash))
{
Console.WriteLine("Collision for {0}", str);
}
set[hash] = false;
}
}
}
Console.WriteLine("Generated {0} of {1} hashes", set.Count, count);
While you could pick almost any of well-known hash functions (as David mentioned) or even choose a "perfect" hash since it looks like your domain is limited (like minimum perfect hash)... It would be great to understand if the source of problems are really collisions.
Update
What I want to say is that .NET build-in hash function for string is not so bad. It doesn't give so many collisions that you would need to write your own algorithm in regular scenarios. And this doesn't depend on the lenght of strings. If you have a lot of 6-symbol strings that doesn't imply that your chances to see a collision are highier than with 1000-symbol strings. This is one of the basic properties of hash functions.
And again, another question is what kind of problems do you experience because of collisions? All build-in hashtables and dictionaries support collision resolution. So I would say all you can see is just... probably some slowness. Is this your problem?
As for your code
return string.Concat(this._from, this._to).GetHashCode();
This can cause problems. Because on every hash code calculation you create a new string. Maybe this is what causes your issues?
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
This would be much better approach - just because you don't create new objects on the heap. Actually it's one of the main points of this approach - get a good hash code of an object with a complex "key" without creating new objects. So if you don't have a single value key then this should work for you. BTW, this is not a new hash function, this is just a way to combine existing hash values without compromising main properties of hash functions.
Any common hash function should be suitable for this purpose. If you're getting collisions on short strings like that, I'd say you're using an unusually bad hash function. You can use Jenkins or Knuth's with no issues.
Here's a very simple hash function that should be adequate. (Implemented in C, but should easily port to any similar language.)
unsigned int hash(const char *it)
{
unsigned hval=0;
while(*it!=0)
{
hval+=*it++;
hval+=(hval<<10);
hval^=(hval>>6);
hval+=(hval<<3);
hval^=(hval>>11);
hval+=(hval<<15);
}
return hval;
}
Note that if you want to trim the bits of the output of this function, you must use the least significant bits. You can also use mod to reduce the output range. The last character of the string tends to only affect the low-order bits. If you need a more even distribution, change return hval; to return hval * 2654435761U;.
Update:
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
This is broken. It treats from="foot",to="ar" as the same as from="foo",to="tar". Since your Equals function doesn't consider those equal, your hash function should not. Possible fixes include:
1) Form the string from,"XXX",to and hash that. (This assumes the string "XXX" almost never appears in your input strings.
2) Combine the hash of 'from' with the hash of 'to'. You'll have to use a clever combining function. For example, XOR or sum will cause from="foo",to="bar" to hash the same as from="bar",to="foo". Unfortunately, choosing the right combining function is not easy without knowing the internals of the hashing function. You can try:
int hc1=from.GetHashCode();
int hc2=to.GetHashCode();
return (hc1<<7)^(hc2>>25)^(hc1>>21)^(hc2<<11);

How does native implementation of ValueType.GetHashCode work?

I created two structures of TheKey type k1={17,1375984} and k2={17,1593144}.
Obviosly the pointers in the second fields are different. But both get same hash code=346948941.
Expected to see different hash codes. See the code below.
struct TheKey
{
public int id;
public string Name;
public TheKey(int id, string name)
{
this.id = id;
Name = name;
}
}
static void Main() {
// assign two different strings to avoid interning
var k1 = new TheKey(17, "abc");
var k2 = new TheKey(17, new string(new[] { 'a', 'b', 'c' }));
Dump(k1); // prints the layout of a structure
Dump(k2);
Console.WriteLine("hash1={0}", k1.GetHashCode());
Console.WriteLine("hash2={0}", k2.GetHashCode());
}
unsafe static void Dump<T>(T s) where T : struct
{
byte[] b = new byte[8];
fixed (byte* pb = &b[0])
{
IntPtr ptr = new IntPtr(pb);
Marshal.StructureToPtr(s, ptr, true);
int* p1 = (int*)(&pb[0]); // first 32 bits
int* p2 = (int*)(&pb[4]);
Console.WriteLine("{0}", *p1);
Console.WriteLine("{0}", *p2);
}
}
Output:
17
1375984
17
1593144
hash1=346948941
hash2=346948941
It is a lot more complicated than meets the eye. For starters, give the key2 value a completely different string. Notice how the hash code is still the same:
var k1 = new TheKey(17, "abc");
var k2 = new TheKey(17, "def");
System.Diagnostics.Debug.Assert(k1.GetHashCode() == k2.GetHashCode());
Which is quite valid, the only requirement for a hash code is that the same value produces the same hash code. Different values don't have to produce different hash codes. That's not physically possible since a .NET hash code can only represent 4 billion distinct values.
Calculating the hash code for a struct is tricky business. The first thing the CLR does is check if the structure contains any reference type references or has gaps between the fields. A reference requires special treatment because the reference value is random. It is a pointer whose value changes when the garbage collector compacts the heap. Gaps in the structure layout are created because of alignment. A struct with a byte and an int has a 3 byte gap between the two fields.
If neither is the case then all the bits in the structure value are significant. The CLR quickly calculates the hash by xor-ing the bits, 32 at a time. This is a 'good' hash, all the fields in the struct participate in the hash code.
If the struct has fields of a reference type or has gaps then another approach is needed. The CLR iterates the fields of the struct and goes looking for one that is usable to generate a hash. A usable one is a field of a value type or an object reference that isn't null. As soon as it finds one, it takes the hash of that field, xors it with the method table pointer and quits.
In other words, only one field in the structure participates in the hash code calculation. Which is your case, only the id field is used. Which is why the string member value doesn't matter.
This is an obscure factoid that's obviously important to be aware of if you ever leave it up to the CLR to generate hash codes for a struct. By far the best thing to do is to just never do this. If you have to, then be sure to order the fields in the struct so that the first field gives you the best hash code. In your case, just swap the id and Name fields.
Another interesting tidbit, the 'good' hash calculation code has a bug. It will use the fast algorithm when the structure contains a System.Decimal. Problem is, the bits of a Decimal are not representative for its numeric value. Try this:
struct Test { public decimal value; }
static void Main() {
var t1 = new Test() { value = 1.0m };
var t2 = new Test() { value = 1.00m };
if (t1.GetHashCode() != t2.GetHashCode())
Console.WriteLine("gack!");
}
k1 and k2 contain the same values. Why are you surprised that they have the same hash code? It is contracted to return the same value for two objects that compare as equal.
Hash codes are created from state (values inside) of the structure / object. Not from where it is saved. And according to this : Why is ValueType.GetHashCode() implemented like it is?, the default behaviour of GetHashCode for value types, which struct is, is to return hash based on the values. And I believe that is the correct behaviour especialy for structures, that are suposed to be imutable.

Categories