EqualityComparer<String> with stable HashCode and option to ignore diacritics?

EqualityComparer<String> with stable HashCode and option to ignore diacritics? - c#

I'm looking for a IEqualityComparer<String> that supports stable HashCode, ie the same HashCode between executions/processes. I also need it to ignore casing and nonspacing combining characters (such as diacritics).
Are there any "easy" ways of accomplishing this in .NET? I have started on my custom implementation with a stable HashCode that ignores casing but I'm beginning to wish I could use the already existing implementations in .NET somehow.
The built-in string comparer adds some random seed to HashCodes between procesee to not make it stable (I think because they cannot guarantee it will remain stable between .NET runtimes?) but I think I can handle that by just making sure the HashCodes I persist gets wiped/rebuilt when moving to another runtime.
In any case, is there any way to access the inner checksum calculation (without the randomness)? Perhaps with reflection?
Update: I'm not an expert on the why but it's evident that the HashCode is calculated differently between runtime. I need it because I have a disk based lookup index that is using the hashcode for strings as keys and since it is persistent I obviously need them to be the same between runtime. I could calculate my own checksums in any way I like of course but since .NET already do a very good job with this I wish I could take advantage of that. But without the "seed" or what you want to call it, the thing that makes the hashcodes different between runtimes.

Related

Why HashSet<T> class is not used to implement Enumerable.Distinct

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation
So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"
What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions
However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime
So, my questions are:
What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?

which is almost a classical implementation of a hash table with "open addressing"
Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.
a lot of code in Set is just a copy-paste from HashSet, with some omissions
AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.
It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells
In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.
As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.
Especially given the fact that both of these classes are in the same assembly System.Core
Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.
Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.
That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think #james-ko was considering testing the benefits of doing that.
It looks like HashSet<T> will perform better
For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).
so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.
Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

Why the GetHashCode does not take advantage of SK.exe tool's hashcode algorithm?

MSDN says:
"The default implementation of the GetHashCode method does not guarantee unique return values for different objects. "
But on the other hand, when I use the sn.exe tool it ensures a unique hash value to create a strongly-named assembly. If I did not get the point wrong, all the content of the assembly is converted to a hash value.
So, why GetHashCode()'s default implementation does not use the same algorithm used by sn.exe to create a unique hash values for objects and expects the developer to implent it?

Not enough bits. GetHashCode() returns 32 of them so there can never be more than 4 billion distinct values. The birthday paradox cuts that down considerably. The strong name generated by sn.exe (not sk.exe) uses a SHA1 hash. Which returns 160 bits, allowing for 2^160 distinct values.
Which is a Really Big Number (1.4E48), ensuring uniqueness by the sheer quantity. Somewhat similar to a Guid which uses 128 bits. Not the same, a Guid generator ensures that no duplicates can occur, SHA1 has no such guarantee.
GetHashCode has a limited number of bits because the primary requirement for the method is that it is fast. Short from providing the bucket index for hashed collections, its use is making equality testing fast. GetHashCode needs to be an order of magnitude faster than Equals(), give or take, to make it useful. That requires many corners to be cut, typically, the GetHashCode implementation for a struct that contains reference types for example only returns the GetHashCode value of the first member.

Those are two entirely different things.
The GetHashCode() function by definition returns (only) a 32 bits integer. It is supposed to use a fast algorithm and does not (can not) guarantee uniqueness. A PC can quickly generate enough strings to show a collision.
When you sign an application (document) you will end up with a lot larger hash (like 128 or 256 bits). While in theory you might still have a collision this has no practical implications.

There's no limit to the number of objects a program can create, call GetHashCode() upon, and abandon. There is, however, a limit of 4,294,967,296 different values GetHashCode() can return. If a program happens to call GetHashCode 4,294,967,297 times, at least one of those calls would have to return a value that had already been returned previously.
It would theoretically be possible for the system to keep a pool of hash-code values, and for objects which are abandoned to have their hash codes put back in the pool so that GetHashCode() could guarantee that it will never return the same value as any other live object (assuming there are no more than 4,294,967,296 live objects, at least). On the other hand, keeping such information would be expensive and not really offer much benefit. From a practical perspective, it's just as good to have the system generate an arbitrary number either when an object is constructed or the first time GetHashCode() is called upon it. There will be occasional collisions, but generally not enough to bother well-written code.
BTW, I've sometimes thought it would be useful for each object to have a 64-bit ID which would be guaranteed unique, and which would also rank objects in order of creation. A 64-bit ID would never overflow within the lifetime of any foreseeable program, and being able to assign objects a ranking could be helpful in some caching or interning scenarios. For example, if a program generates some large objects by reading data from files, and frequently scans them to find differences, it may often find objects that contain identical data but are distinct. If two distinct objects are found to be identical and interchangeable, replacing reference to the newer one with the older one may considerably expedite future comparisons among them; if many matching objects are compared among each other, many of the references to newer objects will get replaced with references to the oldest ones, without having to explicitly cache anything. Absent some means of determining "age", however, such an approaches wouldn't really work, since there would be no way to know which reference should be abandoned in favor of the other.

Unrelated. Wonder how you could relate these two!!
Still, to add more argument:
Hashcode for a value 'can not guarantee' uniqueness for different values. But it does 'guarantee' a same hash code for a given value/object!. That means:
var hashOne = "SO".GetHashCode();
var hastTwo = "SO".GetHashCode();
Debug.Assert(hashOne==hashTwo); //The assertion would succeed.
SN just just generates a random unique number, with no logic over an instance.

Generate a Hashcode for a string that is platform independent

We have an application that
Generates a hash code on a string
Saves that hash code into a DB along with associated data
Later, it queries the DB using the string hash code for retrieving the data
This is obviously a bug because the value returned from string.GetHashCode() varies from .NET versions and architectures (32/64 bit). To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead. What we'd like to do is come up with a quick and dirty fix for now, and refactor the code later to do it the right way.
The quick and dirty fix seems like creating a static GetInvariantHashCode(string s) helper method that is consistent across architectures.
Can suggest an algorithm for generating a hashcode on a string that is equivalent on 32 bit and 64 bit architecture?
A few more notes:
I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar

I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar
Then just let the database index the strings for you!
Look, I have no idea how large your domain is, but you're going to get collisions very rapidly with very high likelihood if it's of any decent size at all. It's the birthday problem with a lot of people relative to the number of birthdays. You're going to have collisions, and lose any gain in speed you might think you're gaining by not just indexing the strings in the first place.
Anyway, you don't need us if you're stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:
string s = "Hello, world!";
int hash = 17;
foreach(char c in s) {
unchecked { hash = hash * 23 + c.GetHashCode(); }
}
Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you're looking for? I don't know, they weren't meant to be used for this purpose. They were meant to be used for balancing hash tables. You're not balancing a hash table. You're using the wrong concept.
Edit (the below was written before the question was edited with new salient information):
You can't do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than String.GetHashCode differening from platform to platform.
There are a lot of instances of string. In fact, way more instances than there are instances of Int32. So, because of the piegonhole principle, you will have collisions. You can't avoid this: your strings are pigeons and your Int32 hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can't use hash codes as unique keys for strings. It doesn't work. Period.
The only way you can make your current proposed design work (using Int32 as an identifier for instances of string) is if you restrict your input space of strings to something that has at size less than or equal to the number of Int32s. Even then, you'll have difficulty coming up with an algorithm that maps your input space of strings to Int32 in a unique way.
Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that's not what SHA-512 is for anyway, it's not to be used for unique identification of messages. It's just to reduce the likelihood of message forgery.
To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead.
Well, then you have a tremendous amount of work ahead of you. I'm sorry you discovered this so late in the game.
I note the documentation for String.GetHashCode:
The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.
And from Object.GetHashCode:
The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table.
Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.

You should just use SHA512.
Note that hashes are not (and cannot be) unique.
If you need it to be unique, just use the identity function as your hash.

You can use one of the managed cryptography classes (such as SHA512Managed) to compute a platform independent hash via ComputeHash. This will require converting the string to a byte array (ie: using Encoding.GetBytes or some other method), and be slow, but be consistent.
That being said, a hash is not guaranteed unique, and is really not a proper mechanism for uniqueness in a database. Using a hash to store data is likely to cause data to get lost, as the first hash collision will overwrite old data (or throw away new data).

Is the .Net HashSet uniqueness calculation completely based on Hash Codes?

I was wondering whether the .Net HashSet<T> is based completely on hash codes or whether it uses equality as well?
I have a particular class that I may potentially instantiate millions of instances of and there is a reasonable chance that some hash codes will collide at that point.
I'm considering using HashSet's to store some instances of this class and am wondering if it's actually worth doing - if the uniqueness of an element is only determined on its hash code then that's of no use to me for real applications
MSDN documentation seems to be rather vague on this topic - any enlightenment would be appreciated

No, it uses equality as well. By definition, hash codes don't need to be unique - anything which assumes they will be is broken. HashSet<T> is sensible. It uses an IEqualityComparer<T> (defaulting to EqualityComparer<T>.Default) to perform both hash code generation and equality tests.

Hash quality and stability of String.GetHashCode() in .NET?

I am wondering about the hash quality and the hash stability produced by the String.GetHashCode() implementation in .NET?
Concerning the quality, I am focusing on algorithmic aspects (hence, the quality of the hash as it impacts large hash-tables, not for security concerns).
Then, concerning the stability, I wondering about the potential versionning issues that might arise from one .NET version to the next.
Some lights on those two aspects would be very appreciated.

I can't give you any details about the quality (though I would assume it is pretty good given that string is one of the framework's core classes that is likely to be used as a hash key).
However, regarding the stability, the hash code produced on different versions of the framework is not guaranteed to be the same, and it has changed in the past, so you absolutely must not rely on the hash code being stable between versions (see here for a reference that it changed between 1.1 and 2.0). In fact, it even differs between the 32-bit and 64-bit versions of the same framework version; from the docs:
The value returned by GetHashCode is platform-dependent. For a specific string value, it differs on the 32-bit and 64-bit versions of the .NET Framework.

This is an old question, but I'd like to contribute by mentionning this microsoft bug about hash quality.
Summary: On 64b, hash quality is very low when your string contains '\0' bytes. Basically, only the start of the string will be hashed.
If like me, you have to use .Net strings to represent binary data as key for high-performance dictionaries, you need to be aware of this bug.
Too bad, it's a WONTFIX... As a sidenote, I don't understand how they could say that modifying the hashcode being a breaking change, when the code includes
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
and the hashcode is already different in x86/64b anyway.

I just came across a related problem to this. On one of my computers (a 64 bit one) I had a problem that I tracked down to 2 different objects being identical except for the (stored) hashcode. That hashcode was created from a string....the same string!
m_storedhash = astring.GetHashCode();
I dont know how these two objects ended up with different hash codes given they were from the same string however I suspect what happened is that within the same .NET exe, one of the class library projects I depend upon has been set to x86 and another to ANYCPU and one of these objects was created in a method inside the x86 class lib and the other object (same input data, same everything) was created in a method inside the ANYCPU class library.
So, does this sound plausible: Within the same executable in memory (not between processes) some of the code could be running with the x86 Framework's string.GetHashCode() and other code x64 Framework's string.GetHashCode() ?

I know that this isn't really included the meanings of quality and stability that you specified, but it's worth being aware that hashing extremely large strings can produce an OutOfMemoryException.
https://connect.microsoft.com/VisualStudio/feedback/details/517457/stringcomparers-gethashcode-string-throws-outofmemoryexception-with-plenty-of-ram-available

The quality of the hash codes are good enough for their intended purpose, i.e. they doesn't cause too many collisions when you use strings as key in a dictionary. I suspect that it will only use the entire string for calculating the hash code if the string length is reasonably short, for huge strings it will brobably only use the first part.
There is no guarantee for stability across versions. The documentation clearly says that the hashing algorithm may change from one version to the next, so that the hash codes are for short term use.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.