Does the length of key affect Dictionary performance? - c#

I will use a Dictionary in a .NET project to store a large number of objects. Therefore I decided to use a GUID-string as a key, to ensure unique keys for each object.
Does a large key such as a GUID (or even larger ones) decrease the performance of a Dictionary, e.g. for retrieving an object via its key?
Thanks,
Andrej

I would recommend using an actual Guid rather than the string representation of the Guid. Yes, when comparing strings the length does affect the number of operations required, since it has to compare the strings character-by-character (at a bare minimum; this is barring any special options like IgnoreCase). The actual Guid will give you only 16 bytes to compare rather than the minimum of 32 in the string.
That being said, you are very likely not going to notice any difference...premature optimization and all that. I would simply go for the Guid key since that's what the data is.

The actual size of an object with respect to retrieving values is irrelevant. The speed of lookup of values is much more dependent on the speed of two methods on the passed in IEqualityComparer<T> instance
GetHashcode()
Equals()
EDIT
A lot of people are using String as a justification for saying that larger object size decreases lookup performance. This must be taken with a grain of salt for several reasons.
The performance of the above said methods for String decrease in performance as the size of the string increases for the default comparer. Just because it's true for System.String does not mean it is true in general
You could just as easily write a different IEqualityComparer<String> in such a way that string length was irrelevant.

Yes and no. Larger strings increase the memory size of a dictionary. And larger sizes mean slightly longer times to calculate hash sizes.
But worrying about those things is probably premature optimization. While it will be slower, it's not anything that you will probably actually notice.

Apparently it does. Here is a good test: Dictionary String Key Test

I did a quick Google search and found this article.
http://dotnetperls.com/dictionary-string-key
It confirms that generally shorter keys perform better than longer ones.

see Performance - using Guid object or Guid string as Key for a similar question. You could test it out with an alternative key.

Related

Which collection type should I use to store a bunch of hashes?

I have a bunch of long strings which I have to manipulate. They can occur again and again and I want to ignore them if they appear twice. I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
Requirements:
Be able to add items (hashes) to my collection
Be able to (quickly) check whether a particular hash is already in the collection.
Not too memory intensive. I might end up with ~100,000 of these hashes.
I don't need to go backwards (key -> value) if that makes any difference.
Any suggestions on which .NET data type would be most efficient?
I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
No, don't do that. Two reasons:
Hashes only tell you if two values might be the same; they don't tell you if they are the same.
You'd be doing a lot of work which has already been done for you.
Basically, you should just keep a HashSet<String>. That should be fine, have a quick lookup, and you don't need to implement it yourself.
The downside is that you will end up keeping all the strings in memory. If that's a problem then you'll need to work out an alternative strategy... which may indeed end up keeping just the hashes in memory. The exact details will probably depend on where the strings come from, and what sort of problem it would cause if you got a false positive. For example, you could keep an MD5 hash of each string, as a "better than just hashCode" hash - but that would still allow an attacker to present you with another string with the same hash. Is that a problem? If so, a more secure hash algorithm (e.g. SHA-256) might help. It still won't guarantee that you end up with different hashes for different strings though.
If you really want to be sure, you'd need to keep the hashes in memory but persist the actual string data (to disk or a database) - then when you've got a possible match (because you've seen the same hash before) you'd need to compare the stored string with the fresh one.
If you're storing the hashes in memory, the best approach will depend on the size of hash you're using. For example, for just a 64-bit hash you could use a Long per hash and keep it in a HashSet<Long>. For longer hashes, you'd need an object which can easily be compared etc. At that point, I suggest you look at Guava and its HashCode class, along with the factory methods in HashCodes (Deprecated since Guava v16).
Use a set.
ISet<T> interface is implemented by e.g. HashSet<T>
Add and Contains are expected O(1), unless you have a really poor hashing function, then the worst case is O(n).

How much capacity should a static hash table have to minimize collisions?

My program retrieves a finite and complete list of elements I want to refer to by a string ID. I'm using a .Net Dictionary<string, MyClass> to store these elements. I personally have no idea how many elements there will be. It could be a few. It could be thousands.
Given the program know exactly how many elements it will be putting in the hash table, what should it specify as the table's capacity. Clearly it should be at least the number of elements it will contain, but using only that number will likely lead to numerous collisions.
Is there a guide to selecting the capacity of a hash table for a known number of elements to balance hash collisions and memory wastage?
EDIT: I'm aware a hash table's size can change. What I'm avoiding first and foremost is leaving it with the default allocation, then immediately adding thousands of elements causing countless resize operations. I won't be adding or removing elements once it's populated. If I know what's going in, I can ensure there's sufficient space upfront. My question relates to the balance of hash collisions versus memory wastage.
Your question seems to imply a false assumption, namely that the dictionary's capacity is fixed. It isn't.
If you know in any given case that a dictionary will hold at least some number of elements, then you can specify that number as the dictionary's initial capacity. The dictionary's capacity is always at least as large as its item count (this is true for .NET 2 through 4, at least; I believe this is an undocumented implementation detail that's subject to change).
Specifying the initial capacity reduces the number of memory allocations by eliminating those that would occurred as the dictionary grows from its default initial capacity to the capacity you have chosen.
If the hash function in use is well chosen, the number of collisions should be relatively small and should have a minimal impact on performance. Specifying an over-large capacity might help in some contrived situations, but I would definitely not give this any thought unless profiling showed that the dictionary's lookups were having a significant impact on performance.
(As an example of a contrived situation, consider a dictionary with int keys with a capacity of 10007, all of whose keys are a multiple of 10007. With the current implementation, all of the items would be stored in a single bucket, because the bucket is chosen by dividing the hash code by the capacity and taking the remainder. In this case, the dictionary would function as a linked list, and forcing it to use a different capacity would fix that.)
This is bit of a subjective question but let me try my best to answer this (from perspective of CLR 2.0. only as I have not yet explored if there have been any changes in dictionary for CLR 4.0).
Your are using a dictionary keyed on string. Since there can be infinite possible strings, it is reasonable to assume that every possible hash code is 'equally likely'. Or in other words each of the 2^32 hash codes (range of int) are equally likely for the string class. Current version of Dictionary in BCL drops off 32nd bit from any 32 bit hash code thus obtained, to essentially get a 31 bit hash code. Hence the range we are dealing with is 2^31 unique equally likely hash codes.
Note that the range of the hash codes is not dependent on the number of elements dictionary contains or can contain.
Dictionary class will use this hash code to allocate a bucket to the 'Myclass' object. So essentially if two different strings return same 31 bits of hash code (assuming BCL designers have chosen the string hash function highly wisely, such instances should be fairly spread out) both will be allocated same bucket. In such a hash collision, nothing can be done.
Now, in current implementation of the Dictionary class, it may happen that even different hash codes (again 31 bit) still end up in the same bucket. The bucket index is identified as follows:
hash = <31 bit hash code>
pr = <least prime number greater than or equal to current dictionary capacity>
bucket_index = hash modulus pr
Hence every hash code of the form (pr*factor + bucket_index) will end up in same bucket irrespective of the factor part.
If you want to be absolutely sure that all different possible 31 bit hash codes end up in different buckets only way is to force the pr to be greater than or equal to the largest possible 31 bit hash code. Or in other words, ensure that every hash code is of the form (pr*0 + hash_code) i.e. pr should be greater than 2^31. This by extension means that the dictionary capacity should be at-least 2^31.
Note that the capacity required to minimize hash collisions is not at all dependent on the number of elements you want to store in the dictionary but on the range of the possible hash codes.
As you can imagine 2^31 is huge huge memory allocation. In fact if you try to specify 2^31 as the capacity, there will be two arrays of 2^31 length. Consider that on a 32 bit machine highest possible address on RAM is 2^32!!!!!
If, for some reason, default behavior of the dictionary is not acceptable to you and it is critical for you to minimize hash collisions (or rather I would say bucket collisions) only hope you have is to provide your own hash code (i.e. you can not use string as key). Such a hash code should keep the formula to obtain bucket index in mind and strive to minimize the range of possible hash codes. Simplest approach is to incrementally assign a number/index to your unique MyClass instances and use this number as your hash code. Then you can specify the total number of MyClass instances as dictionary capacity. Though, in such a case an array can easily be maintained instead of dictionary as you know the 'index' of the object and index is incremental.
In the end, I would like to re-iterate what others have said, 'there will not be countless resizes'. Dictionary doubles its capacity (rounded off to nearest prime number greater than or equal to the new capacity) each time it finds itself short of space. In order to save some processing, you can very well set capacity to number of MyClass instances you have as in any case dictionary will require this much capacity to store the instances but this will not minimize 'hash-collisions' and for normal circumstances will be fast enough.
Datastructure like HashTable are meant for dynamic memory allocation. You can however mention the initial size in some structures. But , when you add new elements , they will expand in size. There is in no way you can restrict the size implicitly.
There are many datastructures available , with their own advantages and disadvantages. You need to select the best one. Limiting the size does not affect the performance. You need to take care of Add, Delete and Search which makes the difference in performance.

Datastructure choices for highspeed and memory efficient detection of duplicate of strings

I have a interesting problem that could be solved in a number of ways:
I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before, it needs to skip processing.
After a specified amount of time, the function should accept duplicate strings.
This function may be called thousands of time per second, and the string data may be very large.
This is a highly abstracted explanation of the real application, just trying to get down to the core concept for the purpose of the question.
The function will need to store state in order to detect duplicates. It also will need to store an associated timestamp in order to expire duplicates.
It does NOT need to store the strings, a unique hash of the string would be fine, providing there is no false positives due to collisions (Use a perfect hash?), and the hash function was performant enough.
The naive implementation would be simply (in C#):
Dictionary<String,DateTime>
though in the interest of lowering memory footprint and potentially increasing performance I'm evaluating a custom data structures to handle this instead of a basic hashtable.
So, given these constraints, what would you use?
EDIT, some additional information that might change proposed implementations:
99% of the strings will not be duplicates.
Almost all of the duplicates will arrive back to back, or nearly sequentially.
In the real world, the function will be called from multiple worker threads, so state management will need to be synchronized.
I don't belive it is possible to construct "perfect hash" without knowing complete set of values first (especially in case of C# int with limited number of values). So any kind of hashing requires ability to compare original values too.
I think dictionary is the best you can get with out of box data structures. Since you can store objects with custom comparisons defined you can easily avoid keeping strings in memeory and simply save location where whole string can be obtained. I.e. object with following values:
stringLocation.fileName="file13.txt";
stringLocation.fromOffset=100;
stringLocation.toOffset=345;
expiration= "2012-09-09T1100";
hashCode = 123456;
Where cutomom comparer will return saved hashCode or retrive string from file if needed and perform comparison.
a unique hash of the string would be fine, providing there is no false
positives due to collisions
That's not possible, if you want the hash code to be shorter than the strings.
Using hash codes implies that there are false positives, only that they are rare enough not to be a performance problem.
I would even consider to create the hash code from only part of the string, to make it faster. Even if that means that you get more false positives, it could increase the overall performance.
Provided the memory footprint is tolerable, I would suggest a Hashset<string> for the strings, and a queue to store a Tuple<DateTime, String>. Something like:
Hashset<string> Strings = new HashSet<string>();
Queue<Tuple<DateTime, String>> Expirations = new Queue<Tuple<DateTime, String>>();
Now, when a string comes in:
if (Strings.Add(s))
{
// string is new. process it.
// and add it to the expiration queue
Expirations.Enqueue(new Tuple<DateTime, String>(DateTime.Now + ExpireTime, s));
}
And, somewhere you'll have to check for the expirations. Perhaps every time you get a new string, you do this:
while (Expirations.Count > 0 && Expirations.Peek().Item1 < DateTime.Now)
{
var e = Expirations.Dequeue();
Strings.Remove(e.Item2);
}
It'd be hard to beat the performance of Hashset here. Granted, you're storing the strings, but that's going to be the only way to guarantee no false positives.
You might also consider using a time stamp other than DateTime.Now. What I typically do is start a Stopwatch when the program starts, and then use the ElapsedMilliseconds value. That avoids potential problems that occur during Daylight Saving Time changes, when the system automatically updates the clock (using NTP), or when the user changes the date/time.
Whether the above solution works for you is going to depend on whether you can stand the memory hit of storing the strings.
Added after "Additional information" was posted:
If this will be accessed by multiple threads, I'd suggest using ConcurrentDictionary rather than Hashset, and BlockingCollection rather than Queue. Or, you could use lock to synchronize access to the non-concurrent data structures.
If it's true that 99% of the strings will not be duplicate, then you'll almost certainly need an expiration queue that can remove things from the dictionary.
If memory footprint of storing whole strings is not acceptable, you have only two choices:
1) Store only hashes of strings, which implies possibility of hash collisions (when hash is shorter than strings). Good hash function (MD5, SHA1, etc.) makes this collision nearly impossible to happen, so it only depends whether it is fast enough for your purpose.
2) Use some kind of lossless compression. Strings have usually good compression ratio (about 10%) and some algorithms such as ZIP let you choose between fast (and less efficient) and slow (with high compression ratio) compression. Another way to compress strings is convert them to UTF8, which is fast and easy to do and has nearly 50% compression ratio for non-unicode strings.
Whatever way you choose, it's always tradeoff between memory footprint and hashing/compression speed. You will probably need to make some benchmarking to choose best solution.

get data from generated hashcode()

i have a string(name str) and i generate hashcode(name H) from that ,
i want recieve orginal string(name str) from recieved hashcode(name H)
The short answer is you can't.
Creating a hashcode is one way operation - there is no reverse operation. The reason for this is that there are (for all practical purposes) infinitely many strings, but only finitely many hash codes (the number of possible hashcodes is bounded by the range of an int). Each hashcode could have been generated from any one of the infinitely many strings that give that hash code and there's no way to know which.
You can try to do it through a Brute Force Attack or with the help of Rainbow tables
Anyway, (even if you succeeded in finding something) with those methods, you would only find a string having the same hascode of the original, but you're ABSOLUTELY not sure that would be the original string, because hascodes are not unique.
Mmh, maybe absolutely is even a bit restrictive, because probability says you 99.999999999999... % won't find the same string :D
Hashing is generating a short fixed size value from a usually larger input. It is in general not reversible.
Mathematically impossible. There are only 2^32 different ints, but almost infinitely many strings, so from the pigeon hole principle follows that you can't restore the string.
You can find a string that matches the HashCode pretty easily, but it probably won't be the string that was originally hashed.
GetHashCode() is designed for use in hashtables and as thus is just a performance trick. It allows quick sorting of the input value into buckets, and nothing more. Its value is implementation defined. So another .net version, or even another instance of the same application might return a different value. return 0; is a valid(but not recommended) implementation of GetHashCode, and would not yield any information about the original string.
many of us would like to be able to do that :=)

'Proper' collection to use to obtain items in O(1) time in C# .NET?

Something I do often if I'm storing a bunch of string values and I want to be able to find them in O(1) time later is:
foreach (String value in someStringCollection)
{
someDictionary.Add(value, String.Empty);
}
This way, I can comfortably perform constant-time lookups on these string values later on, such as:
if (someDictionary.containsKey(someKey))
{
// etc
}
However, I feel like I'm cheating by making the value String.Empty. Is there a more appropriate .NET Collection I should be using?
If you're using .Net 3.5, try HashSet. If you're not using .Net 3.5, try C5. Otherwise your current method is ok (bool as #leppie suggests is better, or not as #JonSkeet suggests, dun dun dun!).
HashSet<string> stringSet = new HashSet<string>(someStringCollection);
if (stringSet.Contains(someString))
{
...
}
You can use HashSet<T> in .NET 3.5, else I would just stick to you current method (actually I would prefer Dictionary<string,bool> but one does not always have that luxury).
something you might want to add is an initial size to your hash. I'm not sure if C# is implemented differently than Java, but it usually has some default size, and if you add more than that, it extends the set. However a properly sized hash is important for achieving as close to O(1) as possible. The goal is to get exactly 1 entry in each bucket, without making it really huge. If you do some searching, I know there is a suggested ratio for sizing the hash table, assuming you know beforehand how many elements you will be adding. For example, something like "the hash should be sized at 1.8x the number of elements to be added" (not the real ratio, just an example).
From Wikipedia:
With a good hash function, a hash
table can typically contain about
70%–80% as many elements as it does
table slots and still perform well.
Depending on the collision resolution
mechanism, performance can begin to
suffer either gradually or
dramatically as more elements are
added. To deal with this, when the
load factor exceeds some threshold, it
is necessary to allocate a new, larger
table, and add all the contents of the
original table to this new table. In
Java's HashMap class, for example, the
default load factor threshold is 0.75.
I should probably make this a question, because I see the problem so often. What makes you think that dictionaries are O(1)? Technically, the only thing likely to be something like O(1) is access into a standard integer-indexed fixed-bound array using an integer index value (there being no look-up in arrays implemented that way).
The presumption that if it looks like an array reference it is O(1) when the "index" is a value that must be looked up somehow, however behind the scenes, means that it is not likely an O(1) scheme unless you are lucky to obtain a hash function with data that has no collisions (and probably a lot of wasted cells).
I see these questions and I even see answers that claim O(1) [not on this particular question, but I do seem them around], with no justification or explanation of what is required to make sure O(1) is actually achieved.
Hmm, I guess this is a decent question. I will do that after I post this remark here.

Categories