I am looking if there is a pre-existing .Net 'Hash-Set type' implementation suitable to atomizing a general type T. We have a large number of identical objects coming in for serialized sources that need to be atomized to conserve memory.
A Dictionary<T,T> with the value == key works perfectly, however the objects in these collections can run into the millions across the app, and so it seem very wasteful to store 2 references to every object.
HashSet cannot be used as it only has Contains, there ?is no way? to get to the actual member instance.
Obviously I could roll my own but wanted to check if there was anything pre-existing. A scan at C5 didn't see anything jumping out, but then their 250+ page documentation does make me wonder if I've missed something.
EDIT The fundemental idea is I need to be able to GET THE UNIQUE OBJECT BACK ie HashSet has Contains(T obj) but not Get(T obj) /EDIT
The collection at worst only needs to implement:
T GetOrAdd(T candidate)
void Clear()
And take an arbitary IComparer
And GetOrAdd is ~O(1) and would ideally be atomic, i.e. doesn't waste time Hashing twice.
EDIT Failing an existing implementation any recommendations on sources for the basic Hashing / Bucketing mechanics would be appreciated. - The Mono HashSet source has been pointed out for this and thus this section is answered /EDIT
You can take a source code of a HashSet<T> from Reference Source and write your own GetOrAdd method.
Related
Ok, before you get all mad because there are hundreds of similar sounding questions posted on the internet, I can assure you that I have just spent the last few hours reading all of them and have not found the answer to my question.
Background:
Basically, one of my large scale applications had been suffering from a situation where some Bindings on the ListBox.SelectedItem property would stop working or the program would crash after an edit had been made to the currently selected item. I initially asked the 'An item with the same key has already been added' Exception on selecting a ListBoxItem from code question here, but got no answers.
I hadn't had time to address that problem until this week, when I was given a number of days to sort it out. Now to cut a long story short, I found out the reason for the problem. It was because my data type classes had overridden the Equals method and therefore the GetHashCode method as well.
Now for those of you that are unaware of this issue, I discovered that you can only implement the GetHashCode method using immutable fields/properties. Using a excerpt from Harvey Kwok's answer to the Overriding GetHashCode() post to explain this:
The problem is that GetHashCode is being used by Dictionary and HashSet collections to place each item in a bucket. If hashcode is calculated based on some mutable fields and the fields are really changed after the object is placed into the HashSet or Dictionary, the object can no longer be found from the HashSet or Dictionary.
So the actual problem was caused because I had used mutable properties in the GetHashCode methods. When users changed these property values in the UI, the associated hash code values of the objects changed and then items could no longer be found in their collections.
Question:
So, my question is what is the best way of handling the situation where I need to implement the GetHashCode method in classes with no immutable fields? Sorry, let me be more specific, as that question has been asked before.
The answers in the Overriding GetHashCode() post suggest that in these situations, it is better to simply return a constant value... some suggest to return the value 1, while other suggest returning a prime number. Personally, I can't see any difference between these suggestions because I would have thought that there would only be one bucket used for either of them.
Furthermore, the Guidelines and rules for GetHashCode article in Eric Lippert's Blog has a section titled Guideline: the distribution of hash codes must be "random" which highlights the pitfalls of using an algorithm that results in not enough buckets being used. He warns of algorithms that decrease the number of buckets used and cause a performance problem when the bucket gets really big. Surely, returning a constant falls into this category.
I had an idea of adding an extra Guid field to all of my data type classes (just in C#, not the database) specifically to be used in and only in the GetHashCode method. So I suppose at the end of this long intro, my actual question is which implementation is better? To summarise:
Summary:
When overriding Object.GetHashCode() in classes with no immutable fields, is it better to return a constant from the GetHashCode method, or to create an additional readonly field for each class, solely to be used in the GetHashCode method? If I should add a new field, what type should it be and shouldn't I then include it in the Equals method?
While I am happy to receive answers from anyone, I am really hoping to receive answers from advanced developers with a sound knowledge on this subject.
Go back to basics. You read my article; read it again. The two ironclad rules that are relevant to your situation are:
if x equals y then the hash code of x must equal the hash code of y. Equivalently: if the hash code of x does not equal the hash code of y then x and y must be unequal.
the hash code of x must remain stable while x is in a hash table.
Those are requirements for correctness. If you can't guarantee those two simple things then your program will not be correct.
You propose two solutions.
Your first solution is that you always return a constant. That meets the requirement of both rules, but you are then reduced to linear searches in your hash table. You might as well use a list.
The other solution you propose is to somehow produce a hash code for each object and store it in the object. That is perfectly legal provided that equal items have equal hash codes. If you do that then you are restricted such that x equals y must be false if the hash codes differ. This seems to make value equality basically impossible. Since you wouldn't be overriding Equals in the first place if you wanted reference equality, this seems like a really bad idea, but it is legal provided that equals is consistent.
I propose a third solution, which is: never put your object in a hash table, because a hash table is the wrong data structure in the first place. The point of a hash table is to quickly answer the question "is this given value in this set of immutable values?" and you don't have a set of immutable values, so don't use a hash table. Use the right tool for the job. Use a list, and live with the pain of doing linear searches.
A fourth solution is: hash on the mutable fields used for equality, remove the object from all hash tables it is in just before every time you mutate it, and put it back in afterwards. This meets both requirements: the hash code agrees with equality, and hashes of objects in hash tables are stable, and you still get fast lookups.
I would either create an additional readonly field or else throw NotSupportedException. In my view the other option is meaningless. Let's see why.
Distinct (fixed) hash codes
Providing distinct hash codes is easy, e.g.:
class Sample
{
private static int counter;
private readonly int hashCode;
public Sample() { this.hashCode = counter++; }
public override int GetHashCode()
{
return this.hashCode;
}
public override bool Equals(object other)
{
return object.ReferenceEquals(this, other);
}
}
Technically you have to look out for creating too many objects and overflowing the counter here, but in practice I think that's not going to be an issue for anyone.
The problem with this approach is that instances will never compare equal. However, that's perfectly fine if you only want to use instances of Sample as indexes into a collection of some other type.
Constant hash codes
If there is any scenario in which distinct instances should compare equal then at first glance you have no other choice than returning a constant. But where does that leave you?
Locating an instance inside a container will always degenerate to the equivalent of a linear search. So in effect by returning a constant you allow the user to make a keyed container for your class, but that container will exhibit the performance characteristics of a LinkedList<T>. This might be obvious to someone familiar with your class, but personally I see it as letting people shoot themselves in the foot. If you know from beforehand that a Dictionary won't behave as one might expect, then why let the user create one? In my view, better to throw NotSupportedException.
But throwing is what you must not do!
Some people will disagree with the above, and when those people are smarter than oneself then one should pay attention. First of all, this code analysis warning states that GetHashCode should not throw. That's something to think about, but let's not be dogmatic. Sometimes you have to break the rules for a reason.
However, that is not all. In his blog post on the subject, Eric Lippert says that if you throw from inside GetHashCode then
your object cannot be a result in many LINQ-to-objects queries that use hash tables
internally for performance reasons.
Losing LINQ is certainly a bummer, but fortunately the road does not end here. Many (all?) LINQ methods that use hash tables have overloads that accept an IEqualityComparer<T> to be used when hashing. So you can in fact use LINQ, but it's going to be less convenient.
In the end you will have to weigh the options yourself. My opinion is that it's better to operate with a whitelist strategy (provide an IEqualityComparer<T> whenever needed) as long as it is technically feasible because that makes the code explicit: if someone tries to use the class naively they get an exception that helpfully tells them what's going on and the equality comparer is visible in the code wherever it is used, making the extraordinary behavior of the class immediately clear.
Where I want to override Equals, but there is no sensible immutable "key" for an object (and for whatever reason it doesn't make sense to make the whole object immutable), in my opinion there is only one "correct" choice:
Implement GetHashCode to hash the same fields as Equals uses. (This might be all the fields.)
Document that these fields must not be altered while in a dictionary.
Trust that users either don't put these objects in dictionaries, or obey the second rule.
(Returning a constant value compromises dictionary performance. Throwing an exception disallows too many useful cases where objects are cached but not modified. Any other implementation for GetHashCode would be wrong.)
Where this runs the user into trouble anyway, it's probably their fault. (Specifically: using a dictionary where they shouldn't, or using a model type in a context where they should be using a view-model type that uses reference equality instead.)
Or perhaps I shouldn't be overriding Equals in the first place.
If the classes truly contain nothing constant on which a hash value can be calculated then I would use something simpler than a GUID. Just use a random number persisted in the class (or in a wrapper class).
A simple approach is to store the hashCode in a private member and generate it on the first use. If your entity doesn't change often, and you're not going to be using two different objects that are Equal (where your Equals method returns true) as keys in your dictionary, then this should be fine:
private int? _hashCode;
public override int GetHashCode() {
if (!_hashCode.HasValue)
_hashCode = Property1.GetHashCode() ^ Property2.GetHashCode() etc... based on whatever you use in your equals method
return _hashCode.Value;
}
However, if you have, say, object a and object b, where a.Equals(b) == true, and you store an entry in your dictionary using a as the key (dictionary[a] = value).
If a does not change, then dictionary[b] will return value, however, if you change a after storing the entry in the dictionary, then dictionary[b] will most likely fail.
The only workaround to this is to rehash the dictionary when any of the keys change.
I'm implementing a special case of an immutable dictionary, which for convenience implements IEnumerable<KeyValuePair<Foo, Bar>>. Operations that would ordinarily modify the dictionary should instead return a new instance.
So far so good. But when I try to write a fluent-style unit test for the class, I find that neither of the two fluent assertion libraries I've tried (Should and Fluent Assertions) supports the NotBeSameAs() operation on objects that implement IEnumerable -- not unless you first cast them to Object.
When I first ran into this, with Should, I assumed that it was just a hole in the framework, but when I saw that Fluent Assertions had the same hole, it made my think that (since I'm a relative newcomer to C#) I might be missing something conceptual about C# collections -- the author of Should implied as much when I filed an issue.
Obviously there are other ways to test this -- cast to Object and use NotBeSameAs(), just use Object.ReferenceEquals, whatever -- but if there's a good reason not to, I'd like to know what that is.
An IEnumerable<T> is not neccessarily a real object. IEnumerable<T> guarantees that you can enumerate through it's states. In simple cases you have a container class like a List<T> that is already materialized. Then you could compare both Lists' addresses. However, your IEnumerable<T> might also point to a sequence of commands, that will be executed once you enumerate. Basically a state machine:
public IEnumerable<int> GetInts()
{
yield return 10;
yield return 20;
yield return 30;
}
If you save this in a variable, you don't have a comparable object (everything is an object, so you do... but it's not meaningful):
var x = GetInts();
Your comparison only works for materialized ( .ToList() or .ToArray() ) IEnumerables, because those state machines have been evaluated and their results been saved to a collection. So yes, the library actually makes sense, if you know you have materialized IEnumerables, you will need to make this knowledge public by casting them to Object and calling the desired function on this object "manually".
In addition what Jon Skeet suggested take a look at this February 2013 MSDN article from Ted Neward:
.NET Collections, Part 2: Working with C5
Immutable (Guarded) Collections
With the rise of functional concepts
and programming styles, a lot of emphasis has swung to immutable data
and immutable objects, largely because immutable objects offer a lot
of benefits vis-à-vis concurrency and parallel programming, but also
because many developers find immutable objects easier to understand
and reason about. Corollary to that concept, then, follows the concept
of immutable collections—the idea that regardless of whether the
objects inside the collection are immutable, the collection itself is
fixed and unable to change (add or remove) the elements in the
collection. (Note: You can see a preview of immutable collections
released on NuGet in the MSDN Base Class Library (BCL) blog at
bit.ly/12AXD78.)
It describes the use of an open source library of collection goodness called C5.
Look at http://itu.dk/research/c5/
I was attempting to use ObjectIDGenerator in C# to generate a unique ID during serialization, however, this class is not available in the XBox360 or Windows Phone 7 .NET frameworks (they use a compact version of .NET). I implemented a version using a dictionary of Object to Int64 and was able to get a fully working version up, however, the performance is unsatisfactory. I'm serializing on the order of tens of thousands of objects, and currently this is the greatest bottleneck in save/load performance. Using the actual .NET implementation on PC it takes about 0.3 seconds to serialize about 20,000 objects. Using my implementation, it takes about 6 seconds.
In profiling, I found that the heavy hitters were .TryGetValue and .Add on the dictionary (which makes sense since it's both indexing and adding to the hash map). More importantly, the virtual equality operator was being called instead of simply comparing references, so I implemented an IEqualityComparer that only used ReferenceEquals (this resulted in a speed increase).
Does anyone have an insight into a better implementation of ObjectIDGenerator? Thanks for your help!
My Implementation: http://pastebin.com/H1skZwNK
[Edit]
Another note, the results of profiling says that the object comparison / ReferenceEquals is still the bottleneck, with a hit count of 43,000,000. I'm wondering if there's a way to store data along side this object without having to look it up in a hash map...
Is it possible to use an Int32 Id property / handle for each object rather than Object? That may help things. It looks like you're assigning an Id type number to each object anyway, only you're then looking up based on the Object reference instead of the Id. Can you persist the object id (your Int64) within each object and make your dictionary into Dictionary<Int64, Object> instead?
You might also want to see if SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> perform better or worse. But if your main bottleneck is in your IEqualityComparer, these might not help very much.
UPDATE
After looking at the ObjectIDGenerator class API, I can see why you can't do what I advised at first; you're creating the ids!
ObjectIDGenerator seems to be manually implementing its own hash table (it allocates an object[] and a parallel long[] and resizes them as objects are added). It also uses RuntimeHelpers.GetHashCode(Object) to calculate its hash rather than an IEqualityComparer which may be a big boost to your perf as its always calling Object.GetHashCode() and not doing the virtual call on the derived type (or the interface call in your case with IEqualityComparer).
You can see the source for yourself via the Microsoft Shared Source Initiative:
I have implemented an evolutionary algorithm in C# which kind of works but I have the assumption that the cloning does not. The algorithm maintains a population of object trees. Each object (tree) can be the result of cloning (e.g. after ‘natural selection’) and should be a unique object consisting of unique objects. Is there a simple way to determine whether the ‘object population’ contains unique/distinct objects – in other words whether objects are shared by more than one tree? I hope my question makes sense.
Thanks.
Best wishes,
Christian
PS: I implemented (I think) deep copy cloning via serialization see:
http://www.codeproject.com/KB/tips/SerializedObjectCloner.aspx
The way to verify whether two objects are the same objects in memory is by comparing them using Object.ReferenceEquals. This checks whether the "Pointers" are the same.
Cloning in C# is by default a shallow copy. The keyword you probably need to google for tutorials is "deep cloning", in order to create object graphs that don't share references.
What about following: Add a static counter to every class for member references of your 'main tree class'. Increment counter it in every counstructor. Determine how many of a 'subobjects'should be contained in tree object (or all tree-objects) and compare that to the counter.
OK, for first, let me see if I get you correctly:
RESULT is a object tree that has some data.
GENERATION is a collection of a result objects.
You have some 'evolution' method that moves each GENERATION to the next step.
If you want to check if one RESULT is equal to the other, you should implement IComparable on it, and for each of its members, do the same.
ADDITION:
Try to get rid of that kind of cloning, and make the clones manually - it WILL be faster. And speed is crucial to you here, because all heuristic comes down to muscle.
If what you're asking is "how do i check if two object variables refer to the same bit of memory, you can call Object.ReferenceEquals
I have a class Customer (with typical customer properties) and I need to pass around, and databind, a "chunk" of Customer instances. Currently I'm using an array of Customer, but I've also used Collection of T (and List of T before I knew about Collection of T). I'd like the thinnest way to pass this chunk around using C# and .NET 3.5.
Currently, the array of Customer is working just fine for me. It data binds well and seems to be as lightweight as it gets. I don't need the stuff List of T offers and Collection of T still seems like overkill. The array does require that I know ahead of time how many Customers I'm adding to the chunk, but I always know that in advance (given rows in a page, for example).
Am I missing something fundamental or is the array of Customer OK? Is there a tradeoff I'm missing?
Also, I'm assuming that Collection of T makes the old loosely-typed ArrayList obsolete. Am I right there?
Yes, Collection<T> (or List<T> more commonly) makes ArrayList pretty much obsolete. In particular, I believe ArrayList isn't even supported in Silverlight 2.
Arrays are okay in some cases, but should be considered somewhat harmful - they have various disadvantages. (They're at the heart of the implementation of most collections, of course...) I'd go into more details, but Eric Lippert does it so much better than I ever could in the article referenced by the link. I would summarise it here, but that's quite hard to do. It really is worth just reading the whole post.
No one has mentioned the Framework Guidelines advice: Don't use List<T> in public API's:
We don’t recommend using List in
public APIs for two reasons.
List<T> is not designed to be extended. i.e. you cannot override any
members. This for example means that
an object returning List<T> from a
property won’t be able to get notified
when the collection is modified.
Collection<T> lets you overrides
SetItem protected member to get
“notified” when a new items is added
or an existing item is changed.
List has lots of members that are not relevant in many scenarios. We
say that List<T> is too “busy” for
public object models. Imagine
ListView.Items property returning
List<T> with all its richness. Now,
look at the actual ListView.Items
return type; it’s way simpler and
similar to Collection<T> or
ReadOnlyCollection<T>
Also, if your goal is two-way Databinding, have a look at BindingList<T> (with the caveat that it is not sortable 'out of the box'!)
Generally, you should 'pass around' IEnumerable<T> or ICollection<T> (depending on whether it makes sense for your consumer to add items).
If you have an immutable list of customers, that is... your list of customers will not change, it's relatively small, and you will always iterate over it first to last and you don't need to add to the list or remove from it, then an array is probably just fine.
If you're unsure, however, then your best bet is a collection of some type. What collection you choose depends on the operations you wish to perform on it. Collections are all about inserts, manipulations, lookups, and deletes. If you do frequent frequent searches for a given element, then a dictionary may be best. If you need to sort the data, then perhaps a SortedList will work better.
I wouldn't worry about "lightweight", unless you're talking a massive number of elements, and even then the advantages of O(1) lookups outweigh the costs of resources.
When you "pass around" a collection, you're only passing a reference, which is basically a pointer. So there is no performance difference between passing a collection and an array.
I'm going to put in a dissenting argument to both Jon and Eric Lippert )which means that you should be very weary of my answer, indeed!).
The heart of Eric Lippert's arguments against arrays is that the contents are immutable, while the data structure itself is not. With regards to returning them from methods, the contents of a List are just as mutable. In fact, because you can add or subtract elements from a List, I would argue that this makes the return value more mutable than an array.
The other reason I'm fond of Arrays is because sometime back I had a small section of performance critical code, so I benchmarked the performance characteristics of the two, and arrays blew Lists out of the water. Now, let me caveat this by saying it was a narrow test for how I was going to use them in a specific situation, and it goes against what I understand of both, but the numbers were wildly different.
Anyway, listen to Jon and Eric =), and I agree that List almost always makes more sense.
I agree with Alun, with one addition. If you may want to address the return value by subscript myArray[n], then use an IList.
An Array inherently supports IList (as well as IEnumerable and ICollection, for that matter). So if you pass by interface, you can still use the array as your underlying data structure. In this way, the methods that you are passing the array into don't have to "know" that the underlying datastructure is an array:
public void Test()
{
IList<Item> test = MyMethod();
}
public IList<Item> MyMethod()
{
Item[] items = new Item[] {new Item()};
return items;
}