Unlike the SortedSet<T> or SortedDictionary<T> from BCL, The Sorted Sets from Redis is much more powerful.
It allows access by member, quite like dictionary
Sequence is determined on score, a different field from member and allows same value
allows to get the min/max value with acceptable time-complexity
I don't want to combine Dictionary + SortedSet to achive my goal.
I am looking at C5 library, not quite sure which class is the best for me.
UPDATE:
Finally I found this, matching my requirement very well.
Related
Alright so I'm taking everything I've learned and trying to implement it in C#. Given that I have a background in Java my ride has been pretty smooth so far, but I'm running into issues into using the Comparer object and functions etc. I don't care about direct implementation/translation, but I want to know how C# compares two generic values. What does it use to sort them? Hashcode, or maybe some C#-specific methodology?
So just to clarify, I know how to sort, search, etc. using methods in C#. What I want to know is what's going on under the hood - what are the Comparer and other functions using to compare two values of generics?
I want to know how C# compares two generic values
It doesn't/can't, that is why there are the ICompariable and IComparer interfaces..
What I want to know is what's going on under the hood
If you're talking about types provided by .Net then..
If you have an array of types (such as string or integer) that already support IComparer, you can sort that array without providing any explicit reference to IComparer. In that case, the elements of the array are cast to the default implementation of IComparer (Comparer.Default) for you.
How to use the IComparable and IComparer interfaces in Visual C# is probably the best article I've seen specific to your question.
The role of IComparable is to provide a method of comparing two objects of a particular type
The role of IComparer is to provide additional comparison mechanisms. For example, you may want to provide ordering of your class on several fields or properties, ascending and descending order on the same field, or both.
Ok, before you get all mad because there are hundreds of similar sounding questions posted on the internet, I can assure you that I have just spent the last few hours reading all of them and have not found the answer to my question.
Background:
Basically, one of my large scale applications had been suffering from a situation where some Bindings on the ListBox.SelectedItem property would stop working or the program would crash after an edit had been made to the currently selected item. I initially asked the 'An item with the same key has already been added' Exception on selecting a ListBoxItem from code question here, but got no answers.
I hadn't had time to address that problem until this week, when I was given a number of days to sort it out. Now to cut a long story short, I found out the reason for the problem. It was because my data type classes had overridden the Equals method and therefore the GetHashCode method as well.
Now for those of you that are unaware of this issue, I discovered that you can only implement the GetHashCode method using immutable fields/properties. Using a excerpt from Harvey Kwok's answer to the Overriding GetHashCode() post to explain this:
The problem is that GetHashCode is being used by Dictionary and HashSet collections to place each item in a bucket. If hashcode is calculated based on some mutable fields and the fields are really changed after the object is placed into the HashSet or Dictionary, the object can no longer be found from the HashSet or Dictionary.
So the actual problem was caused because I had used mutable properties in the GetHashCode methods. When users changed these property values in the UI, the associated hash code values of the objects changed and then items could no longer be found in their collections.
Question:
So, my question is what is the best way of handling the situation where I need to implement the GetHashCode method in classes with no immutable fields? Sorry, let me be more specific, as that question has been asked before.
The answers in the Overriding GetHashCode() post suggest that in these situations, it is better to simply return a constant value... some suggest to return the value 1, while other suggest returning a prime number. Personally, I can't see any difference between these suggestions because I would have thought that there would only be one bucket used for either of them.
Furthermore, the Guidelines and rules for GetHashCode article in Eric Lippert's Blog has a section titled Guideline: the distribution of hash codes must be "random" which highlights the pitfalls of using an algorithm that results in not enough buckets being used. He warns of algorithms that decrease the number of buckets used and cause a performance problem when the bucket gets really big. Surely, returning a constant falls into this category.
I had an idea of adding an extra Guid field to all of my data type classes (just in C#, not the database) specifically to be used in and only in the GetHashCode method. So I suppose at the end of this long intro, my actual question is which implementation is better? To summarise:
Summary:
When overriding Object.GetHashCode() in classes with no immutable fields, is it better to return a constant from the GetHashCode method, or to create an additional readonly field for each class, solely to be used in the GetHashCode method? If I should add a new field, what type should it be and shouldn't I then include it in the Equals method?
While I am happy to receive answers from anyone, I am really hoping to receive answers from advanced developers with a sound knowledge on this subject.
Go back to basics. You read my article; read it again. The two ironclad rules that are relevant to your situation are:
if x equals y then the hash code of x must equal the hash code of y. Equivalently: if the hash code of x does not equal the hash code of y then x and y must be unequal.
the hash code of x must remain stable while x is in a hash table.
Those are requirements for correctness. If you can't guarantee those two simple things then your program will not be correct.
You propose two solutions.
Your first solution is that you always return a constant. That meets the requirement of both rules, but you are then reduced to linear searches in your hash table. You might as well use a list.
The other solution you propose is to somehow produce a hash code for each object and store it in the object. That is perfectly legal provided that equal items have equal hash codes. If you do that then you are restricted such that x equals y must be false if the hash codes differ. This seems to make value equality basically impossible. Since you wouldn't be overriding Equals in the first place if you wanted reference equality, this seems like a really bad idea, but it is legal provided that equals is consistent.
I propose a third solution, which is: never put your object in a hash table, because a hash table is the wrong data structure in the first place. The point of a hash table is to quickly answer the question "is this given value in this set of immutable values?" and you don't have a set of immutable values, so don't use a hash table. Use the right tool for the job. Use a list, and live with the pain of doing linear searches.
A fourth solution is: hash on the mutable fields used for equality, remove the object from all hash tables it is in just before every time you mutate it, and put it back in afterwards. This meets both requirements: the hash code agrees with equality, and hashes of objects in hash tables are stable, and you still get fast lookups.
I would either create an additional readonly field or else throw NotSupportedException. In my view the other option is meaningless. Let's see why.
Distinct (fixed) hash codes
Providing distinct hash codes is easy, e.g.:
class Sample
{
private static int counter;
private readonly int hashCode;
public Sample() { this.hashCode = counter++; }
public override int GetHashCode()
{
return this.hashCode;
}
public override bool Equals(object other)
{
return object.ReferenceEquals(this, other);
}
}
Technically you have to look out for creating too many objects and overflowing the counter here, but in practice I think that's not going to be an issue for anyone.
The problem with this approach is that instances will never compare equal. However, that's perfectly fine if you only want to use instances of Sample as indexes into a collection of some other type.
Constant hash codes
If there is any scenario in which distinct instances should compare equal then at first glance you have no other choice than returning a constant. But where does that leave you?
Locating an instance inside a container will always degenerate to the equivalent of a linear search. So in effect by returning a constant you allow the user to make a keyed container for your class, but that container will exhibit the performance characteristics of a LinkedList<T>. This might be obvious to someone familiar with your class, but personally I see it as letting people shoot themselves in the foot. If you know from beforehand that a Dictionary won't behave as one might expect, then why let the user create one? In my view, better to throw NotSupportedException.
But throwing is what you must not do!
Some people will disagree with the above, and when those people are smarter than oneself then one should pay attention. First of all, this code analysis warning states that GetHashCode should not throw. That's something to think about, but let's not be dogmatic. Sometimes you have to break the rules for a reason.
However, that is not all. In his blog post on the subject, Eric Lippert says that if you throw from inside GetHashCode then
your object cannot be a result in many LINQ-to-objects queries that use hash tables
internally for performance reasons.
Losing LINQ is certainly a bummer, but fortunately the road does not end here. Many (all?) LINQ methods that use hash tables have overloads that accept an IEqualityComparer<T> to be used when hashing. So you can in fact use LINQ, but it's going to be less convenient.
In the end you will have to weigh the options yourself. My opinion is that it's better to operate with a whitelist strategy (provide an IEqualityComparer<T> whenever needed) as long as it is technically feasible because that makes the code explicit: if someone tries to use the class naively they get an exception that helpfully tells them what's going on and the equality comparer is visible in the code wherever it is used, making the extraordinary behavior of the class immediately clear.
Where I want to override Equals, but there is no sensible immutable "key" for an object (and for whatever reason it doesn't make sense to make the whole object immutable), in my opinion there is only one "correct" choice:
Implement GetHashCode to hash the same fields as Equals uses. (This might be all the fields.)
Document that these fields must not be altered while in a dictionary.
Trust that users either don't put these objects in dictionaries, or obey the second rule.
(Returning a constant value compromises dictionary performance. Throwing an exception disallows too many useful cases where objects are cached but not modified. Any other implementation for GetHashCode would be wrong.)
Where this runs the user into trouble anyway, it's probably their fault. (Specifically: using a dictionary where they shouldn't, or using a model type in a context where they should be using a view-model type that uses reference equality instead.)
Or perhaps I shouldn't be overriding Equals in the first place.
If the classes truly contain nothing constant on which a hash value can be calculated then I would use something simpler than a GUID. Just use a random number persisted in the class (or in a wrapper class).
A simple approach is to store the hashCode in a private member and generate it on the first use. If your entity doesn't change often, and you're not going to be using two different objects that are Equal (where your Equals method returns true) as keys in your dictionary, then this should be fine:
private int? _hashCode;
public override int GetHashCode() {
if (!_hashCode.HasValue)
_hashCode = Property1.GetHashCode() ^ Property2.GetHashCode() etc... based on whatever you use in your equals method
return _hashCode.Value;
}
However, if you have, say, object a and object b, where a.Equals(b) == true, and you store an entry in your dictionary using a as the key (dictionary[a] = value).
If a does not change, then dictionary[b] will return value, however, if you change a after storing the entry in the dictionary, then dictionary[b] will most likely fail.
The only workaround to this is to rehash the dictionary when any of the keys change.
For a small set of key/value pairs (default 2, max 5), a Dictionary<TKey, TValue> seems like overkill. Is there a much simpler data structure that could be used in my case ? I'm caching computed values for certain objects (i.e. <MyClass, double>), so retrieval speed is important.
Thanks
A List<KeyValuePair<TKey, TValue>> (created with an appropriate capacity) would probably work just as well in this case... but it wouldn't be terribly idiomatic. (Just to be clear, you'd simply call Equals on each key element, ignoring the hash code completely.) If List<T> feels a bit heavy to you, you could even go down to KeyValuePair<TKey, TValue>[] if you wanted. Ick, but hey... it's your code.
Have you actually tried Dictionary<TKey, TValue> and found it to be too slow? "Seems like overkill" doesn't seem nearly as good an argument as "I've tried it, profiled it, and found an unacceptable amount of my application's time is spent creating dictionaries and looking up entries in them. I need my application to have performance characteristic X and at the moment I only have Y."
If your key type has a particular ordering (and if you were going to perform more lookups on the data structure than you were going to create instances) you could sort the list, meaning you would have a maximum of 3 comparisons for any particular lookup. With only 5 entries you could even hard-code all the potential paths, if you were looking to optimize to the hilt. (You might even have different implementations for 2, 3, 4 and 5 elements. It's getting a little silly at that point though.) This is basically a SortedList<TKey, TValue> implementation, but you may be able to optimize it a little for your scenario of only having a few entries. Again, it's worth trying the built-in types first.
What's vital is that you know how important this part of your code really is to your overall performance - and when it will be "good enough" so you can stop appropriately.
If the set of keys is known at compile time, than you could simply create a class (or struct) with nullable properties that hold the values.
If you use an array like KeyValuePair<TKey, TValue>[], you could sort it and then search using a binary search. But this is only fast if you have to sort once and then retrieve many times.
I often like to use the Hashtable class (http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx).
Another thing you can do though, is not worry about managing any cache yourself and just use the caching from ASP.NET. All you have to do is include the system.web assembly and then you can use it even in non-web applications. Here's an article on using the .Net cache code in a Windows Forms app. It's really simple.
http://www.codeproject.com/KB/cs/cacheinwinformapps.aspx
D.
I'm looking at List and I see a BinarySearch method with a few overloads, and I can't help wondering if it makes sense at all to have a method like that in List?
Why would I want to do a binary search unless the list was sorted? And if the list wasn't sorted, calling the method would just be a waste of CPU time. What's the point of having that method on List?
I note in addition to the other correct answers that binary search is surprisingly hard to write correctly. There are lots of corner cases and some tricky integer arithmetic. Since binary search is obviously a common operation on sorted lists, the BCL team did the world a service by writing the binary search algorithm correctly once rather than encouraging customers to all write their own binary search algorithm; a significant number of those customer-authored algorithms would be wrong.
Sorting and searching are two very common operations on lists. It would be unfriendly to limit a developer's options by not offering binary search on a regular list.
Library design requires compromises - the .NET designers chose to offer the binary search function on both arrays and lists in C# because they likely felt (as I do) that these are useful and common operations, and programmers who choose to use them understand their prerequisites (namely that the list is ordered) before calling them.
It's easy enough to sort a List<T> using one of the Sort() overloads. If you feel that you need an invariant that gaurantees sorting, you can always use SortedList<TKey,TValue> or SortedSet<T> instead.
BinarySearch only makes sense on a List<T> that is sorted, just like IList<T>.Add only makes sense for an IList<T> with IsReadOnly = false. It's messy, but it's just something to deal with: sometimes functionality X depends on criterion Y. The fact that Y isn't always true doesn't make X useless.
Now, in my opinion, it's frustrating that .NET doesn't have general Sort and BinarySearch methods for any IList<T> implementation (e.g., as extension methods). If it did, we could easily sort and search for items within any non-read-only collection providing random access.
Then again, you can always write your own (or copy someone else's).
Others have pointed out that BinarySearch is quite useful on a sorted List<T>. It doesn't really belong on List<T>, though, as anyone with C++ STL experience would immediately recognize.
With recent C# language developments, it makes more sense to define the notion of a sorted list (e.g., ISortedList<T> : IList<T>) and define BinarySearch (et. al.) as extension methods of that interface. This is a cleaner, more orthogonal type of design.
I've started doing just that as part of the Nito.Linq library. I expect the first stable release to be in a couple of months.
yes but List has Sort() method as well so you can call it before BinarySearch.
Searching and sorting are algorithmic primitives. It's helpful for the standard library to have fast reliable implementations. Otherwise, developers waste time reinventing the wheel.
However, in the case of the .NET Framework, it's unfortunate that the specific choices of algorithms happens to make them less useful than they might be. In some cases, their behaviour is not defined:
List<T>.BinarySearch If the List contains more than one element with the same value, the method returns only one of the occurrences, and it might return any one of the occurrences, not necessarily the first one.
List<T> This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.
That's a shame, because there are deterministic algorithms that are just as fast, and these would be much more useful as building blocks. It's noteworthy that the binary search algorithms in Python, Ruby and Go all find the first matching element.
I agree it's completely dumb to Call BinarySearch on an unsorted list, but it's perfect if you know your large list is sorted.
I've used it when checking if items from a stream exist in a (more or less) static list of 100,000 items or more.
Binary Searching the list is ORDERS of magnitude faster than doing a list.Find, which is many orders of magnitude faster than a database look up.
I makes sense, and I'm glad it there (not that it would be rocket science to implement it if it wasn't).
Perhaps another point is that an array could be equally as unsorted. So in theory, having a BinarySearch on an array could be invalid too.
However, as with all features of a higher level language, they need to be applied by someone with reason and understanding of the data, or they will fail. Sure, some cross-checks could be applied, and we could have a flag that said "IsSorted" and it would fail on binary search otherwise, but .....
Some pseudo code:
if List is sorted
use the BinarySearch method
else if List is not sorted and you think sorting it is "waste of CPU time"
use a different algorithm that is more suitable and efficient
I want something along the lines of Python's tuples (or, for sets, frozensets), which are hashable. I have a List<String> which is most certainly not hashing correctly (i.e. by value).
You will have to define your own container, possibly wrapping the List, to get useful semantics for equality-hash-equals (GetHashCode and Equals). You could even make the wrapper conform to IList if you like.
To avoid mutability issues and a changing GetHashCode/Equals results (which would make use of your new object in a hashing Dictionary problematic!) you should also provide some kind of guard (perhaps make a copy of the input upon creation of your type) and/or document the constraints.
You can use SequenceEqual to implement Equals rather trivially, but you'll need to implement a GetHashCode in a relevant way -- a simple method is a shifting XOR of the GetHashCode of each element.
Alternatively, if this is just used in a single Dictionary you can supply a custom IEqualityComparer and avoid creating a wrapped type: Dictionary constructor overload.
It depends what your final goals are and there very well already be such wrapping containers :-)
Note: In .NET4 there is a set of Tuple<...> classes which override GetHashCode and Equals. See cadenza as the 3rd party alternative for prior .NET versions.