A simpler data structure for key-value lookup?

A simpler data structure for key-value lookup? - c#

For a small set of key/value pairs (default 2, max 5), a Dictionary<TKey, TValue> seems like overkill. Is there a much simpler data structure that could be used in my case ? I'm caching computed values for certain objects (i.e. <MyClass, double>), so retrieval speed is important.
Thanks

A List<KeyValuePair<TKey, TValue>> (created with an appropriate capacity) would probably work just as well in this case... but it wouldn't be terribly idiomatic. (Just to be clear, you'd simply call Equals on each key element, ignoring the hash code completely.) If List<T> feels a bit heavy to you, you could even go down to KeyValuePair<TKey, TValue>[] if you wanted. Ick, but hey... it's your code.
Have you actually tried Dictionary<TKey, TValue> and found it to be too slow? "Seems like overkill" doesn't seem nearly as good an argument as "I've tried it, profiled it, and found an unacceptable amount of my application's time is spent creating dictionaries and looking up entries in them. I need my application to have performance characteristic X and at the moment I only have Y."
If your key type has a particular ordering (and if you were going to perform more lookups on the data structure than you were going to create instances) you could sort the list, meaning you would have a maximum of 3 comparisons for any particular lookup. With only 5 entries you could even hard-code all the potential paths, if you were looking to optimize to the hilt. (You might even have different implementations for 2, 3, 4 and 5 elements. It's getting a little silly at that point though.) This is basically a SortedList<TKey, TValue> implementation, but you may be able to optimize it a little for your scenario of only having a few entries. Again, it's worth trying the built-in types first.
What's vital is that you know how important this part of your code really is to your overall performance - and when it will be "good enough" so you can stop appropriately.

If the set of keys is known at compile time, than you could simply create a class (or struct) with nullable properties that hold the values.

If you use an array like KeyValuePair<TKey, TValue>[], you could sort it and then search using a binary search. But this is only fast if you have to sort once and then retrieve many times.

I often like to use the Hashtable class (http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx).
Another thing you can do though, is not worry about managing any cache yourself and just use the caching from ASP.NET. All you have to do is include the system.web assembly and then you can use it even in non-web applications. Here's an article on using the .Net cache code in a Windows Forms app. It's really simple.
http://www.codeproject.com/KB/cs/cacheinwinformapps.aspx
D.

Related

.Net Collection for atomizing T?

I am looking if there is a pre-existing .Net 'Hash-Set type' implementation suitable to atomizing a general type T. We have a large number of identical objects coming in for serialized sources that need to be atomized to conserve memory.
A Dictionary<T,T> with the value == key works perfectly, however the objects in these collections can run into the millions across the app, and so it seem very wasteful to store 2 references to every object.
HashSet cannot be used as it only has Contains, there ?is no way? to get to the actual member instance.
Obviously I could roll my own but wanted to check if there was anything pre-existing. A scan at C5 didn't see anything jumping out, but then their 250+ page documentation does make me wonder if I've missed something.
EDIT The fundemental idea is I need to be able to GET THE UNIQUE OBJECT BACK ie HashSet has Contains(T obj) but not Get(T obj) /EDIT
The collection at worst only needs to implement:
T GetOrAdd(T candidate)
void Clear()
And take an arbitary IComparer
And GetOrAdd is ~O(1) and would ideally be atomic, i.e. doesn't waste time Hashing twice.
EDIT Failing an existing implementation any recommendations on sources for the basic Hashing / Bucketing mechanics would be appreciated. - The Mono HashSet source has been pointed out for this and thus this section is answered /EDIT

You can take a source code of a HashSet<T> from Reference Source and write your own GetOrAdd method.

Redis-like SortedSet in C#

Unlike the SortedSet<T> or SortedDictionary<T> from BCL, The Sorted Sets from Redis is much more powerful.
It allows access by member, quite like dictionary
Sequence is determined on score, a different field from member and allows same value
allows to get the min/max value with acceptable time-complexity
I don't want to combine Dictionary + SortedSet to achive my goal.
I am looking at C5 library, not quite sure which class is the best for me.
UPDATE:
Finally I found this, matching my requirement very well.

ObjectIDGenerator implementation

I was attempting to use ObjectIDGenerator in C# to generate a unique ID during serialization, however, this class is not available in the XBox360 or Windows Phone 7 .NET frameworks (they use a compact version of .NET). I implemented a version using a dictionary of Object to Int64 and was able to get a fully working version up, however, the performance is unsatisfactory. I'm serializing on the order of tens of thousands of objects, and currently this is the greatest bottleneck in save/load performance. Using the actual .NET implementation on PC it takes about 0.3 seconds to serialize about 20,000 objects. Using my implementation, it takes about 6 seconds.
In profiling, I found that the heavy hitters were .TryGetValue and .Add on the dictionary (which makes sense since it's both indexing and adding to the hash map). More importantly, the virtual equality operator was being called instead of simply comparing references, so I implemented an IEqualityComparer that only used ReferenceEquals (this resulted in a speed increase).
Does anyone have an insight into a better implementation of ObjectIDGenerator? Thanks for your help!
My Implementation: http://pastebin.com/H1skZwNK
[Edit]
Another note, the results of profiling says that the object comparison / ReferenceEquals is still the bottleneck, with a hit count of 43,000,000. I'm wondering if there's a way to store data along side this object without having to look it up in a hash map...

Is it possible to use an Int32 Id property / handle for each object rather than Object? That may help things. It looks like you're assigning an Id type number to each object anyway, only you're then looking up based on the Object reference instead of the Id. Can you persist the object id (your Int64) within each object and make your dictionary into Dictionary<Int64, Object> instead?
You might also want to see if SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> perform better or worse. But if your main bottleneck is in your IEqualityComparer, these might not help very much.
UPDATE
After looking at the ObjectIDGenerator class API, I can see why you can't do what I advised at first; you're creating the ids!
ObjectIDGenerator seems to be manually implementing its own hash table (it allocates an object[] and a parallel long[] and resizes them as objects are added). It also uses RuntimeHelpers.GetHashCode(Object) to calculate its hash rather than an IEqualityComparer which may be a big boost to your perf as its always calling Object.GetHashCode() and not doing the virtual call on the derived type (or the interface call in your case with IEqualityComparer).
You can see the source for yourself via the Microsoft Shared Source Initiative:

Why is there a List<T>.BinarySearch(...)?

I'm looking at List and I see a BinarySearch method with a few overloads, and I can't help wondering if it makes sense at all to have a method like that in List?
Why would I want to do a binary search unless the list was sorted? And if the list wasn't sorted, calling the method would just be a waste of CPU time. What's the point of having that method on List?

I note in addition to the other correct answers that binary search is surprisingly hard to write correctly. There are lots of corner cases and some tricky integer arithmetic. Since binary search is obviously a common operation on sorted lists, the BCL team did the world a service by writing the binary search algorithm correctly once rather than encouraging customers to all write their own binary search algorithm; a significant number of those customer-authored algorithms would be wrong.

Sorting and searching are two very common operations on lists. It would be unfriendly to limit a developer's options by not offering binary search on a regular list.
Library design requires compromises - the .NET designers chose to offer the binary search function on both arrays and lists in C# because they likely felt (as I do) that these are useful and common operations, and programmers who choose to use them understand their prerequisites (namely that the list is ordered) before calling them.
It's easy enough to sort a List<T> using one of the Sort() overloads. If you feel that you need an invariant that gaurantees sorting, you can always use SortedList<TKey,TValue> or SortedSet<T> instead.

BinarySearch only makes sense on a List<T> that is sorted, just like IList<T>.Add only makes sense for an IList<T> with IsReadOnly = false. It's messy, but it's just something to deal with: sometimes functionality X depends on criterion Y. The fact that Y isn't always true doesn't make X useless.
Now, in my opinion, it's frustrating that .NET doesn't have general Sort and BinarySearch methods for any IList<T> implementation (e.g., as extension methods). If it did, we could easily sort and search for items within any non-read-only collection providing random access.
Then again, you can always write your own (or copy someone else's).

Others have pointed out that BinarySearch is quite useful on a sorted List<T>. It doesn't really belong on List<T>, though, as anyone with C++ STL experience would immediately recognize.
With recent C# language developments, it makes more sense to define the notion of a sorted list (e.g., ISortedList<T> : IList<T>) and define BinarySearch (et. al.) as extension methods of that interface. This is a cleaner, more orthogonal type of design.
I've started doing just that as part of the Nito.Linq library. I expect the first stable release to be in a couple of months.

yes but List has Sort() method as well so you can call it before BinarySearch.

Searching and sorting are algorithmic primitives. It's helpful for the standard library to have fast reliable implementations. Otherwise, developers waste time reinventing the wheel.
However, in the case of the .NET Framework, it's unfortunate that the specific choices of algorithms happens to make them less useful than they might be. In some cases, their behaviour is not defined:
List<T>.BinarySearch If the List contains more than one element with the same value, the method returns only one of the occurrences, and it might return any one of the occurrences, not necessarily the first one.
List<T> This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.
That's a shame, because there are deterministic algorithms that are just as fast, and these would be much more useful as building blocks. It's noteworthy that the binary search algorithms in Python, Ruby and Go all find the first matching element.

I agree it's completely dumb to Call BinarySearch on an unsorted list, but it's perfect if you know your large list is sorted.
I've used it when checking if items from a stream exist in a (more or less) static list of 100,000 items or more.
Binary Searching the list is ORDERS of magnitude faster than doing a list.Find, which is many orders of magnitude faster than a database look up.
I makes sense, and I'm glad it there (not that it would be rocket science to implement it if it wasn't).

Perhaps another point is that an array could be equally as unsorted. So in theory, having a BinarySearch on an array could be invalid too.
However, as with all features of a higher level language, they need to be applied by someone with reason and understanding of the data, or they will fail. Sure, some cross-checks could be applied, and we could have a flag that said "IsSorted" and it would fail on binary search otherwise, but .....

Some pseudo code:
if List is sorted
use the BinarySearch method
else if List is not sorted and you think sorting it is "waste of CPU time"
use a different algorithm that is more suitable and efficient

Which is faster/more efficient: Dictionary<string,object> or Dictionary<enum,object>?

Are enum types faster/more efficient than string types when used as dictionary keys?
IDictionary<string,object> or IDictionary<enum,object>
As a matter of fact, which data type is most suitable as a dictionary key and why?
Consider the following: NOTE: Only 5 properties for simplicity
struct MyKeys
{
public string Incomplete = "IN";
public string Submitted = "SU";
public string Processing="PR";
public string Completed = "CO";
public string Closed = "CL";
}
and
enum MyKeys
{
Incomplete,
Submitted,
Processing,
Completed,
Closed
}
Which of the above will be better if used as keys in a dictionary!

Certainly the enum version is better (when both are applicable and make sense, of course). Not just for performance (it can be better or worse, see Rashack's very good comment) as it's checked compile time and results in cleaner code.
You can circumvent the comparer issue by using Dictionary<int, object> and casting enum keys to ints or specifying a custom comparer.

I think you should start by focusing on correctness. This is far more important than the minimal difference between the minor performance differences that may occur within your program. In this case I would focus on the proper representation of your types (enum appears to be best). Then later on profile your application and if there is a issue, then and only then should you fix it.
Making code faster later in the process is typically a straight forward process. Take the link that skolima provided. If you had chosen enum, it would have been a roughly 10 minute fix to remove a potential performance problem in your application. I want to stress the word potential here. This was definitely a problem for NHibernate but as to whether or not it would be a problem for your program would be solely determined by the uses.
On the other hand, making code more correct later in the process tends to be more difficult. In a large enough problem you'll find that people start taking dependencies on the side effects of the previous bad behavior. This can make correcting code without breaking other components challenging.

Use enum to get cleaner and nicer code, but remember to provide a custom comparer if you are concerned with performance: http://ayende.com/Blog/archive/2009/02/21/dictionaryltenumtgt-puzzler.aspx .

I would guess that the enum version is faster. Under the hood the dictionary references everything by hashcode. My guess is that it is slower to generate the hashcode for a string. However, this is probably negligibly slower, and is most certainly faster than anything like a string compare. I agree with the other posters who said that an enum is cleaner.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.