Random Access by Position of Key/Value Pairs in .NET (C#) - c#

I am currently developing a program that uses C#'s Dictionary container (specifically, SortedDictionary). This container works very well for my purposes except for one specific case because I want random access. Specifically, I am generating a random position using a pseudorandom number generator and I need to be able to access that value in the SortedDictionary. At the point that this happens, I do not have a key value.
I could potentially switch to a List which would solve this problem, but would create problems in the rest of the algorithm where SortedDictionary works quite well. Any suggestions/solutions would be much appreciated.
I am currently developing Visual Studio 2005.
Thank you.

You can use a SortedList and it has a Values collection which you may access through an integer index.

public TValue GetRandomElement<TKey, TValue>(SortedDictionary<TKey, TValue> dict)
{
Random randGen = new Random();
int randIndex = randGen.Next(dict.Values.Count);
int i = 0;
foreach (TValue value in dict.Values)
{
if (i++ == randIndex)
return value;
}
// this shouldn't happen unless I have a bug above or you are accessing the dictionary from multiple threads
return default(TValue);
}
Blindly enumerating the ValueCollection is not the most efficient thing in the world. But it gets the job done. If this is a frequent operation in your scenario, you should consider a hybrid data structure that has the performance characteristics needed for both dictionary lookup and random access.

Linq could do this for you:
int n = GetRandomIndex();
object item = dictionary.ElementAt(n).Value;

You don't provide enough information to come up with a solution. How many elements, how often are you going to do this, do you have memory/speed constraints? BTree, SortedList, inserting special nodes in the SortedDictionary could all be useful

Will pulling a random key work?
var randValue = myDictionary.Values.ToList()[myRandomInt];
Edit:
Seems the keys collection and values collection are both IEnumerables so you can't use [] operators. This is the best it gets it seems.
Edit:
Without Linq... Perhaps expensive, but you could copyto array and then pull a value at an index
System.Collections.Generic.KeyValuePair<string, int>[] dictCopy = new System.Collections.Generic.KeyValuePair<string, int>[myDictionary.Count];
myDictionary.CopyTo(dictCopy, 0);
var randValue = dictCopy[myRandomInt].Value;

Related

Set allowing quick insert/deletion and random selection in C#

What data structure could I use in C# to allow quick insertion/deletion as well as uniform random selection? A List has slow deletion by element (since it needs to find the index of the element each time), while a HashSet does not seem to allow random selection of an element (without copying to a list.)
The data structure will be updated continuously, so insertion and deletion need to be online procedures. It seems as if there should be a way to make insertion, deletion, and random selection all O(log n).
A binary search tree with arbitrary integer keys assigned to the objects would solve all of these problems, but I can't find the appropriate class in the C# standard library. Is there a canonical way to solve this without writing a custom binary search tree?
There is already a BST in the C# BCL, it's called a SortedDictionary<TKey, TValue>, if you don't want Key Value Pairs, but instead want single items, you can use the SortedSet<T> (SortedSet is in .NET 4.0).
It sounds like from your example you'd want a SortedDictionary<int, WhateverValueType>. Though I'm not sure exactly what you are after when you say "uniform random selection".
Of course, the Dictionary<TKey, TValue> is O(1) which is much faster. So unless you have a need for sorted order of the keys, I'd use that.
UPDATE: From the sounds of your needs, you're going to have a catch-22 on efficiency. To be able to jump into a random contiguous index in the data structure, how often will you be inserting/deleting? If not often, you could use an array and just Sort() after (O(n log n)), or always insert/delete in order (O(n)).
Or, you could wrap a Dictionary<int, YourType> and keep a parallel List<int> and update it after every Add/Delete:
_dictionary.Add(newIndex, newValue);
_indexes.Add(newIndex);
And then just access a random index from the list on lookups. The nice thing is that in this method really the Add() will be ~ O(1) (unless List resizes, but you can set an initial capacity to avoid some of that) but you would incurr a O(n) cost on removes.
I'm afraid the problem is you'll either sacrifice times on the lookups, or on the deletes/inserts. The problem is all the best access-time containers are non-contiguous. With the dual List<int>/Dictionary<int, YourValue> combo, though, you'd have a pretty good mix.
UPDATE 2: It sounds like from our continued discussion that if that absolute performance is your requirement you may have better luck rolling your own. Was fun to think about though, I'll update if I think of anything else.
Binary search trees and derived structures, like SortedDictionary or SortedSet, operate by comparing keys.
Your objects are not comparable by itself, but they offer object identity and a hash value. Therefore, a HashSet is the right data structure. Note: A Dictionary<int,YourType> is not appropriate because removal becomes a linear search (O(n)), and doesn't solve the random problem after removals.
Insert is O(1)
Remove is O(1)
RandomElement is O(n). It can easily be implemented, e.g.
set.ElementAt(random.Next(set.Count))
No copying to an intermediate list is necessary.
I realize that this question is over 3 years old, but just for people who come across this page:
If you don't need to keep the items in the data set sorted, you can just use a List<ItemType>.
Insertion and random selection are O(1). You can do deletion in O(1) by just moving the last item to the position of the item you want to delete and removing it from the end.
Code:
using System; // For the Random
using System.Collections.Generic; // The List
// List:
List<ItemType> list = new List<ItemType>();
// Add x:
ItemType x = ...; // The item to insert into the list
list.Add( x );
// Random selection
Random r = ...; // Probably get this from somewhere else
int index = r.Next( list.Count );
ItemType y = list[index];
// Remove item at index
list[index] = list[list.Count - 1]; // Copy last item to index
list.RemoveAt( list.Count - 1 ); // Remove from end of list
EDIT: Of course, to remove an element from the List<ItemType> you'll need to know its index. If you want to remove a random element, you can use a random index (as done in the example above). If you want to remove a given item, you can keep a Dictionary<ItemType,int> which maps the items to their indices. Adding, removing and updating these indices can all be done in O(1) (amortized).
Together this results in a complexity of O(1) (amortized) for all operations.

What is the quickest way to compare a C# Dictionary to a 'gold standard' Dictionary for equality?

I have a known-good Dictionary, and at run time I need to create a new Dictionary and run a check to see if it has the same key-value pairs as the known-good Dictionary (potentially inserted in different orders), and take one path if it does and another if it doesn't. I don't necessarily need to serialize the entire known-good Dictionary (I could use a hash, for example), but I need some on-disk data that has enough information about the known-good Dictionary to allow for comparison, if not for recreation. What is the quickest way to do this? I can use a SortedDictionary, but the amount of time required to initialize and add values counts in the speed of this task.
Concrete example:
Consider a Dictionary<String,List<String>> that looks something like this (in no particular order, obviously):
{ {"key1", {"value1", "value2"} }, {"key2", {"value3", "value4"} } }
I create that Dictionary once and save some form of information about it on disk (a full serialization, a hash, whatever). Then, at runtime, I do the following:
Dictionary<String,List<String>> d1 = new Dictionary<String,List<String>> ();
Dictionary<String,List<String>> d2 = new Dictionary<String,List<String>> ();
Dictionary<String,List<String>> d3 = new Dictionary<String,List<String>> ();
String key11 = "key1";
String key12 = "key1";
String key13 = "key1";
String key21 = "key2";
String key22 = "key2";
String key23 = "key2";
List<String> value11 = new List<String> {"value1", "value2"};
List<String> value12 = new List<String> {"value1", "value2"};
List<String> value13 = new List<String> {"value1", "value2"};
List<String> value21 = new List<String> {"value3", "value4"};
List<String> value22 = new List<String> {"value3", "value4"};
List<String> value23 = new List<String> {"value3", "value5"};
dict1.add(key11, value11);
dict1.add(key21, value21);
dict2.add(key22, value22);
dict2.add(key12, value12);
dict3.add(key13, value13);
dict3.add(key23, value23);
dict1.compare(fileName); //Should return true
dict2.compare(fileName); //Should return true
dict3.compare(fileName); //Should return false
Again, if the overall time from startup to the return from compare() is quicker, I can change this code to use a SortedDictionary (or anything else) instead, but I can't guarantee ordering and I need some consistent comparison. compare() could load a serialization and iterate through the dictionaries, it could serialize the in-memory dictionary and compare the serialization to the file name, or it could do any number of other things.
Solution one: use set equality.
If the dictionaries are of different sizes, you know they are unequal.
If they are of the same size then build a mutable hash set of keys from one dictionary. Remove from it all the keys from the other dictionary. If you attempted to remove a key that wasn't there, then the key sets are unequal and you know which key was the problem.
Alternatively, build two hash sets and take their intersection; the resulting intersection should be the size of the original sets.
This takes O(n) time and O(n) space.
Once you know that the key sets are equal then go through all the keys one at a time, fetch the values, and do comparison of the values. Since the values are sequences, use SequenceEquals. This takes O(n) time and O(1) space.
Solution two: sort the keys
Again, if the dictionaries are of different size, you know they are unequal.
If they are of the same size, sort both sets of keys and do a SequenceEquals on them; if the sequences of keys are unequal then the dictionaries are unequal.
This takes O(n lg n) time and O(n) space.
If that succeeds, then again, go through the keys one at a time and compare the values.
Solution three:
Again, check the dictionaries to see if they are the same size.
If they are, then iterate over the keys of one dictionary and check to see if the key exists in the other dictionary. If it does not, then they are not equal. If it does, then check the corresponding values for equality.
This is O(n) in time and O(1) in space.
How to choose amongst these possible solutions? It depends on what the likely failure mode is, and whether you need to know what the missing or extra key is. If the likely failure mode is a bad key then it might be more performant to choose a solution that concentrates on finding the bad key first, and only checking for bad values if all the keys turn out to be OK. If the likely failure mode is a bad value, then the third solution is probably best, since it prioritizes checking values early.
Due to my comments on the accepted answer, here's a stricter check.
goodDictionary.Keys.All(k=>
{
List<string> otherVal;
if(!testDictionary.TryGetValue(k,out otherVal))
{
return false;
}
return goodDictionary[k].SequenceEquals(otherVal);
})
If you already have serialisation, then take the hash (I recommend SHA-1) of each serialised dictionary and then compare them.
I don't think there is a magic bullet here; you just need to do a lookup for each key pair:
public bool IsDictionaryAMatch(Dictionary<string, List<string>> dictionaryToCheck)
{
foreach(var kvp in dictionaryToCheck)
{
// Do the Keys Match
if(!goodDictionary.Exists(x => x.Key == kvp.Key))
return false;
foreach(var valueElement in kvp.Value)
{
// Do the Values in each list match
if(!goodDictionary[kvp.Key].Exists(x => x == valueElement))
return false;
}
}
return true;
}
Well, at some point you need to compare that each key has the same value, but before that you can do quick things, like checking to see how many keys each dictionary has, then checking that the list of keys match. Those should be fairly quick, and if either of those tests fail you can abort the more expensive testing.
After that, you might be able to build separate lists of keys and then fire off a Paraells query to compare the actual values.

Is C# Dictionary<string, List<string>>.GetObjectData() (serialization) Consistent?

I know Dictionaries don't store their key-value pairs in the order that they are added, but if I add the same key-value pairs (potentially in different orders) to two different Dictionaries and serialize the results, will the data on file be the same?
Edit: To clarify, I'm asking specifically about the output of GetObjectData(), not any particular serializer. Consider the following code:
Dictionary<string,List<string>> dict1 = new Dictionary<string,List<string>>();
Dictionary<string,List<string>> dict2 = new Dictionary<string,List<string>>();
string key11 = "key1";
string key12 = "key1";
string key21 = "key2";
string key22 = "key2";
List<string> values11 = new List(1);
List<string> values12 = new List(1);
List<string> values21 = new List(1);
List<string> values22 = new List(1);
values11.add("value1");
values12.add("value1");
values21.add("value2");
values22.add("value2");
dict1.add(key11, values11);
dict2.add(key22, values22);
dict1.add(key21, values21);
dict2.add(key12, values12);
Will dict1 and dict2 return the same thing for GetObjectData()? If not, why not?
Whether or not it is would end up being an implementation detail that likely would not be guaranteed in future versions and/or alternate implementations; As such, I would recommend at the very least having a test written that verifies it and can run as part of your standard tests. But if you are implementing a solution that absolutely depends on it, then it may be worth writing your own serializer...
Almost certainly not! The Dictionary works by hashing, and there has to be some method of hash collision resolution. So let's say that the first time you go through the dictionary you add key1 and key2 in that order. key1 ends up in the "normal" spot for keys that hash to that particular value. key2 is stored "somewhere else" (dependent on the implementation).
Now change the order that you add keys. key2 goes in the normal spot and key1 goes "somewhere else."
You cannot make any assumptions about the order of the items in your dictionary.
Even if you could guarantee the order, that guarantee could be invalidated with the next change to the .NET Framework because the implementation of string.GetHashCode might change (it has in the past). That would completely change the order in which keys are stored in the dictionary's underlying data structures, so any saved data created by a previous version of the Framework would likely not agree with data you create when running with the new version.
That would depend on internal implementation details of the particular (version of the) Dictionary.
So in general, No.
There have been topics here about using Serialization to determine Equality, it fails on several corner cases.

.NET Framework - Any way to get Dictionary<> to be a little faster?

I'm doing a Dictionary<> lookup in an O(n^2) loop and need it to be ridiculously fast. It's not. Does anyone have any insight into how Dictionary<> is implemented? I'm testing Dictionary performance with an isolated test case after running my code through a profiler and determining Dictionary lookups are the bulk of the CPU time.. My test code is like this:
Int32[] keys = new Int32[10] { 38784, 19294, 109574, 2450985, 5, 398, 98405, 12093, 909802, 38294394 };
Dictionary<Int32, MyData> map = new Dictionary<Int32, MyData>();
//Add a bunch of things to map
timer.Start();
Object item;
for (int i = 0; i < 1000000; i++)
{
for (int j = 0; j < keys.Length; j++)
{
bool isFound = map.ContainsKey(keys[j]);
if (isFound)
{
item = map[keys[j]];
}
}
}
timer.Stop();
ContainsKey and map[] are the two slow parts (equally slow).. If i add a TryGetValue, it's nearly identical in speed to ContainsKey. Here's some interesting facts..
A Dictionary<Guid, T> is about twice as slow as Dictionary<Int32, T>. Dictionary<String, T> is about twice as slow as a Guid dictionary. A Dictionary<Byte, T> is a good 50% faster than using Ints. This leads me to believe that a Dictionary is doing an O(log n) binary search to find the key, and the comparison operators on the keys are the bottleneck. For some reason, I don't believe it's implemented as a Hashtable, because .NET already has a Hashtable class, and in my experience it's even slower than Dictionary.
The dictionaries I'm building are only accessed by one thread at a time, so read locking is not an issue. RAM is also not an issue. The dictionary will most likely only have about 10 buckets, but each bucket can point to one of about 2,000 possibly things. Does anyone have any feedback on how to make this faster? Thanks!
Mike
The dictionary is implemented using a hash table, I have looked at the code using Reflector a while back.
"The dictionary will most likely only
have about 10 buckets, but each bucket
can point to one of about 2,000
possibly things."
There is your problem. The dictionary uses the hash to locate the bucket, but the lookup in the bucket is linear.
You have to implement a hash algorithm with a better distribution to get better performance. The relation should be at least the opposite, i.e. 2000 buckets with 10 items each.
Adding to the comments about creating your own implementation based on knowing the data, here is an example that will have no clashes. This may throw OutOfMemoryExceptions based on the size of the objects. I tried using an int indexer but that would throw an OutOfMemoryException. If null is returned the item doesn't exist.
I haven't profiled this but I would expect minor speed improvements, but larger memory use.
public class QuickLookup<T> where T : class
{
private T[] _postives = new T[short.MaxValue + 1];
private T[] _negatives = new T[short.MaxValue + 1];
public T this[short key]
{
get
{
return key < 0 ? _negatives[(key * -1) - 1] : _postives[key];
}
set
{
if (key < 0)
_negatives[key * -1] = value;
else
_postives[key] = value;
}
}
}
If you only have 10 buckets with 2000 things each could you just build a single list with all 20000 things which can be directly indexed by a key known to your loop? For example:
List<MyData> = new List();
//add all items to list indexed by their key (RAM is not an issue right?)
item = ItemList[key];
This way you reference them directly with no dictionary or hash lookup.
It sounds like you're saying that your dictionary will only have 10 items in it. If so, a hash table may be unwarranted. You can just store your data in a list/array and either iterate over it or use a binary search to find your keys (try both to see what's faster).
If you use a binary search, your list will have to be sorted; if you just iterate over your list and there are some keys that are looked-up more frequently than others, you can put them at the beginning of the list to speed things up.
On the other hand, if your keys are known in advance, you can write your own implementation of a hash table with a fast and perfect hash function (i.e. no collisions), and that should be unbeatable.
The insight into the inner workings of the hash table are spot on. You should definitely be using TryGetValue as your entire inner loop:
map.TryGetValue(keys[j], out item);
doing ContainsKey and Item[] is doing the hard part (the lookup) twice. An extra if and an extra keys[j] are minor, but will add up in a tight loop. Using a foreach over your keys will probably be slower, but depending on the actual contents of the loop, it might be worth profiling.

Performance when checking for duplicates

I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. I have tried using a
List<int>
and
Dictionary<int, bool>
With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. My expectation is that this is because a List allows for indexed access and a Dictionary does not. What I was wondering is, is there a better solution to this problem. I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. I'm using C# and .NET 2.0. And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique.
They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.
Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..
A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.
However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.
You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.
Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.
Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.
I don't really get what you are asking.
Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.
If you already have the data in a dictionary then all keys are unique, there can be no duplicates.
I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.
foreach (int key in keys)
{
if (!MyDataDict.ContainsKey(key))
{
if (!MyDuplicatesDict.ContainsKey(key))
MyDuplicatesDict.Add(key);
}
else
MyDataDict.Add(key);
}
If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.
For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.
Performs of a bitmap would be O(1) (per check) which is hard to beat.
There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.
If you're going to use a List, use the BinarySearch:
// initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();
foreach ( int Key in MyKeys )
{
// this is an O(log N) operation
int index = FoundKeys.BinarySearch( Key );
if ( index < 0 )
{
// if the Key is not in our list,
// index is the two's compliment of the next value that is in the list
// i.e. the position it should occupy, and we maintain sorted-ness!
FoundKeys.Insert( ~index, Key );
}
else
{
if ( DuplicateKeys.ContainsKey( Key ) )
{
DuplicateKeys[Key]++;
}
else
{
DuplicateKeys.Add( Key, 1 );
}
}
}
You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );

Categories