Why is HashTable Delete O(1)? - c#

I understand why a HashTable Add is O(1) (however please correct me if I'm wrong): The item being added is always allocated to the first available spot in the backing array.
I understand why a Lookup is O(n) (again, please correct me if I'm wrong): You need to walk through the backing array to find the value/key requested, and the running time of this operation will be directly proportional to the size of the collection.
However, why, then, is a Delete constant? Seems to me the same principals involved in an Add/Lookup are required.
EDIT
The MSDN article refers to a scenario where the item requested to be deleted isn't found. It mentions this as being an O(1) operation.

The worst cases for Insert and Delete are supposed to be O(n), see http://en.wikipedia.org/wiki/Hash_table.
When we Insert, we have to check if the value is in the table or not, hence O(n) in the worst case.
Just imagine a pathological case when all hash values are the same.
Maybe MSDN refers to average complexity.

O(1) is the best case, and probably the average case if you appropriately size the table. Worst case deletion for a HashTable is O(n).

Related

C# - Binary search a (sorted) Dictionary

I have a file of records, sorted alphabetically:
Andrew d432
Ben x127
...
...
Zac b332
The first field is a person name, the second field is some id. Once I read the file, I do not need to make any changes to the data.
I want to treat each record as a Key-Value pair, where the person name is the Key. I don't know which class to use in order to access a record (as fast as possible). Dictionary does not has a binary search. On the other hand, as I understand, SortedList and SortedDictionary should be used only when I need to insert/remove data.
Edit: To clarify, I'm talking about simply accessing a record, like:
x = MyDic[Zac]
What no one has stated is why dictionaries are O(1) and why it IS faster than a binary search. One side point is that dictionaries are not sorted by the key. The whole point of a dictionary is to go to the exact* (for all practical purposes) location of the item that is referenced by the key value. It does not "search" for the item - it knows the exact location of the item you want.
So a binary search would be pointless on a hash-based dictionary because there is no need to "search" for an item when the collection already knows exactly where it is.
*This isn't completely true in the case of hash collisions, but the principle of the dictionary is to get the item directly, and any additional lookups are an implementation detail and should be rare.
On the other hand, as I understand, SortedList and SortedDictionary should be used only when I need to insert/remove data.
They should be used when you want the data automatically sorted when adding or removing data. Note that SortedDictionary loses the performance gain of a "normal" dictionary because it now has to search for the location using the key value. It's primary use is to allow you to iterate over the keys in order.
If you have a unique key value per item, don't need to iterate the items in any particular order, and want the fastest "get" performance, then Dictionary is the way to go.
In general dictionary lookup will be faster than binary search of a collection. There are two specific cases when that's not true:
If the list is small (fewer than 15 (possibly as low as 10) items, in my tests), then the overhead of computing a hash code and going through the dictionary lookup will be slower than binary search on an array. But beyond 15 items, dictionary lookup beats binary search, hands down.
If there are many hash collisions (due either to a bad hash function or a dictionary with a high load factor), then dictionary lookup slows down. If it's really bad, then binary search could potentially beat dictionary lookup.
In 15 years working with .NET dictionaries holding all kinds of data, I've never seen #2 be a problem when using the standard String.GetHashCode() method with real world data. The only time I've run into trouble is when I created a bad GetHashCode() method.

What does it mean when an operation "approaches O(1)" as opposed to "is O(1)"?

Consider for example the documentation for the .NET Framework 4.5 Dictionary<TKey, TValue> class:
In the remarks for the .ContainsKey method, they state that
This method approaches an O(1) operation.
And in the remarks for the .Count property, they state that
Retrieving the value of this property is an O(1) operation.
Note that I am not necessarily asking for the details of C#, .NET, Dictionary or what Big O notation is in general. I just found this distinction of "approaches" intriguing.
Is there any difference? If so, how significant can it potentially be? Should I pay attention to it?
If the hash function used by the underlying objects in the hash code is "good" it means collisions will be very rare. Odds are good that there will be only one item at a given hash bucket, maybe two, and almost never more. If you could say, for certain, that there will never be more than c items (where c is a constant) in a bucket, then the operation would be O(c) (which is O(1)). But that assurance can't be made. It's possible, that you just happen to have n different items that, unluckily enough, all collide, and all end up in the same bucket, and in that case, ContainsKey is O(n). It's also possible that the hash function isn't "good" and results in hash collisions frequently, which can make the actual contains check worse than just O(1).
It's because Dictionary is an implementation of hash tables - this means that a key lookup is done by using a hashing function that tells you which bucket among many buckets contained in the data structure contains the value you're looking up. Usually, for a good hashing function, assuming a large enough set of buckets, each bucket only contains a single element - in which case the complexity is indeed O(1) - unfortunately, this is not always true - the hashing function may have clashes, in which case a bucket may contain more than one entry and so the algorithm has to iterate through the bucket until it finds the entry you're looking for - so it's no longer O(1) for these (hopefully) rare cases.
O(1) is a constant running time, approaching O(1) is close to constant but not quite, but for most purposes is a negligible increase. You should not pay attention to it.
Something either is or isn't O(1). I think what they're trying to say is that running time is approximately O(1) per operation for a large number of operations.

What is time and space complexity of Dictionary?

Let's assume I have data of size N (i.e.N elements) and the dictionary was created with capacity N. What is the complexity of:
space -- of entire dictionary
time -- adding entry to dictionary
MS revealed only that entry retrieval is close to O(1). But what about the rest?
What is the complexity of space -- of entire dictionary
Dictionary uses associative array data structure that is of O(N) space complexity.
Msdn says: "the Dictionary class is implemented as a hash table". And hash table uses associative arrays in turn.
What is the complexity of time -- adding entry to dictionary
A single add takes amortized O(1) time. In most cases it is O(1), it changes to O(N) when the underlying data structure needs to grow or shrink. As the later only happens infrequently, people use the word "amortized".
The time complexity of adding a new entry is documented under Dictionary<T>.Add():
If Count is less than the capacity, this method approaches an O(1) operation. If the capacity must be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.
It is not formally documented, but widely stated (and visible via disassembly) that the underlying storage is an array of name-value pairs. Thus space complexity is O(n).
As it is not part of the specification this could, in theory, change; but in practice is highly unlikely to because it would change the performance of the various operations (eg. enumeration) which could be visible.

Efficient insertion and search of strings

In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)
It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.
HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.
If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).
The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)
If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

'Proper' collection to use to obtain items in O(1) time in C# .NET?

Something I do often if I'm storing a bunch of string values and I want to be able to find them in O(1) time later is:
foreach (String value in someStringCollection)
{
someDictionary.Add(value, String.Empty);
}
This way, I can comfortably perform constant-time lookups on these string values later on, such as:
if (someDictionary.containsKey(someKey))
{
// etc
}
However, I feel like I'm cheating by making the value String.Empty. Is there a more appropriate .NET Collection I should be using?
If you're using .Net 3.5, try HashSet. If you're not using .Net 3.5, try C5. Otherwise your current method is ok (bool as #leppie suggests is better, or not as #JonSkeet suggests, dun dun dun!).
HashSet<string> stringSet = new HashSet<string>(someStringCollection);
if (stringSet.Contains(someString))
{
...
}
You can use HashSet<T> in .NET 3.5, else I would just stick to you current method (actually I would prefer Dictionary<string,bool> but one does not always have that luxury).
something you might want to add is an initial size to your hash. I'm not sure if C# is implemented differently than Java, but it usually has some default size, and if you add more than that, it extends the set. However a properly sized hash is important for achieving as close to O(1) as possible. The goal is to get exactly 1 entry in each bucket, without making it really huge. If you do some searching, I know there is a suggested ratio for sizing the hash table, assuming you know beforehand how many elements you will be adding. For example, something like "the hash should be sized at 1.8x the number of elements to be added" (not the real ratio, just an example).
From Wikipedia:
With a good hash function, a hash
table can typically contain about
70%–80% as many elements as it does
table slots and still perform well.
Depending on the collision resolution
mechanism, performance can begin to
suffer either gradually or
dramatically as more elements are
added. To deal with this, when the
load factor exceeds some threshold, it
is necessary to allocate a new, larger
table, and add all the contents of the
original table to this new table. In
Java's HashMap class, for example, the
default load factor threshold is 0.75.
I should probably make this a question, because I see the problem so often. What makes you think that dictionaries are O(1)? Technically, the only thing likely to be something like O(1) is access into a standard integer-indexed fixed-bound array using an integer index value (there being no look-up in arrays implemented that way).
The presumption that if it looks like an array reference it is O(1) when the "index" is a value that must be looked up somehow, however behind the scenes, means that it is not likely an O(1) scheme unless you are lucky to obtain a hash function with data that has no collisions (and probably a lot of wasted cells).
I see these questions and I even see answers that claim O(1) [not on this particular question, but I do seem them around], with no justification or explanation of what is required to make sure O(1) is actually achieved.
Hmm, I guess this is a decent question. I will do that after I post this remark here.

Categories