Efficient insertion and search of strings

Efficient insertion and search of strings - c#

In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)

It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.

HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.

If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).

The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)

If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

Related

Fastest way to check if a string is a substring C#?

I have a need to check if a list of items contains a string...so kind of like the list gets filtered as the user types in a search box. So, on the text changed event, I am checking if the entered text is contained in one of the listox items and filtering out...so
something like:
value.Contains(enteredText)
I was wondering if this is the fastest and most efficient way to filter out listbox items?
Is Contains() method the best way to search for substrings in C#?

I'd say that in all but very exceptional circumstances, it's fast and efficient enough, and even in such exceptional circumstances it's likely to be a purely academical problem. If you use it and come across any bottlenecks in your logic related to this then I'd be surprised, but only then would it be worth looking at, then chances are you'll be looking elsewhere.

Contains is one of the cheapest methods in my code completion filtering algorithm (Part 6 #6, where #7 and the fuzzy logic matching described in the footnote are vastly more expensive), which doesn't have problems keeping up with even a fast typing user and thousands of items in the dropdown.
I highly doubt it will cause you problems.

Although this is not the fastest option globally, it is the fastest one for which you do not need to code anything. It should be sufficient for filtering drop-down items.
For longer texts, you may want to go with the KMP Algorithm, which has a linear timing complexity. Note, however, that it would not make any difference for very short search strings.
For searches that have lots of matches (e.g. ones that you get for the first one to two characters) you may want to precompute a table that maps single letters and letter pairs to the rows in your drop-down list for a much faster look-up at the expense of using more memory (a pretty standard tradeoff in programming in general).

Sort or RemoveAll first on an IEnumerable that needs both?

When an IEnumerable needs both to be sorted and for elements to be removed, are there advantages/drawback of performing the stages in a particular order? My performance tests appear to indicate that it's irrelevant.
A simplified (and somewhat contrived) example of what I mean is shown below:
public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
{
IEnumerable<DataItem> result = this.GetDataItems();
result.Sort(sortOrder);
result.RemoveAll(item => !item.Display);
result = result.Take(maximum);
return result;
}

If your tests indicate it's irrelevant, than why worry about it? Don't optimize before you need to, only when it becomes a problem. If you find a problem with performance, and have used a profiler, and have found that that method is the hotspot, then you can worry more about it.
On second thought, have you considered using LINQ? Those calls could be replaced with a call to Where and OrderBy, both of which are deferred, and then calling Take, like you have in your example. The LINQ libraries should find the best way of doing this for you, and if your data size expands to the point where it takes a noticeable amount of time to process, you can use PLINQ with a simple call to AsParallel.

You might as well RemoveAll before sorting so that you'll have fewer elements to sort.

I think that Sort() method would usually have complexity of O(n*log(n)), and RemoveAll() just O(n), so in general it is probably better to remove items first.

You'd want something like this:
public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
{
IEnumerable<DataItem> result = this.GetDataItems();
return result
.Where(item => item.Display)
.OrderBy(sortOrder)
.Take(maximum);
}

There are two answers that are correct, but won't teach you anything:
It doesn't matter.
You should probably do RemoveAll first.
The first is correct because you said your performance tests showed it's irrelevant. The second is correct because it will have an effect on larger datasets.
There's a third answer that also isn't very useful: Sometimes it's faster to do removals afterwards.
Again, it doesn't actually tell you anything, but "sometimes" always means there is more to learn.
There's also only so much value in saying "profile first". What if profiling shows that 90% of the time is spent doing x.Foo(), which it does in a loop? Is the problem with Foo(), with the loop or with both? Obviously if we can make both more efficient we should, but how do we reason about that without knowledge outside of what a profiler tells us?
When something happens over multiple items (which is true of both RemoveAll and Sort) there are five things (I'm sure there are more I'm not thinking of now) that will affect the performance impact:
The per-set constant costs (both time and memory). How much it costs to do things like calling the function that we pass a collection to, etc. These are almost always negligible, but there could be some nasty high cost hidden there (often because of a mistake).
The per-item constant costs (both time and memory). How much it costs to do something that we do on some or all of the items. Because this happens multiple times, there can be an appreciable win in improving them.
The number of items. As a rule the more items, the more the performance impact. There are exceptions (next item), but unless those exceptions apply (and we need to consider the next item to know when this is the case), then this will be important.
The complexity of the operation. Again, this is a matter of both time-complexity and memory-complexity, but here the chances that we might choose to improve one at the cost of another. I'll talk about this more below.
The number of simultaneous operations. This can be a big difference between "works on my machine" and "works on the live system". If a super time-efficient approach uses .5GB of memory is tested on a machine with 2GB of memory available, it'll work wonderfully, but when you move it to a machine with 8GB of memory available and have multiple concurrent users, it'll hit a bottleneck at 16 simultaneous operations, and suddenly what was beating other approaches in your performance measurements becomes the application's hotspot.
To talk about complexity a bit more. The time complexity is a measure of how the time taken to do something relates the number of items it is done with, while memory complexity is a measure of how the memory used relates to that same number of items. Obtaining an item from a dictionary is O(1) or constant because it takes the same amount of time however large the dictionary is (not strictly true, strictly it "approaches" O(1), but it's close enough for most thinking). Finding something in an already sorted list can be O(log2 n) or logarithmic. Filtering through a list will be linear or O(n). Sorting something using a quicksort (which is what Sort uses) tends to be linearithmic or O(n log2 n) but in its worse case - against a list already sorted - will be quadratic O(n2).
Considering these, with a set of 8 items, an O(1) operation will take 1k seconds to do something, where k is a constant amount of time, O(log2 n) means 3k seconds, O(n) means 8k, O(n log2 n) means 24k and O(n2) means 64k. These are the most commonly found though there are plenty of others like O(nm) which is affected by two different sizes, or O(n!) which would be 40320k.
Obviously, we want as low a complexity as possible, though since k will be different in each case, sometimes the best solution for a small set has a high complexity (but low k constant) though a lower-complexity case will beat it with larger input.
So. Let's go back to the cases you are considering, viz filtering followed by sorting vs. sorting followed by filtering.
Per-set constants. Since we are moving two operations around but still doing both, this will be the same either way.
Per-item constants. Again, we're still doing the same things per item in either case, so no effect.
Number of items. Filtering reduces the number of items. Therefore the sooner we filter items, the more efficient the rest of the operation. Therefore doing RemoveAll first wins in this regard.
Complexity of the operation. It's either a O(n) followed by a average-case-O(log2 n)-worse-case-O(n2), or it's an average-case-O(log2 n)-worse-case-O(n2) followed by an O(n). Same either way.
Number of simultaneous cases. Total memory pressure will be relieved the sooner we remove some items, (slight win for RemoveAll first).
So, we've got two reasons to consider RemoveAll first as likely to be more efficient and none to consider it likely to be less efficient.
We would not assume that we were 100% guaranteed to be correct here. For a start we could simply have made a mistake in our reasoning. For another, there could be other factors we've dismissed as irrelevant that were actually pertinent. It is still true that we should profile before optimising, but reasoning about the sort of things I've mentioned above will both make us more likely to write performant code in the first place (not the same as optimising; but a matter of picking between options when readability, clarity and correctness is equal either way) and makes it easier to find likely ways to improve those things that profiling has found to be troublesome.
For a slightly different but relevant case, consider if the criteria sorted on matched those removed on. E.g. if we were to sort by date and remove all items after a given date.
In this case, if the list deallocates on all removals, it'll still be O(n), but with a much smaller constant. Alternatively, if it just moved the "last-item" pointer*, it becomes O(1). Finding the pointer is O(log2 n), so here there's both reasons to consider that filtering first will be faster (the reasons given above) and that sorting first will be faster (that removal can be made a much faster operation than it was before). With this sort of case it becomes only possible to tell by extending our profiling. It is also true that the performance will be affected by the type of data sent, so we need to profile with realistic data, rather than artificial test data, and we may even find that what was the more performant choice becomes the less performant choice months later when the dataset it is used on changes. Here the ability to reason becomes even more important, because we should note the possibility that changes in real-world use may make this change in this regard, and know that it is something we need to keep an eye on throughout the project's life.
(*Note, List<T> does not just move a last-item pointer for a RemoveRange that covers the last item, but another collection could.)

It would probably be better to the RemoveAll first, although it would only make much of a difference if your sorting comparison was intensive to calculate.

Speed of C# lists

Are C# lists fast? What are the good and bad sides of using lists to handle objects?
Extensive use of lists will make software slower? What are the alternatives to lists in C#?
How many objects is "too many objects" for lists?

List<T> uses a backing array to hold items:
Indexer access (i.e. fetch/update) is O(1)
Remove from tail is O(1)
Remove from elsewhere requires existing items to be shifted up, so O(n) effectively
Add to end is O(1) unless it requires resizing, in which case it's O(n). (This doubles the size of the buffer, so the amortized cost is O(1).)
Add to elsewhere requires existing items to be shifted down, so O(n) effectively
Finding an item is O(n) unless it's sorted, in which case a binary search gives O(log n)
It's generally fine to use lists fairly extensively. If you know the final size when you start populating a list, it's a good idea to use the constructor which lets you specify the capacity, to avoid resizing. Beyond that: if you're concerned, break out the profiler...

Compared to what?
If you mean List<T>, then that is essentially a wrapper around an array; so fast to read/write by index, relatively fast to append (since it allows extra space at the end, doubling in size when necessary) and remove from the end, but more expensive to do other operations (insert/delete other than the end)
An array is again fast by index, but fixed size (no append/delete)
Dictionary<,> etc offer better access by key
A list isn't intrinsically slow; especially if you know you always need to look at all the data, or can access it by index. But for large lists it may be better (and more convenient) to search via a key. There are various dictionary implementations in .NET, each with different costs re size / performance.

'Proper' collection to use to obtain items in O(1) time in C# .NET?

Something I do often if I'm storing a bunch of string values and I want to be able to find them in O(1) time later is:
foreach (String value in someStringCollection)
{
someDictionary.Add(value, String.Empty);
}
This way, I can comfortably perform constant-time lookups on these string values later on, such as:
if (someDictionary.containsKey(someKey))
{
// etc
}
However, I feel like I'm cheating by making the value String.Empty. Is there a more appropriate .NET Collection I should be using?

If you're using .Net 3.5, try HashSet. If you're not using .Net 3.5, try C5. Otherwise your current method is ok (bool as #leppie suggests is better, or not as #JonSkeet suggests, dun dun dun!).
HashSet<string> stringSet = new HashSet<string>(someStringCollection);
if (stringSet.Contains(someString))
{
...
}

You can use HashSet<T> in .NET 3.5, else I would just stick to you current method (actually I would prefer Dictionary<string,bool> but one does not always have that luxury).

something you might want to add is an initial size to your hash. I'm not sure if C# is implemented differently than Java, but it usually has some default size, and if you add more than that, it extends the set. However a properly sized hash is important for achieving as close to O(1) as possible. The goal is to get exactly 1 entry in each bucket, without making it really huge. If you do some searching, I know there is a suggested ratio for sizing the hash table, assuming you know beforehand how many elements you will be adding. For example, something like "the hash should be sized at 1.8x the number of elements to be added" (not the real ratio, just an example).
From Wikipedia:
With a good hash function, a hash
table can typically contain about
70%–80% as many elements as it does
table slots and still perform well.
Depending on the collision resolution
mechanism, performance can begin to
suffer either gradually or
dramatically as more elements are
added. To deal with this, when the
load factor exceeds some threshold, it
is necessary to allocate a new, larger
table, and add all the contents of the
original table to this new table. In
Java's HashMap class, for example, the
default load factor threshold is 0.75.

I should probably make this a question, because I see the problem so often. What makes you think that dictionaries are O(1)? Technically, the only thing likely to be something like O(1) is access into a standard integer-indexed fixed-bound array using an integer index value (there being no look-up in arrays implemented that way).
The presumption that if it looks like an array reference it is O(1) when the "index" is a value that must be looked up somehow, however behind the scenes, means that it is not likely an O(1) scheme unless you are lucky to obtain a hash function with data that has no collisions (and probably a lot of wasted cells).
I see these questions and I even see answers that claim O(1) [not on this particular question, but I do seem them around], with no justification or explanation of what is required to make sure O(1) is actually achieved.
Hmm, I guess this is a decent question. I will do that after I post this remark here.

List.BinarySearch vs Dictionary.TryGetValue - which is faster

Which would be faster for say 500 elements.
Or what's the faster data structure/collection for retrieving elements?
List<MyObj> myObjs = new List<MyObj>();
int i = myObjs.BinarySearch(myObjsToFind);
MyObj obj = myObjs[i];
Or
Dictionary<MyObj, MyObj> myObjss = new Dictionary<MyObj, MyObj>();
MyObj value;
myObjss.TryGetValue(myObjsToFind, out value);

I assume in your real code you'd actually populate myObjs - and sort it.
Have you just tried it? It will depend on several factors:
Do you need to sort the list for any other reason?
How fast is MyObj.CompareTo(MyObj)?
How fast is MyObj.GetHashCode()?
How fast is MyObj.Equals()?
How likely are you to get hash collisions?
Does it actually make a significant difference to you?
It'll take around 8 or 9 comparisons in the binary search case, against a single call to GetHashCode and some number of calls to Equals (depending on hash collisions) in the dictionary case. Then there's the intrinsic calculations (accessing arrays etc) involved in both cases.
Is this really a bottleneck for you though?
I'd expect Dictionary to be a bit faster at 500 elements, but not very much faster. As the collection grows, the difference will obviously grow.

Have been doing some real world tests with in memory collection of about 500k items.
Binary Search wins in every way.
Dictionary slows down the more hash collision you have. Binary search technically slows down but no where as fast as the dictionaries algorithm.
The neat thing about the binary search is it will tell you exactly where to insert the item into the list if not found.. so making the sorted list is pretty fast too. (not as fast)
Dictionaries that large also consume a lot of memory compared to a list sorted with binary search. From my tests the sorted list consumed about 27% of the memory a dictionary id. (so a diction claimed 3.7 X the memory)
For smallish list dictionary is just fine -- once you get largish it may not be the best choice.

The latter.
A binary search runs at O(log n) while a hashtable will be O(1).

Big 'O' notation, as used by some of the commenters, is a great guideline to use. In practice, though, the only way to be sure which way is faster in a particular situation is to time your own code before and after a change (as hinted at by Jon).

BinarySearch requires the list to already be sorted. [edit: Forgot that dictionary is a hashtable. So lookup is O(1)]. The 2 are not really the same either. The first one is really just checking if it exists in the list and where it is. If you want to just check existance in a dictionary use the contain method.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.