Huge Dictionary and sub string lookup

Huge Dictionary and sub string lookup - c#

I have a dictionary with 500,000 keys and I have to compare using Key.contains("Description"). This is making my performance really slow. Is there any other alternative way to perform faster search?
I had List before but that performed even worse. Tried using Index on List but did not improve performance much.

Other than storing all possible substrings of all possible keys as the keys in the dictionary (which you almost certainly wouldn't have enough memory to do) there really isn't much to be done besides iterating through the entire collection and doing the check on each item. Given that you're iterating the entire collection, there's not really much benefit to using a Dictionary over a List, at least for this specific operation (perhaps other operations you perform on this data benefit from it being in a Dictionary). They're both going to be quite slow. You simply have an inherently expensive operation that you're trying to perform.
If you can alter your requirements somehow to search for a string exactly equal to your search string then you can use the dictionary's hash based lookup, which is super fast, and if you could use a StartsWith or EndsWith operation instead of a full Contains then you could sort the data and use a binary search, but with a Contains operation none of those optimizations can be made.

If the search is performed multiple times, you may want to consider using extra collections holding just the items that match a predefined condition.
These collections would be populated at the same time your original dictionary is populated.
This could be a viable solution if you have a limited number of fixed searches.

I've read that by doing Regex you get an extra overhead, but why don't you benchmark it yourself?
Something like this:
var test = "Telle Carraige Sawmill Rh-ccxxH440xxx38.5Hyv-Op-rL-2008";
var matchCollection = Regex.Matches(test, "(Carraige|Sawmill)",RegexOptions.IgnoreCase);
//matchCollection.Count should be == 2

Related

Fastest way to check if a string is a substring C#?

I have a need to check if a list of items contains a string...so kind of like the list gets filtered as the user types in a search box. So, on the text changed event, I am checking if the entered text is contained in one of the listox items and filtering out...so
something like:
value.Contains(enteredText)
I was wondering if this is the fastest and most efficient way to filter out listbox items?
Is Contains() method the best way to search for substrings in C#?

I'd say that in all but very exceptional circumstances, it's fast and efficient enough, and even in such exceptional circumstances it's likely to be a purely academical problem. If you use it and come across any bottlenecks in your logic related to this then I'd be surprised, but only then would it be worth looking at, then chances are you'll be looking elsewhere.

Contains is one of the cheapest methods in my code completion filtering algorithm (Part 6 #6, where #7 and the fuzzy logic matching described in the footnote are vastly more expensive), which doesn't have problems keeping up with even a fast typing user and thousands of items in the dropdown.
I highly doubt it will cause you problems.

Although this is not the fastest option globally, it is the fastest one for which you do not need to code anything. It should be sufficient for filtering drop-down items.
For longer texts, you may want to go with the KMP Algorithm, which has a linear timing complexity. Note, however, that it would not make any difference for very short search strings.
For searches that have lots of matches (e.g. ones that you get for the first one to two characters) you may want to precompute a table that maps single letters and letter pairs to the rows in your drop-down list for a much faster look-up at the expense of using more memory (a pretty standard tradeoff in programming in general).

What is faster: sorted collection or list and linq queries on it (with intensive insertion/deletion)?

I have a dilemma. I have to implement prioritized queue (custom sort order). I need to insert/process/delete a lot of messages per second by using it (~100-1000).
Which design is faster at run-time?
1) custom sorted by priority collection (list)
2) list(non-sorted collection) + linq query all time when I need to process (dequeue) message
3) something else
ADDED:
SOLUTION:
List (Dictionary) of queues by priority: SortedList<int, VPair<bool, Queue<MyMessage>>>
where int - priority, bool - true if it is not empty queue

Whats your read/write ratio? Are multiple threads involved, if so, how?
As always when asking about performance, benchmark both code-paths and see for yourself (this is especially true the more specific your problem domain is).

The only way to know for sure is to measure the performance for yourself.

Well, finding an element in an unsorted data structure takes O(n) on average (one pass over the data structure). Binary Search trees have an average insertion complexity on O(log n) and also an average lookup complexity of O(log n). So in theory using something like that would be faster. In reality the overhead or the shape of the data might kill the theoretical advantage.
Also if your custom sort order can change at runtime you might have to rebuild the sorted data structure which is an additional performance hit.
In the end: If it is important for your application then try the different approaches and benchmark it yourself - it's the only way to be certain that it works.

Introducing sorting will always incur an insertion performance overhead as far as I'm aware. If there's no need for the sorting then use a nice generic Dictionary which will provide a quick lookup based on your unique key.

Efficient insertion and search of strings

In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)

It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.

HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.

If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).

The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)

If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

What .NET collection provides the fastest search

I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.

In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.

If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.

Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.

You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.

I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):

Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.

If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item

If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet

If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).

List.BinarySearch vs Dictionary.TryGetValue - which is faster

Which would be faster for say 500 elements.
Or what's the faster data structure/collection for retrieving elements?
List<MyObj> myObjs = new List<MyObj>();
int i = myObjs.BinarySearch(myObjsToFind);
MyObj obj = myObjs[i];
Or
Dictionary<MyObj, MyObj> myObjss = new Dictionary<MyObj, MyObj>();
MyObj value;
myObjss.TryGetValue(myObjsToFind, out value);

I assume in your real code you'd actually populate myObjs - and sort it.
Have you just tried it? It will depend on several factors:
Do you need to sort the list for any other reason?
How fast is MyObj.CompareTo(MyObj)?
How fast is MyObj.GetHashCode()?
How fast is MyObj.Equals()?
How likely are you to get hash collisions?
Does it actually make a significant difference to you?
It'll take around 8 or 9 comparisons in the binary search case, against a single call to GetHashCode and some number of calls to Equals (depending on hash collisions) in the dictionary case. Then there's the intrinsic calculations (accessing arrays etc) involved in both cases.
Is this really a bottleneck for you though?
I'd expect Dictionary to be a bit faster at 500 elements, but not very much faster. As the collection grows, the difference will obviously grow.

Have been doing some real world tests with in memory collection of about 500k items.
Binary Search wins in every way.
Dictionary slows down the more hash collision you have. Binary search technically slows down but no where as fast as the dictionaries algorithm.
The neat thing about the binary search is it will tell you exactly where to insert the item into the list if not found.. so making the sorted list is pretty fast too. (not as fast)
Dictionaries that large also consume a lot of memory compared to a list sorted with binary search. From my tests the sorted list consumed about 27% of the memory a dictionary id. (so a diction claimed 3.7 X the memory)
For smallish list dictionary is just fine -- once you get largish it may not be the best choice.

The latter.
A binary search runs at O(log n) while a hashtable will be O(1).

Big 'O' notation, as used by some of the commenters, is a great guideline to use. In practice, though, the only way to be sure which way is faster in a particular situation is to time your own code before and after a change (as hinted at by Jon).

BinarySearch requires the list to already be sorted. [edit: Forgot that dictionary is a hashtable. So lookup is O(1)]. The 2 are not really the same either. The first one is really just checking if it exists in the list and where it is. If you want to just check existance in a dictionary use the contain method.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Huge Dictionary and sub string lookup - c#

Related

Fastest way to check if a string is a substring C#?

What is faster: sorted collection or list and linq queries on it (with intensive insertion/deletion)?

Efficient insertion and search of strings

What .NET collection provides the fastest search

List.BinarySearch vs Dictionary.TryGetValue - which is faster

Categories

Resources