Effective approach on fast look up of unique words in C# - c#

I have the following problem. I have to store a list of unique words in multiple languages in memory and of course when I add new words I have to check whether the new word already exist.
Of course this needs to be blazingly fast, primarily because of the huge number of words.
I was thinking about implementing a Suffix Tree, but I wondered whether there is an easier approach with some already implemented internal structures.
P.S. Number of words ≈ 107.

First, note that Suffix Trees might be an overkill here, since they allow fast search for any suffix of any word, which might be a bit too much than what you are looking for. A trie is a very similar DS, that also allows fast search for a word, but since it does not support fast search for any suffix - its creation is simpler (both to program and efficiency).
Another simpler alternative is using a simple hash table, which is implemented in C# as a HashSet. While a HashSet is on theory slower on worst case - the average case for each lookup takes constant time, and it might be enough for your application.
My suggestion is:
First try using a HashSet, which requires much less effort to implement, benchmark it and check if it is enough.
Make sure your DS is moddable, so you can switch it with very little effort if you later decide to. This is usually done by introducing an interface that is responsible to the addition and lookup of words, and if you need it changed - just introduce a different implementation to the interface.
If you do decide to add suffix tree or trie - use community resources, no need to reinvent the wheel - someone has already implemented most of these data structures, and they are available online.

Related

Would a Patricia Trie algorithm be the fastest way to do a 'prefix' search (i.e. Starts With search) in .NET?

I have a file that contains a lot of names of people and some meta for each person.
I'm trying to do a search for the people, given some search query.
So I'm first going to do an exact search (easy, implemented as a Dictionary with a Key Lookup).
If that fails (or fails to reach the max number of records specified), then I was hoping to try and go back over all the keys again and do a StartsWith check.
Now this works fine. But I was curious if there's a better way to do this. Looking around the interwebs it looks like there's a data structure called a Patricia Trie alg.
so ..
Is the Patricia Trie the recommended way of doing a StartsWith check over a large set of string-keys?
If yes, is there a nuget package for this?
No - i do no want to do a Contains check. nope. nada. zilch.
Edit 1:
Also, my data will be loaded in once at app startup so I'm happy to incur a slower 'build' step while creating/populating the data structure. Especially if runtime search-perf is better.
Also, I understand that the word 'performance' depends on heaps of factors, like the number of items stored and the size of the search queries etc. So this is more an academic question which I wish to do some code-comparison vs a microptimisation anti-pattern.
Edit 2:
Also, this previous SO question sorta talks about a similar problem I'm having but the answers seem to be talking about a substring search and not a prefix (i.e. StartsWith) search.

C# how to find the state someone was born in with their social security number

I am currently working on an assignment in which I am to validate various formats using regular expressions (phone numbers, birth date, email address, Social Security). One of the features our teacher has suggested would be to have a method that returns the state an individual was born using their Social Security Number.
xxx-xx-xxxx
The first 3 digits correspond to a state/area as outlined here:
http://socialsecuritynumerology.com/prefixes.php
If I've isolated the first 3 numbers as an integer already, is there anyway I could quickly match the number with its corresponding area code?
Currently I'm only using if-else statements but its getting pretty tedious.
Example:
if (x > 0 && x<3)
return "New Hampshire";
else if (x <= 7)
return "Maine";
...
You have a few options here:
50 if statements, one for each state, as you are doing.
A switch with 999 conditions, matching each option with a state. It probably looks cleaner and you can generate it with a script and interject the return statements wherever necessary. Maybe worse than option 1 in terms of tediousness.
Import the file as text, parse it into a Dictionary and do a simple lookup. The mapping is most likely not going to change in the near future, so the robustness argument is rather moot, but it is probably "simpler" in terms of amount of effort*. And it's another chance to practice regex to parse lines in the file you linked.
*Where "effort" is measured purely in the amount of tedious gruntwork prone to annoying human error and hand fatigue. Energy consumed within the brain due to engineering and implementing a solution where the computer does the ugly stuff for you is not included. :)
It's hard to tell your level of skill and what your course has taught you so far which is why it's difficult answering these kinds of questions, and also for the most part that's why you will get a negative response from people - they assume that you would have had the answer in your course materials already and will assume that you are being lazy.
I'm going to assume that you are at a basic level and that you already know how to solve the problem the brute force way (your if/else construct) and that you are genuinely interested in how to make your code better and not simply asking for a solution you can copy/paste.
Now, while your if/else idea will work, that is a procedural way of thinking. You are working with an object oriented language, so I would suggest to you to think about how you could use the principles of OO to make this work better. A good starting point would be to make a collection of state objects that contain all the parameters you need. You could then loop through your state collection and use their properties to find the matching one. You could create the state collection by reading from a file or database or even just hard coding it for the purposes of your assignment.
Your intuition that a long chain of if and else if statements might be unwieldy is sound. Not only is it tedious, but the search to find the correct interval is inefficient. However, something needs to be in charge of this tedium. A simple solution would be to use a Dictionary to store key/value pairs that you could build up once, and reuse throughout the application. This has the downside of requiring more space than necessary, as every individual mapping becomes an element of the data structure. You could instead follow the advice from this question and use a data structure more suited to ranged values for look ups.
It's difficult to tell from your question as it's written what your level of expertise is, and going into any real detail here would essentially mean completing your assignment for you. It should be noted though, that there's nothing inherently wrong with your solution, and depending on where you are in your education it may be what's expected.

Queryable Language Dictionary and Word Searching Functionality

In a future project I will need to implement functionality meant for searching words (either by length or given a set of characters and their position in the word) which will return all words that meet a certain criteria.
In order to do so, I will need language dictionaries that can be easily queryable in LINQ. The first thing I'd like to ask is if anyone knows about good dictionaries to use in this kind of application and environment used.
And I'd also like to ask about good ways to search the said dictionary for a word. Would a hash table help speeding up queries? The thing is that a language dictionary can be quite huge, and knowing I will have plenty of search criteria, what would be a good way to implement such functionality in order to avoid hindering the search speed?
Without knowing the exact set of stuff you are likely to need to optimize for, it's hard to say. The standard data structures for efficiently organizing a large corpus of words for fast retrieval is the "trie" data structure, or, if space efficiency is important (because say you're writing a program for a phone, or other memory-constrained environment) then a DAWG -- a Directed Acyclic Word Graph. (A DAWG is essentially a trie that merges common paths to leaves.)
Other interesting questions that I'd want to know an answer to before designing a data structure are things like: will the dictionary ever change? If it does change, are there performance constraints on how fast the new data needs to be integrated into the structure? Will the structure be used only as a fast lookup device, or would you like to store summary information about the words in it as well? (If the latter then a DAWG is unsuitable, since two words may share the same prefix and suffix nodes.) And so on.
I would search the literature for information on tries, DAWGs and ways to optimize Scrabble programs; clearly Scrabble requires all kinds of clever searching of a corpus of strings, and as a result there have been some very fast variants on DAWG data structures built by Scrabble enthusiasts.
I have recently written an immutable trie data structure in C# which I'm planning on blogging about at some point. I'll update this answer in the coming months if I do end up doing that.

Comparer Class Most Efficient?

Consider the following
A list of approximately 10,000 folders
Of these folders, a list of rules determine if they qualify to go to the next stage
The rules are a text based comparison such that
if folder name contains (...any of the following from a list of exceptions) - such that there is a one to many comparison for each folder, but the folder name string must contain (or must NOT contain) any of the strings it is compared do
I'm relatively new to C# so I'm not entirely sure what's under the hood of each class
Any advice in some general direction would be greatly appreciated.
Do you have a performance problem, or are you trying to optimize the code before it has been written?
The Comparer class is typically not the topmost performant class of the .NET framework, but it has to cater for quite a lot of scenarios.
If you know the source and target types, you're usually better off implementing your own specific comparer class.
However, unless you know that you have a performance problem, I wouldn't worry too much about it.
First of 10 K folders is not a large number... So you might not want to worry about performance already.
So dont optimize...
After that you might want to consider the way you search your names... Instead of a seach for every single element, you could create a regexp that will perform all searches at once, but that is optimization without a real need...
First you need a reason to change the code

Should I be concerned about .NET dictionary speed?

I will be creating a project that will use dictionary lookups and inserts quite a bit. Is this something to be concerned about?
Also, if I do benchmarking and such and it is really bad, then what is the best way of replacing dictionary with something else? Would using an array with "hashed" keys even be faster? That wouldn't help on insert time though will it?
Also, I don't think I'm micro-optimizing because this really will be a significant part of code on a production server, so if this takes an extra 100ms to complete, then we will be looking for new ways to handle this.
You are micro-optimizing. Do you even have working code yet? Remember, "If it doesn't work, it doesn't matter how fast it doesn't work." (Mich Ravera) http://www.codingninja.co.uk/best-programmers-quotes/.
You have no idea where the bottlenecks will be, and already you're focused on Dictionary. What if the problem is somewhere else?
How do you know how the Dictionary class is implemented? Maybe it already uses an array with hashed keys!
P.S. It's really ".NET Dictionaries", not "C# Dictionaries", because C# is just one of several programming languages that use the framework.
Hello, I will be creating a project
that will use dictionary lookups and
inserts quite a bit. Is this something
to be concerned about?
Yes. It is always wise to consider performance factors up front.
The form that your concern should take is as follows: your concern should be encouraging you to write realistic, user-focused performance specifications. It should be encouraging you to start writing performance tests early, and running them often, so that you can see how every single change to the product affects performance. That way you will be informed immediately when a code change causes a user-affecting change in performance. And it should be encouraging you to run profiles often, so that you are reasoning about performance based on empirical measurements, rather than random guesses and hunches.
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
The best way to do this is to build a reasonable abstraction layer. If you have a class (or interface) which represents the "insert" and "lookup" abstract data type, then you can replace its internals without changing any of the callers.
Note that adding a layer of abstraction itself has a performance cost. If your profiling shows that the abstraction layer is too expensive, if the extra couple nanoseconds per call is too much, then you might have to get rid of the abstraction layer. Again, this decision will be driven by real-world performance data.
Would using an array with "hashed"
keys even be faster? That wouldn't
help on insert time though will it?
Neither you nor anyone reading this can possibly know which one is faster until you write it both ways and then benchmark it both ways under real-world conditions. Doing it under "lab" conditions will skew your results; you'll need to understand how things work when the GC is under realistic memory pressure, and so on. You might as well ask us which horse will run faster in next year's Kentucky Derby. If we knew the answer just by looking at the racing form, we'd all be rich already. You can't possibly expect anyone to know which of two entirely hypothetical, unwritten pieces of code will be faster under unspecified conditions!
The Dictionary<TKey, TValue> class is actually implemented as a hash table which makes lookups very fast (close to O(1)). See the API documentation for more information. I doubt you could make a better implementation yourself.
Wait and see if the performance of your application is below expectations
If it is then use a profiler to determine if the Dictionary lookup is the source of the problem
If it is then do some tests with representative data to see if another choice of list would be quicker.
In short - no, in general you shouldn't worry about the performance of implementation details until after you have a problem.
I would do a benchmark of the Dictionary, HashTable (HashSet in .NET), and perhaps a home grown class, and see which works out best under your typical usage conditions.
Normally I would say it's fine (insert StackOverflow's favorite premature ejaculation quote here), but if this is a core peice of the application, Benchmark, Benchmark, Benchmark.
The only concern that I can think of is that the speed of the dictionary relies on the key class having a reasonably fast GetHashCode method. Lookups and inserts are really fast, so you shouldn't have any problem there.
Regarding using an array, that's what the Dictionary class does already. Actually it uses two arrays, one for the keys and one for the values.
If you would have any performance problems with a Dictionary, it would be quite easy to make a wrapper for any kind of storage, that has the same methods and behaviour as a Dictionary so that you can replace it seamlessly.
I'm not sure that anyone has really answered this part yet:
Also, if I do benchmarking and such
and it is really bad, then what is the
best way of replacing dictionary with
something else?
For this, wherever possible, declare your variables as IDictionary<TKey, TValue>. That's the main interface that Dictionary derives from. (I'm assuming that if you care that much about performance, then you aren't considering non-generic collections.) Then, in the future, you can change the underlying implementation class without having to change any of the code that uses that dictionary. For example:
IDictionary<string, int> myDict = new Dictionary<string, int>();
If your application is multithreaded then the key part of performance is going to be synchronizing this Dictionary correctly.
If it is single-threaded then almost certainly bottleneck will be elsewhere. Such as reading these objects from wherever you are reading them.
I use Dictionary for UDP relay server . Each time packet arrives it performs Dictionary.ContainsKey and Dictionary[Key] , and it works great (massive number of clients). I had concerns when I was making the thing but it turned out that was last thing I should worry about.
Have a look at C# HybridDictionary Usage
HybridDictionary Class
This class is recommended for cases
where the number of elements in a
dictionary is unknown. It takes
advantage of the improved performance
of a ListDictionary with small
collections, and offers the
flexibility of switching to a
Hashtable which handles larger
collections better than ListDictionary
You may consider using the C5 library. I've found it to be very fast and thoughtfully designed. Others on stackoverflow have found the same. With C5 you have the option of using general type interfaces (with a captial I), or directly the data structures underneath. Naturally the interfaces allow you to swap out different implementations, but I have found in performance testing that the interfaces will cost you.
You may want to look at the KeyedCollection class in System.ObjectModel. From the MSDN description, "provides the abstract base class for a collection whose keys are embedded in the values."

Categories