Substring indexing many similar strings

Substring indexing many similar strings - c#

I have a large collection of data, in this case imagine a 80,000+ array of String all containing file paths.
Being filepaths it means large groups of them start off with the same path, e.g. I over 50,000 of the files start off with "/dataset1/subsetAA/childX/".
I want to allow freetext searching of these paths. Right now I do that with a simple predicate that looks like this:
foreach(String term in terms)
if( path.IndexOf( term, StringComparison.OrdinalIgnoreCase ) == -1 )
return false;
return true;
I do save search results as they're typed in, so the more you type in the quicker it gets, however the initial few searches (e.g. for "f" > "fo" > "foo") can take up to 3 or 4 seconds on even a fast machine.
I'd like to build a substring index up that eliminates my need to use IndexOf, and preferably one that takes advantage of common paths to reduce index size, I don't want to consume too much memory.

read about the data structure known as a Trie: http://en.wikipedia.org/wiki/Trie
It does exactly what you want, it takes many strings and builds a tree out of common prefixes, with strings, each leaf being a string that follows the series of prefixes in its parents (that you can build by concatenating all of its parents to what's in the leaf, to save space)

however the initial few searches (e.g. for "f" > "fo" > "foo") can take up to 3 or 4 seconds on even a fast machine.
That's the only thing that you need to optimize then. Create a very simple structure that consists of three hash sets - for single characters, for two characters, and for three characters. Each element of the one-character hash index would contain a list of elements that include the indexed character; each element of the two-character hash index would contain a list of elements that include the indexed pair of characters; three-character index would do likewise.
When the initial portion of the search is typed, look up using indexes. For example, when f is typed, you would grab the list of items containing f from the first hash table. As the user continues typing, you'd grab items from the second index for the "fo" key, and then from the third index for the "foo" key.
As soon as you get four characters or more, you go back to the searches based on IndexOf, using the last three characters of the search term to look up the initial list in the hash based on three-character substrings. The number of items that you get from the list would be relatively small, so the searches should go much faster.
Another optimization should be stopping your search as soon as you've got enough items to display to your user. For example, if the user types "tas" (from "dataset") your three-character index would give you 50000 hits. Grab the first 20 (or as many as you need to display), and skip the remaining ones: the users will refine their search shortly, so the additional items will likely be discarded shortly anyway.

Related

Iterate over strings that ".StartsWith" without using LINQ

I'm building a custom textbox to enable mentioning people in a social media context. This means that I detect when somebody types "#" and search a list of contacts for the string that follows the "#" sign.
The easiest way would be to use LINQ, with something along the lines of Members.Where(x => x.Username.StartsWith(str). The problem is that the amount of potential results can be extremely high (up to around 50,000), and performance is extremely important in this context.
What alternative solutions do I have? Is there anything similar to a dictionary (a hashtable based solution) but that would allow me to use Key.StartsWith without itterating over every single entry? If not, what would be the fastest and most efficient way to achieve this?

Do you have to show a dropdown of 50000? If you can limit your dropdown, you can for example just display the first 10.
var filteredMembers = new List<MemberClass>
foreach(var member in Members)
{
if(member.Username.StartWith(str)) filteredMembers.Add(member);
if(filteredMembers >= 10) break;
}
Alternatively:
You can try storing all your member's usernames into a Trie in addition to your collection. That should give you a better performance then looping through all 50000 elements.
Assuming your usernames are unique, you can store your member information in a dictionary and use the usernames as the key.
This is a tradeoff of memory for performance of course.

It is not really clear where the data is stored in the first place. Are all the names in memory or in a database?
In case you store them in database, you can just use the StartsWith approach in the ORM, which would translate to a LIKE query on the DB, which would just do its job. If you enable full text on the column, you could improve the performance even more.
Now supposing all the names are already in memory. Remember the computer CPU is extremely fast so even looping through 50 000 entries takes just a few moments.
StartsWith method is optimized and it will return false as soon as it encounters a non-matching character. Finding the ones that actually match should be pretty fast. But you can still do better.
As others suggest, you could build a trie to store all the names and be able to search for matches pretty fast, but there is a disadvantage - building the trie requires you to read all the names and create the whole data structure which is complex. Also you would be restricted only to a given set of characters and a unexpected character would have to be dealt with separately.
You can however group the names into "buckets". First start with the first character and create a dictionary with the character as a key and a list of names as the value. Now you effectively narrowed every following search approximately 26 times (supposing English alphabet). But don't have to stop there - you can perform this on another level, for the second character in each group. And then third and so on.
With each level you are effectively narrowing each group significantly and the search will be much faster afterwards. But there is of course the up-front cost of building the data structure, so you always have to find the right trade-off for you. More work up-front = faster search, less work = slower search.
Finally, when the user types, with each new letter she narrows the target group. Hence, you can always maintain the set of relevant names for the current input and cut it down with each successive keystroke. This will prevent you from having to go from the beginning each time and will improve the efficiency significantly.

Use BinarySearch
This is a pretty normal case, assuming that the data are stored in-memory, and here is a pretty standard way to handle it.
Use a normal List<string>. You don't need a HashTable or a SortedList. However, an IEnumerable<string> won't work; it has to be a list.
Sort the list beforehand (using LINQ, e.g. OrderBy( s => s)), e.g. during initialization or when retrieving it. This is the key to the whole approach.
Find the index of the best match using BinarySearch. Because the list is sorted, a binary search can find the best match very quickly and without scanning the whole list like Select/Where might.
Take the first N entries after the found index. Optionally you can truncate the list if not all N entries are a decent match, e.g. if someone typed "AZ" and there are only one or two items before "BA."
Example:
public static IEnumerable<string> Find(List<string> list, string firstFewLetters, int maxHits)
{
var startIndex = list.BinarySearch(firstFewLetters);
//If negative, no match. Take the 2's complement to get the index of the closest match.
if (startIndex < 0)
{
startIndex = ~startIndex;
}
//Take maxHits items, or go till end of list
var endIndex = Math.Min(
startIndex + maxHits - 1,
list.Count-1
);
//Enumerate matching items
for ( int i = startIndex; i <= endIndex; i++ )
{
var s = list[i];
if (!s.StartsWith(firstFewLetters)) break; //This line is optional
yield return s;
}
}
Click here for a working sample on DotNetFiddle.

Search algorithm for partial words in C#

I'm currently looking for a way to realize a partial word pattern algorithm in C#. The situation I'm in looks like follows:
I got a textfield for the search pattern. Every time the user enters or deletes a char in this field, an event triggers which re-runs the search algorithm. So in case I want to search for the word "face" in strings like
"Facebook", "Facelifting", ""Faceless Face" (whatever that should be) or in generally ANY real life sentences as strings,
the algorithm would first start running when typing "f" in the field. It then show the most relevant String on top of a list the strings are in. The second time it runs when "fa" is typed, and the list is sorted again. This goes on until "face" is completely typed in the textfield and the list is sorted again.
However I don't know what algorithm could be used. I tried the answer from Alain (Getting the closest string match), a simple Levenshtein-Distance algorithm as well as an self-made algorithm, which calculates the priority via
priority = (length_of_typed_pattern) * (amount_of_substr_matches)
In C#, the latter looks like this:
count = Regex.Matches(Regex.Escape(title), pattern).Count;
priority = pattern.Length * count;
The pattern as well as the title are composed of only lowercase letters.
My conclusions so far:
Hamming distance won't make any sense since the strings are not the same length most of the time
The answer from Alain works fine, but only if at least one word completely matches (you only find a most relevant string/sentence when at least one word is equal with the pattern, so if you have "face" typed and there's a string containing the word "facebook", the string containing "facebook" is almost never a top priority
What other ideas could I try? The goal would be to sort the list of strings the best possible way in the earliest moment (with the fewest letters).
You can look at my implementations in the search-* branches of my repository on http://github.com/croemheld/sprung) in Sprung/WindowMatcher.cs and Sprung/Window.cs.
Thanks for your help.

First of all you need to store frequency related to a string(number of times a particular string is searched) in some place to show most relevant one when searched. If you need to show say k most relevant entries so a Min Heap of size 'k' can be implemented.
Case 1- If a letter is pressed for the first time:-
Step (a) Read all the string starting from a Data base or dictionary and store in some data structure(Say DS1) with a FLAG_VALID(set to 1 initially) which shows that it is valid string for the present search characters(for first letter all the strings will be valid).
As you read strings fill the Min Heap according to their Frequency and an element with certain frequency is inserted only when its frequency is greater than minimum one(i.e. the first element of min Heap).
Step (b) (This step is same for all case to show result) To show results you need to show elements in reverse order than Min Heap i.e. first element in Min Heap will have least priority, so basically we need to delete all elements one by one and show it from last to first.
NOTE:- Min Heap will contain reference to a particular string and so the string and its frequency can be accessed at the same time.
Case 2- Inserting next letters in search box:
Step (a) Search through DS1 in which all strings are present and check FLAG_VALID first. If it is a valid string than compare the string from search box and the string from DS1. Set the flag accordingly(if it is a match-1 or not-0) and fill k-Min Heap as it is empty from last search as in Case 1.
Step (b) is as usual.
Case 3- Deleting a letter in search box:
It is similar to above cases but this time we will need to search for those strings also whose FALG_VALID is 0(i.e string which are invalid).
This is a crude searching method and can be improved using certain Data structure and tweaking the algorithm.

Data structure for huge amount of records with ability to do "GetAllWithPrefix" fast

I need an in-memory data structure to store lots of values (~10M records) of type string with length of up to 100 characters (just in case it helps).
I'm going to run the following operations extensively:
Add
Get
Remove
GetAllWithPrefix(string prefix) - returns a list of all values corresponding to the prefix.
Obviously, any of the above operations need to be done in O(1) or O(logn)
I'm a bit rusty. What would be the best data structure for that? preferably direct me to the right class.
Thank you

What you want is a trie, which is a tree where each node has a map of character to nodes (branch) and a walk in the trie is a prefix walk, certain nodes also have a flag that implies if it's at the end of a string stored in the trie even though there could be further branches from node (string is prefix of another added string). A normal implementation has complexity: Insertion O(w), lookup is O(w) and prefix search is O(w+n) where w is string length and n is total length of words in the tree with w as prefix.
You can read about one c# implementation here
https://visualstudiomagazine.com/articles/2015/10/20/text-pattern-search-trie-class-net.aspx?m=1
Update
I want to clarify that the time complexities above in your special case is actually O(1) when considering your string length has an upper limit of 100.

String similar to a set of strings

I need to compare a set of strings to another set of strings and find which strings are similar (fuzzy-string matching).
For example:
{ "A.B. Mann Incorporated", "Mr. Enrique Bellini", "Park Management Systems" }
and
{ "Park", "AB Mann Inc.", "E. Bellini" }
Assuming a zero-based index, the matches would be 0-1, 1-2, 2-0. Obviously, no algorithm can be perfect at this type of thing.
I have a working implementation of the Levenshtein-distance algorithm, but using it to find similar strings from each set necessitates looping through both sets of strings to do the comparison, resulting in an O(n^2) algorithm. This runs unacceptably slow even with modestly sized sets.
I've also tried a clustering algorithm that uses shingling and the Jaccard coefficient. Unfortunately, this too runs in O(n^2), which ends up being too slow, even with bit-level optimizations.
Does anyone know of a more efficient algorithm (faster than O(n^2)), or better yet, a library already written in C#, for accomplishing this?

Not a direct answer to the O(N^2) but a comment on the N1 algorithm.
That is sample data but it is all clean. That is not data that I would use Levenstien on. Incriminate would have closer distance to Incorporated than Inc. E. would not match well to Enrique.
Levenshtein-distance is good at catching key entry errors.
It is also good for matching OCR.
If you have clean data I would go with stemming and other custom rules.
Porter stemmer is available for C# and if you have clean data
E.G.
remove . and other punctuation
remove stop words (the)
stem
parse each list once and assign an int value for each unique stem
do the match on int
still N^2 but now N1 is faster
you might add in a single cap the matches a word that start with cap gets a partial score
also need to account for number of words
two groups of 5 that match of 3 should score higher then two groups of 10 that match on 4
I would create Int hashsets for each phrase and then intersect and count.
Not sure you can get out of N^2.
But I am suggesting you look at N1.
Lucene is a library with phrase matching but it is not really set up for batches.
Create the index with the intent it is used many time so index search speed is optimized over index creation time.

In the given examples at least one word is always matching. A possible approach could use a multimap (a dictionary being able to store multiple entries per key) or a Dictionary<TKey,List<TVlaue>>. Each string from the first set would be splitted into single words. These words would be used as key in the multimap and the whole string would be stored as value.
Now you can split strings from the second set into single words and do an O(1) lookup for each word, i.e. an O(N) lookup for all the words. This yields a first raw result, where each match contains at least one matching word. Finally you would have to refine this raw result by applying other rules (like searching for initials or abbreviated words).

This problem, called "string similarity join," has been studied a lot recently in the research community. We released a source code package in C++ called Flamingo that implements such an algorithm http://flamingo.ics.uci.edu/releases/4.1/src/partenum/. We also have a Hadoop-based implementation at http://asterix.ics.uci.edu/fuzzyjoin/ if your data set is too large for a single machine.

.NET: How to efficiently check for uniqueness in a List<string> of 50,000 items?

In some library code, I have a List that can contain 50,000 items or more.
Callers of the library can invoke methods that result in strings being added to the list. How do I efficiently check for uniqueness of the strings being added?
Currently, just before adding a string, I scan the entire list and compare each string to the to-be-added string. This starts showing scale problems above 10,000 items.
I will benchmark this, but interested in insight.
if I replace the List<> with a Dictionary<> , will ContainsKey() be appreciably faster as the list grows to 10,000 items and beyond?
if I defer the uniqueness check until after all items have been added, will it be faster? At that point I would need to check every element against every other element, still an n^^2 operation.
EDIT
Some basic benchmark results. I created an abstract class that exposes 2 methods: Fill and Scan. Fill just fills the collection with n items (I used 50,000). Scan scans the list m times (I used 5000) to see if a given value is present. Then I built an implementation of that class for List, and another for HashSet.
The strings used were uniformly 11 characters in length, and randomly generated via a method in the abstract class.
A very basic micro-benchmark.
Hello from Cheeso.Tests.ListTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.4428266
Time to scan: 00:00:13.0291180
Hello from Cheeso.Tests.HashSetTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.3797751
Time to scan: 00:00:00.4364431
So, for strings of that length, HashSet is roughly 25x faster than List , when scanning for uniqueness. Also, for this size of collection, HashSet has zero penalty over List when adding items to the collection.
The results are interesting and not valid. To get valid results, I'd need to do warmup intervals, multiple trials, with random selection of the implementation. But I feel confident that that would move the bar only slightly.
Thanks everyone.
EDIT2
After adding randomization and multple trials, HashSet consistently outperforms List in this case, by about 20x.
These results don't necessarily hold for strings of variable length, more complex objects, or different collection sizes.

You should use the HashSet<T> class, which is specifically designed for what you're doing.

Use HashSet<string> instead of List<string>, then it should scale very well.

From my tests, HashSet<string> takes no time compared to List<string> :)

Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.

Does the Contains(T) function not work for you?

I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative_arrays.
A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.
edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.
edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.