Optimising Searching of a 2-dimensional Array with LINQ - c#

I have a 2-dimensional array of objects (predominantly, but not exclusively strings) that I want to filter by a string (sSearch) using LINQ. The following query works, but isn't as fast as I would like.
I have changed Count to Any, which led to a significant increase in speed and replaced Contains by a regular expression that ignores case, thereby elimiating the call to ToLower. Combined this has more than halved the execution time.
What is now very noticeable is that increasing the length of the search term from 1 to 2 letters triples the execution time and there is another jump from 3 to 4 letters (~50% increase in execution time). While this is obviously not surprising I wonder whether there is anything else that could be done to optimise the matching of strings?
Regex rSearch = new Regex(sSearch, RegexOptions.IgnoreCase);
rawData.Where(row => row.Any(column => rSearch.IsMatch(column.ToString())));
In this case the dataset has about 10k rows and 50 columns, but the size could vary fairly significantly.
Any suggestions on how to optimise this would be greatly appreciated.

One optimisation is to use Any instead of Count - that way as soon as one matching column has been found, the row can be returned.
rawData.Where(row => row.Any(column => column.ToString()
.ToLower().Contains(sSearch)))
You should also be aware that ToLower is culture-sensitive. If may not be a problem in your case, but it's worth being aware of. ToLowerInvariant may be a better option for you. It's a shame there isn't an overload for Contains which lets you specify that you want a case-insensitive match...
EDIT: You're using a regular expression now - have you tried RegexOptions.Compiled? It may or may not help...

Related

Iterate over strings that ".StartsWith" without using LINQ

I'm building a custom textbox to enable mentioning people in a social media context. This means that I detect when somebody types "#" and search a list of contacts for the string that follows the "#" sign.
The easiest way would be to use LINQ, with something along the lines of Members.Where(x => x.Username.StartsWith(str). The problem is that the amount of potential results can be extremely high (up to around 50,000), and performance is extremely important in this context.
What alternative solutions do I have? Is there anything similar to a dictionary (a hashtable based solution) but that would allow me to use Key.StartsWith without itterating over every single entry? If not, what would be the fastest and most efficient way to achieve this?
Do you have to show a dropdown of 50000? If you can limit your dropdown, you can for example just display the first 10.
var filteredMembers = new List<MemberClass>
foreach(var member in Members)
{
if(member.Username.StartWith(str)) filteredMembers.Add(member);
if(filteredMembers >= 10) break;
}
Alternatively:
You can try storing all your member's usernames into a Trie in addition to your collection. That should give you a better performance then looping through all 50000 elements.
Assuming your usernames are unique, you can store your member information in a dictionary and use the usernames as the key.
This is a tradeoff of memory for performance of course.
It is not really clear where the data is stored in the first place. Are all the names in memory or in a database?
In case you store them in database, you can just use the StartsWith approach in the ORM, which would translate to a LIKE query on the DB, which would just do its job. If you enable full text on the column, you could improve the performance even more.
Now supposing all the names are already in memory. Remember the computer CPU is extremely fast so even looping through 50 000 entries takes just a few moments.
StartsWith method is optimized and it will return false as soon as it encounters a non-matching character. Finding the ones that actually match should be pretty fast. But you can still do better.
As others suggest, you could build a trie to store all the names and be able to search for matches pretty fast, but there is a disadvantage - building the trie requires you to read all the names and create the whole data structure which is complex. Also you would be restricted only to a given set of characters and a unexpected character would have to be dealt with separately.
You can however group the names into "buckets". First start with the first character and create a dictionary with the character as a key and a list of names as the value. Now you effectively narrowed every following search approximately 26 times (supposing English alphabet). But don't have to stop there - you can perform this on another level, for the second character in each group. And then third and so on.
With each level you are effectively narrowing each group significantly and the search will be much faster afterwards. But there is of course the up-front cost of building the data structure, so you always have to find the right trade-off for you. More work up-front = faster search, less work = slower search.
Finally, when the user types, with each new letter she narrows the target group. Hence, you can always maintain the set of relevant names for the current input and cut it down with each successive keystroke. This will prevent you from having to go from the beginning each time and will improve the efficiency significantly.
Use BinarySearch
This is a pretty normal case, assuming that the data are stored in-memory, and here is a pretty standard way to handle it.
Use a normal List<string>. You don't need a HashTable or a SortedList. However, an IEnumerable<string> won't work; it has to be a list.
Sort the list beforehand (using LINQ, e.g. OrderBy( s => s)), e.g. during initialization or when retrieving it. This is the key to the whole approach.
Find the index of the best match using BinarySearch. Because the list is sorted, a binary search can find the best match very quickly and without scanning the whole list like Select/Where might.
Take the first N entries after the found index. Optionally you can truncate the list if not all N entries are a decent match, e.g. if someone typed "AZ" and there are only one or two items before "BA."
Example:
public static IEnumerable<string> Find(List<string> list, string firstFewLetters, int maxHits)
{
var startIndex = list.BinarySearch(firstFewLetters);
//If negative, no match. Take the 2's complement to get the index of the closest match.
if (startIndex < 0)
{
startIndex = ~startIndex;
}
//Take maxHits items, or go till end of list
var endIndex = Math.Min(
startIndex + maxHits - 1,
list.Count-1
);
//Enumerate matching items
for ( int i = startIndex; i <= endIndex; i++ )
{
var s = list[i];
if (!s.StartsWith(firstFewLetters)) break; //This line is optional
yield return s;
}
}
Click here for a working sample on DotNetFiddle.

Why Trie DataStructure when Dictionary Class can be used for string count from large files

Suppose I need to count words from a very large file ( words are split by " " )
I would do following
Not load entire file in memory , read stream line by line.
For each line Split words and add distinct word to "dictionary" ( I mean, use Dictionary Class
in .NET ) with their count.
Now to retrieve most frequent word, sort dictionary and get it.
but most solutions are a favoring Trie Data structure for this , please clarify why (also, it would be great if why not hash table over dictionary is clarified ).
Thanks.
I can't help mentioning that not only is this a map-reduce problem, it's the map-reduce problem.
That aside, the reason you would use a trie implementation is for efficiency in looking up each word to increment its count (or for adding a word that does not yet exist in the trie). In a basic trie, the lookup time per word is O(n), where n is the number of characters in the word. Over an entire document, then, with no parallel processing, you would be looking at O(n) time just for lookups, where n is the number of characters in the document. Then, it would be (probably) a depth-first search to retrieve all the words so that you could extract the information you need. Worst-case performance of the depth-first search would be the same O(n), but the expected case would be better due to common prefixes.
If you use a different structure, such as the standard System.Collections.Generic.Dictionary<TKey, TValue>, that involves a hash lookup, the cost is related to the hash lookup and implementation as well as the prevalence of hash collisions. However, even that may not be the major part of the cost. Assume arguendo that the hash lookup is constant-time and trivial. Because equal hash codes do not guarantee equal strings, as the MSDN docs warn repeatedly, it is still necessary to compare strings for equality, which is almost certainly implemented as O(n), where n is the number of characters (for simplicity). So, depending on the implementations of the trie and some hash-lookup-based dictionary, the hash-lookup-based dictionary is likely no better than the trie, and it may well be worse.
One valid criticism of my analysis might be that the lookup at each node in the trie may not be constant-time; it would depend on the collection used to determine the edges to the succeeding nodes. However, a hash-lookup-based dictionary may work well here if we don't care about sorting the keys later. Hash collisions are unlikely when the input is one character, and equality comparisons would be much less involved than with full strings. The insert performance is likely reasonable as well, again depending on the implementation.
However, if you know you are going to determine the top n words by word count, you likely need to keep track of the top n word counts as you go in addition to keeping track of them in the trie. That way, you do not need to recompute the top n after populating the trie.
You can use File.ReadLines which is similar to a stream-reader.
var mostFrequent = File.ReadLines("Path")
.SelectMany(l => l.Split()) // splits also by tabs
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.First(); // or Take(10) if you want the top 10
Console.Write("Word:{0} Count:{1}", mostFrequent.Key, mostFrequent.Count());

Can you dynamically search for sequences within a string in c#?

First time asking a question on here;
I am looking for a way to be able to use a search algorithm, or a built in method to dynamically search for repeating sequences within a string, or other variable.
The reason I say dynamic, is because I want it to be able to search through the string and locate repeating sequences on its own. I am not going to be able to supply a constructor of a sequence to look for.
I am unsure if this is even possible, but if it is, all help would be appreciated!
Here is a basic visual representation of what I am looking for (mind you, this is not code, just a for instance of a string)
This is going to be a long string that will have sequences throughout it. This may have matching characters side by side or it may not, but regardless, this is going to be a long string. If this is going to be a long string, I need it to find these sequences throughout it on its own!
As you can see by the above example, there are 2 sets of matching sequences throughout the single string. If there is any way to identify these programatically, along with being able to be searched through very fast for these different patterns, it would help me significantly!
The matches will most likely be stored in a List / array for later use as well.
Thank you for any help you are able to provide!
Edit:
As this question was asked, case sensitivity will not be an issue.
When I was mentioning there were 2 matches, I meant that 2 particular sequences, had a duplicate. One of which, had 2 duplicates.
#HenkHolterman You are correct that this is going to be a compression algorithm, however, I was not sure where to start for looking for the sequences that I will be matching.
I had been doing multiple searches regarding something similar to this, but was coming up short with the answers I were looking for. That is why my question was posed here the way it was.
Thank you for all the responses I have gotten so far though!
Here's the basic brute force idea
first you find all repeating sequences of size 1(you can change the minimum size to whatever you want).
To do this, you essentially go down the line, and use a regex to find all of the Ts and then all the hs, etc...
Then you find all sequences of size 2, so you'd find all the Ths and the his and the iss
you repeat this until you have found all of the sequences.
The runtime would be
the time complexity to find a particular sequence with regex: O(n)
times the number of different sequences of a particular size: O(n)
times the number of sizes: O(n)
the total time complexity would be O(n3)
Use a suffix tree to do this in O(n) time. I am adding this extraneous sentence to keep this from being converted into a comment.

String similar to a set of strings

I need to compare a set of strings to another set of strings and find which strings are similar (fuzzy-string matching).
For example:
{ "A.B. Mann Incorporated", "Mr. Enrique Bellini", "Park Management Systems" }
and
{ "Park", "AB Mann Inc.", "E. Bellini" }
Assuming a zero-based index, the matches would be 0-1, 1-2, 2-0. Obviously, no algorithm can be perfect at this type of thing.
I have a working implementation of the Levenshtein-distance algorithm, but using it to find similar strings from each set necessitates looping through both sets of strings to do the comparison, resulting in an O(n^2) algorithm. This runs unacceptably slow even with modestly sized sets.
I've also tried a clustering algorithm that uses shingling and the Jaccard coefficient. Unfortunately, this too runs in O(n^2), which ends up being too slow, even with bit-level optimizations.
Does anyone know of a more efficient algorithm (faster than O(n^2)), or better yet, a library already written in C#, for accomplishing this?
Not a direct answer to the O(N^2) but a comment on the N1 algorithm.
That is sample data but it is all clean. That is not data that I would use Levenstien on. Incriminate would have closer distance to Incorporated than Inc. E. would not match well to Enrique.
Levenshtein-distance is good at catching key entry errors.
It is also good for matching OCR.
If you have clean data I would go with stemming and other custom rules.
Porter stemmer is available for C# and if you have clean data
E.G.
remove . and other punctuation
remove stop words (the)
stem
parse each list once and assign an int value for each unique stem
do the match on int
still N^2 but now N1 is faster
you might add in a single cap the matches a word that start with cap gets a partial score
also need to account for number of words
two groups of 5 that match of 3 should score higher then two groups of 10 that match on 4
I would create Int hashsets for each phrase and then intersect and count.
Not sure you can get out of N^2.
But I am suggesting you look at N1.
Lucene is a library with phrase matching but it is not really set up for batches.
Create the index with the intent it is used many time so index search speed is optimized over index creation time.
In the given examples at least one word is always matching. A possible approach could use a multimap (a dictionary being able to store multiple entries per key) or a Dictionary<TKey,List<TVlaue>>. Each string from the first set would be splitted into single words. These words would be used as key in the multimap and the whole string would be stored as value.
Now you can split strings from the second set into single words and do an O(1) lookup for each word, i.e. an O(N) lookup for all the words. This yields a first raw result, where each match contains at least one matching word. Finally you would have to refine this raw result by applying other rules (like searching for initials or abbreviated words).
This problem, called "string similarity join," has been studied a lot recently in the research community. We released a source code package in C++ called Flamingo that implements such an algorithm http://flamingo.ics.uci.edu/releases/4.1/src/partenum/. We also have a Hadoop-based implementation at http://asterix.ics.uci.edu/fuzzyjoin/ if your data set is too large for a single machine.

Creating a "spell check" that checks against a database with a reasonable runtime

I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.
I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?
The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.
For what it's worth, I'm using NHibernate for data access.
I would appreciate any feedback on how I can do this or what my options are.
Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.
The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.
As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.
Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.
I guess the Levenshtein distance is more useful here than the Hamming distance.
Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
1 insertion (208)
aexample
bexample
cexample
...
examplex
exampley
examplez
1 deletion (7)
xample
eample
exmple
...
exampl
1 substitution (182)
axample
bxample
cxample
...
examplz
You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.
Notice how most misspellings occur by doing the same operation with a different character:
1 insertion (8)
?example
e?xample
ex?ample
exa?mple
exam?ple
examp?le
exampl?e
example?
1 deletion (7)
xample
eample
exmple
exaple
examle
exampe
exampl
1 substitution (7)
?xample
e?ample
ex?mple
exa?ple
exam?le
examp?e
exampl?
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.
Example: User enters exaple (notice missing m).
SELECT DISTINCT word
FROM dictionary
WHERE hint = '?exaple'
OR hint = 'e?xaple'
OR hint = 'ex?aple'
OR hint = 'exa?ple'
OR hint = 'exap?le'
OR hint = 'exapl?e'
OR hint = 'exaple?'
OR hint = 'xaple'
OR hint = 'eaple'
OR hint = 'exple'
OR hint = 'exale'
OR hint = 'exape'
OR hint = 'exapl'
OR hint = '?xaple'
OR hint = 'e?aple'
OR hint = 'ex?ple'
OR hint = 'exa?le'
OR hint = 'exap?e'
OR hint = 'exapl?'
exaple with 1 insertion == exa?ple == example with 1 substitution
See also: How does the Google “Did you mean?” Algorithm work?
it loads all records from a user-specified table (or tables) into memory and then performs the check
don't do that
Either
Do the match match on the back end
and only return the results you need.
or
Cache the records into memory early
on a take the working set hit and do
the check when you need it.
You will need to structure your data differently than a database can. Build a custom search tree, with all dictionary data needed, on the client. Although memory might become a problem if the dictionary is extremely big, the search itself will be very fast. O(nlogn) if I recall correctly.
Have a look at BK-Trees
Also, instead of using the Hamming distance, consider the Levenshtein distance
The answer you marked as correct..
Note: when i say dictionary.. in this post, i mean hash map .. map..
basically i mean a python dictionary
Another way you can improve its performance by creating an inverted index of words.
So rather than calculating the edit distance against whole db, you create 26 dictionary.. each has a key an alphabet. so english language has 26 alphabets.. so keys are "a","b".. "z"
So assume you have word in your db "apple"
So in the "a" dictionary : you add the word "apple"
in the "p" dictionary: you add the word "apple"
in the "l" dictionary: you add the word "apple"
in the "e" dictionary : you add the word "apple"
So, do this for all the words in the dictionary..
Now when the misspelled word is entered..
lets say aplse
you start with "a" and retreive all the words in "a"
then you start with "p" and find the intersection of words between "a" and "p"
then you start with "l" and find the intersection of words between "a", "p" and "l"
and you do this for all the alphabetss.
in the end you will have just the bunch of words which are made of alphabets "a","p","l","s","e"
In the next step, you calculate the edit distance between the input word and the bunch of words returned by the above steps.. thus drastically reducing your run time..
now there might be a case when nothing might be returned..
so something like "aklse".. there is a good chance that there is no word which is made of just these alphabets..
In this case, you will have to start reversing the above step to a stage where you have finite numbers of word left.
So somethng like start with *klse (intersection between words k, l,s,e) num(wordsreturned) =k1
then a*lse( intersection between words a,l,s,e)... numwords = k2
and so on..
choose the one which have higher number of words returned.. in this case, there is really no one answer.. as a lot of words might have same edit distance.. you can just say that if editdistance is greater than "k" then there is no good match...
There are many sophisticated algorithms built on top of this..
like after these many steps, use statistical inferences (probability the word is "apple" when the input is "aplse".. and so on) Then you go machine learning way :)

Categories