How to check a partial similarity of two strings in C#

How to check a partial similarity of two strings in C# - c#

Is there any function in C# that check the % of similarity of two strings?
For example i have:
var string1="Hello how are you doing";
var string2= " hi, how are you";
and the
function(string1, string2)
will return similarity ratio because the words "how", "are", "you" are present in the line.
Or even better, return me 60% of similarity because "how", "are", "you" is a 3/5 of string1.
Does any function exist in C# which do that?

A common measure for similarity of strings is the so-called Levenshtein distance or edit distance. In this approach, a certain defined set of edit operation is defined. The Levenshtein distance is the minimum number of edit steps which is necessary to obtain the second string from the first. Closely related is the Damerau-Levenshtein distance, which uses a different set of edit operations.
Algorithmically, the Levenshtein distance can be calculated using Dynamic programming, which can be considered efficient. However, note that this approach does not actually take single words into account and cannot directly express the similarity in percent.

Now i am going to risk a -1 here for my suggestions, but in situations where you are trying to get something which is close but not so complex, then there is a lot of simpler solutions then the Levenshtein distance, which is perfect if you need exakt results and have time to code it.
If you are a bit looser concerning the accuracy, then i would follow this simple rules:
compare literal first (strSearch == strReal) - if match exit
convert search string and real string to lowercase
remove vowels and other chars from strings [aeiou-"!]
now you have two converted strings. your search string:
mths dhlgrn mtbrn
and your real string to compare to
rstrnt mths dhlgrn
compare the converted strings, if they match exit
split only the search strings by its words either with simple split function or using Regular Expressions \W+
calculate the virtual value (weight) of one part by dividing 100 by the number of parts - in this case 33
compare each part of the search string with the
real string, if it is contained, and add the value for each match to your total weight. In this case we have three elements and two matches so the result is 66 - so 66% match
This method is simple and extendable to go more and more in detail, actually you could use steps 1-7 and if step 7 returns anything above 50% then you figure you have a match, and otherwise you use more complex calculations.
ok, now don't -1 me too fast, because other answers are perfect, this is just a solution for lazy developers and might be of value there, where the result fulfills the expectations.

You can create a function that splits both strings into arrays, and then iterate over one of them to check if the word exists in the other one.
If you want percentage of it you would have to count total amount of words and see how many are similar and create a number based on that.

Related

Search algorithm for partial words in C#

I'm currently looking for a way to realize a partial word pattern algorithm in C#. The situation I'm in looks like follows:
I got a textfield for the search pattern. Every time the user enters or deletes a char in this field, an event triggers which re-runs the search algorithm. So in case I want to search for the word "face" in strings like
"Facebook", "Facelifting", ""Faceless Face" (whatever that should be) or in generally ANY real life sentences as strings,
the algorithm would first start running when typing "f" in the field. It then show the most relevant String on top of a list the strings are in. The second time it runs when "fa" is typed, and the list is sorted again. This goes on until "face" is completely typed in the textfield and the list is sorted again.
However I don't know what algorithm could be used. I tried the answer from Alain (Getting the closest string match), a simple Levenshtein-Distance algorithm as well as an self-made algorithm, which calculates the priority via
priority = (length_of_typed_pattern) * (amount_of_substr_matches)
In C#, the latter looks like this:
count = Regex.Matches(Regex.Escape(title), pattern).Count;
priority = pattern.Length * count;
The pattern as well as the title are composed of only lowercase letters.
My conclusions so far:
Hamming distance won't make any sense since the strings are not the same length most of the time
The answer from Alain works fine, but only if at least one word completely matches (you only find a most relevant string/sentence when at least one word is equal with the pattern, so if you have "face" typed and there's a string containing the word "facebook", the string containing "facebook" is almost never a top priority
What other ideas could I try? The goal would be to sort the list of strings the best possible way in the earliest moment (with the fewest letters).
You can look at my implementations in the search-* branches of my repository on http://github.com/croemheld/sprung) in Sprung/WindowMatcher.cs and Sprung/Window.cs.
Thanks for your help.

First of all you need to store frequency related to a string(number of times a particular string is searched) in some place to show most relevant one when searched. If you need to show say k most relevant entries so a Min Heap of size 'k' can be implemented.
Case 1- If a letter is pressed for the first time:-
Step (a) Read all the string starting from a Data base or dictionary and store in some data structure(Say DS1) with a FLAG_VALID(set to 1 initially) which shows that it is valid string for the present search characters(for first letter all the strings will be valid).
As you read strings fill the Min Heap according to their Frequency and an element with certain frequency is inserted only when its frequency is greater than minimum one(i.e. the first element of min Heap).
Step (b) (This step is same for all case to show result) To show results you need to show elements in reverse order than Min Heap i.e. first element in Min Heap will have least priority, so basically we need to delete all elements one by one and show it from last to first.
NOTE:- Min Heap will contain reference to a particular string and so the string and its frequency can be accessed at the same time.
Case 2- Inserting next letters in search box:
Step (a) Search through DS1 in which all strings are present and check FLAG_VALID first. If it is a valid string than compare the string from search box and the string from DS1. Set the flag accordingly(if it is a match-1 or not-0) and fill k-Min Heap as it is empty from last search as in Case 1.
Step (b) is as usual.
Case 3- Deleting a letter in search box:
It is similar to above cases but this time we will need to search for those strings also whose FALG_VALID is 0(i.e string which are invalid).
This is a crude searching method and can be improved using certain Data structure and tweaking the algorithm.

String similar to a set of strings

I need to compare a set of strings to another set of strings and find which strings are similar (fuzzy-string matching).
For example:
{ "A.B. Mann Incorporated", "Mr. Enrique Bellini", "Park Management Systems" }
and
{ "Park", "AB Mann Inc.", "E. Bellini" }
Assuming a zero-based index, the matches would be 0-1, 1-2, 2-0. Obviously, no algorithm can be perfect at this type of thing.
I have a working implementation of the Levenshtein-distance algorithm, but using it to find similar strings from each set necessitates looping through both sets of strings to do the comparison, resulting in an O(n^2) algorithm. This runs unacceptably slow even with modestly sized sets.
I've also tried a clustering algorithm that uses shingling and the Jaccard coefficient. Unfortunately, this too runs in O(n^2), which ends up being too slow, even with bit-level optimizations.
Does anyone know of a more efficient algorithm (faster than O(n^2)), or better yet, a library already written in C#, for accomplishing this?

Not a direct answer to the O(N^2) but a comment on the N1 algorithm.
That is sample data but it is all clean. That is not data that I would use Levenstien on. Incriminate would have closer distance to Incorporated than Inc. E. would not match well to Enrique.
Levenshtein-distance is good at catching key entry errors.
It is also good for matching OCR.
If you have clean data I would go with stemming and other custom rules.
Porter stemmer is available for C# and if you have clean data
E.G.
remove . and other punctuation
remove stop words (the)
stem
parse each list once and assign an int value for each unique stem
do the match on int
still N^2 but now N1 is faster
you might add in a single cap the matches a word that start with cap gets a partial score
also need to account for number of words
two groups of 5 that match of 3 should score higher then two groups of 10 that match on 4
I would create Int hashsets for each phrase and then intersect and count.
Not sure you can get out of N^2.
But I am suggesting you look at N1.
Lucene is a library with phrase matching but it is not really set up for batches.
Create the index with the intent it is used many time so index search speed is optimized over index creation time.

In the given examples at least one word is always matching. A possible approach could use a multimap (a dictionary being able to store multiple entries per key) or a Dictionary<TKey,List<TVlaue>>. Each string from the first set would be splitted into single words. These words would be used as key in the multimap and the whole string would be stored as value.
Now you can split strings from the second set into single words and do an O(1) lookup for each word, i.e. an O(N) lookup for all the words. This yields a first raw result, where each match contains at least one matching word. Finally you would have to refine this raw result by applying other rules (like searching for initials or abbreviated words).

This problem, called "string similarity join," has been studied a lot recently in the research community. We released a source code package in C++ called Flamingo that implements such an algorithm http://flamingo.ics.uci.edu/releases/4.1/src/partenum/. We also have a Hadoop-based implementation at http://asterix.ics.uci.edu/fuzzyjoin/ if your data set is too large for a single machine.

Creating a character variation algorithm for a synonym table

I have a need to create a variation/synonym table for a client who needs to make sure if someone enters an incorrect variable, we can return the correct part.
Example, if we have a part ID of GRX7-00C. When the client enters this into a part table, they would like to automatically create a variation table that will store variations that this product could be. Like GBX7-OOC (letter O instead of number 0). Or if they have the number 1, to be able to use L or I.
So if we have part GRL8-OOI we could have the following associated to it in the variation table:
GRI8-OOI
GRL8-0OI
GRL8-O0I
GRL8-OOI
etc....
I currently have a manual entry for this, but there could be a ton of variations of these parts. So, would anyone have a good idea at how I can create a automatic process for this?
How can I do this in C# and/or SQL?

I'm not a C# programmer, but for other .NET languages it would make more sense to me to create a list of CHARACTERS that are similar, and group those together, and use RegEx to evaluate if it matches.
i.e. for your example:
Original:
GRL8-001
Regex-ploded:
GR(l|L|1)(8|b|B)-(0|o|O)(0|o|O)(1|l|L)
You could accomplish this by having a table of interchangeable characters and running a replace function to sub the RegEx for the character automatically.

Lookex function psuedocode (works like soundex but for look alike instead of sound alike)
string input
for each char c
if c in "O0Q" c = 'O'
else if c in "IL1" c = 'I'
etc.
compute a single Lookex code and store that with each product id. If user's entry doesn't match a product id, compute the Lookex code on their entry and search for all products having that code (there could be more than 1). This would consume minimal space, and be quite fast with a single index, and inexpensive to compute as well.

Given your input above, what I would do is not store a table of synonyms, but instead, have a set of rules checked against a master dictionary. So for example, if the user types in a value that is not found in the dictionary, change O to 0, and check for that existing in the dictionary. Change GR to GB and check for that. Etc. All the variations they want to allow described above can be explained as rules that you can apply one at a time or in combination and check if the resulting entry exists. That way you do not have to have a massive dictionary of synonyms to maintain and update.

I wouldn't go the synonym route at all.
I would cleanse all values in the database using a standard rule set.
For every value that exists, replace all '0's with 'O's, strip out dashes etc, so that for each real value you have only one modified value and store that in a seperate field\table.
Then I would cleanse the input the same way, and do a two-part match. Check the actual input string against the actual database values(this will get you exact matches), and secondly check the cleansed input against the cleansed values. Then order the output against the actual database values using a distance calc such as Levenshtein Distance to get the most likely match.
Now for the input:
GRL8-OO1
With parts:
GRL8-00I & GRL8-OOI
These would all normalize to the same value GRL8OOI, though the distance match would be closer for GRL8-OOI, so that would be your closest bet.
Granted this dramatically reduces the "uniqueness" of your part numbers, but the combo of the two-part match and the Levenshtein should get you what you are looking for.
There are several T-SQL implementations of Levenshtein available

Creating a "spell check" that checks against a database with a reasonable runtime

I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.
I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?
The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.
For what it's worth, I'm using NHibernate for data access.
I would appreciate any feedback on how I can do this or what my options are.

Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.
The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.

As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.
Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.

I guess the Levenshtein distance is more useful here than the Hamming distance.
Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
1 insertion (208)
aexample
bexample
cexample
...
examplex
exampley
examplez
1 deletion (7)
xample
eample
exmple
...
exampl
1 substitution (182)
axample
bxample
cxample
...
examplz
You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.
Notice how most misspellings occur by doing the same operation with a different character:
1 insertion (8)
?example
e?xample
ex?ample
exa?mple
exam?ple
examp?le
exampl?e
example?
1 deletion (7)
xample
eample
exmple
exaple
examle
exampe
exampl
1 substitution (7)
?xample
e?ample
ex?mple
exa?ple
exam?le
examp?e
exampl?
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.
Example: User enters exaple (notice missing m).
SELECT DISTINCT word
FROM dictionary
WHERE hint = '?exaple'
OR hint = 'e?xaple'
OR hint = 'ex?aple'
OR hint = 'exa?ple'
OR hint = 'exap?le'
OR hint = 'exapl?e'
OR hint = 'exaple?'
OR hint = 'xaple'
OR hint = 'eaple'
OR hint = 'exple'
OR hint = 'exale'
OR hint = 'exape'
OR hint = 'exapl'
OR hint = '?xaple'
OR hint = 'e?aple'
OR hint = 'ex?ple'
OR hint = 'exa?le'
OR hint = 'exap?e'
OR hint = 'exapl?'
exaple with 1 insertion == exa?ple == example with 1 substitution
See also: How does the Google “Did you mean?” Algorithm work?

it loads all records from a user-specified table (or tables) into memory and then performs the check
don't do that
Either
Do the match match on the back end
and only return the results you need.
or
Cache the records into memory early
on a take the working set hit and do
the check when you need it.

You will need to structure your data differently than a database can. Build a custom search tree, with all dictionary data needed, on the client. Although memory might become a problem if the dictionary is extremely big, the search itself will be very fast. O(nlogn) if I recall correctly.
Have a look at BK-Trees
Also, instead of using the Hamming distance, consider the Levenshtein distance

The answer you marked as correct..
Note: when i say dictionary.. in this post, i mean hash map .. map..
basically i mean a python dictionary
Another way you can improve its performance by creating an inverted index of words.
So rather than calculating the edit distance against whole db, you create 26 dictionary.. each has a key an alphabet. so english language has 26 alphabets.. so keys are "a","b".. "z"
So assume you have word in your db "apple"
So in the "a" dictionary : you add the word "apple"
in the "p" dictionary: you add the word "apple"
in the "l" dictionary: you add the word "apple"
in the "e" dictionary : you add the word "apple"
So, do this for all the words in the dictionary..
Now when the misspelled word is entered..
lets say aplse
you start with "a" and retreive all the words in "a"
then you start with "p" and find the intersection of words between "a" and "p"
then you start with "l" and find the intersection of words between "a", "p" and "l"
and you do this for all the alphabetss.
in the end you will have just the bunch of words which are made of alphabets "a","p","l","s","e"
In the next step, you calculate the edit distance between the input word and the bunch of words returned by the above steps.. thus drastically reducing your run time..
now there might be a case when nothing might be returned..
so something like "aklse".. there is a good chance that there is no word which is made of just these alphabets..
In this case, you will have to start reversing the above step to a stage where you have finite numbers of word left.
So somethng like start with *klse (intersection between words k, l,s,e) num(wordsreturned) =k1
then a*lse( intersection between words a,l,s,e)... numwords = k2
and so on..
choose the one which have higher number of words returned.. in this case, there is really no one answer.. as a lot of words might have same edit distance.. you can just say that if editdistance is greater than "k" then there is no good match...
There are many sophisticated algorithms built on top of this..
like after these many steps, use statistical inferences (probability the word is "apple" when the input is "aplse".. and so on) Then you go machine learning way :)

How do I determine if two similar band names represent the same band?

I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance

The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.

I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.

soundex may also be useful

In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.