loose string search within an array with C# - c#

lets say we have a
string[] array = {"telekinesis", "laureate", "Allequalsfive", "Indulgence"};
and we need to find a word within this array
normally we'd do following: (or use any similar method to find a string)
bool result = array.Contains("laureate"); // returns true
In my case, the word that I am searching for, may have errors in it (as the title suggests).
For example, I can't distinguish a difference between letters "I"(large "i") and "l"(small "L") and "1"(number one).
Is there any way how I can find a word such as "Allequalsfive" or "A11equalsfive" or "AIIequalsfive"? (loose search) Normally result will be "false".
If only I can specify to ignore some letters.. (the sequence is constant, other letters are constants).

With the help of extension methods & Levenshtein Distance algorithm
var array = new string[]{ "telekinesis", "laureate",
"Allequalsfive", "Indulgence" };
bool b = array.LooseContains("A11equalsfive", 2); //returns true
-
public static class UsefulExtensions
{
public static bool LooseContains(this IEnumerable<string> list, string word,int distance)
{
foreach (var s in list)
if (s.LevenshteinDistance(word) <= distance) return true;
return false;
}
//
//http://www.merriampark.com/ldcsharp.htm
//
public static int LevenshteinDistance(this string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
return m;
if (m == 0)
return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++){}
for (int j = 0; j <= m; d[0, j] = j++){}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (char.ToUpperInvariant(t[j - 1]) == char.ToUpperInvariant(s[i - 1])) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

You can use the Contains overload that takes an IEqualityComparer<TSource>.
Implement your own equality comparer that ignores the letters you want and off you go.

if you only need to know if the word is loosely contained in your array, then you can just "clean" the letters you want to ignore (e.g. replace "1" by "l") in both your search word and array:
Func<string, string> clean = x => x.ToLower().Replace('1', 'l');
var array = (new string[] { "telekinesis", "laureate", "A11equalsfive", "Indulgence" }).Select(x => clean(x));
bool result = array.Contains(clean("allequalsfive"));
Otherwise you can look up the Where() LINQ keyword, which lets you filter an array based on a function that you specify.

Related

Non exact match in XML

I've got problem. I have to equal value from XML with string, which is typed in textBox. What I have to do is make program more "inteligent" which means, if I type "kraków" instead of "Kraków", program should find the location anyway.
Sample of code:
public static IEnumerable<XElement> GetRowsWithColumn(IEnumerable<XElement> rows, String name, String value)
{
return rows
.Where(row => row.Elements("col")
.Any(col =>
col.Attributes("name").Any(attr => attr.Value.Equals(name))
&& col.Value.Equals(value)));
}
If I type "Kraków" then I get good response from XML, but when I type "kraków" there's no match. What should I do?
And if I can ask one more question, how can I prompts such as google have? If you type "progr" google shows you "programming" for example.
You could compare your values while you use
.ToUpper()
for your strings.
To get these prompts as google have, you could need regular expression.
For more see here:
Learning Regular Expressions
just make a function which compares the strings. you can use any criteria you want
...
col.Attributes("name").Any(attr => AreEquivelant(attr.Value, name))
...
private static bool AreEquivelant(string s1, string s2)
{
//compare the strings however you want
}
You are going to find a distance. A distance is the difference between two words. You can use Levenshtein for this one.
From Wikipedia :
In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
A basic usecase :
static void Main(string[] args)
{
Console.WriteLine(Levenshtein.FindDistance("Alois", "aloisdg"));
Console.WriteLine(Levenshtein.FindDistance("Alois", "aloisdg", true));
Console.ReadLine();
}
Output
3
2
Lower the value, better is the match.
For your example, you could use it and if the match is lower than something (like 2) you got a valid match.
I made one here :
Code :
public static int FindDistance(string s1, string s2, bool forceLowerCase = false)
{
if (String.IsNullOrEmpty(s1) || s1.Length == 0)
return String.IsNullOrEmpty(s2) ? s2.Length : 0;
if (String.IsNullOrEmpty(s2) || s2.Length == 0)
return String.IsNullOrEmpty(s1) ? s1.Length : 0;
// not in Levenshtein but I need it.
if (forceLowerCase)
{
s1 = s1.ToLowerInvariant();
s2 = s2.ToLowerInvariant();
}
int s1Len = s1.Length;
int s2Len = s2.Length;
int[,] d = new int[s1Len + 1, s2Len + 1];
for (int i = 0; i <= s1Len; i++)
d[i, 0] = i;
for (int j = 0; j <= s2Len; j++)
d[0, j] = j;
for (int i = 1; i <= s1Len; i++)
{
for (int j = 1; j <= s2Len; j++)
{
int cost = Convert.ToInt32(s1[i - 1] != s2[j - 1]);
int min = Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1);
d[i, j] = Math.Min(min, d[i - 1, j - 1] + cost);
}
}
return d[s1Len, s2Len];
}

Multiple Lucene queries Wildcard Searches and Proximity matching

I'm using Lucene to auto complete words in a search engine(RTL language) the auto complete function invoked after insertion of 3 letters.
I'd like to have a proximity matching to the 3 letters query before invoking the Wildcard function.
For example I'd like to make a sub-string search to my db only for the first 3 letters for every entry, with a proximity matching to this comparison.
presumably I'm looking for digger but I'd also like to have doggy in my results, so if I've entered
dig (the first 3 letters in the search engine) with a proximity matching equals to 1, digger and doggy would surface.
Can I do that?
You can use IndexReader's Terms methods to enumerate terms in the index. You can then use a custom function to calculate the distance between these terms and the text you search. I'll use Levenshtein distance for demo.
var terms = indexReader.ClosestTerms(field, "dig")
.OrderBy(t => t.Item2)
.Take(10)
.ToArray();
public static class LuceneUtils
{
public static IEnumerable<Tuple<string, int>> ClosestTerms(this IndexReader reader, string field, string text)
{
return reader.TermsStartingWith(field, text[0].ToString())
.Select(x => new Tuple<string, int>(x, LevenshteinDistance(x, text)));
}
public static IEnumerable<string> TermsStartingWith(this IndexReader reader, string field, string text)
{
using (var tEnum = reader.Terms(new Term(field, text)))
{
do
{
var term = tEnum.Term;
if (term == null) yield break;
if (term.Field != field) yield break;
if (!term.Text.StartsWith(text)) yield break;
yield return term.Text;
} while (tEnum.Next());
}
}
//http://www.dotnetperls.com/levenshtein
public static int LevenshteinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++) { }
for (int j = 0; j <= m; d[0, j] = j++) { }
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
I suppose you can try doing it, you will just need to add search criteria of Wildcard after Proximity search.
Have you read this?
http://www.lucenetutorial.com/lucene-query-syntax.html
Also take a look at
Lucene proximity search with boundaries?

Find closest match to input string in a list of strings

I have problems finding an implementation of closest match strings for .net
I would like to match a list of strings, example:
input string: "Publiczna Szkoła Podstawowa im. Bolesława Chrobrego w Wąsoszu"
List of strings:
Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu
Szkoła Podstawowa Specjalna
Szkoła Podstawowa im.Henryka Sienkiewicza w Wąsoszu
Szkoła Podstawowa im. Romualda Traugutta w Wąsoszu Górnym
This would clearly need to be matched with "Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu".
What algorithms are there available for .net?
Edit distance
Edit distance is a way of quantifying how dissimilar two strings
(e.g., words) are to one another by counting the minimum number of
operations required to transform one string into the other.
Levenshtein distance
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (i.e. insertions, deletions or
substitutions) required to change one word into the other.
Fast, memory efficient Levenshtein algorithm
C# Levenshtein
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
.NET does not supply anything out of the box - you need to implement a an Edit Distance algorithm yourself. For example, you can use Levenshtein Distance, like this:
// This code is an implementation of the pseudocode from the Wikipedia,
// showing a naive implementation.
// You should research an algorithm with better space complexity.
public static int LevenshteinDistance(string s, string t) {
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
;
for (int j = 0; j <= m; d[0, j] = j++)
;
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
Call LevenshteinDistance(targetString, possible[i]) for each i, then pick the string possible[i] for which LevenshteinDistance returns the smallest value.
Late to the party, but I had a similar requirement to #Ali123:
"ECM" is closer to "Open form for ECM" than "transcribe" phonetically
I found a simple solution that works for my use case, which is comparing sentences, and finding the sentence that has the most words in common:
public static string FindBestMatch(string stringToCompare, IEnumerable<string> strs) {
HashSet<string> strCompareHash = stringToCompare.Split(' ').ToHashSet();
int maxIntersectCount = 0;
string bestMatch = string.Empty;
foreach (string str in strs)
{
HashSet<string> strHash = str.Split(' ').ToHashSet();
int intersectCount = strCompareHash.Intersect(strCompareHash).Count();
if (intersectCount > maxIntersectCount)
{
maxIntersectCount = intersectCount;
bestMatch = str;
}
}
return bestMatch;
}

Damerau–Levenshtein distance algorithm, disable counting of delete

How can i disable counting of deletion, in this implementation of Damerau-Levenshtein distance algorithm, or if there is other algorithm already implemented please point me to it.
Example(disabled deletion counting):
string1: how are you?
string2: how oyu?
distance: 1 (for transposition, 4 deletes doesn't count)
And here is the algorithm:
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer_transpositionRow
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
var im1 = i - 1;
var im2 = i - 2;
var minDistance = threshold;
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var jm1 = j - 1;
var jm2 = j - 2;
var cost = string1[im1] == string2[jm1] ? 0 : 1;
var del = d[im1, j] + 1;
var ins = d[i, jm1] + 1;
var sub = d[im1, jm1] + cost;
//Math.Min is slower than native code
//d[i, j] = Math.Min(del, Math.Min(ins, sub));
d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);
if (d[i, j] < minDistance)
minDistance = d[i, j];
}
if (minDistance > threshold)
return int.MaxValue;
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
? int.MaxValue
: d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
public static int DamerauLevenshteinDistance( string string1
, string string2
, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
// WRONG FOR YOUR NEEDS:
// if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
// return (string1 ?? "").Length + (string2 ?? "").Length;
//DO IT THIS WAY:
if (String.IsNullOrEmpty(string1))
// First string is empty, so every character of
// String2 has been inserted:
return (string2 ?? "").Length;
if (String.IsNullOrEmpty(string2))
// Second string is empty, so every character of string1
// has been deleted, but you dont count deletions:
return 0;
// DO NOT SWAP THE STRINGS IF YOU WANT TO DEAL WITH INSERTIONS
// IN A DIFFERENT MANNER THEN WITH DELETIONS:
// THE FOLLOWING IS WRONG FOR YOUR NEEDS:
// // Ensure string2 (inner cycle) is longer_transpositionRow
// if (string1.Length > string2.Length)
// {
// var tmp = string1;
// string1 = string2;
// string2 = tmp;
// }
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
//all changes are insertions
return string2.Length - string1.Length;
// REVERSE CASE: STRING2 IS CONTAINED WITHIN STRING1
if (string1.Contains(string2))
//all changes are deletions which you don't count:
return 0;
var length1 = string1.Length;
var length2 = string2.Length;
// PAY ATTENTION TO THIS CHANGE!
// length1+1 rows is way too much! You need only 3 rows (0, 1 and 2)
// read my explanation below the code!
// TOO MUCH ROWS: var d = new int[length1 + 1, length2 + 1];
var d = new int[2, length2 + 1];
// THIS INITIALIZATION COUNTS DELETIONS. YOU DONT WANT IT
// or (var i = 0; i <= d.GetUpperBound(0); i++)
// d[i, 0] = i;
// But you must initiate the first element of each row with 0:
for (var i = 0; i <= 2; i++)
d[i, 0] = 0;
// This initialization counts insertions. You need it, but for
// better consistency of code I call the variable j (not i):
for (var j = 0; j <= d.GetUpperBound(1); j++)
d[0, j] = j;
// Now do the job:
// for (var i = 1; i <= d.GetUpperBound(0); i++)
for (var i = 1; i <= length1; i++)
{
//Here in this for-loop: add "%3" to evey term
// that is used as first index of d!
var im1 = i - 1;
var im2 = i - 2;
var minDistance = threshold;
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var jm1 = j - 1;
var jm2 = j - 2;
var cost = string1[im1] == string2[jm1] ? 0 : 1;
// DON'T COUNT DELETIONS! var del = d[im1, j] + 1;
var ins = d[i % 3, jm1] + 1;
var sub = d[im1 % 3, jm1] + cost;
// Math.Min is slower than native code
// d[i, j] = Math.Min(del, Math.Min(ins, sub));
// DEL DOES NOT EXIST
// d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
d[i % 3, j] = ins <= sub ? ins : sub;
if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
d[i % 3, j] = Math.Min(d[i % 3, j], d[im2 % 3, jm2] + cost);
if (d[i % 3, j] < minDistance)
minDistance = d[i % 3, j];
}
if (minDistance > threshold)
return int.MaxValue;
}
return d[length1 % 3, d.GetUpperBound(1)] > threshold
? int.MaxValue
: d[length1 % 3, d.GetUpperBound(1)];
}
here comes my explanation why you need only 3 rows:
Look at this line:
var d = new int[length1 + 1, length2 + 1];
If one string has the length n and the other has the length m, then your code needs a space of (n+1)*(m+1) integers. Each Integer needs 4 Byte. This is waste of memory if your strings are long. If both strings are 35.000 byte long, you will need more than 4 GB of memory!
In this code you calculate and write a new value for d[i,j]. And to do this, you read values from its upper neighbor (d[i,jm1]), from its left neighbor (d[im1,j]), from its upper-left neighbor (d[im1,jm1]) and finally from its double-upper-double-left neighbour (d[im2,jm2]). So you just need values from your actual row and 2 rows before.
You never need values from any other row. So why do you want to store them? Three rows are enough, and my changes make shure, that you can work with this 3 rows without reading any wrong value at any time.
I would advise not rewriting this specific algorithm to handle specific cases of "free" edits. Many of them radically simplify the concept of the problem to the point where the metric will not convey any useful information.
For example, when substitution is free the distance between all strings is the difference between their lengths. Simply transmute the smaller string into the prefix of the larger string and add the needed letters. (You can guarantee that there is no smaller distance because one insertion is required for each character of edit distance.)
When transposition is free the question reduces to determining the sum of differences of letter counts. (Since the distance between all anagrams is 0, sorting the letters in each string and exchanging out or removing the non-common elements of the larger string is the best strategy. The mathematical argument is similar to that of the previous example.)
In the case when insertion and deletion are free the edit distance between any two strings is zero. If only insertion OR deletion is free this breaks the symmetry of the distance metric - with free deletions, the distance from a to aa is 1, while the distance from aa to a is 1. Depending on the application this could possibly be desirable; but I'm not sure if it's something you're interested in. You will need to greatly alter the presented algorithm because it makes the mentioned assumption of one string always being longer than the other.
Try to change var del = d[im1, j] + 1; to var del = d[im1, j];, I think that solves your problem.

Sort String Array using Levenstein Algorithm results

I've been working on an Access file editor in C#, and i've been trying to get a search feature added to my program. So far, I have the database file populate a 2D array, which i then use to populate a ListView box in another window. From this new window, I would like to be able to search each entry by Model Number. So far, i've managed to incorporate the Levenstein Algorithm, which seems to have much use. I can get the algorithm to assign the distance value between each entry and the search keyboard, and assign that value to another integer array. I can also sort the results in increasing order.
However, my current problem is that i'd would like to have the Model numbers sorted with the same respect to the distance values from the Levenstein Algorithm, so that the most relevant result becomes the first choice in the ListView box. Any ideas anyone??!?!
Here's what i've got so far:
private void OnSearch(object sender, System.EventArgs e)
{
string a;
string b;
int[] result = new int[1000];
int[] sorted = new int[1000];
for (int i = 0; i < rowC; i++)
{
a = PartNum[i]; // Array to search
b = SearchBox1.Text; // keyword to search with
if (GetDistance(a, b) == 0)
{
return;
}
result[i] = GetDistance(a, b); //add each distance result into array
}
int index;
int x;
for (int j = 1; j < rowC; j++) //quick insertion sort
{
index = result[j];
x = j;
while ((x > 0) && (result[x - 1] > index))
{
result[x] = result[x - 1];
x = x - 1;
}
result[x] = index;
}
}
public static int GetDistance(string s, string t)
{
if (String.IsNullOrEmpty(s) || String.IsNullOrEmpty(t))
{
MessageBox.Show("Please enter something to search!!");
return 0;
}
int n = s.Length;
int m = t.Length;
if (n == 0)
{
return m;
}
else if (m == 0)
{
return n;
}
int[] p = new int[n + 1];
int[] d = new int[n + 1];
int[] _d;
char t_j;
int cost;
for (int i = 0; i <= n; i++)
{
p[i] = i;
}
for (int j = 1; j <= m; j++)
{
t_j = t[j - 1];
d[0] = j;
for (int i = 1; i <= n; i++)
{
cost = (s[i - 1] == t_j) ? 0 : 1;
d[i] = Math.Min(Math.Min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
}
_d = p;
p = d;
d = _d;
}
return p[n];
}
Do you have LINQ available to you? If so:
var ordered = PartNum.OrderBy(x => GetDistance(x, SearchBox1.Text))
.ToList();
// Do whatever with the ordered list
Note that this has the disadvantage of not aborting early if you find an exact match, as well as not making the actual distances available - but it's not entirely clear how you're using the results anyway...
Another option would be:
var ordered = (from word in PartNum
let distance = GetDistance(word, SearchBox1.Text))
orderby distance
select new { word, distance }).ToList();
Then you've got the distance as well.
In order to sort your array by Levenstein distance you need to include the model numbers as part of your array so that, when you sort the array by Levenstein number, the model numbers will go along for the ride.
To do this, create a class representing each part:
public class Part
{
public string PartNumber;
public int LevensteinDistance;
}
and then create an array of Part:
Part[] parts;
You can then reference each element like so:
parts[n].LevensteinDistance
parts[n].PartNumber

Categories