Multiple Lucene queries Wildcard Searches and Proximity matching

Multiple Lucene queries Wildcard Searches and Proximity matching - c#

I'm using Lucene to auto complete words in a search engine(RTL language) the auto complete function invoked after insertion of 3 letters.
I'd like to have a proximity matching to the 3 letters query before invoking the Wildcard function.
For example I'd like to make a sub-string search to my db only for the first 3 letters for every entry, with a proximity matching to this comparison.
presumably I'm looking for digger but I'd also like to have doggy in my results, so if I've entered
dig (the first 3 letters in the search engine) with a proximity matching equals to 1, digger and doggy would surface.
Can I do that?

You can use IndexReader's Terms methods to enumerate terms in the index. You can then use a custom function to calculate the distance between these terms and the text you search. I'll use Levenshtein distance for demo.
var terms = indexReader.ClosestTerms(field, "dig")
.OrderBy(t => t.Item2)
.Take(10)
.ToArray();
public static class LuceneUtils
{
public static IEnumerable<Tuple<string, int>> ClosestTerms(this IndexReader reader, string field, string text)
{
return reader.TermsStartingWith(field, text[0].ToString())
.Select(x => new Tuple<string, int>(x, LevenshteinDistance(x, text)));
}
public static IEnumerable<string> TermsStartingWith(this IndexReader reader, string field, string text)
{
using (var tEnum = reader.Terms(new Term(field, text)))
{
do
{
var term = tEnum.Term;
if (term == null) yield break;
if (term.Field != field) yield break;
if (!term.Text.StartsWith(text)) yield break;
yield return term.Text;
} while (tEnum.Next());
}
}
//http://www.dotnetperls.com/levenshtein
public static int LevenshteinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++) { }
for (int j = 0; j <= m; d[0, j] = j++) { }
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

I suppose you can try doing it, you will just need to add search criteria of Wildcard after Proximity search.
Have you read this?
http://www.lucenetutorial.com/lucene-query-syntax.html
Also take a look at
Lucene proximity search with boundaries?

Related

Compare two different sized strings similarity

I have a string that is produced by code but it may not be correct.
So i have a user screen that user checks and changes it.
I have to let user to change maximum of 5 characters.
I need to check how many characters are changed by user
with comparing two strings.
length of strings may be different.
Thanx in advance. (language c#)

You could compute the Levenshtein Distance between the two strings, which returns the number of character edits (removals, inserts, replacements) that must occur to get from string A to string B.
public static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 0; j <= m; d[0, j] = j++);
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
Then handle it:
if (LevenshteinDistance.Compute(s1, s2) <= 5)
// Valid
else
// Invalid

Non exact match in XML

I've got problem. I have to equal value from XML with string, which is typed in textBox. What I have to do is make program more "inteligent" which means, if I type "kraków" instead of "Kraków", program should find the location anyway.
Sample of code:
public static IEnumerable<XElement> GetRowsWithColumn(IEnumerable<XElement> rows, String name, String value)
{
return rows
.Where(row => row.Elements("col")
.Any(col =>
col.Attributes("name").Any(attr => attr.Value.Equals(name))
&& col.Value.Equals(value)));
}
If I type "Kraków" then I get good response from XML, but when I type "kraków" there's no match. What should I do?
And if I can ask one more question, how can I prompts such as google have? If you type "progr" google shows you "programming" for example.

You could compare your values while you use
.ToUpper()
for your strings.
To get these prompts as google have, you could need regular expression.
For more see here:
Learning Regular Expressions

just make a function which compares the strings. you can use any criteria you want
...
col.Attributes("name").Any(attr => AreEquivelant(attr.Value, name))
...
private static bool AreEquivelant(string s1, string s2)
{
//compare the strings however you want
}

You are going to find a distance. A distance is the difference between two words. You can use Levenshtein for this one.
From Wikipedia :
In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
A basic usecase :
static void Main(string[] args)
{
Console.WriteLine(Levenshtein.FindDistance("Alois", "aloisdg"));
Console.WriteLine(Levenshtein.FindDistance("Alois", "aloisdg", true));
Console.ReadLine();
}
Output
3
2
Lower the value, better is the match.
For your example, you could use it and if the match is lower than something (like 2) you got a valid match.
I made one here :
Code :
public static int FindDistance(string s1, string s2, bool forceLowerCase = false)
{
if (String.IsNullOrEmpty(s1) || s1.Length == 0)
return String.IsNullOrEmpty(s2) ? s2.Length : 0;
if (String.IsNullOrEmpty(s2) || s2.Length == 0)
return String.IsNullOrEmpty(s1) ? s1.Length : 0;
// not in Levenshtein but I need it.
if (forceLowerCase)
{
s1 = s1.ToLowerInvariant();
s2 = s2.ToLowerInvariant();
}
int s1Len = s1.Length;
int s2Len = s2.Length;
int[,] d = new int[s1Len + 1, s2Len + 1];
for (int i = 0; i <= s1Len; i++)
d[i, 0] = i;
for (int j = 0; j <= s2Len; j++)
d[0, j] = j;
for (int i = 1; i <= s1Len; i++)
{
for (int j = 1; j <= s2Len; j++)
{
int cost = Convert.ToInt32(s1[i - 1] != s2[j - 1]);
int min = Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1);
d[i, j] = Math.Min(min, d[i - 1, j - 1] + cost);
}
}
return d[s1Len, s2Len];
}

Find closest match to input string in a list of strings

I have problems finding an implementation of closest match strings for .net
I would like to match a list of strings, example:
input string: "Publiczna Szkoła Podstawowa im. Bolesława Chrobrego w Wąsoszu"
List of strings:
Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu
Szkoła Podstawowa Specjalna
Szkoła Podstawowa im.Henryka Sienkiewicza w Wąsoszu
Szkoła Podstawowa im. Romualda Traugutta w Wąsoszu Górnym
This would clearly need to be matched with "Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu".
What algorithms are there available for .net?

Edit distance
Edit distance is a way of quantifying how dissimilar two strings
(e.g., words) are to one another by counting the minimum number of
operations required to transform one string into the other.
Levenshtein distance
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (i.e. insertions, deletions or
substitutions) required to change one word into the other.
Fast, memory efficient Levenshtein algorithm
C# Levenshtein
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}

.NET does not supply anything out of the box - you need to implement a an Edit Distance algorithm yourself. For example, you can use Levenshtein Distance, like this:
// This code is an implementation of the pseudocode from the Wikipedia,
// showing a naive implementation.
// You should research an algorithm with better space complexity.
public static int LevenshteinDistance(string s, string t) {
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
;
for (int j = 0; j <= m; d[0, j] = j++)
;
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
Call LevenshteinDistance(targetString, possible[i]) for each i, then pick the string possible[i] for which LevenshteinDistance returns the smallest value.

Late to the party, but I had a similar requirement to #Ali123:
"ECM" is closer to "Open form for ECM" than "transcribe" phonetically
I found a simple solution that works for my use case, which is comparing sentences, and finding the sentence that has the most words in common:
public static string FindBestMatch(string stringToCompare, IEnumerable<string> strs) {
HashSet<string> strCompareHash = stringToCompare.Split(' ').ToHashSet();
int maxIntersectCount = 0;
string bestMatch = string.Empty;
foreach (string str in strs)
{
HashSet<string> strHash = str.Split(' ').ToHashSet();
int intersectCount = strCompareHash.Intersect(strCompareHash).Count();
if (intersectCount > maxIntersectCount)
{
maxIntersectCount = intersectCount;
bestMatch = str;
}
}
return bestMatch;
}

loose string search within an array with C#

lets say we have a
string[] array = {"telekinesis", "laureate", "Allequalsfive", "Indulgence"};
and we need to find a word within this array
normally we'd do following: (or use any similar method to find a string)
bool result = array.Contains("laureate"); // returns true
In my case, the word that I am searching for, may have errors in it (as the title suggests).
For example, I can't distinguish a difference between letters "I"(large "i") and "l"(small "L") and "1"(number one).
Is there any way how I can find a word such as "Allequalsfive" or "A11equalsfive" or "AIIequalsfive"? (loose search) Normally result will be "false".
If only I can specify to ignore some letters.. (the sequence is constant, other letters are constants).

With the help of extension methods & Levenshtein Distance algorithm
var array = new string[]{ "telekinesis", "laureate",
"Allequalsfive", "Indulgence" };
bool b = array.LooseContains("A11equalsfive", 2); //returns true
-
public static class UsefulExtensions
{
public static bool LooseContains(this IEnumerable<string> list, string word,int distance)
{
foreach (var s in list)
if (s.LevenshteinDistance(word) <= distance) return true;
return false;
}
//
//http://www.merriampark.com/ldcsharp.htm
//
public static int LevenshteinDistance(this string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
return m;
if (m == 0)
return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++){}
for (int j = 0; j <= m; d[0, j] = j++){}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (char.ToUpperInvariant(t[j - 1]) == char.ToUpperInvariant(s[i - 1])) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

You can use the Contains overload that takes an IEqualityComparer<TSource>.
Implement your own equality comparer that ignores the letters you want and off you go.

if you only need to know if the word is loosely contained in your array, then you can just "clean" the letters you want to ignore (e.g. replace "1" by "l") in both your search word and array:
Func<string, string> clean = x => x.ToLower().Replace('1', 'l');
var array = (new string[] { "telekinesis", "laureate", "A11equalsfive", "Indulgence" }).Select(x => clean(x));
bool result = array.Contains(clean("allequalsfive"));
Otherwise you can look up the Where() LINQ keyword, which lets you filter an array based on a function that you specify.

Comparing strings with tolerance

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.
Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.

You could use the Levenshtein Distance algorithm.
"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com
This one is from dotnetperls.com:
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

There is nothing in the .NET framework that will help you with this out-of-the-box.
The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.
For example, it could be argued that the words sword and sord (yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).
That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.
The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.
Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.
It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone (which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).
Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).

Here is an implementation of the LevenshteinDistance method that uses far less memory while producing the same results. This is a C# adaptation of the pseudo code found in this wikipedia article under the "Iterative with two matrix rows" heading.
public static int LevenshteinDistance(string source, string target)
{
// degenerate cases
if (source == target) return 0;
if (source.Length == 0) return target.Length;
if (target.Length == 0) return source.Length;
// create two work vectors of integer distances
int[] v0 = new int[target.Length + 1];
int[] v1 = new int[target.Length + 1];
// initialize v0 (the previous row of distances)
// this row is A[0][i]: edit distance for an empty s
// the distance is just the number of characters to delete from t
for (int i = 0; i < v0.Length; i++)
v0[i] = i;
for (int i = 0; i < source.Length; i++)
{
// calculate v1 (current row distances) from the previous row v0
// first element of v1 is A[i+1][0]
// edit distance is delete (i+1) chars from s to match empty t
v1[0] = i + 1;
// use formula to fill in the rest of the row
for (int j = 0; j < target.Length; j++)
{
var cost = (source[i] == target[j]) ? 0 : 1;
v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
}
// copy v1 (current row) to v0 (previous row) for next iteration
for (int j = 0; j < v0.Length; j++)
v0[j] = v1[j];
}
return v1[target.Length];
}
Here is a function that will give you the percentage similarity.
/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
if ((source == null) || (target == null)) return 0.0;
if ((source.Length == 0) || (target.Length == 0)) return 0.0;
if (source == target) return 1.0;
int stepsToSame = LevenshteinDistance(source, target);
return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}

Your other option is to compare phonetically using Soundex or Metaphone. I just completed an article that presents C# code for both algorithms. You can view it at http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex.

Here are two methods that calculate the Levenshtein Distance between strings.
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.
/// <summary>
/// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
/// </summary>
/// <param name="first">The first string, used as a source.</param>
/// <param name="second">The second string, used as a target.</param>
/// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
/// <remarks>
/// From http://www.merriampark.com/ldcsharp.htm
/// </remarks>
public static int LevenshteinDistance(string first, string second)
{
if (first == null)
{
throw new ArgumentNullException("first");
}
if (second == null)
{
throw new ArgumentNullException("second");
}
int n = first.Length;
int m = second.Length;
var d = new int[n + 1, m + 1]; // matrix
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
d[i, j] = Math.Min(
Math.Min(
d[i - 1, j] + 1,
d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}

using System;
public class Example
{
public static int getEditDistance(string X, string Y)
{
int m = X.Length;
int n = Y.Length;
int[][] T = new int[m + 1][];
for (int i = 0; i < m + 1; ++i) {
T[i] = new int[n + 1];
}
for (int i = 1; i <= m; i++) {
T[i][0] = i;
}
for (int j = 1; j <= n; j++) {
T[0][j] = j;
}
int cost;
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
cost = X[i - 1] == Y[j - 1] ? 0: 1;
T[i][j] = Math.Min(Math.Min(T[i - 1][j] + 1, T[i][j - 1] + 1),
T[i - 1][j - 1] + cost);
}
}
return T[m][n];
}
public static double findSimilarity(string x, string y) {
if (x == null || y == null) {
throw new ArgumentException("Strings must not be null");
}
double maxLength = Math.Max(x.Length, y.Length);
if (maxLength > 0) {
// optionally ignore case if needed
return (maxLength - getEditDistance(x, y)) / maxLength;
}
return 1.0;
}
public static void Main()
{
string s1 = "Techie Delight";
string s2 = "Tech Delight";
double similarity = findSimilarity(s1, s2) * 100;
Console.WriteLine(similarity); // 85.71428571428571
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.