Non exact match in XML

Non exact match in XML - c#

I've got problem. I have to equal value from XML with string, which is typed in textBox. What I have to do is make program more "inteligent" which means, if I type "kraków" instead of "Kraków", program should find the location anyway.
Sample of code:
public static IEnumerable<XElement> GetRowsWithColumn(IEnumerable<XElement> rows, String name, String value)
{
return rows
.Where(row => row.Elements("col")
.Any(col =>
col.Attributes("name").Any(attr => attr.Value.Equals(name))
&& col.Value.Equals(value)));
}
If I type "Kraków" then I get good response from XML, but when I type "kraków" there's no match. What should I do?
And if I can ask one more question, how can I prompts such as google have? If you type "progr" google shows you "programming" for example.

You could compare your values while you use
.ToUpper()
for your strings.
To get these prompts as google have, you could need regular expression.
For more see here:
Learning Regular Expressions

just make a function which compares the strings. you can use any criteria you want
...
col.Attributes("name").Any(attr => AreEquivelant(attr.Value, name))
...
private static bool AreEquivelant(string s1, string s2)
{
//compare the strings however you want
}

You are going to find a distance. A distance is the difference between two words. You can use Levenshtein for this one.
From Wikipedia :
In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
A basic usecase :
static void Main(string[] args)
{
Console.WriteLine(Levenshtein.FindDistance("Alois", "aloisdg"));
Console.WriteLine(Levenshtein.FindDistance("Alois", "aloisdg", true));
Console.ReadLine();
}
Output
3
2
Lower the value, better is the match.
For your example, you could use it and if the match is lower than something (like 2) you got a valid match.
I made one here :
Code :
public static int FindDistance(string s1, string s2, bool forceLowerCase = false)
{
if (String.IsNullOrEmpty(s1) || s1.Length == 0)
return String.IsNullOrEmpty(s2) ? s2.Length : 0;
if (String.IsNullOrEmpty(s2) || s2.Length == 0)
return String.IsNullOrEmpty(s1) ? s1.Length : 0;
// not in Levenshtein but I need it.
if (forceLowerCase)
{
s1 = s1.ToLowerInvariant();
s2 = s2.ToLowerInvariant();
}
int s1Len = s1.Length;
int s2Len = s2.Length;
int[,] d = new int[s1Len + 1, s2Len + 1];
for (int i = 0; i <= s1Len; i++)
d[i, 0] = i;
for (int j = 0; j <= s2Len; j++)
d[0, j] = j;
for (int i = 1; i <= s1Len; i++)
{
for (int j = 1; j <= s2Len; j++)
{
int cost = Convert.ToInt32(s1[i - 1] != s2[j - 1]);
int min = Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1);
d[i, j] = Math.Min(min, d[i - 1, j - 1] + cost);
}
}
return d[s1Len, s2Len];
}

Related

fuzzy matching word on OCR page

I have a static phrase the I am searching an OCR'd image for.
string KeywordToFind = "Account Number"
string OcrPageText = "
GEORGIA
POWER
A SOUTHERN COMPANY
AecountNumber
122- 493
Pagel of2
Please Pay By
Jan 29,2014
Total Due
39.11
"
How can I find the word "AecountNumber" using my keyword "Account Number"?
I have tried using variations of the Levenshtein Distance Algorithm HERE with varied success. I've also tried regexes, but the OCR often converts the text differently, thus rendering the regex useless.
Suggestions? I can provide more code if the link doesn't give enough information. Also, Thanks!

Why not try something mostly arbitrary, like this -- while it would certainly match a lot more than just account number, the chances of the start and end characters existing elsewhere in that order is pretty slim.
A.?c.?.?nt ?N.?[mn]b.?r
http://regex101.com/r/zV1yM2
It'll match things like:
Account Number
AccntNumbr
Aecnt Nunber

Answered My Question with the use of sub-strings. Posting in case others run into the same type of problem. A little unorthodox, but it works great for me.
int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage
//Look for best match with 1 less character than it should have, then the correct amount of characters.
//And last, with 1 more character. (This is because one letter can be recognized as
//two (W -> VV) and visa versa)
for (int i = 0; i < 3; i++)
{
for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
{
string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
{
PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
LowestLevenshteinNumber = lNumber;
}
}
TextLengthBuffer++;
}
public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}

Fuzzy matching multiple words in string

I'm trying to employ the help of the Levenshtein Distance to find fuzzy keywords(static text) on an OCR page.
To do this, I want to give a percentage of errors that are allowed (say, 15%).
string Keyword = "past due electric service";
Since the keyword is 25 characters long, I want to allow for 4 errors (25 * .15 rounded up)
I need to be able to compare it to...
string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank
you! current electric service total balances unpaid 7
days after the total due date are subject to a late
charge of 7.5% of the amount due or $2.00, whichever/5
greater. "
This is how I am doing it now...
int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202
int NumberOfErrorsAllowed = 4;
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205
Clearly, Keyword is not found in OCR_Text (which it shouldn't be). But, using Levenshtein's Distance, the number of errors is less than the 15% leeway (therefore my logic says it's found).
Does anyone know of a better way to do this?

Answered My Question with the use of sub-strings. Posting in case others run into the same type of problem. A little unorthodox, but it works great for me.
int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage
//Look for best match with 1 less character than it should have, then the correct amount of characters.
//And last, with 1 more character. (This is because one letter can be recognized as
//two (W -> VV) and visa versa)
for (int i = 0; i < 3; i++)
{
for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
{
string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
{
PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
LowestLevenshteinNumber = lNumber;
}
}
TextLengthBuffer++;
}
public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}

I think it's not working because a large chunk of your string is matched. So what I'd do, is try splitting your Keyword into separate words.
Then find all places where those words are matched in your OCR_TEXT.
Then look at all those places where they matched and see if any 4 of those places are consecutive and match the original phrase.
Am unsure if my explanation is clear?

Multiple Lucene queries Wildcard Searches and Proximity matching

I'm using Lucene to auto complete words in a search engine(RTL language) the auto complete function invoked after insertion of 3 letters.
I'd like to have a proximity matching to the 3 letters query before invoking the Wildcard function.
For example I'd like to make a sub-string search to my db only for the first 3 letters for every entry, with a proximity matching to this comparison.
presumably I'm looking for digger but I'd also like to have doggy in my results, so if I've entered
dig (the first 3 letters in the search engine) with a proximity matching equals to 1, digger and doggy would surface.
Can I do that?

You can use IndexReader's Terms methods to enumerate terms in the index. You can then use a custom function to calculate the distance between these terms and the text you search. I'll use Levenshtein distance for demo.
var terms = indexReader.ClosestTerms(field, "dig")
.OrderBy(t => t.Item2)
.Take(10)
.ToArray();
public static class LuceneUtils
{
public static IEnumerable<Tuple<string, int>> ClosestTerms(this IndexReader reader, string field, string text)
{
return reader.TermsStartingWith(field, text[0].ToString())
.Select(x => new Tuple<string, int>(x, LevenshteinDistance(x, text)));
}
public static IEnumerable<string> TermsStartingWith(this IndexReader reader, string field, string text)
{
using (var tEnum = reader.Terms(new Term(field, text)))
{
do
{
var term = tEnum.Term;
if (term == null) yield break;
if (term.Field != field) yield break;
if (!term.Text.StartsWith(text)) yield break;
yield return term.Text;
} while (tEnum.Next());
}
}
//http://www.dotnetperls.com/levenshtein
public static int LevenshteinDistance(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++) { }
for (int j = 0; j <= m; d[0, j] = j++) { }
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

I suppose you can try doing it, you will just need to add search criteria of Wildcard after Proximity search.
Have you read this?
http://www.lucenetutorial.com/lucene-query-syntax.html
Also take a look at
Lucene proximity search with boundaries?

loose string search within an array with C#

lets say we have a
string[] array = {"telekinesis", "laureate", "Allequalsfive", "Indulgence"};
and we need to find a word within this array
normally we'd do following: (or use any similar method to find a string)
bool result = array.Contains("laureate"); // returns true
In my case, the word that I am searching for, may have errors in it (as the title suggests).
For example, I can't distinguish a difference between letters "I"(large "i") and "l"(small "L") and "1"(number one).
Is there any way how I can find a word such as "Allequalsfive" or "A11equalsfive" or "AIIequalsfive"? (loose search) Normally result will be "false".
If only I can specify to ignore some letters.. (the sequence is constant, other letters are constants).

With the help of extension methods & Levenshtein Distance algorithm
var array = new string[]{ "telekinesis", "laureate",
"Allequalsfive", "Indulgence" };
bool b = array.LooseContains("A11equalsfive", 2); //returns true
-
public static class UsefulExtensions
{
public static bool LooseContains(this IEnumerable<string> list, string word,int distance)
{
foreach (var s in list)
if (s.LevenshteinDistance(word) <= distance) return true;
return false;
}
//
//http://www.merriampark.com/ldcsharp.htm
//
public static int LevenshteinDistance(this string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
return m;
if (m == 0)
return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++){}
for (int j = 0; j <= m; d[0, j] = j++){}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (char.ToUpperInvariant(t[j - 1]) == char.ToUpperInvariant(s[i - 1])) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}

You can use the Contains overload that takes an IEqualityComparer<TSource>.
Implement your own equality comparer that ignores the letters you want and off you go.

if you only need to know if the word is loosely contained in your array, then you can just "clean" the letters you want to ignore (e.g. replace "1" by "l") in both your search word and array:
Func<string, string> clean = x => x.ToLower().Replace('1', 'l');
var array = (new string[] { "telekinesis", "laureate", "A11equalsfive", "Indulgence" }).Select(x => clean(x));
bool result = array.Contains(clean("allequalsfive"));
Otherwise you can look up the Where() LINQ keyword, which lets you filter an array based on a function that you specify.

Comparing strings with tolerance

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.
Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.

You could use the Levenshtein Distance algorithm.
"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com
This one is from dotnetperls.com:
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

There is nothing in the .NET framework that will help you with this out-of-the-box.
The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.
For example, it could be argued that the words sword and sord (yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).
That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.
The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.
Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.
It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone (which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).
Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).

Here is an implementation of the LevenshteinDistance method that uses far less memory while producing the same results. This is a C# adaptation of the pseudo code found in this wikipedia article under the "Iterative with two matrix rows" heading.
public static int LevenshteinDistance(string source, string target)
{
// degenerate cases
if (source == target) return 0;
if (source.Length == 0) return target.Length;
if (target.Length == 0) return source.Length;
// create two work vectors of integer distances
int[] v0 = new int[target.Length + 1];
int[] v1 = new int[target.Length + 1];
// initialize v0 (the previous row of distances)
// this row is A[0][i]: edit distance for an empty s
// the distance is just the number of characters to delete from t
for (int i = 0; i < v0.Length; i++)
v0[i] = i;
for (int i = 0; i < source.Length; i++)
{
// calculate v1 (current row distances) from the previous row v0
// first element of v1 is A[i+1][0]
// edit distance is delete (i+1) chars from s to match empty t
v1[0] = i + 1;
// use formula to fill in the rest of the row
for (int j = 0; j < target.Length; j++)
{
var cost = (source[i] == target[j]) ? 0 : 1;
v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
}
// copy v1 (current row) to v0 (previous row) for next iteration
for (int j = 0; j < v0.Length; j++)
v0[j] = v1[j];
}
return v1[target.Length];
}
Here is a function that will give you the percentage similarity.
/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
if ((source == null) || (target == null)) return 0.0;
if ((source.Length == 0) || (target.Length == 0)) return 0.0;
if (source == target) return 1.0;
int stepsToSame = LevenshteinDistance(source, target);
return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}

Your other option is to compare phonetically using Soundex or Metaphone. I just completed an article that presents C# code for both algorithms. You can view it at http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex.

Here are two methods that calculate the Levenshtein Distance between strings.
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.
/// <summary>
/// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
/// </summary>
/// <param name="first">The first string, used as a source.</param>
/// <param name="second">The second string, used as a target.</param>
/// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
/// <remarks>
/// From http://www.merriampark.com/ldcsharp.htm
/// </remarks>
public static int LevenshteinDistance(string first, string second)
{
if (first == null)
{
throw new ArgumentNullException("first");
}
if (second == null)
{
throw new ArgumentNullException("second");
}
int n = first.Length;
int m = second.Length;
var d = new int[n + 1, m + 1]; // matrix
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
d[i, j] = Math.Min(
Math.Min(
d[i - 1, j] + 1,
d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}

using System;
public class Example
{
public static int getEditDistance(string X, string Y)
{
int m = X.Length;
int n = Y.Length;
int[][] T = new int[m + 1][];
for (int i = 0; i < m + 1; ++i) {
T[i] = new int[n + 1];
}
for (int i = 1; i <= m; i++) {
T[i][0] = i;
}
for (int j = 1; j <= n; j++) {
T[0][j] = j;
}
int cost;
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
cost = X[i - 1] == Y[j - 1] ? 0: 1;
T[i][j] = Math.Min(Math.Min(T[i - 1][j] + 1, T[i][j - 1] + 1),
T[i - 1][j - 1] + cost);
}
}
return T[m][n];
}
public static double findSimilarity(string x, string y) {
if (x == null || y == null) {
throw new ArgumentException("Strings must not be null");
}
double maxLength = Math.Max(x.Length, y.Length);
if (maxLength > 0) {
// optionally ignore case if needed
return (maxLength - getEditDistance(x, y)) / maxLength;
}
return 1.0;
}
public static void Main()
{
string s1 = "Techie Delight";
string s2 = "Tech Delight";
double similarity = findSimilarity(s1, s2) * 100;
Console.WriteLine(similarity); // 85.71428571428571
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.