Select enum value by string similarity - c#

I have an enum with six distinct values:
One
Two
Three
Four
Five
Six
which is filled from a config file (i.e. string)
Let's say someone writes into the config file any of the values
On
one
onbe
or other common misspellings/typos, I want to set the most similar value from the enum (in this case, "One") instead of throwing up.
Does C# have something like that builtin or will I have to adapt an existing Edit Distance algorithm for C# and hook that into the enum?

You can use Levinshtein distance, this tells us the number of edits needed to turn one string into another:
so just go through all values in your enum and calculate Levinshtein distance:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) || String.IsNullOrEmpty(b)) return 0;
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++) ;
for (int j = 0; j <= lengthB; distances[0, j] = j++) ;
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}

Related

How to store the results of an array matrix into a smaller array c#

I need to add a value which would be either p1(payoff one) or p2 (payoff two) to the surrounding four neighbours of a value in a matrix that'll then be printed into a new array matrix. If it's 1 then p1 will need to be added to it's neighbours or if its 0 then p2 will be added to its neighbours. I've tried to do this approach with a nested for loop but my 'if' statement in my for loop is giving me errors and Im not sure where to go next with it.
class MainClass
{
static void Main(string[] args)
{
int m, n, i, j, p1, p2;
// rows and columns of the matrix+
m = 3;
n = 3;
//Payoff matrix
p1 = 10; //cheat payoff matrix
p2 = 5; //co-op payoff matrix
int[,] arr = new int[3, 3];
Console.Write("To enter 1 it means to co-operate" );
Console.Write("To enter 0 it means to cheat");
Console.Write("Enter elements of the Matrix: ");
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
arr[i, j] = Convert.ToInt16(Console.ReadLine());
}
}
Console.WriteLine("Printing Matrix: ");
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
Console.Write(arr[i, j] + "\t");
}
Console.WriteLine();
}
// how to change the values of the matrix
int[] payoffMatrix = new int[4];
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
if(arr[i,j] == 1)
{
arr[i, j] = arr[i - 1, j] , arr[i + 1, j] , arr[i, j - 1] , arr[i, j + 1];
}
}
Console.WriteLine();
}
The result of the neighbouring values need to be printed into the payoff matrix aswell
If I understood correctly you need to make a copy of your array first. Because otherwise you would read e.g. "1" from a array position (i + 1) from the current iteration. That's probably not what you want.
Then you just set the desired values in your for-loop. You need some bound checking, because e.g. arrNew[i - 1] will only be accessible if i > 0
this gives you something like:
int[,] arrNew = arr.Clone() as int[,]; //creates a copy of arr
for (i = 0; i < m; i++)
{
for (j = 0; j < n; j++)
{
if (arr[i, j] == 1)
{
if (i > 0) //bounds checking
{
arrNew[i - 1, j] = 1;
}
if (i < m - 1) //bounds checking
{
arrNew[i + 1, j] = 1;
}
if (j > 0) //bounds checking
{
arrNew[i, j - 1] = 1;
}
if (j < n - 1) //bounds checking
{
arrNew[i, j + 1] = 1;
}
}
}
}
For a matrix:
0 0 0
0 1 0
0 0 0
the result would be
0 1 0
1 1 1
0 1 0

Compare two different sized strings similarity

I have a string that is produced by code but it may not be correct.
So i have a user screen that user checks and changes it.
I have to let user to change maximum of 5 characters.
I need to check how many characters are changed by user
with comparing two strings.
length of strings may be different.
Thanx in advance. (language c#)
You could compute the Levenshtein Distance between the two strings, which returns the number of character edits (removals, inserts, replacements) that must occur to get from string A to string B.
public static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 0; j <= m; d[0, j] = j++);
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
Then handle it:
if (LevenshteinDistance.Compute(s1, s2) <= 5)
// Valid
else
// Invalid

fuzzy matching word on OCR page

I have a static phrase the I am searching an OCR'd image for.
string KeywordToFind = "Account Number"
string OcrPageText = "
GEORGIA
POWER
A SOUTHERN COMPANY
AecountNumber
122- 493
Pagel of2
Please Pay By
Jan 29,2014
Total Due
39.11
"
How can I find the word "AecountNumber" using my keyword "Account Number"?
I have tried using variations of the Levenshtein Distance Algorithm HERE with varied success. I've also tried regexes, but the OCR often converts the text differently, thus rendering the regex useless.
Suggestions? I can provide more code if the link doesn't give enough information. Also, Thanks!
Why not try something mostly arbitrary, like this -- while it would certainly match a lot more than just account number, the chances of the start and end characters existing elsewhere in that order is pretty slim.
A.?c.?.?nt ?N.?[mn]b.?r
http://regex101.com/r/zV1yM2
It'll match things like:
Account Number
AccntNumbr
Aecnt Nunber
Answered My Question with the use of sub-strings. Posting in case others run into the same type of problem. A little unorthodox, but it works great for me.
int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage
//Look for best match with 1 less character than it should have, then the correct amount of characters.
//And last, with 1 more character. (This is because one letter can be recognized as
//two (W -> VV) and visa versa)
for (int i = 0; i < 3; i++)
{
for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
{
string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
{
PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
LowestLevenshteinNumber = lNumber;
}
}
TextLengthBuffer++;
}
public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}

Regex or string compare with allowance of error

I'm trying to do a string compare in C# with some allowance for error. For example, if my search term is "Welcome", but if my comparison string (generated through OCR) is "We1come" and my error allowance is 20%, that should match. That part isn't so difficult using something like the Levenshtein algorithm. The hard part is making it work within a larger block of text, like a regular expression. For example, maybe my OCR result is "Hello. My name is Ben. We1come to my StackOverflow question.", I would like to pick out that We1come as a good result compared to my search term.
Took quite a while, but this works well. Fun problem :)
string PossibleString = PossibleString.ToString().ToLower();
string StaticText = StaticText.ToLower();
decimal PossibleStringLength = (PossibleString.Length);
decimal StaticTextLength = (StaticText.Length);
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero);
int LevenshteinDistance = LevenshteinAlgorithm(StaticText, PossibleString);
string PossibleResult = string.Empty;
if (LevenshteinDistance == PossibleStringLength - StaticTextLength)
{
// Perfect match. no need to calculate.
PossibleResult = StaticText;
}
else
{
int TextLengthBuffer = (int)StaticTextLength - 1;
int LowestLevenshteinNumber = 999999;
for (int i = 0; i < 3; i++) // Check for best results with same amount of characters as expected, as well as +/- 1
{
for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
{
string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
{
PossibleResult = possibleResult;
LowestLevenshteinNumber = lNumber;
}
}
TextLengthBuffer++;
}
}
public static int LevenshteinAlgorithm(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
If it is somehow predictable how the OCR can miss letters, I would replace the letters in the search with misses.
If the search is Welcome, the regex would be (?i)We[l1]come.

Find closest match to input string in a list of strings

I have problems finding an implementation of closest match strings for .net
I would like to match a list of strings, example:
input string: "Publiczna Szkoła Podstawowa im. Bolesława Chrobrego w Wąsoszu"
List of strings:
Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu
Szkoła Podstawowa Specjalna
Szkoła Podstawowa im.Henryka Sienkiewicza w Wąsoszu
Szkoła Podstawowa im. Romualda Traugutta w Wąsoszu Górnym
This would clearly need to be matched with "Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu".
What algorithms are there available for .net?
Edit distance
Edit distance is a way of quantifying how dissimilar two strings
(e.g., words) are to one another by counting the minimum number of
operations required to transform one string into the other.
Levenshtein distance
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (i.e. insertions, deletions or
substitutions) required to change one word into the other.
Fast, memory efficient Levenshtein algorithm
C# Levenshtein
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
.NET does not supply anything out of the box - you need to implement a an Edit Distance algorithm yourself. For example, you can use Levenshtein Distance, like this:
// This code is an implementation of the pseudocode from the Wikipedia,
// showing a naive implementation.
// You should research an algorithm with better space complexity.
public static int LevenshteinDistance(string s, string t) {
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
;
for (int j = 0; j <= m; d[0, j] = j++)
;
for (int i = 1; i <= n; i++) {
for (int j = 1; j <= m; j++) {
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
Call LevenshteinDistance(targetString, possible[i]) for each i, then pick the string possible[i] for which LevenshteinDistance returns the smallest value.
Late to the party, but I had a similar requirement to #Ali123:
"ECM" is closer to "Open form for ECM" than "transcribe" phonetically
I found a simple solution that works for my use case, which is comparing sentences, and finding the sentence that has the most words in common:
public static string FindBestMatch(string stringToCompare, IEnumerable<string> strs) {
HashSet<string> strCompareHash = stringToCompare.Split(' ').ToHashSet();
int maxIntersectCount = 0;
string bestMatch = string.Empty;
foreach (string str in strs)
{
HashSet<string> strHash = str.Split(' ').ToHashSet();
int intersectCount = strCompareHash.Intersect(strCompareHash).Count();
if (intersectCount > maxIntersectCount)
{
maxIntersectCount = intersectCount;
bestMatch = str;
}
}
return bestMatch;
}

Categories