Implementing Levenstein distance for reversed string combination? - c#

I have an employees list in my application. Every employee has name and surname, so I have a list of elements like:
["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]
I want my customers to have a feature to search employees by names with a fuzzy-searching algorithm. For example, if user enters "Yuma Turmon", the closest element - "Uma Turman" will return. I use a Levenshtein distance algorithm, I found here.
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
I iterate user's input (full name) over the list of employee names and compare distance. If it is below 3, for example, I return found employee.
Now I want allow users to search by reversed names - for example, if user inputs "Turmon Uma" it will return "Uma Turman", as actually real distance is 1, because First name and Last name is the same as Last name and First name. My algorithm now counts it as different strings, far away. How can I modify it so that names are found regardless of order?

You can create a reversed version of employee names with LINQ. For example, if you have a list of employees like
x = ["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]
you can write the following code:
var reversedNames = x.Select(p=> $"{p.Split(' ')[1] p.Split(' ')[0]}");
It will return the reversed version, like:
xReversed = ["Carry Jim", "Turman Uma", "Gates Bill", "Skeet John"]
Then repeat you algorithm with this data too.

A few thoughts, as this is a potentially complicated problem to get right:
Split each employee name into a list of strings. Personally, I'd probably discard anything with 2 or fewer letters, unless that's all the name is composed of. This should help with surnames like "De La Cruz" which might get searched as "dela cruz". Store the list of names for each employee in a dictionary that points back to that employee.
Split the search terms in the same way you split the names in the list. For each search term find the names with the lowest Levenshtein distance, then for each one, starting at the lowest, repeat the search with the rest of the search terms against the other names for that employee. Repeat this step with each word in the query. For example, if the query is John Smith, find the best single word name matches for John, then match remaining names for those "best match" employees on Smith, and get a sum of the distances. Then find the best matches for Smith and match remaining names on John, and sum the distances. The best match is the one with the lowest total distance. You can provide a list of best matches by returning the top 10, say, sorted by total distance. And it won't matter which way around the names in the database or the search terms are. In fact they could be completely out of order and it wouldn't matter.
Consider how to handle hyphenated names. I'd probably split them as if they were not hyphenated.
Consider upper/lower case characters, if you haven't already. You should store lookups in one case and convert the search terms to the same case before comparison.
Be careful of accented letters, many people have them in their names, such as á. Your algorithm won't work correctly with them. Be even more careful if you expect to ever have non-alpha double byte characters, eg. Chinese, Japanese, Arabic, etc.
Two more benefits of splitting the names of each employee:
"Unused" names won't count against the total, so if I only search using the last name, it won't count against me in finding the shortest distance.
Along the same lines, you could apply some extra rules to help with finding non-standard names. For example, hyphenated names could be stored both as hyphenated (eg. Wells-Harvey), compound (WellsHarvey) and individual names (Wells and Harvey separate), all against the same employee. A low-distance match on any one name is a low-distance match on the employee, again extra names don't count against the total.
Here's some basic code that seems to work, however it only really takes into account points 1, 2 and 4:
using System;
using System.Collections.Generic;
using System.Linq;
namespace EmployeeSearch
{
static class Program
{
static List<string> EmployeesList = new List<string>() { "Jim Carrey", "Uma Thurman", "Bill Gates", "Jon Skeet" };
static Dictionary<int, List<string>> employeesById = new Dictionary<int, List<string>>();
static Dictionary<string, List<int>> employeeIdsByName = new Dictionary<string, List<int>>();
static void Main()
{
Init();
var results = FindEmployeeByNameFuzzy("Umaa Thurrmin");
// Returns:
// (1) Uma Thurman Distance: 3
// (0) Jim Carrey Distance: 10
// (3) Jon Skeet Distance: 11
// (2) Bill Gates Distance: 12
Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name} Distance: {r.Distance}")));
var results = FindEmployeeByNameFuzzy("Tormin Oma");
// Returns:
// (1) Uma Thurman Distance: 4
// (3) Jon Skeet Distance: 7
// (0) Jim Carrey Distance: 8
// (2) Bill Gates Distance: 9
Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name} Distance: {r.Distance}")));
Console.Read();
}
private static void Init() // prepare our lists
{
for (int i = 0; i < EmployeesList.Count; i++)
{
// Preparing the list of names for each employee - add special cases such as hyphenation here as well
var names = EmployeesList[i].ToLower().Split(new char[] { ' ' }).ToList();
employeesById.Add(i, names);
// This is not used here, but could come in handy if you want a unique index of names pointing to employee ids for optimisation:
foreach (var name in names)
{
if (employeeIdsByName.ContainsKey(name))
{
employeeIdsByName[name].Add(i);
}
else
{
employeeIdsByName.Add(name, new List<int>() { i });
}
}
}
}
private static List<SearchResult> FindEmployeeByNameFuzzy(string query)
{
var results = new List<SearchResult>();
// Notice we're splitting the search terms the same way as we split the employee names above (could be refactored out into a helper method)
var searchterms = query.ToLower().Split(new char[] { ' ' });
// Comparison with each employee
for (int i = 0; i < employeesById.Count; i++)
{
var r = new SearchResult() { Id = i, Name = EmployeesList[i] };
var employeenames = employeesById[i];
foreach (var searchterm in searchterms)
{
int min = searchterm.Length;
// for each search term get the min distance for all names for this employee
foreach (var name in employeenames)
{
var distance = LevenshteinDistance.Compute(searchterm, name);
min = Math.Min(min, distance);
}
// Sum the minimums for all search terms
r.Distance += min;
}
results.Add(r);
}
// Order by lowest distance first
return results.OrderBy(e => e.Distance).ToList();
}
}
public class SearchResult
{
public int Distance { get; set; }
public int Id { get; set; }
public string Name { get; set; }
}
public static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
}
Simply call Init() when you start, then call
var results = FindEmployeeByNameFuzzy(userquery);
to return an ordered list of the best matches.
Disclaimers: This code is not optimal and has only been briefly tested, doesn't check for nulls, could explode and kill a kitten, etc, etc. If you have a large number of employees then this could be very slow. There are several improvements that could be made, for example when looping over the Levenshtein algorithm you could drop out if the distance gets above the current minimum distance.

Related

Figure out max number of consecutive seats

I had an interviewer ask me to write a program in c# to figure out the max number of 4 members families that can sit consecutively in a venue, taking into account that the 4 members must be consecutively seated in one single row, with the following context:
N represents the number of rows availabe.
The Columns are labeled from the letter "A" to "K", purposely ommiting the letter "i" (in other words, {A,B,C,D,E,F,G,H,J,K})
M represents a list of reserved seats
Quick example:
N = 2
M = {"1A","2F","1C"}
Solution = 3
In the representation you can see that, with the reservations and the size given, only three families of 4 can be seated in a consecutive order.
How would you solve this? is it possible to not use for loops? (Linq solutions)
I got mixed up in the for loops when trying to deal with the reservations aray: My idea was to obtain all the reservations that a row has, but then I don't really know how to deal with the letters (Converting directly from letter to number is a no go because the missing "I") and you kinda need the letters to position the reserved sits anyway.
Any approach or insight on how to go about this problem would be nice.
Thanks in advance!
Here is another implementation.
I also tried to explain why certain things have been done.
Good luck.
private static int GetNumberOfAvailablePlacesForAFamilyOfFour(int numberOfRows, string[] reservedSeats)
{
// By just declaring the column names as a string of the characters
// we can query the column index by colulmnNames.IndexOf(char)
string columnNames = "ABCDEFGHJK";
// Here we transform the reserved seats to a matrix
// 1A 2F 1C becomes
// reservedSeatMatrix[0] = [0, 2] -> meaning row 1 and columns A and C, indexes 0 and 2
// reservedSeatMatrix[1] = [5] -> meaning row 2 and column F, index 5
List<List<int>> reservedSeatMatrix = new List<List<int>>();
for (int row = 0; row < numberOfRows; row++)
{
reservedSeatMatrix.Add(new List<int>());
}
foreach (string reservedSeat in reservedSeats)
{
int seatRow = Convert.ToInt32(reservedSeat.Substring(0, reservedSeat.Length - 1));
int seatColumn = columnNames.IndexOf(reservedSeat[reservedSeat.Length - 1]);
reservedSeatMatrix[seatRow - 1].Add(seatColumn);
}
// Then comes the evaluation.
// Which is simple enough to read.
int numberOfAvailablePlacesForAFamilyOfFour = 0;
for (int row = 0; row < numberOfRows; row++)
{
// Reset the number of consecutive seats at the beginning of a new row
int numberOfConsecutiveEmptySeats = 0;
for (int column = 0; column < columnNames.Length; column++)
{
if (reservedSeatMatrix[row].Contains(column))
{
// reset when a reserved seat is reached
numberOfConsecutiveEmptySeats = 0;
continue;
}
numberOfConsecutiveEmptySeats++;
if(numberOfConsecutiveEmptySeats == 4)
{
numberOfAvailablePlacesForAFamilyOfFour++;
numberOfConsecutiveEmptySeats = 0;
}
}
}
return numberOfAvailablePlacesForAFamilyOfFour;
}
static void Main(string[] args)
{
int familyPlans = GetNumberOfAvailablePlacesForAFamilyOfFour(2, new string[] { "1A", "2F", "1C" });
}
Good luck on your interview
As always, you will be asked how could you improve that? So you'd consider complexity stuff like O(N), O(wtf).
Underlying implementation would always need for or foreach. Just importantly, never do unnecessary in a loop. For example, if there's only 3 seats left in a row, you don't need to keep hunting on that row because it is not possible to find any.
This might help a bit:
var n = 2;
var m = new string[] { "1A", "2F", "1C" };
// We use 2 dimension bool array here. If it is memory constraint, we can use BitArray.
var seats = new bool[n, 10];
// If you just need the count, you don't need a list. This is for returning more information.
var results = new List<object>();
// Set reservations.
foreach (var r in m)
{
var row = r[0] - '1';
// If it's after 'H', then calculate index based on 'J'.
// 8 is index of J.
var col = r[1] > 'H' ? (8 + r[1] - 'J') : r[1] - 'A';
seats[row, col] = true;
}
// Now you should all reserved seats marked as true.
// This is O(N*M) where N is number of rows, M is number of columns.
for (int row = 0; row < n; row++)
{
int start = -1;
int length = 0;
for (int col = 0; col < 10; col++)
{
if (start < 0)
{
if (!seats[row, col])
{
// If there's no consecutive seats has started, and current seat is available, let's start!
start = col;
length = 1;
}
}
else
{
// If have started, check if we could have 4 seats.
if (!seats[row, col])
{
length++;
if (length == 4)
{
results.Add(new { row, start });
start = -1;
length = 0;
}
}
else
{
// // We won't be able to reach 4 seats, so reset
start = -1;
length = 0;
}
}
if (start < 0 && col > 6)
{
// We are on column H now (only have 3 seats left), and we do not have a consecutive sequence started yet,
// we won't be able to make it, so break and continue next row.
break;
}
}
}
var solution = results.Count;
LINQ, for and foreach are similar things. It is possible you could wrap the above into a custom iterator like:
class ConsecutiveEnumerator : IEnumerable
{
public IEnumerator GetEnumerator()
{
}
}
Then you could start using LINQ.
If you represent your matrix in simple for developers format, it will be easier. You can accomplish it either by dictionary or perform not so complex mapping by hand. In any case this will calculate count of free consecutive seats:
public static void Main(string[] args)
{
var count = 0;//total count
var N = 2; //rows
var M = 10; //columns
var familySize = 4;
var matrix = new []{Tuple.Create(0,0),Tuple.Create(1,5), Tuple.Create(0,2)}.OrderBy(x=> x.Item1).ThenBy(x=> x.Item2).GroupBy(x=> x.Item1, x=> x.Item2);
foreach(var row in matrix)
{
var prevColumn = -1;
var currColumn = 0;
var free = 0;
var div = 0;
//Instead of enumerating entire matrix, we just calculate intervals in between reserved seats.
//Then we divide them by family size to know how many families can be contained within
foreach(var column in row)
{
currColumn = column;
free = (currColumn - prevColumn - 1)/familySize;
count += free;
prevColumn = currColumn;
}
currColumn = M;
free = (currColumn - prevColumn - 1)/familySize;
count += free;
}
Console.WriteLine("Result: {0}", count);
}

C# Finding Maximum Value of Items that fit in certain slots

All,
I'm having an issue wrapping my head around how to find the ideal relationship between 2 values while also having to fit those items into a certain position. I honestly don't even know where to begin. I researched the knapsack problem, but there is no positional requirements.
Example:
I have $50.00 to spend on food. I need to eat 4 meals (breakfast, lunch, dinner, and a snack - which can be breakfast or lunch). Each meal has 4? properties; name, slot, calories, and cost. My goal is to eat the most calories while staying under my $50.00 allotment.
class Meal
{
public enum MealType
{
Breakfast = 1,
Lunch = 2,
Dinner = 3
}
public string Name { get; set; }
public MealType Type { get; set; }
public int Calories { get; set; }
public decimal Cost { get; set; }
public Meal(string _name, MealType _type, int _calories, decimal _cost)
{
Name = _name;
Type = _type;
Calories = _calories;
Cost = _cost;
}
}
I'm able to read in from a spreadsheet or some other source and create a Meal for each record and add it to a List. I have no idea how to read through my list (or any collection), though, and find the maximum caloric combination of 4 meals while adhering to the requirement that Meal #1 must be of type "breakfast", Meal #2 must be of type "lunch", Meal #3 must be of type "dinner", and Meal #4 can be of type "breakfast" or "lunch". Any ideas on where to begin? Thanks.
UPDATE
I ended up getting this to work - though I'm sure it is not very efficient, especially as the input list grows. Here is the hideous code (I obviously also defined a Menu class not included):
List<Meal> allMeals = Meal.GetMeals();
List<Meal> allBreakfast = new List<Meal>();
List<Meal> allLunch = new List<Meal>();
List<Meal> allDinner = new List<Meal>();
List<Meal> allSnack = new List<Meal>();
foreach (Meal meal in allMeals)
{
if (meal.MealType == Meal.MealType.Breakfast)
{
allBreakfast.Add(meal);
allSnack.Add(meal);
}
else if (meal.MealType == Meal.MealType.Lunch)
{
allLunch.Add(meal);
allSnack.Add(meal);
}
else if (meal.MealType == Meal.MealType.Dinner)
{
allDinner.Add(meal);
}
}
foreach (Meal breakfast in allBreakfast)
{
foreach (Meal lunch in allLunch)
{
foreach (Meal dinner in allDinner)
{
foreach (Meal snack in allSnack)
{
if (snack == breakfast || snack == lunch)
{
continue;
}
currMenu = new Menu(breakfast, lunch, dinner, snack);
if (currMenu.Cost < Menu.MaxCost && (maxMenu == null || currMenu.Calories > maxMenu.Calories))
{
maxMenu = currMenu;
}
}
}
}
}
In this situation, if you are not familiar with complex algorithm, you can try the brute force algo which is still feasible in the time and memory space.
We have 4 sets to store record: A, B, C, D (A: breakfast, B: lunch, C: dinner, D = A + B)
Our aim is to find Ai, Bj, Cu, Dv so as to:
1. Ai.cost + Bj.cost + Cu.cost + Dv.cost <=50
2. Ai.calo + Bj.calo + Cu.calo + Dv.calo is maximum
As in reality, we just count up to 2 decimal--> range of price is from 0.00 -> 50.00, i.e. 5000 case
1. Find all cost combination of Ai + Bj (<=50) store into array AB.
if (Ai1 + Bj1 == Ai2 + Bj2), just take combination have higher calories.
=> Complexity: O(|A| * |B|)
2. Find all cost combination of Cu + Dv (<=50) store into array CD with the same process as for AB
=> Complexity: O(|C| * |D|)
Note: size of AB and CD is 5000 elements
3. Find all combination AB[p] + CD[q] which cost is under 5000 and we can know which one got highest calories.
=> Complexity: O(|AB| * |CD| ) = O(5000 * 5000)
Totally, if using brute-force, the time complexity is O(5000 * 5000) + O(|spreadSheet size for one categories|^2).
If the spreadSheet size is <= 5000 and you are not familiar with algorithm, you can use this simple approach. Otherwise, just move ahead, algorithm is fun :)
Ok, so if you have read about Knapsack problem, and still don't know how to solve it, here is something for you to consider
The idea is simple, the array result contains all the value of money that can be reached by any Meal combination. So at first, when you doesn't buy anything, only result[50] can be reached.
Iterating through all the choices, starting from breakfast to snack, and only check for the value that can be reached, we can slowly build up the value table. For example, if breakfast has three prices(1,2,3) so the next state can only be (50-1), (50-2), (50-3) and so on. In array result, we store the maximum value of calories which can be created. So if one Breakfast set is cost 3, calo 2, and another set is cost 3, calo 4, so the value at index (50 -3) of the result array is 4.
Finally, after scanning through all of the state, the value storing in result array is the maximum value of calories that can be made.
int money = 50;// your initial money
int[] result = new int[money + 1];
for (int i = 0; i < result.length; i++) {
result[i] = -1;//-1 means cannot reach
}
result[50] = 0;//Only 50 can be reached at first
List<Meal>[] list = new List[4]; // Array for all 4 meals
// 0 is Breakfast
// 1 is lunch
// 2 for dinner
// 3 for snack
for (int i = 0; i < 4; i++) {//Iterating from breakfast to snack
int[] temp = new int[money + 1];//A temporary array to store value
for (int j = 0; j < result.length; j++) {
temp[j] = -1;
}
for (int j = 0; j < result.length; j++) {
if (result[j] != -1) {//For previous value, which can be reached
//Try all posibilites
for (int k = 0; k < list[i].size(); k++) {
int cost = list[i].get(k).cost;
int energy = list[i].get(k).energy;
if (cost <= result[j]) {
temp[j - cost] = max(temp[j - cost], result[j] + energy);
}
}
}
}
result = temp;//Need to switch these arrays.
}
int max = 0;
for(int i = 0; i < result.length; i++){
max = max(max, result[i]);
}
System.out.println(max);

How to measure similarity of 2 strings aside from computing distance

I am creating a program that checks if the word is simplified word(txt, msg, etc.) and if it is simplified it finds the correct spelling like txt=text, msg=message. Iam using the NHunspell Suggest Method in c# which suggest all possible results.
The problem is if I inputted "txt" the result is text,tat, tot, etc. I dont know how to select the correct word. I used Levenshtein Distance (C# - Compare String Similarity) but the results still results to 1.
Input: txt
Result: text = 1, ext = 1 tit = 1
Can you help me how to get the meaning or the correct spelling of the simplified words?
Example: msg
I have tested your input with your sample data and only text has a distance of 25 whereas the other have a distance of 33. Here's my code:
string input = "TXT";
string[] words = new[]{"text","tat","tot"};
var levenshtein = new Levenshtein();
const int maxDistance = 30;
var distanceGroups = words
.Select(w => new
{
Word = w,
Distance = levenshtein.iLD(w.ToUpperInvariant(), input)
})
.Where(x => x.Distance <= maxDistance)
.GroupBy(x => x.Distance)
.OrderBy(g => g.Key)
.ToList();
foreach (var topCandidate in distanceGroups.First())
Console.WriteLine("Word:{0} Distance:{1}", topCandidate.Word, topCandidate.Distance);
and here is the levenshtein class:
public class Levenshtein
{
///*****************************
/// Compute Levenshtein distance
/// Memory efficient version
///*****************************
public int iLD(String sRow, String sCol)
{
int RowLen = sRow.Length; // length of sRow
int ColLen = sCol.Length; // length of sCol
int RowIdx; // iterates through sRow
int ColIdx; // iterates through sCol
char Row_i; // ith character of sRow
char Col_j; // jth character of sCol
int cost; // cost
/// Test string length
if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
// Step 1
if (RowLen == 0)
{
return ColLen;
}
if (ColLen == 0)
{
return RowLen;
}
/// Create the two vectors
int[] v0 = new int[RowLen + 1];
int[] v1 = new int[RowLen + 1];
int[] vTmp;
/// Step 2
/// Initialize the first vector
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
v0[RowIdx] = RowIdx;
}
// Step 3
/// Fore each column
for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
{
/// Set the 0'th element to the column number
v1[0] = ColIdx;
Col_j = sCol[ColIdx - 1];
// Step 4
/// Fore each row
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
Row_i = sRow[RowIdx - 1];
// Step 5
if (Row_i == Col_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
/// Find minimum
int m_min = v0[RowIdx] + 1;
int b = v1[RowIdx - 1] + 1;
int c = v0[RowIdx - 1] + cost;
if (b < m_min)
{
m_min = b;
}
if (c < m_min)
{
m_min = c;
}
v1[RowIdx] = m_min;
}
/// Swap the vectors
vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
///
/// The vectors where swaped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
//System.Console.WriteLine("iDist=" + v0[RowLen]);
int max = System.Math.Max(RowLen, ColLen);
return ((100 * v0[RowLen]) / max);
}
///*****************************
/// Compute the min
///*****************************
private int Minimum(int a, int b, int c)
{
int mi = a;
if (b < mi)
{
mi = b;
}
if (c < mi)
{
mi = c;
}
return mi;
}
///*****************************
/// Compute Levenshtein distance
///*****************************
public int LD(String sNew, String sOld)
{
int[,] matrix; // matrix
int sNewLen = sNew.Length; // length of sNew
int sOldLen = sOld.Length; // length of sOld
int sNewIdx; // iterates through sNew
int sOldIdx; // iterates through sOld
char sNew_i; // ith character of sNew
char sOld_j; // jth character of sOld
int cost; // cost
/// Test string length
if (Math.Max(sNew.Length, sOld.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.LD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sNew.Length, sOld.Length) + "."));
// Step 1
if (sNewLen == 0)
{
return sOldLen;
}
if (sOldLen == 0)
{
return sNewLen;
}
matrix = new int[sNewLen + 1, sOldLen + 1];
// Step 2
for (sNewIdx = 0; sNewIdx <= sNewLen; sNewIdx++)
{
matrix[sNewIdx, 0] = sNewIdx;
}
for (sOldIdx = 0; sOldIdx <= sOldLen; sOldIdx++)
{
matrix[0, sOldIdx] = sOldIdx;
}
// Step 3
for (sNewIdx = 1; sNewIdx <= sNewLen; sNewIdx++)
{
sNew_i = sNew[sNewIdx - 1];
// Step 4
for (sOldIdx = 1; sOldIdx <= sOldLen; sOldIdx++)
{
sOld_j = sOld[sOldIdx - 1];
// Step 5
if (sNew_i == sOld_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
matrix[sNewIdx, sOldIdx] = Minimum(matrix[sNewIdx - 1, sOldIdx] + 1, matrix[sNewIdx, sOldIdx - 1] + 1, matrix[sNewIdx - 1, sOldIdx - 1] + cost);
}
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
//System.Console.WriteLine("Dist=" + matrix[sNewLen, sOldLen]);
int max = System.Math.Max(sNewLen, sOldLen);
return (100 * matrix[sNewLen, sOldLen]) / max;
}
}
Not a complete solution, just a hopefully helpful suggestion...
It seems to me that people are unlikely to use simplifications that are as long as the correct word, so you could at least filter out all results whose length <= the input's length.
You really need to implement the SOUNDEX routine that exists in SQL. I've done that in the following code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Soundex
{
class Program
{
static char[] ignoreChars = new char[] { 'a', 'e', 'h', 'i', 'o', 'u', 'w', 'y' };
static Dictionary<char, int> charVals = new Dictionary<char, int>()
{
{'b',1},
{'f',1},
{'p',1},
{'v',1},
{'c',2},
{'g',2},
{'j',2},
{'k',2},
{'q',2},
{'s',2},
{'x',2},
{'z',2},
{'d',3},
{'t',3},
{'l',4},
{'m',5},
{'n',5},
{'r',6}
};
static void Main(string[] args)
{
Console.WriteLine(Soundex("txt"));
Console.WriteLine(Soundex("text"));
Console.WriteLine(Soundex("ext"));
Console.WriteLine(Soundex("tit"));
Console.WriteLine(Soundex("Cammmppppbbbeeelll"));
}
static string Soundex(string s)
{
s = s.ToLower();
StringBuilder sb = new StringBuilder();
sb.Append(s.First());
foreach (var c in s.Substring(1))
{
if (ignoreChars.Contains(c)) { continue; }
// if the previous character yields the same integer then skip it
if ((int)char.GetNumericValue(sb[sb.Length - 1]) == charVals[c]) { continue; }
sb.Append(charVals[c]);
}
return string.Join("", sb.ToString().Take(4)).PadRight(4, '0');
}
}
}
See, with this code, the only match out of the examples you gave would be text. Run the console application and you'll see the output (i.e. txt would match text).
One method I think programs like word uses to correct spellings, is to use NLP (Natural Language Processing) techniques to get the order of Nouns/Adjectives used in the context of the spelling mistakes.. then comparing that to known sentence structures they can estimate 70% chance the spelling mistake was a noun and use that information to filter the corrected spellings.
SharpNLP looks like a good library but I haven't had a chance to fiddle with it yet. To build a library of known sentence structures BTW, in uni we applied our algorithms to public domain books.
check out sams simMetrics library I found on SO (download here, docs here) for loads more options for algorithms to use besides Levenshtein distance.
Expanding on my comment, you could use regex to search for a result that is an 'expansion' of the input. Something like this:
private int stringSimilarity(string input, string result)
{
string regexPattern = ""
foreach (char c in input)
regexPattern += input + ".*"
Match match = Regex.Match(result, regexPattern,
RegexOptions.IgnoreCase);
if (match.Success)
return 1;
else
return 0;
}
Ignore the 1 and the 0 - I don't know how similarity valuing works.

English Dictionary word matching from a string

I'm trying to get my head around a problem of identifying the best match of English words from a dictionary file to a given string.
For example ("lines" being a List of dictionary words):
string testStr = "cakeday";
for (int x= 0; x<= testStr.Length; x++)
{
string test = testStr.Substring(x);
if (test.Length > 0)
{
string test2 = testStr.Remove(counter);
int count = (from w in lines where w.Equals(test) || w.Equals(test2) select w).Count();
Console.WriteLine("Test: {0} / {1} : {2}", test, test2, count);
}
}
Gives the output:
Test: cakeday / : 0
Test: akeday / c : 1
Test: keday / ca : 0
Test: eday / cak : 0
Test: day / cake : 2
Test: ay / caked : 1
Test: y / cakeda : 1
Obviously "day / cake" is the best fit for the string however if I were to introduce a 3rd word into the string e.g "cakedaynow" it doesnt work so well.
I know the example is primitive, its more a proof of concept and was wondering if anyone had any experience with this type of string analysis?
Thanks!
You'll want to research the class of algorithms appropriate to what you're trying to do. Start with Approximate string matching on Wikipedia.
Also, here's a Levenshtein Edit Distance implementation in C# to get you started:
using System;
namespace StringMatching
{
/// <summary>
/// A class to extend the string type with a method to get Levenshtein Edit Distance.
/// </summary>
public static class LevenshteinDistanceStringExtension
{
/// <summary>
/// Get the Levenshtein Edit Distance.
/// </summary>
/// <param name="strA">The current string.</param>
/// <param name="strB">The string to determine the distance from.</param>
/// <returns>The Levenshtein Edit Distance.</returns>
public static int GetLevenshteinDistance(this string strA, string strB)
{
if (string.IsNullOrEmpty(strA) && string.IsNullOrEmpty(strB))
return 0;
if (string.IsNullOrEmpty(strA))
return strB.Length;
if (string.IsNullOrEmpty(strB))
return strA.Length;
int[,] deltas; // matrix
int lengthA;
int lengthB;
int indexA;
int indexB;
char charA;
char charB;
int cost; // cost
// Step 1
lengthA = strA.Length;
lengthB = strB.Length;
deltas = new int[lengthA + 1, lengthB + 1];
// Step 2
for (indexA = 0; indexA <= lengthA; indexA++)
{
deltas[indexA, 0] = indexA;
}
for (indexB = 0; indexB <= lengthB; indexB++)
{
deltas[0, indexB] = indexB;
}
// Step 3
for (indexA = 1; indexA <= lengthA; indexA++)
{
charA = strA[indexA - 1];
// Step 4
for (indexB = 1; indexB <= lengthB; indexB++)
{
charB = strB[indexB - 1];
// Step 5
if (charA == charB)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
deltas[indexA, indexB] = Math.Min(deltas[indexA - 1, indexB] + 1, Math.Min(deltas[indexA, indexB - 1] + 1, deltas[indexA - 1, indexB - 1] + cost));
}
}
// Step 7
return deltas[lengthA, lengthB];
}
}
}
Why not:
Check all the strings inside the search word extracting from current search position to all possible lengths of the string and extract all discovered words. E.g.:
var list = new List<string>{"the", "me", "cat", "at", "theme"};
const string testStr = "themecat";
var words = new List<string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
for(int i = (len - 1); i > x; i--)
{
string test = testStr.Substring(x, i - x + 1);
if (list.Contains(test) && !words.Contains(test))
{
words.Add(test);
}
}
}
words.ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values
Output:
theme, the, me, cat, at
Edit
Live Scenario 1:
For instance let's say you want to always choose the longest word in a jumbled sentence, you could read from front forward, reducing the amount of text read till you are through. Using a dictionary makes it much easier, by storing the indexes of the discovered words, we can quickly check to see if we have stored a word containing another word we are evaluating before.
Example:
var list = new List<string>{"the", "me", "cat", "at", "theme", "crying", "them"};
const string testStr = "themecatcryingthem";
var words = new Dictionary<int, string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
int n = len > 28 ? 28 : len;//assuming 28 is the maximum length of an english word
for(int i = (n - 1); i > x; i--)
{
string test = testStr.Substring(x, i - x + 1);
if (list.Contains(test))
{
if (!words.ContainsValue(test))
{
bool found = false;//to check if there's a shorter item starting from same index
var key = testStr.IndexOf(test, x, len - x);
foreach (var w in words)
{
if (w.Value.Contains(test) && w.Key != key && key == (w.Key + w.Value.Length - test.Length))
{
found = true;
}
}
if (!found && !words.ContainsKey(key)) words.Add(key, test);
}
}
}
}
words.Values.ToList().ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values
Output:
theme, cat, crying, them

How to calculate distance similarity measure of given 2 strings?

I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
ö instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function

Categories