English Dictionary word matching from a string - c#

I'm trying to get my head around a problem of identifying the best match of English words from a dictionary file to a given string.
For example ("lines" being a List of dictionary words):
string testStr = "cakeday";
for (int x= 0; x<= testStr.Length; x++)
{
string test = testStr.Substring(x);
if (test.Length > 0)
{
string test2 = testStr.Remove(counter);
int count = (from w in lines where w.Equals(test) || w.Equals(test2) select w).Count();
Console.WriteLine("Test: {0} / {1} : {2}", test, test2, count);
}
}
Gives the output:
Test: cakeday / : 0
Test: akeday / c : 1
Test: keday / ca : 0
Test: eday / cak : 0
Test: day / cake : 2
Test: ay / caked : 1
Test: y / cakeda : 1
Obviously "day / cake" is the best fit for the string however if I were to introduce a 3rd word into the string e.g "cakedaynow" it doesnt work so well.
I know the example is primitive, its more a proof of concept and was wondering if anyone had any experience with this type of string analysis?
Thanks!

You'll want to research the class of algorithms appropriate to what you're trying to do. Start with Approximate string matching on Wikipedia.
Also, here's a Levenshtein Edit Distance implementation in C# to get you started:
using System;
namespace StringMatching
{
/// <summary>
/// A class to extend the string type with a method to get Levenshtein Edit Distance.
/// </summary>
public static class LevenshteinDistanceStringExtension
{
/// <summary>
/// Get the Levenshtein Edit Distance.
/// </summary>
/// <param name="strA">The current string.</param>
/// <param name="strB">The string to determine the distance from.</param>
/// <returns>The Levenshtein Edit Distance.</returns>
public static int GetLevenshteinDistance(this string strA, string strB)
{
if (string.IsNullOrEmpty(strA) && string.IsNullOrEmpty(strB))
return 0;
if (string.IsNullOrEmpty(strA))
return strB.Length;
if (string.IsNullOrEmpty(strB))
return strA.Length;
int[,] deltas; // matrix
int lengthA;
int lengthB;
int indexA;
int indexB;
char charA;
char charB;
int cost; // cost
// Step 1
lengthA = strA.Length;
lengthB = strB.Length;
deltas = new int[lengthA + 1, lengthB + 1];
// Step 2
for (indexA = 0; indexA <= lengthA; indexA++)
{
deltas[indexA, 0] = indexA;
}
for (indexB = 0; indexB <= lengthB; indexB++)
{
deltas[0, indexB] = indexB;
}
// Step 3
for (indexA = 1; indexA <= lengthA; indexA++)
{
charA = strA[indexA - 1];
// Step 4
for (indexB = 1; indexB <= lengthB; indexB++)
{
charB = strB[indexB - 1];
// Step 5
if (charA == charB)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
deltas[indexA, indexB] = Math.Min(deltas[indexA - 1, indexB] + 1, Math.Min(deltas[indexA, indexB - 1] + 1, deltas[indexA - 1, indexB - 1] + cost));
}
}
// Step 7
return deltas[lengthA, lengthB];
}
}
}

Why not:
Check all the strings inside the search word extracting from current search position to all possible lengths of the string and extract all discovered words. E.g.:
var list = new List<string>{"the", "me", "cat", "at", "theme"};
const string testStr = "themecat";
var words = new List<string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
for(int i = (len - 1); i > x; i--)
{
string test = testStr.Substring(x, i - x + 1);
if (list.Contains(test) && !words.Contains(test))
{
words.Add(test);
}
}
}
words.ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values
Output:
theme, the, me, cat, at
Edit
Live Scenario 1:
For instance let's say you want to always choose the longest word in a jumbled sentence, you could read from front forward, reducing the amount of text read till you are through. Using a dictionary makes it much easier, by storing the indexes of the discovered words, we can quickly check to see if we have stored a word containing another word we are evaluating before.
Example:
var list = new List<string>{"the", "me", "cat", "at", "theme", "crying", "them"};
const string testStr = "themecatcryingthem";
var words = new Dictionary<int, string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
int n = len > 28 ? 28 : len;//assuming 28 is the maximum length of an english word
for(int i = (n - 1); i > x; i--)
{
string test = testStr.Substring(x, i - x + 1);
if (list.Contains(test))
{
if (!words.ContainsValue(test))
{
bool found = false;//to check if there's a shorter item starting from same index
var key = testStr.IndexOf(test, x, len - x);
foreach (var w in words)
{
if (w.Value.Contains(test) && w.Key != key && key == (w.Key + w.Value.Length - test.Length))
{
found = true;
}
}
if (!found && !words.ContainsKey(key)) words.Add(key, test);
}
}
}
}
words.Values.ToList().ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values
Output:
theme, cat, crying, them

Related

Implementing Levenstein distance for reversed string combination?

I have an employees list in my application. Every employee has name and surname, so I have a list of elements like:
["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]
I want my customers to have a feature to search employees by names with a fuzzy-searching algorithm. For example, if user enters "Yuma Turmon", the closest element - "Uma Turman" will return. I use a Levenshtein distance algorithm, I found here.
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
I iterate user's input (full name) over the list of employee names and compare distance. If it is below 3, for example, I return found employee.
Now I want allow users to search by reversed names - for example, if user inputs "Turmon Uma" it will return "Uma Turman", as actually real distance is 1, because First name and Last name is the same as Last name and First name. My algorithm now counts it as different strings, far away. How can I modify it so that names are found regardless of order?
You can create a reversed version of employee names with LINQ. For example, if you have a list of employees like
x = ["Jim Carry", "Uma Turman", "Bill Gates", "John Skeet"]
you can write the following code:
var reversedNames = x.Select(p=> $"{p.Split(' ')[1] p.Split(' ')[0]}");
It will return the reversed version, like:
xReversed = ["Carry Jim", "Turman Uma", "Gates Bill", "Skeet John"]
Then repeat you algorithm with this data too.
A few thoughts, as this is a potentially complicated problem to get right:
Split each employee name into a list of strings. Personally, I'd probably discard anything with 2 or fewer letters, unless that's all the name is composed of. This should help with surnames like "De La Cruz" which might get searched as "dela cruz". Store the list of names for each employee in a dictionary that points back to that employee.
Split the search terms in the same way you split the names in the list. For each search term find the names with the lowest Levenshtein distance, then for each one, starting at the lowest, repeat the search with the rest of the search terms against the other names for that employee. Repeat this step with each word in the query. For example, if the query is John Smith, find the best single word name matches for John, then match remaining names for those "best match" employees on Smith, and get a sum of the distances. Then find the best matches for Smith and match remaining names on John, and sum the distances. The best match is the one with the lowest total distance. You can provide a list of best matches by returning the top 10, say, sorted by total distance. And it won't matter which way around the names in the database or the search terms are. In fact they could be completely out of order and it wouldn't matter.
Consider how to handle hyphenated names. I'd probably split them as if they were not hyphenated.
Consider upper/lower case characters, if you haven't already. You should store lookups in one case and convert the search terms to the same case before comparison.
Be careful of accented letters, many people have them in their names, such as á. Your algorithm won't work correctly with them. Be even more careful if you expect to ever have non-alpha double byte characters, eg. Chinese, Japanese, Arabic, etc.
Two more benefits of splitting the names of each employee:
"Unused" names won't count against the total, so if I only search using the last name, it won't count against me in finding the shortest distance.
Along the same lines, you could apply some extra rules to help with finding non-standard names. For example, hyphenated names could be stored both as hyphenated (eg. Wells-Harvey), compound (WellsHarvey) and individual names (Wells and Harvey separate), all against the same employee. A low-distance match on any one name is a low-distance match on the employee, again extra names don't count against the total.
Here's some basic code that seems to work, however it only really takes into account points 1, 2 and 4:
using System;
using System.Collections.Generic;
using System.Linq;
namespace EmployeeSearch
{
static class Program
{
static List<string> EmployeesList = new List<string>() { "Jim Carrey", "Uma Thurman", "Bill Gates", "Jon Skeet" };
static Dictionary<int, List<string>> employeesById = new Dictionary<int, List<string>>();
static Dictionary<string, List<int>> employeeIdsByName = new Dictionary<string, List<int>>();
static void Main()
{
Init();
var results = FindEmployeeByNameFuzzy("Umaa Thurrmin");
// Returns:
// (1) Uma Thurman Distance: 3
// (0) Jim Carrey Distance: 10
// (3) Jon Skeet Distance: 11
// (2) Bill Gates Distance: 12
Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name} Distance: {r.Distance}")));
var results = FindEmployeeByNameFuzzy("Tormin Oma");
// Returns:
// (1) Uma Thurman Distance: 4
// (3) Jon Skeet Distance: 7
// (0) Jim Carrey Distance: 8
// (2) Bill Gates Distance: 9
Console.WriteLine(string.Join("\r\n", results.Select(r => $"({r.Id}) {r.Name} Distance: {r.Distance}")));
Console.Read();
}
private static void Init() // prepare our lists
{
for (int i = 0; i < EmployeesList.Count; i++)
{
// Preparing the list of names for each employee - add special cases such as hyphenation here as well
var names = EmployeesList[i].ToLower().Split(new char[] { ' ' }).ToList();
employeesById.Add(i, names);
// This is not used here, but could come in handy if you want a unique index of names pointing to employee ids for optimisation:
foreach (var name in names)
{
if (employeeIdsByName.ContainsKey(name))
{
employeeIdsByName[name].Add(i);
}
else
{
employeeIdsByName.Add(name, new List<int>() { i });
}
}
}
}
private static List<SearchResult> FindEmployeeByNameFuzzy(string query)
{
var results = new List<SearchResult>();
// Notice we're splitting the search terms the same way as we split the employee names above (could be refactored out into a helper method)
var searchterms = query.ToLower().Split(new char[] { ' ' });
// Comparison with each employee
for (int i = 0; i < employeesById.Count; i++)
{
var r = new SearchResult() { Id = i, Name = EmployeesList[i] };
var employeenames = employeesById[i];
foreach (var searchterm in searchterms)
{
int min = searchterm.Length;
// for each search term get the min distance for all names for this employee
foreach (var name in employeenames)
{
var distance = LevenshteinDistance.Compute(searchterm, name);
min = Math.Min(min, distance);
}
// Sum the minimums for all search terms
r.Distance += min;
}
results.Add(r);
}
// Order by lowest distance first
return results.OrderBy(e => e.Distance).ToList();
}
}
public class SearchResult
{
public int Distance { get; set; }
public int Id { get; set; }
public string Name { get; set; }
}
public static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
}
Simply call Init() when you start, then call
var results = FindEmployeeByNameFuzzy(userquery);
to return an ordered list of the best matches.
Disclaimers: This code is not optimal and has only been briefly tested, doesn't check for nulls, could explode and kill a kitten, etc, etc. If you have a large number of employees then this could be very slow. There are several improvements that could be made, for example when looping over the Levenshtein algorithm you could drop out if the distance gets above the current minimum distance.

How to build the heliocentric algorithm?

Given this problem:
Consider two of the planets in the orbital system: Earth and Mars.
Assume the Earth orbits the Sun in exactly 365 Earth days, and Mars
orbits the Sun in exactly 687 Earth days. Thus the Earth’s orbit
starts at day 0 and continues to day 364, and then starts over at day
0. Mars orbits similarly, but on a 687-day time scale.
We would like to find out how long it will take until both planets are
on day. 0 of their orbits simultaneously. Write a program that can
determine this.
Input Format:
The first line of input contains an integer N indicating the number of
test cases. N lines follow. Each test case contains two integers E and
M. These indicate which days Earth and Mars are at their respective
orbits.
Output Format:
For each case, display the case number followed by the smallest number
of days until the two planets will both be on day 0 of their orbits.
Follow the format of the sample output.
Sample Input 1
0 0
364 686
360 682
0 1
1 0
Sample Output 1
Case 1: 0
Case 2: 1
Case 3: 5
Case 4: 239075
Case 5: 11679
I tried solving the problem using modules but it doesn't seem correct
static string readInput;
static string firstStr = "";
static string secondStr = "";
static int firstInput;
static int secondInput;
static int testCases = 10;
static int caseNumber = 1;
static int outPut;
caseNumber <= testCases
static void Main(string[] args) {
//recall runProcess as long caseNumber is less or equal testCases
while (caseNumber <= testCases) {
runProcess();
Console.WriteLine("Case " + caseNumber + ": " + outPut);
caseNumber++;
}
}
Read input from console:
/// <summary>
/// This is the main process, is extracted to void so we can recall it.
/// </summary>
public static void runProcess() {
readInput = Console.ReadLine();
if (readInput != null) {
for (int i = 0; i < readInput.Length; i++) {
secondStr = secondStr + readInput[i];
if (readInput[i] == ' ') {
firstStr = secondStr;
secondStr = "";
continue;
}
}
}
firstInput = Convert.ToInt32(firstStr);
secondInput = Convert.ToInt32(secondStr);
outPut = atZero(firstInput, secondInput);
}
/// <summary>
/// This method takes the input data from the console to later determine the zero point
/// </summary>
/// <param name="earthDays"></param>
/// <param name="marsDays"></param>
/// <returns></returns>
public static int atZero(int earthDays, int marsDays) {
int earthOrbit = 365;
int marsOrbit = 687;
int modEarth = earthOrbit;
int modMars = marsOrbit;
int earthDistinction = earthOrbit - earthDays;
int marsDistinction = marsOrbit - marsDays;
if ((modInverse(earthDistinction, marsDistinction, modMars)) == 0) {
return (modInverse(marsDistinction, earthDistinction, modEarth)) * marsDistinction;
} else {
return (modInverse(earthDistinction, marsDistinction, modMars)) * earthDistinction;
}
}
mod invert
/// <summary>
/// The method below takes a denominator, numerator and a mod to later invert the mod.
/// </summary>
/// <param name="denominator"></param>
/// <param name="numerator"></param>
/// <param name="mod"></param>
/// <returns>modInverse</returns>
static int modInverse(int denominator, int numerator, int mod) {
int i = mod, outputAll = 0, d = numerator;
while (denominator > 0) {
int divided = i / denominator, x = denominator;
denominator = i % x;
i = x;
x = d;
d = outputAll - divided * x;
outputAll = x;
}
outputAll %= mod;
if (outputAll < 0) outputAll = (outputAll + mod) % mod;
return outputAll;
}
Is there any way to solve the problem without modules?
Thanks.
A straight forward way to calculate a solution could be this method:
private static int DaysTillBothAt0(int currentEarthDay, int currentMarsDay)
{
int result = 0, earth = currentEarthDay, mars = currentMarsDay;
while (earth != 0 || mars != 0)
{
result += 1;
earth = (earth + 1) % 365;
mars = (mars + 1) % 687;
}
return result;
}
This is of course not the fastest alogrithm or mathematically extraordinary elegant, but for the required data range performance doesn't matter here at all. (I don't know what your teacher expects, though).
It simply counts the orbits forward until they meet at 0.
You can use this for your test cases like this:
result = DaysTillBothAt0(0, 0); // 0
result = DaysTillBothAt0(364, 686); // 1
result = DaysTillBothAt0(360, 682); // 5
result = DaysTillBothAt0(0, 1); // 239075
result = DaysTillBothAt0(1, 0); // 11679
One more solution for this problem
private static int DaysTillBothAt0(int currentEarthday, int currentMarsday) {
int count = 365 - currentEarthday;
currentMarsday = (currentMarsday + count) % 687;
while (currentMarsday != 0) {
currentMarsday = (currentMarsday + 365) % 687;
count += 365;
}
return currentMarsday;
}

How to measure similarity of 2 strings aside from computing distance

I am creating a program that checks if the word is simplified word(txt, msg, etc.) and if it is simplified it finds the correct spelling like txt=text, msg=message. Iam using the NHunspell Suggest Method in c# which suggest all possible results.
The problem is if I inputted "txt" the result is text,tat, tot, etc. I dont know how to select the correct word. I used Levenshtein Distance (C# - Compare String Similarity) but the results still results to 1.
Input: txt
Result: text = 1, ext = 1 tit = 1
Can you help me how to get the meaning or the correct spelling of the simplified words?
Example: msg
I have tested your input with your sample data and only text has a distance of 25 whereas the other have a distance of 33. Here's my code:
string input = "TXT";
string[] words = new[]{"text","tat","tot"};
var levenshtein = new Levenshtein();
const int maxDistance = 30;
var distanceGroups = words
.Select(w => new
{
Word = w,
Distance = levenshtein.iLD(w.ToUpperInvariant(), input)
})
.Where(x => x.Distance <= maxDistance)
.GroupBy(x => x.Distance)
.OrderBy(g => g.Key)
.ToList();
foreach (var topCandidate in distanceGroups.First())
Console.WriteLine("Word:{0} Distance:{1}", topCandidate.Word, topCandidate.Distance);
and here is the levenshtein class:
public class Levenshtein
{
///*****************************
/// Compute Levenshtein distance
/// Memory efficient version
///*****************************
public int iLD(String sRow, String sCol)
{
int RowLen = sRow.Length; // length of sRow
int ColLen = sCol.Length; // length of sCol
int RowIdx; // iterates through sRow
int ColIdx; // iterates through sCol
char Row_i; // ith character of sRow
char Col_j; // jth character of sCol
int cost; // cost
/// Test string length
if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
// Step 1
if (RowLen == 0)
{
return ColLen;
}
if (ColLen == 0)
{
return RowLen;
}
/// Create the two vectors
int[] v0 = new int[RowLen + 1];
int[] v1 = new int[RowLen + 1];
int[] vTmp;
/// Step 2
/// Initialize the first vector
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
v0[RowIdx] = RowIdx;
}
// Step 3
/// Fore each column
for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
{
/// Set the 0'th element to the column number
v1[0] = ColIdx;
Col_j = sCol[ColIdx - 1];
// Step 4
/// Fore each row
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
Row_i = sRow[RowIdx - 1];
// Step 5
if (Row_i == Col_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
/// Find minimum
int m_min = v0[RowIdx] + 1;
int b = v1[RowIdx - 1] + 1;
int c = v0[RowIdx - 1] + cost;
if (b < m_min)
{
m_min = b;
}
if (c < m_min)
{
m_min = c;
}
v1[RowIdx] = m_min;
}
/// Swap the vectors
vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
///
/// The vectors where swaped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
//System.Console.WriteLine("iDist=" + v0[RowLen]);
int max = System.Math.Max(RowLen, ColLen);
return ((100 * v0[RowLen]) / max);
}
///*****************************
/// Compute the min
///*****************************
private int Minimum(int a, int b, int c)
{
int mi = a;
if (b < mi)
{
mi = b;
}
if (c < mi)
{
mi = c;
}
return mi;
}
///*****************************
/// Compute Levenshtein distance
///*****************************
public int LD(String sNew, String sOld)
{
int[,] matrix; // matrix
int sNewLen = sNew.Length; // length of sNew
int sOldLen = sOld.Length; // length of sOld
int sNewIdx; // iterates through sNew
int sOldIdx; // iterates through sOld
char sNew_i; // ith character of sNew
char sOld_j; // jth character of sOld
int cost; // cost
/// Test string length
if (Math.Max(sNew.Length, sOld.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.LD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sNew.Length, sOld.Length) + "."));
// Step 1
if (sNewLen == 0)
{
return sOldLen;
}
if (sOldLen == 0)
{
return sNewLen;
}
matrix = new int[sNewLen + 1, sOldLen + 1];
// Step 2
for (sNewIdx = 0; sNewIdx <= sNewLen; sNewIdx++)
{
matrix[sNewIdx, 0] = sNewIdx;
}
for (sOldIdx = 0; sOldIdx <= sOldLen; sOldIdx++)
{
matrix[0, sOldIdx] = sOldIdx;
}
// Step 3
for (sNewIdx = 1; sNewIdx <= sNewLen; sNewIdx++)
{
sNew_i = sNew[sNewIdx - 1];
// Step 4
for (sOldIdx = 1; sOldIdx <= sOldLen; sOldIdx++)
{
sOld_j = sOld[sOldIdx - 1];
// Step 5
if (sNew_i == sOld_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
matrix[sNewIdx, sOldIdx] = Minimum(matrix[sNewIdx - 1, sOldIdx] + 1, matrix[sNewIdx, sOldIdx - 1] + 1, matrix[sNewIdx - 1, sOldIdx - 1] + cost);
}
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
//System.Console.WriteLine("Dist=" + matrix[sNewLen, sOldLen]);
int max = System.Math.Max(sNewLen, sOldLen);
return (100 * matrix[sNewLen, sOldLen]) / max;
}
}
Not a complete solution, just a hopefully helpful suggestion...
It seems to me that people are unlikely to use simplifications that are as long as the correct word, so you could at least filter out all results whose length <= the input's length.
You really need to implement the SOUNDEX routine that exists in SQL. I've done that in the following code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Soundex
{
class Program
{
static char[] ignoreChars = new char[] { 'a', 'e', 'h', 'i', 'o', 'u', 'w', 'y' };
static Dictionary<char, int> charVals = new Dictionary<char, int>()
{
{'b',1},
{'f',1},
{'p',1},
{'v',1},
{'c',2},
{'g',2},
{'j',2},
{'k',2},
{'q',2},
{'s',2},
{'x',2},
{'z',2},
{'d',3},
{'t',3},
{'l',4},
{'m',5},
{'n',5},
{'r',6}
};
static void Main(string[] args)
{
Console.WriteLine(Soundex("txt"));
Console.WriteLine(Soundex("text"));
Console.WriteLine(Soundex("ext"));
Console.WriteLine(Soundex("tit"));
Console.WriteLine(Soundex("Cammmppppbbbeeelll"));
}
static string Soundex(string s)
{
s = s.ToLower();
StringBuilder sb = new StringBuilder();
sb.Append(s.First());
foreach (var c in s.Substring(1))
{
if (ignoreChars.Contains(c)) { continue; }
// if the previous character yields the same integer then skip it
if ((int)char.GetNumericValue(sb[sb.Length - 1]) == charVals[c]) { continue; }
sb.Append(charVals[c]);
}
return string.Join("", sb.ToString().Take(4)).PadRight(4, '0');
}
}
}
See, with this code, the only match out of the examples you gave would be text. Run the console application and you'll see the output (i.e. txt would match text).
One method I think programs like word uses to correct spellings, is to use NLP (Natural Language Processing) techniques to get the order of Nouns/Adjectives used in the context of the spelling mistakes.. then comparing that to known sentence structures they can estimate 70% chance the spelling mistake was a noun and use that information to filter the corrected spellings.
SharpNLP looks like a good library but I haven't had a chance to fiddle with it yet. To build a library of known sentence structures BTW, in uni we applied our algorithms to public domain books.
check out sams simMetrics library I found on SO (download here, docs here) for loads more options for algorithms to use besides Levenshtein distance.
Expanding on my comment, you could use regex to search for a result that is an 'expansion' of the input. Something like this:
private int stringSimilarity(string input, string result)
{
string regexPattern = ""
foreach (char c in input)
regexPattern += input + ".*"
Match match = Regex.Match(result, regexPattern,
RegexOptions.IgnoreCase);
if (match.Success)
return 1;
else
return 0;
}
Ignore the 1 and the 0 - I don't know how similarity valuing works.

Generating permutations of a set (most efficiently)

I would like to generate all permutations of a set (a collection), like so:
Collection: 1, 2, 3
Permutations: {1, 2, 3}
{1, 3, 2}
{2, 1, 3}
{2, 3, 1}
{3, 1, 2}
{3, 2, 1}
This isn't a question of "how", in general, but more about how most efficiently.
Also, I wouldn't want to generate ALL permutations and return them, but only generating a single permutation, at a time, and continuing only if necessary (much like Iterators - which I've tried as well, but turned out to be less efficient).
I've tested many algorithms and approaches and came up with this code, which is most efficient of those I tried:
public static bool NextPermutation<T>(T[] elements) where T : IComparable<T>
{
// More efficient to have a variable instead of accessing a property
var count = elements.Length;
// Indicates whether this is the last lexicographic permutation
var done = true;
// Go through the array from last to first
for (var i = count - 1; i > 0; i--)
{
var curr = elements[i];
// Check if the current element is less than the one before it
if (curr.CompareTo(elements[i - 1]) < 0)
{
continue;
}
// An element bigger than the one before it has been found,
// so this isn't the last lexicographic permutation.
done = false;
// Save the previous (bigger) element in a variable for more efficiency.
var prev = elements[i - 1];
// Have a variable to hold the index of the element to swap
// with the previous element (the to-swap element would be
// the smallest element that comes after the previous element
// and is bigger than the previous element), initializing it
// as the current index of the current item (curr).
var currIndex = i;
// Go through the array from the element after the current one to last
for (var j = i + 1; j < count; j++)
{
// Save into variable for more efficiency
var tmp = elements[j];
// Check if tmp suits the "next swap" conditions:
// Smallest, but bigger than the "prev" element
if (tmp.CompareTo(curr) < 0 && tmp.CompareTo(prev) > 0)
{
curr = tmp;
currIndex = j;
}
}
// Swap the "prev" with the new "curr" (the swap-with element)
elements[currIndex] = prev;
elements[i - 1] = curr;
// Reverse the order of the tail, in order to reset it's lexicographic order
for (var j = count - 1; j > i; j--, i++)
{
var tmp = elements[j];
elements[j] = elements[i];
elements[i] = tmp;
}
// Break since we have got the next permutation
// The reason to have all the logic inside the loop is
// to prevent the need of an extra variable indicating "i" when
// the next needed swap is found (moving "i" outside the loop is a
// bad practice, and isn't very readable, so I preferred not doing
// that as well).
break;
}
// Return whether this has been the last lexicographic permutation.
return done;
}
It's usage would be sending an array of elements, and getting back a boolean indicating whether this was the last lexicographical permutation or not, as well as having the array altered to the next permutation.
Usage example:
var arr = new[] {1, 2, 3};
PrintArray(arr);
while (!NextPermutation(arr))
{
PrintArray(arr);
}
The thing is that I'm not happy with the speed of the code.
Iterating over all permutations of an array of size 11 takes about 4 seconds.
Although it could be considered impressive, since the amount of possible permutations of a set of size 11 is 11! which is nearly 40 million.
Logically, with an array of size 12 it will take about 12 times more time, since 12! is 11! * 12, and with an array of size 13 it will take about 13 times more time than the time it took with size 12, and so on.
So you can easily understand how with an array of size 12 and more, it really takes a very long time to go through all permutations.
And I have a strong hunch that I can somehow cut that time by a lot (without switching to a language other than C# - because compiler optimization really does optimize pretty nicely, and I doubt I could optimize as good, manually, in Assembly).
Does anyone know any other way to get that done faster?
Do you have any idea as to how to make the current algorithm faster?
Note that I don't want to use an external library or service in order to do that - I want to have the code itself and I want it to be as efficient as humanly possible.
This might be what you're looking for.
private static bool NextPermutation(int[] numList)
{
/*
Knuths
1. Find the largest index j such that a[j] < a[j + 1]. If no such index exists, the permutation is the last permutation.
2. Find the largest index l such that a[j] < a[l]. Since j + 1 is such an index, l is well defined and satisfies j < l.
3. Swap a[j] with a[l].
4. Reverse the sequence from a[j + 1] up to and including the final element a[n].
*/
var largestIndex = -1;
for (var i = numList.Length - 2; i >= 0; i--)
{
if (numList[i] < numList[i + 1]) {
largestIndex = i;
break;
}
}
if (largestIndex < 0) return false;
var largestIndex2 = -1;
for (var i = numList.Length - 1 ; i >= 0; i--) {
if (numList[largestIndex] < numList[i]) {
largestIndex2 = i;
break;
}
}
var tmp = numList[largestIndex];
numList[largestIndex] = numList[largestIndex2];
numList[largestIndex2] = tmp;
for (int i = largestIndex + 1, j = numList.Length - 1; i < j; i++, j--) {
tmp = numList[i];
numList[i] = numList[j];
numList[j] = tmp;
}
return true;
}
Update 2018-05-28:
A new multithreaded version (lot faster) is available below as another answer.
Also an article about permutation: Permutations: Fast implementations and a new indexing algorithm allowing multithreading
A little bit too late...
According to recent tests (updated 2018-05-22)
Fastest is mine BUT not in lexicographic order
For fastest lexicographic order, Sani Singh Huttunen solution seems to be the way to go.
Performance test results for 10 items (10!) in release on my machine (millisecs):
Ouellet : 29
SimpleVar: 95
Erez Robinson : 156
Sani Singh Huttunen : 37
Pengyang : 45047
Performance test results for 13 items (13!) in release on my machine (seconds):
Ouellet : 48.437
SimpleVar: 159.869
Erez Robinson : 327.781
Sani Singh Huttunen : 64.839
Advantages of my solution:
Heap's algorithm (Single swap per permutation)
No multiplication (like some implementations seen on the web)
Inlined swap
Generic
No unsafe code
In place (very low memory usage)
No modulo (only first bit compare)
My implementation of Heap's algorithm:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
namespace WpfPermutations
{
/// <summary>
/// EO: 2016-04-14
/// Generator of all permutations of an array of anything.
/// Base on Heap's Algorithm. See: https://en.wikipedia.org/wiki/Heap%27s_algorithm#cite_note-3
/// </summary>
public static class Permutations
{
/// <summary>
/// Heap's algorithm to find all pmermutations. Non recursive, more efficient.
/// </summary>
/// <param name="items">Items to permute in each possible ways</param>
/// <param name="funcExecuteAndTellIfShouldStop"></param>
/// <returns>Return true if cancelled</returns>
public static bool ForAllPermutation<T>(T[] items, Func<T[], bool> funcExecuteAndTellIfShouldStop)
{
int countOfItem = items.Length;
if (countOfItem <= 1)
{
return funcExecuteAndTellIfShouldStop(items);
}
var indexes = new int[countOfItem];
// Unecessary. Thanks to NetManage for the advise
// for (int i = 0; i < countOfItem; i++)
// {
// indexes[i] = 0;
// }
if (funcExecuteAndTellIfShouldStop(items))
{
return true;
}
for (int i = 1; i < countOfItem;)
{
if (indexes[i] < i)
{ // On the web there is an implementation with a multiplication which should be less efficient.
if ((i & 1) == 1) // if (i % 2 == 1) ... more efficient ??? At least the same.
{
Swap(ref items[i], ref items[indexes[i]]);
}
else
{
Swap(ref items[i], ref items[0]);
}
if (funcExecuteAndTellIfShouldStop(items))
{
return true;
}
indexes[i]++;
i = 1;
}
else
{
indexes[i++] = 0;
}
}
return false;
}
/// <summary>
/// This function is to show a linq way but is far less efficient
/// From: StackOverflow user: Pengyang : http://stackoverflow.com/questions/756055/listing-all-permutations-of-a-string-integer
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="list"></param>
/// <param name="length"></param>
/// <returns></returns>
static IEnumerable<IEnumerable<T>> GetPermutations<T>(IEnumerable<T> list, int length)
{
if (length == 1) return list.Select(t => new T[] { t });
return GetPermutations(list, length - 1)
.SelectMany(t => list.Where(e => !t.Contains(e)),
(t1, t2) => t1.Concat(new T[] { t2 }));
}
/// <summary>
/// Swap 2 elements of same type
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="a"></param>
/// <param name="b"></param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void Swap<T>(ref T a, ref T b)
{
T temp = a;
a = b;
b = temp;
}
/// <summary>
/// Func to show how to call. It does a little test for an array of 4 items.
/// </summary>
public static void Test()
{
ForAllPermutation("123".ToCharArray(), (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
int[] values = new int[] { 0, 1, 2, 4 };
Console.WriteLine("Ouellet heap's algorithm implementation");
ForAllPermutation(values, (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
Console.WriteLine("Linq algorithm");
foreach (var v in GetPermutations(values, values.Length))
{
Console.WriteLine(String.Join("", v));
}
// Performance Heap's against Linq version : huge differences
int count = 0;
values = new int[10];
for (int n = 0; n < values.Length; n++)
{
values[n] = n;
}
Stopwatch stopWatch = new Stopwatch();
ForAllPermutation(values, (vals) =>
{
foreach (var v in vals)
{
count++;
}
return false;
});
stopWatch.Stop();
Console.WriteLine($"Ouellet heap's algorithm implementation {count} items in {stopWatch.ElapsedMilliseconds} millisecs");
count = 0;
stopWatch.Reset();
stopWatch.Start();
foreach (var vals in GetPermutations(values, values.Length))
{
foreach (var v in vals)
{
count++;
}
}
stopWatch.Stop();
Console.WriteLine($"Linq {count} items in {stopWatch.ElapsedMilliseconds} millisecs");
}
}
}
An this is my test code:
Task.Run(() =>
{
int[] values = new int[12];
for (int n = 0; n < values.Length; n++)
{
values[n] = n;
}
// Eric Ouellet Algorithm
int count = 0;
var stopwatch = new Stopwatch();
stopwatch.Reset();
stopwatch.Start();
Permutations.ForAllPermutation(values, (vals) =>
{
foreach (var v in vals)
{
count++;
}
return false;
});
stopwatch.Stop();
Console.WriteLine($"This {count} items in {stopwatch.ElapsedMilliseconds} millisecs");
// Simple Plan Algorithm
count = 0;
stopwatch.Reset();
stopwatch.Start();
PermutationsSimpleVar permutations2 = new PermutationsSimpleVar();
permutations2.Permutate(1, values.Length, (int[] vals) =>
{
foreach (var v in vals)
{
count++;
}
});
stopwatch.Stop();
Console.WriteLine($"Simple Plan {count} items in {stopwatch.ElapsedMilliseconds} millisecs");
// ErezRobinson Algorithm
count = 0;
stopwatch.Reset();
stopwatch.Start();
foreach(var vals in PermutationsErezRobinson.QuickPerm(values))
{
foreach (var v in vals)
{
count++;
}
};
stopwatch.Stop();
Console.WriteLine($"Erez Robinson {count} items in {stopwatch.ElapsedMilliseconds} millisecs");
});
Usage examples:
ForAllPermutation("123".ToCharArray(), (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
int[] values = new int[] { 0, 1, 2, 4 };
ForAllPermutation(values, (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
Well, if you can handle it in C and then translate to your language of choice, you can't really go much faster than this, because the time will be dominated by print:
void perm(char* s, int n, int i){
if (i >= n-1) print(s);
else {
perm(s, n, i+1);
for (int j = i+1; j<n; j++){
swap(s[i], s[j]);
perm(s, n, i+1);
swap(s[i], s[j]);
}
}
}
perm("ABC", 3, 0);
Update 2018-05-28, a new version, the fastest ... (multi-threaded)
Time taken for fastest algorithms
Need: Sani Singh Huttunen (fastest lexico) solution and my new OuelletLexico3 which support indexing
Indexing has 2 main advantages:
allows to get anyone permutation directly
allows multi-threading (derived from the first advantage)
Article: Permutations: Fast implementations and a new indexing algorithm allowing multithreading
On my machine (6 hyperthread cores : 12 threads) Xeon E5-1660 0 # 3.30Ghz, tests algorithms running with empty stuff to do for 13! items (time in millisecs):
53071: Ouellet (implementation of Heap)
65366: Sani Singh Huttunen (Fastest lexico)
11377: Mix OuelletLexico3 - Sani Singh Huttunen
A side note: using shares properties/variables between threads for permutation action will strongly impact performance if their usage is modification (read / write). Doing so will generate "false sharing" between threads. You will not get expected performance. I got this behavior while testing. My experience showed problems when I try to increase the global variable for the total count of permutation.
Usage:
PermutationMixOuelletSaniSinghHuttunen.ExecuteForEachPermutationMT(
new int[] {1, 2, 3, 4},
p =>
{
Console.WriteLine($"Values: {p[0]}, {p[1]}, p[2]}, {p[3]}");
});
Code:
using System;
using System.Runtime.CompilerServices;
namespace WpfPermutations
{
public class Factorial
{
// ************************************************************************
protected static long[] FactorialTable = new long[21];
// ************************************************************************
static Factorial()
{
FactorialTable[0] = 1; // To prevent divide by 0
long f = 1;
for (int i = 1; i <= 20; i++)
{
f = f * i;
FactorialTable[i] = f;
}
}
// ************************************************************************
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long GetFactorial(int val) // a long can only support up to 20!
{
if (val > 20)
{
throw new OverflowException($"{nameof(Factorial)} only support a factorial value <= 20");
}
return FactorialTable[val];
}
// ************************************************************************
}
}
namespace WpfPermutations
{
public class PermutationSaniSinghHuttunen
{
public static bool NextPermutation(int[] numList)
{
/*
Knuths
1. Find the largest index j such that a[j] < a[j + 1]. If no such index exists, the permutation is the last permutation.
2. Find the largest index l such that a[j] < a[l]. Since j + 1 is such an index, l is well defined and satisfies j < l.
3. Swap a[j] with a[l].
4. Reverse the sequence from a[j + 1] up to and including the final element a[n].
*/
var largestIndex = -1;
for (var i = numList.Length - 2; i >= 0; i--)
{
if (numList[i] < numList[i + 1])
{
largestIndex = i;
break;
}
}
if (largestIndex < 0) return false;
var largestIndex2 = -1;
for (var i = numList.Length - 1; i >= 0; i--)
{
if (numList[largestIndex] < numList[i])
{
largestIndex2 = i;
break;
}
}
var tmp = numList[largestIndex];
numList[largestIndex] = numList[largestIndex2];
numList[largestIndex2] = tmp;
for (int i = largestIndex + 1, j = numList.Length - 1; i < j; i++, j--)
{
tmp = numList[i];
numList[i] = numList[j];
numList[j] = tmp;
}
return true;
}
}
}
using System;
namespace WpfPermutations
{
public class PermutationOuelletLexico3<T> // Enable indexing
{
// ************************************************************************
private T[] _sortedValues;
private bool[] _valueUsed;
public readonly long MaxIndex; // long to support 20! or less
// ************************************************************************
public PermutationOuelletLexico3(T[] sortedValues)
{
_sortedValues = sortedValues;
Result = new T[_sortedValues.Length];
_valueUsed = new bool[_sortedValues.Length];
MaxIndex = Factorial.GetFactorial(_sortedValues.Length);
}
// ************************************************************************
public T[] Result { get; private set; }
// ************************************************************************
/// <summary>
/// Sort Index is 0 based and should be less than MaxIndex. Otherwise you get an exception.
/// </summary>
/// <param name="sortIndex"></param>
/// <param name="result">Value is not used as inpu, only as output. Re-use buffer in order to save memory</param>
/// <returns></returns>
public void GetSortedValuesFor(long sortIndex)
{
int size = _sortedValues.Length;
if (sortIndex < 0)
{
throw new ArgumentException("sortIndex should greater or equal to 0.");
}
if (sortIndex >= MaxIndex)
{
throw new ArgumentException("sortIndex should less than factorial(the lenght of items)");
}
for (int n = 0; n < _valueUsed.Length; n++)
{
_valueUsed[n] = false;
}
long factorielLower = MaxIndex;
for (int index = 0; index < size; index++)
{
long factorielBigger = factorielLower;
factorielLower = Factorial.GetFactorial(size - index - 1); // factorielBigger / inverseIndex;
int resultItemIndex = (int)(sortIndex % factorielBigger / factorielLower);
int correctedResultItemIndex = 0;
for(;;)
{
if (! _valueUsed[correctedResultItemIndex])
{
resultItemIndex--;
if (resultItemIndex < 0)
{
break;
}
}
correctedResultItemIndex++;
}
Result[index] = _sortedValues[correctedResultItemIndex];
_valueUsed[correctedResultItemIndex] = true;
}
}
// ************************************************************************
}
}
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace WpfPermutations
{
public class PermutationMixOuelletSaniSinghHuttunen
{
// ************************************************************************
private long _indexFirst;
private long _indexLastExclusive;
private int[] _sortedValues;
// ************************************************************************
public PermutationMixOuelletSaniSinghHuttunen(int[] sortedValues, long indexFirst = -1, long indexLastExclusive = -1)
{
if (indexFirst == -1)
{
indexFirst = 0;
}
if (indexLastExclusive == -1)
{
indexLastExclusive = Factorial.GetFactorial(sortedValues.Length);
}
if (indexFirst >= indexLastExclusive)
{
throw new ArgumentException($"{nameof(indexFirst)} should be less than {nameof(indexLastExclusive)}");
}
_indexFirst = indexFirst;
_indexLastExclusive = indexLastExclusive;
_sortedValues = sortedValues;
}
// ************************************************************************
public void ExecuteForEachPermutation(Action<int[]> action)
{
// Console.WriteLine($"Thread {System.Threading.Thread.CurrentThread.ManagedThreadId} started: {_indexFirst} {_indexLastExclusive}");
long index = _indexFirst;
PermutationOuelletLexico3<int> permutationOuellet = new PermutationOuelletLexico3<int>(_sortedValues);
permutationOuellet.GetSortedValuesFor(index);
action(permutationOuellet.Result);
index++;
int[] values = permutationOuellet.Result;
while (index < _indexLastExclusive)
{
PermutationSaniSinghHuttunen.NextPermutation(values);
action(values);
index++;
}
// Console.WriteLine($"Thread {System.Threading.Thread.CurrentThread.ManagedThreadId} ended: {DateTime.Now.ToString("yyyyMMdd_HHmmss_ffffff")}");
}
// ************************************************************************
public static void ExecuteForEachPermutationMT(int[] sortedValues, Action<int[]> action)
{
int coreCount = Environment.ProcessorCount; // Hyper treading are taken into account (ex: on a 4 cores hyperthreaded = 8)
long itemsFactorial = Factorial.GetFactorial(sortedValues.Length);
long partCount = (long)Math.Ceiling((double)itemsFactorial / (double)coreCount);
long startIndex = 0;
var tasks = new List<Task>();
for (int coreIndex = 0; coreIndex < coreCount; coreIndex++)
{
long stopIndex = Math.Min(startIndex + partCount, itemsFactorial);
PermutationMixOuelletSaniSinghHuttunen mix = new PermutationMixOuelletSaniSinghHuttunen(sortedValues, startIndex, stopIndex);
Task task = Task.Run(() => mix.ExecuteForEachPermutation(action));
tasks.Add(task);
if (stopIndex == itemsFactorial)
{
break;
}
startIndex = startIndex + partCount;
}
Task.WaitAll(tasks.ToArray());
}
// ************************************************************************
}
}
The fastest permutation algorithm that i know of is the QuickPerm algorithm.
Here is the implementation, it uses yield return so you can iterate one at a time like required.
Code:
public static IEnumerable<IEnumerable<T>> QuickPerm<T>(this IEnumerable<T> set)
{
int N = set.Count();
int[] a = new int[N];
int[] p = new int[N];
var yieldRet = new T[N];
List<T> list = new List<T>(set);
int i, j, tmp; // Upper Index i; Lower Index j
for (i = 0; i < N; i++)
{
// initialize arrays; a[N] can be any type
a[i] = i + 1; // a[i] value is not revealed and can be arbitrary
p[i] = 0; // p[i] == i controls iteration and index boundaries for i
}
yield return list;
//display(a, 0, 0); // remove comment to display array a[]
i = 1; // setup first swap points to be 1 and 0 respectively (i & j)
while (i < N)
{
if (p[i] < i)
{
j = i%2*p[i]; // IF i is odd then j = p[i] otherwise j = 0
tmp = a[j]; // swap(a[j], a[i])
a[j] = a[i];
a[i] = tmp;
//MAIN!
for (int x = 0; x < N; x++)
{
yieldRet[x] = list[a[x]-1];
}
yield return yieldRet;
//display(a, j, i); // remove comment to display target array a[]
// MAIN!
p[i]++; // increase index "weight" for i by one
i = 1; // reset index i to 1 (assumed)
}
else
{
// otherwise p[i] == i
p[i] = 0; // reset p[i] to zero
i++; // set new index value for i (increase by one)
} // if (p[i] < i)
} // while(i < N)
}
Here is the fastest implementation I ended up with:
public class Permutations
{
private readonly Mutex _mutex = new Mutex();
private Action<int[]> _action;
private Action<IntPtr> _actionUnsafe;
private unsafe int* _arr;
private IntPtr _arrIntPtr;
private unsafe int* _last;
private unsafe int* _lastPrev;
private unsafe int* _lastPrevPrev;
public int Size { get; private set; }
public bool IsRunning()
{
return this._mutex.SafeWaitHandle.IsClosed;
}
public bool Permutate(int start, int count, Action<int[]> action, bool async = false)
{
return this.Permutate(start, count, action, null, async);
}
public bool Permutate(int start, int count, Action<IntPtr> actionUnsafe, bool async = false)
{
return this.Permutate(start, count, null, actionUnsafe, async);
}
private unsafe bool Permutate(int start, int count, Action<int[]> action, Action<IntPtr> actionUnsafe, bool async = false)
{
if (!this._mutex.WaitOne(0))
{
return false;
}
var x = (Action)(() =>
{
this._actionUnsafe = actionUnsafe;
this._action = action;
this.Size = count;
this._arr = (int*)Marshal.AllocHGlobal(count * sizeof(int));
this._arrIntPtr = new IntPtr(this._arr);
for (var i = 0; i < count - 3; i++)
{
this._arr[i] = start + i;
}
this._last = this._arr + count - 1;
this._lastPrev = this._last - 1;
this._lastPrevPrev = this._lastPrev - 1;
*this._last = count - 1;
*this._lastPrev = count - 2;
*this._lastPrevPrev = count - 3;
this.Permutate(count, this._arr);
});
if (!async)
{
x();
}
else
{
new Thread(() => x()).Start();
}
return true;
}
private unsafe void Permutate(int size, int* start)
{
if (size == 3)
{
this.DoAction();
Swap(this._last, this._lastPrev);
this.DoAction();
Swap(this._last, this._lastPrevPrev);
this.DoAction();
Swap(this._last, this._lastPrev);
this.DoAction();
Swap(this._last, this._lastPrevPrev);
this.DoAction();
Swap(this._last, this._lastPrev);
this.DoAction();
return;
}
var sizeDec = size - 1;
var startNext = start + 1;
var usedStarters = 0;
for (var i = 0; i < sizeDec; i++)
{
this.Permutate(sizeDec, startNext);
usedStarters |= 1 << *start;
for (var j = startNext; j <= this._last; j++)
{
var mask = 1 << *j;
if ((usedStarters & mask) != mask)
{
Swap(start, j);
break;
}
}
}
this.Permutate(sizeDec, startNext);
if (size == this.Size)
{
this._mutex.ReleaseMutex();
}
}
private unsafe void DoAction()
{
if (this._action == null)
{
if (this._actionUnsafe != null)
{
this._actionUnsafe(this._arrIntPtr);
}
return;
}
var result = new int[this.Size];
fixed (int* pt = result)
{
var limit = pt + this.Size;
var resultPtr = pt;
var arrayPtr = this._arr;
while (resultPtr < limit)
{
*resultPtr = *arrayPtr;
resultPtr++;
arrayPtr++;
}
}
this._action(result);
}
private static unsafe void Swap(int* a, int* b)
{
var tmp = *a;
*a = *b;
*b = tmp;
}
}
Usage and testing performance:
var perms = new Permutations();
var sw1 = Stopwatch.StartNew();
perms.Permutate(0,
11,
(Action<int[]>)null); // Comment this line and...
//PrintArr); // Uncomment this line, to print permutations
sw1.Stop();
Console.WriteLine(sw1.Elapsed);
Printing method:
private static void PrintArr(int[] arr)
{
Console.WriteLine(string.Join(",", arr));
}
Going deeper:
I did not even think about this for a very long time, so I can only explain my code so much, but here's the general idea:
Permutations aren't lexicographic - this allows me to practically perform less operations between permutations.
The implementation is recursive, and when the "view" size is 3, it skips the complex logic and just performs 6 swaps to get the 6 permutations (or sub-permutations, if you will).
Because the permutations aren't in a lexicographic order, how can I decide which element to bring to the start of the current "view" (sub permutation)? I keep record of elements that were already used as "starters" in the current sub-permutation recursive call and simply search linearly for one that wasn't used in the tail of my array.
The implementation is for integers only, so to permute over a generic collection of elements you simply use the Permutations class to permute indices instead of your actual collection.
The Mutex is there just to ensure things don't get screwed when the execution is asynchronous (notice that you can pass an UnsafeAction parameter that will in turn get a pointer to the permuted array. You must not change the order of elements in that array (pointer)! If you want to, you should copy the array to a tmp array or just use the safe action parameter which takes care of that for you - the passed array is already a copy).
Note:
I have no idea how good this implementation really is - I haven't touched it in so long.
Test and compare to other implementations on your own, and let me know if you have any feedback!
Enjoy.
Here is a generic permutation finder that will iterate through every permutation of a collection and call an evalution function. If the evalution function returns true (it found the answer it was looking for), the permutation finder stops processing.
public class PermutationFinder<T>
{
private T[] items;
private Predicate<T[]> SuccessFunc;
private bool success = false;
private int itemsCount;
public void Evaluate(T[] items, Predicate<T[]> SuccessFunc)
{
this.items = items;
this.SuccessFunc = SuccessFunc;
this.itemsCount = items.Count();
Recurse(0);
}
private void Recurse(int index)
{
T tmp;
if (index == itemsCount)
success = SuccessFunc(items);
else
{
for (int i = index; i < itemsCount; i++)
{
tmp = items[index];
items[index] = items[i];
items[i] = tmp;
Recurse(index + 1);
if (success)
break;
tmp = items[index];
items[index] = items[i];
items[i] = tmp;
}
}
}
}
Here is a simple implementation:
class Program
{
static void Main(string[] args)
{
new Program().Start();
}
void Start()
{
string[] items = new string[5];
items[0] = "A";
items[1] = "B";
items[2] = "C";
items[3] = "D";
items[4] = "E";
new PermutationFinder<string>().Evaluate(items, Evaluate);
Console.ReadLine();
}
public bool Evaluate(string[] items)
{
Console.WriteLine(string.Format("{0},{1},{2},{3},{4}", items[0], items[1], items[2], items[3], items[4]));
bool someCondition = false;
if (someCondition)
return true; // Tell the permutation finder to stop.
return false;
}
}
Here is a recursive implementation with complexity O(n * n!)1 based on swapping of the elements of an array. The array is initialised with values from 1, 2, ..., n.
using System;
namespace Exercise
{
class Permutations
{
static void Main(string[] args)
{
int setSize = 3;
FindPermutations(setSize);
}
//-----------------------------------------------------------------------------
/* Method: FindPermutations(n) */
private static void FindPermutations(int n)
{
int[] arr = new int[n];
for (int i = 0; i < n; i++)
{
arr[i] = i + 1;
}
int iEnd = arr.Length - 1;
Permute(arr, iEnd);
}
//-----------------------------------------------------------------------------
/* Method: Permute(arr) */
private static void Permute(int[] arr, int iEnd)
{
if (iEnd == 0)
{
PrintArray(arr);
return;
}
Permute(arr, iEnd - 1);
for (int i = 0; i < iEnd; i++)
{
swap(ref arr[i], ref arr[iEnd]);
Permute(arr, iEnd - 1);
swap(ref arr[i], ref arr[iEnd]);
}
}
}
}
On each recursive step we swap the last element with the current element pointed to by the local variable in the for loop and then we indicate the uniqueness of the swapping by: incrementing the local variable of the for loop and decrementing the termination condition of the for loop, which is initially set to the number of the elements in the array, when the latter becomes zero we terminate the recursion.
Here are the helper functions:
//-----------------------------------------------------------------------------
/*
Method: PrintArray()
*/
private static void PrintArray(int[] arr, string label = "")
{
Console.WriteLine(label);
Console.Write("{");
for (int i = 0; i < arr.Length; i++)
{
Console.Write(arr[i]);
if (i < arr.Length - 1)
{
Console.Write(", ");
}
}
Console.WriteLine("}");
}
//-----------------------------------------------------------------------------
/*
Method: swap(ref int a, ref int b)
*/
private static void swap(ref int a, ref int b)
{
int temp = a;
a = b;
b = temp;
}
1. There are n! permutations of n elements to be printed.
I would be surprised if there are really order of magnitude improvements to be found. If there are, then C# needs fundamental improvement. Furthermore doing anything interesting with your permutation will generally take more work than generating it. So the cost of generating is going to be insignificant in the overall scheme of things.
That said, I would suggest trying the following things. You have already tried iterators. But have you tried having a function that takes a closure as input, then then calls that closure for each permutation found? Depending on internal mechanics of C#, this may be faster.
Similarly, have you tried having a function that returns a closure that will iterate over a specific permutation?
With either approach, there are a number of micro-optimizations you can experiment with. For instance you can sort your input array, and after that you always know what order it is in. For example you can have an array of bools indicating whether that element is less than the next one, and rather than do comparisons, you can just look at that array.
There's an accessible introduction to the algorithms and survey of implementations in Steven Skiena's Algorithm Design Manual (chapter 14.4 in the second edition)
Skiena references D. Knuth. The Art of Computer Programming, Volume 4 Fascicle 2: Generating All Tuples and Permutations. Addison Wesley, 2005.
I created an algorithm slightly faster than Knuth's one:
11 elements:
mine: 0.39 seconds
Knuth's: 0.624 seconds
13 elements:
mine: 56.615 seconds
Knuth's: 98.681 seconds
Here's my code in Java:
public static void main(String[] args)
{
int n=11;
int a,b,c,i,tmp;
int end=(int)Math.floor(n/2);
int[][] pos = new int[end+1][2];
int[] perm = new int[n];
for(i=0;i<n;i++) perm[i]=i;
while(true)
{
//this is where you can use the permutations (perm)
i=0;
c=n;
while(pos[i][1]==c-2 && pos[i][0]==c-1)
{
pos[i][0]=0;
pos[i][1]=0;
i++;
c-=2;
}
if(i==end) System.exit(0);
a=(pos[i][0]+1)%c+i;
b=pos[i][0]+i;
tmp=perm[b];
perm[b]=perm[a];
perm[a]=tmp;
if(pos[i][0]==c-1)
{
pos[i][0]=0;
pos[i][1]++;
}
else
{
pos[i][0]++;
}
}
}
The problem is my algorithm only works for odd numbers of elements. I wrote this code quickly so I'm pretty sure there's a better way to implement my idea to get better performance, but I don't really have the time to work on it right now to optimize it and solve the issue when the number of elements is even.
It's one swap for every permutation and it uses a really simple way to know which elements to swap.
I wrote an explanation of the method behind the code on my blog: http://antoinecomeau.blogspot.ca/2015/01/fast-generation-of-all-permutations.html
As the author of this question was asking about an algorithm:
[...] generating a single permutation, at a time, and continuing only if necessary
I would suggest considering Steinhaus–Johnson–Trotter algorithm.
Steinhaus–Johnson–Trotter algorithm on Wikipedia
Beautifully explained here
It's 1 am and I was watching TV and thought of this same question, but with string values.
Given a word find all permutations. You can easily modify this to handle an array, sets, etc.
Took me a bit to work it out, but the solution I came up was this:
string word = "abcd";
List<string> combinations = new List<string>();
for(int i=0; i<word.Length; i++)
{
for (int j = 0; j < word.Length; j++)
{
if (i < j)
combinations.Add(word[i] + word.Substring(j) + word.Substring(0, i) + word.Substring(i + 1, j - (i + 1)));
else if (i > j)
{
if(i== word.Length -1)
combinations.Add(word[i] + word.Substring(0, i));
else
combinations.Add(word[i] + word.Substring(0, i) + word.Substring(i + 1));
}
}
}
Here's the same code as above, but with some comments
string word = "abcd";
List<string> combinations = new List<string>();
//i is the first letter of the new word combination
for(int i=0; i<word.Length; i++)
{
for (int j = 0; j < word.Length; j++)
{
//add the first letter of the word, j is past i so we can get all the letters from j to the end
//then add all the letters from the front to i, then skip over i (since we already added that as the beginning of the word)
//and get the remaining letters from i+1 to right before j.
if (i < j)
combinations.Add(word[i] + word.Substring(j) + word.Substring(0, i) + word.Substring(i + 1, j - (i + 1)));
else if (i > j)
{
//if we're at the very last word no need to get the letters after i
if(i== word.Length -1)
combinations.Add(word[i] + word.Substring(0, i));
//add i as the first letter of the word, then get all the letters up to i, skip i, and then add all the lettes after i
else
combinations.Add(word[i] + word.Substring(0, i) + word.Substring(i + 1));
}
}
}
//+------------------------------------------------------------------+
//| |
//+------------------------------------------------------------------+
/**
* http://marknelson.us/2002/03/01/next-permutation/
* Rearranges the elements into the lexicographically next greater permutation and returns true.
* When there are no more greater permutations left, the function eventually returns false.
*/
// next lexicographical permutation
template <typename T>
bool next_permutation(T &arr[], int firstIndex, int lastIndex)
{
int i = lastIndex;
while (i > firstIndex)
{
int ii = i--;
T curr = arr[i];
if (curr < arr[ii])
{
int j = lastIndex;
while (arr[j] <= curr) j--;
Swap(arr[i], arr[j]);
while (ii < lastIndex)
Swap(arr[ii++], arr[lastIndex--]);
return true;
}
}
return false;
}
//+------------------------------------------------------------------+
//| |
//+------------------------------------------------------------------+
/**
* Swaps two variables or two array elements.
* using references/pointers to speed up swapping.
*/
template<typename T>
void Swap(T &var1, T &var2)
{
T temp;
temp = var1;
var1 = var2;
var2 = temp;
}
//+------------------------------------------------------------------+
//| |
//+------------------------------------------------------------------+
// driver program to test above function
#define N 3
void OnStart()
{
int i, x[N];
for (i = 0; i < N; i++) x[i] = i + 1;
printf("The %i! possible permutations with %i elements:", N, N);
do
{
printf("%s", ArrayToString(x));
} while (next_permutation(x, 0, N - 1));
}
// Output:
// The 3! possible permutations with 3 elements:
// "1,2,3"
// "1,3,2"
// "2,1,3"
// "2,3,1"
// "3,1,2"
// "3,2,1"
// Permutations are the different ordered arrangements of an n-element
// array. An n-element array has exactly n! full-length permutations.
// This iterator object allows to iterate all full length permutations
// one by one of an array of n distinct elements.
// The iterator changes the given array in-place.
// Permutations('ABCD') => ABCD DBAC ACDB DCBA
// BACD BDAC CADB CDBA
// CABD ADBC DACB BDCA
// ACBD DABC ADCB DBCA
// BCAD BADC CDAB CBDA
// CBAD ABDC DCAB BCDA
// count of permutations = n!
// Heap's algorithm (Single swap per permutation)
// http://www.quickperm.org/quickperm.php
// https://stackoverflow.com/a/36634935/4208440
// https://en.wikipedia.org/wiki/Heap%27s_algorithm
// My implementation of Heap's algorithm:
template<typename T>
class PermutationsIterator
{
int b, e, n;
int c[32]; /* control array: mixed radix number in rising factorial base.
the i-th digit has base i, which means that the digit must be
strictly less than i. The first digit is always 0, the second
can be 0 or 1, the third 0, 1 or 2, and so on.
ArrayResize isn't strictly necessary, int c[32] would suffice
for most practical purposes. Also, it is much faster */
public:
PermutationsIterator(T &arr[], int firstIndex, int lastIndex)
{
this.b = firstIndex; // v.begin()
this.e = lastIndex; // v.end()
this.n = e - b + 1;
ArrayInitialize(c, 0);
}
// Rearranges the input array into the next permutation and returns true.
// When there are no more permutations left, the function returns false.
bool next(T &arr[])
{
// find index to update
int i = 1;
// reset all the previous indices that reached the maximum possible values
while (c[i] == i)
{
c[i] = 0;
++i;
}
// no more permutations left
if (i == n)
return false;
// generate next permutation
int j = (i & 1) == 1 ? c[i] : 0; // IF i is odd then j = c[i] otherwise j = 0.
swap(arr[b + j], arr[b + i]); // generate a new permutation from previous permutation using a single swap
// Increment that index
++c[i];
return true;
}
};
I found this algo on rosetta code and it is really the fastest one I tried. http://rosettacode.org/wiki/Permutations#C
/* Boothroyd method; exactly N! swaps, about as fast as it gets */
void boothroyd(int *x, int n, int nn, int callback(int *, int))
{
int c = 0, i, t;
while (1) {
if (n > 2) boothroyd(x, n - 1, nn, callback);
if (c >= n - 1) return;
i = (n & 1) ? 0 : c;
c++;
t = x[n - 1], x[n - 1] = x[i], x[i] = t;
if (callback) callback(x, nn);
}
}
/* entry for Boothroyd method */
void perm2(int *x, int n, int callback(int*, int))
{
if (callback) callback(x, n);
boothroyd(x, n, n, callback);
}
If you just want to calculate the number of possible permutations you can avoid all that hard work above and use something like this (contrived in c#):
public static class ContrivedUtils
{
public static Int64 Permutations(char[] array)
{
if (null == array || array.Length == 0) return 0;
Int64 permutations = array.Length;
for (var pos = permutations; pos > 1; pos--)
permutations *= pos - 1;
return permutations;
}
}
You call it like this:
var permutations = ContrivedUtils.Permutations("1234".ToCharArray());
// output is: 24
var permutations = ContrivedUtils.Permutations("123456789".ToCharArray());
// output is: 362880
Simple C# recursive solution by swapping, for the initial call the index must be 0
static public void Permute<T>(List<T> input, List<List<T>> permutations, int index)
{
if (index == input.Count - 1)
{
permutations.Add(new List<T>(input));
return;
}
Permute(input, permutations, index + 1);
for (int i = index+1 ; i < input.Count; i++)
{
//swap
T temp = input[index];
input[index] = input[i];
input[i] = temp;
Permute(input, permutations, index + 1);
//swap back
temp = input[index];
input[index] = input[i];
input[i] = temp;
}
}

How to calculate distance similarity measure of given 2 strings?

I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
ö instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function

Categories