Rabin-Karp string matching algorithm by rolling hash

Rabin-Karp string matching algorithm by rolling hash - c#

Here is an implementation of Rabin-Karp String matching algorithm in C#...
static void Main(string[] args)
{
string A = "String that contains a pattern.";
string B = "pattern";
ulong siga = 0;
ulong sigb = 0;
ulong Q = 100007;
ulong D = 256;
for (int i = 0; i < B.Length; i++)
{
siga = (siga * D + (ulong)A[i]) % Q;
sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
return;
}
ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
pow = (pow * D) % Q;
for (int j = 1; j <= A.Length - B.Length; j++)
{
siga = (siga + Q - pow * (ulong)A[j - 1] %Q) % Q;
siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
if (siga == sigb)
{
if (A.Substring(j, B.Length) == B)
{
Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
A.Substring(j, B.Length),
A.Substring(j +B.Length)));
return;
}
}
}
Console.WriteLine("Not copied!");
}
but it has one problem if i change the position of second string than it shows result not copied but
string A = "String that contains a pattern.";
string B = "pattern";
here it shows not copied
string A = "String that contains a pattern.";
string B = "Matches contains a pattern ";
i want to check whether it is copy from first string or not even i would add something in it it or change the position but it shouldn't make difference so how to change it that it would just compare the hashes of each word in string than implement it............

Change
string B = "Matches contains a pattern ";
to
string B = "contains a pattern ";
and it will work

Related

inaccurate results with function to add an array of digits together

so i have this function:
static int[] AddArrays(int[] a, int[] b)
{
int length1 = a.Length;
int length2 = b.Length;
int carry = 0;
int max_length = Math.Max(length1, length2) + 1;
int[] minimum_arr = new int[max_length - length1].Concat(a).ToArray();
int[] maximum_arr = new int[max_length - length2].Concat(b).ToArray();
int[] new_arr = new int[max_length];
for (int i = max_length - 1; i >= 0; i--)
{
int first_digit = maximum_arr[i];
int second_digit = i - (max_length - minimum_arr.Length) >= 0 ? minimum_arr[i - (max_length - minimum_arr.Length)] : 0;
if (second_digit + first_digit + carry > 9)
{
new_arr[i] = (second_digit + first_digit + carry) % 10;
carry = 1;
}
else
{
new_arr[i] = second_digit + first_digit + carry;
carry = 0;
}
}
if (carry == 1)
{
int[] result = new int[max_length + 1];
result[0] = 1;
Array.Copy(new_arr, 0, result, 1, max_length);
return result;
}
else
{
return new_arr;
}
}
it basically takes 2 lists of digits and adds them together. the point of this is that each array of digits represent a number that is bigger then the integer limits. now this function is close to working the results get innacurate at certein places and i honestly have no idea why. for example if the function is given these inputs:
"1481298410984109284109481491284901249018490849081048914820948019" and
"3475893498573573849739857349873498739487598" (both of these are being turned into a array of integers before being sent to the function)
the expected output is:
1,481,298,410,984,109,284,112,957,384,783,474,822,868,230,706,430,922,413,560,435,617
and what i get is:
1,481,298,410,984,109,284,457,070,841,142,258,634,158,894,233,092,241,356,043,561,7
i would very much appreciate some help with this ive been trying to figure it out for hours and i cant seem to get it to work perfectly.

I suggest Reverse arrays a and b and use good old school algorithm:
static int[] AddArrays(int[] a, int[] b) {
Array.Reverse(a);
Array.Reverse(b);
int[] result = new int[Math.Max(a.Length, b.Length) + 1];
int carry = 0;
int value = 0;
for (int i = 0; i < Math.Max(a.Length, b.Length); ++i) {
value = (i < a.Length ? a[i] : 0) + (i < b.Length ? b[i] : 0) + carry;
result[i] = value % 10;
carry = value / 10;
}
if (carry > 0)
result[result.Length - 1] = carry;
else
Array.Resize(ref result, result.Length - 1);
// Let's restore a and b
Array.Reverse(a);
Array.Reverse(b);
Array.Reverse(result);
return result;
}
Demo:
string a = "1481298410984109284109481491284901249018490849081048914820948019";
string b = "3475893498573573849739857349873498739487598";
string c = string.Concat(AddArrays(
a.Select(d => d - '0').ToArray(),
b.Select(d => d - '0').ToArray()));
Console.Write(c);
Output:
1481298410984109284112957384783474822868230706430922413560435617

How can i optimize this problem while using only 3 for loops?

So the problem that I'm trying to optimize is to find and print all four-digit numbers of the type ABCD for which: A + B = C + D.
For example:
1001
1010
1102
etc.
I have used four for loops to solve this (one for every digit of the number).
for (int a = 1; a <= 9; a++)
{
for (int b = 0; b <= 9; b++)
{
for (int c = 0; c <= 9; c++)
{
for (int d = 0; d <= 9; d++)
{
if ((a + b) == (c + d))
{
Console.WriteLine(" " + a + " " + b + " " + c + " " + d);
}
}
}
}
}
My question is: how can I solve this using only 3 for loops?

Here's an option with two loops (though still 10,000 iterations), separating the pairs of digits:
int sumDigits(int input)
{
int result = 0;
while (input != 0)
{
result += input % 10;
input /= 10;
}
return result;
}
//optimized for max of two digits
int sumDigitsAlt(int input)
{
return (input % 10) + ( (input / 10) % 10);
}
// a and b
for (int i = 0; i <= 99; i++)
{
int sum = sumDigits(i);
// c and d
for (int j = 0; j <= 99; j++)
{
if (sum == sumDigits(j))
{
Console.WriteLine( (100 * i) + j);
}
}
}
I suppose the while() loop inside of sumDigits() might count as a third loop, but since we know we have at most two digits we could remove it if needed.
And, of course, we can use a similar tactic to do this with one loop which counts from 0 to 9999, and even that we can hide:
var numbers = Enumerable.Range(0, 10000).
Where(n => {
// there is no a/b
if (n < 100 && n == 0) return true;
if (n < 100) return false;
int sumCD = n % 10;
n /= 10;
sumCD += n % 10;
n /= 10;
int sumAB = n % 10;
n /= 10;
sumAB += n % 10;
return (sumAB == sumCD);
});

One approach is to write a method that takes in an integer and returns true if the integer is four digits and the sum of the first two equal the sum of the second two:
public static bool FirstTwoEqualLastTwo(int input)
{
if (input < 1000 || input > 9999) return false;
var first = input / 1000;
var second = (input - first * 1000) / 100;
var third = (input - first * 1000 - second * 100) / 10;
var fourth = input - first * 1000 - second * 100 - third * 10;
return (first + second) == (third + fourth);
}
Then you can write a single loop from 1000-9999 and output the numbers for which this is true with a space between each digit (not sure why that's the output, but it appears that's what you were doing in your sample code):
static void Main(string[] args)
{
for (int i = 1000; i < 10000; i++)
{
if (FirstTwoEqualLastTwo(i))
{
Console.WriteLine(" " + string.Join(" ", i.ToString().ToArray()));
}
}
Console.Write("Done. Press any key to exit...");
Console.ReadKey();
}

We can compute the value of d from the values of a,b,c.
for (int a = 1; a <= 9; a++)
{
for (int b = 0; b <= 9; b++)
{
for (int c = 0; c <= 9; c++)
{
if (a + b >= c && a + b <= 9 + c)
{
int d = a + b - c;
Console.WriteLine(" " + a + " " + b + " " + c + " " + d);
}
}
}
}
We can further optimize by changing the condition of the third loop to for (int c = max(0, a + b - 9); c <= a + b; c++) and getting rid of the if statement.

Regex or string compare with allowance of error

I'm trying to do a string compare in C# with some allowance for error. For example, if my search term is "Welcome", but if my comparison string (generated through OCR) is "We1come" and my error allowance is 20%, that should match. That part isn't so difficult using something like the Levenshtein algorithm. The hard part is making it work within a larger block of text, like a regular expression. For example, maybe my OCR result is "Hello. My name is Ben. We1come to my StackOverflow question.", I would like to pick out that We1come as a good result compared to my search term.

Took quite a while, but this works well. Fun problem :)
string PossibleString = PossibleString.ToString().ToLower();
string StaticText = StaticText.ToLower();
decimal PossibleStringLength = (PossibleString.Length);
decimal StaticTextLength = (StaticText.Length);
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero);
int LevenshteinDistance = LevenshteinAlgorithm(StaticText, PossibleString);
string PossibleResult = string.Empty;
if (LevenshteinDistance == PossibleStringLength - StaticTextLength)
{
// Perfect match. no need to calculate.
PossibleResult = StaticText;
}
else
{
int TextLengthBuffer = (int)StaticTextLength - 1;
int LowestLevenshteinNumber = 999999;
for (int i = 0; i < 3; i++) // Check for best results with same amount of characters as expected, as well as +/- 1
{
for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
{
string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
{
PossibleResult = possibleResult;
LowestLevenshteinNumber = lNumber;
}
}
TextLengthBuffer++;
}
}
public static int LevenshteinAlgorithm(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}

If it is somehow predictable how the OCR can miss letters, I would replace the letters in the search with misses.
If the search is Welcome, the regex would be (?i)We[l1]come.

Rabin Karp string matching algorithm

I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S?
What significance do these values carry with them?
static void Main(string[] args)
{
string A = "String that contains a pattern.";
string B = "pattern";
ulong siga = 0;
ulong sigb = 0;
ulong Q = 100007;
ulong D = 256;
for (int i = 0; i < B.Length; i++)
{
siga = (siga * D + (ulong)A[i]) % Q;
sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
return;
}
ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
pow = (pow * D) % Q;
for (int j = 1; j <= A.Length - B.Length; j++)
{
siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
if (siga == sigb)
{
if (A.Substring(j, B.Length) == B)
{
Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
A.Substring(j, B.Length),
A.Substring(j + B.Length)));
return;
}
}
}
Console.WriteLine("Not copied!");
}

About the magic numbers Paul's answer is pretty clear.
As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.
The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2) instead of linear O(n).
Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.
So, let's comment your code:
for (int i = 0; i < B.Length; i++)
{
siga = (siga * D + (ulong)A[i]) % Q;
sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
return;
}
^ This piece computes the hash of pattern B (sigb), and the hashcode of the initial substring of A of the same length of B.
Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B).
ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
pow = (pow * D) % Q;
^ Here's computed pow that is necessary to perform the rolling hash.
for (int j = 1; j <= A.Length - B.Length; j++)
{
siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
if (siga == sigb)
{
if (A.Substring(j, B.Length) == B)
{
Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
A.Substring(j, B.Length),
A.Substring(j + B.Length)));
return;
}
}
}
^ Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).
If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.
¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.

The algorithm uses hashing for fast string comparison. Q and D are magic numbers that the coder probably arrived at with a little bit of trial and error and give a good distribution of hash values for this particular algorithm.
You can see these types of magic numbers used for hashing many places. The example below is the decompiled definition of the GetHashCode function of a .NET 2.0 string type:
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override unsafe int GetHashCode()
{
char* chrPointer = null;
int num1;
int num2;
fixed (string str = (string)this)
{
num1 = 352654597;
num2 = num1;
int* numPointer = chrPointer;
for (int i = this.Length; i > 0; i = i - 4)
{
num1 = (num1 << 5) + num1 + (num1 >> 27) ^ numPointer;
if (i <= 2)
{
break;
}
num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPointer + (void*)4;
numPointer = numPointer + (void*)8;
}
}
return num1 + num2 * 1566083941;
}
Here is another example from a R# generated GetHashcode override function for a sample type:
public override int GetHashCode()
{
unchecked
{
int result = (SomeStrId != null ? SomeStrId.GetHashCode() : 0);
result = (result*397) ^ (Desc != null ? Desc.GetHashCode() : 0);
result = (result*397) ^ (AnotherId != null ? AnotherId.GetHashCode() : 0);
return result;
}
}

Compare string similarity

What is the best way to compare two strings to see how similar they are?
Examples:
My String
My String With Extra Words
Or
My String
My Slightly Different String
What I am looking for is to determine how similar the first and second string in each pair is. I would like to score the comparison and if the strings are similar enough, I would consider them a matching pair.
Is there a good way to do this in C#?

static class LevenshteinDistance
{
public static int Compute(string s, string t)
{
if (string.IsNullOrEmpty(s))
{
if (string.IsNullOrEmpty(t))
return 0;
return t.Length;
}
if (string.IsNullOrEmpty(t))
{
return s.Length;
}
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// initialize the top and right of the table to 0, 1, 2, ...
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 1; j <= m; d[0, j] = j++);
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
int min1 = d[i - 1, j] + 1;
int min2 = d[i, j - 1] + 1;
int min3 = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(Math.Min(min1, min2), min3);
}
}
return d[n, m];
}
}

If anyone was wondering what the C# equivalent of what #FrankSchwieterman posted is:
public static int GetDamerauLevenshteinDistance(string s, string t)
{
if (string.IsNullOrEmpty(s))
{
throw new ArgumentNullException(s, "String Cannot Be Null Or Empty");
}
if (string.IsNullOrEmpty(t))
{
throw new ArgumentNullException(t, "String Cannot Be Null Or Empty");
}
int n = s.Length; // length of s
int m = t.Length; // length of t
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
int[] p = new int[n + 1]; //'previous' cost array, horizontally
int[] d = new int[n + 1]; // cost array, horizontally
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
for (i = 0; i <= n; i++)
{
p[i] = i;
}
for (j = 1; j <= m; j++)
{
char tJ = t[j - 1]; // jth character of t
d[0] = j;
for (i = 1; i <= n; i++)
{
int cost = s[i - 1] == tJ ? 0 : 1; // cost
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.Min(Math.Min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
}
// copy current distance counts to 'previous row' distance counts
int[] dPlaceholder = p; //placeholder to assist in swapping p and d
p = d;
d = dPlaceholder;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}

I am comparing two sentences like this
string[] vs = string1.Split(new char[] { ' ', '-', '/', '(', ')' },StringSplitOptions.RemoveEmptyEntries);
string[] vs1 = string2.Split(new char[] { ' ', '-', '/', '(', ')' }, StringSplitOptions.RemoveEmptyEntries);
vs.Intersect(vs1, StringComparer.OrdinalIgnoreCase).Count();
Intersect gives you a set of identical word lists , I continue by looking at the count and saying if it is more than 1, these two sentences contain similar words.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Rabin-Karp string matching algorithm by rolling hash - c#

Change string B = "Matches contains a pattern "; to string B = "contains a pattern "; and it will work

Related

inaccurate results with function to add an array of digits together

How can i optimize this problem while using only 3 for loops?

Regex or string compare with allowance of error

Rabin Karp string matching algorithm

Compare string similarity

Categories

Resources