Boyer-Moore Practical in C#? - c#

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.
I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.
I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?
Below is the key code.
// Base for search classes
abstract class SearchBase
{
public const int InvalidIndex = -1;
protected string _pattern;
public SearchBase(string pattern) { _pattern = pattern; }
public abstract int Search(string text, int startIndex);
public int Search(string text) { return Search(text, 0); }
}
/// <summary>
/// A simplified Boyer-Moore implementation.
///
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
private byte[] _skipArray;
public BoyerMoore2(string pattern)
: base(pattern)
{
// TODO: To be replaced with multi-stage table
_skipArray = new byte[0x10000];
for (int i = 0; i < _skipArray.Length; i++)
_skipArray[i] = (byte)_pattern.Length;
for (int i = 0; i < _pattern.Length - 1; i++)
_skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
}
public override int Search(string text, int startIndex)
{
int i = startIndex;
// Loop while there's still room for search term
while (i <= (text.Length - _pattern.Length))
{
// Look if we have a match at this position
int j = _pattern.Length - 1;
while (j >= 0 && _pattern[j] == text[i + j])
j--;
if (j < 0)
{
// Match found
return i;
}
// Advance to next comparision
i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
}
// No match found
return InvalidIndex;
}
}
EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

Based on my own tests and the comments made here, I've concluded that the reason String.IndexOf() performs so well with StringComparision.Ordinal is because the method calls into unmanaged code that likely employs hand-optimized assembly language.
I have run a number of different tests and String.IndexOf() just seems to be faster than anything I can implement using managed C# code.
If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

My bet is that setting that flag allows String.IndexOf to use Boyer-Moore itself. And its implementation is better than yours.
Without that flag it has to be careful using Boyer-Moore (and probably doesn't) because of potential issues around Unicode. In particular the possibility of Unicode causes the transition tables that Boyer-Moore uses to blow up.

I made some small changes to your code, and made a different implementation to the Boyer-Moore algorithm and got better results.
I got the idea for this implementation from here
But to be honest, I would expect to reach a higher speed compared to IndexOf.
class SearchResults
{
public int Matches { get; set; }
public long Ticks { get; set; }
}
abstract class SearchBase
{
public const int InvalidIndex = -1;
protected string _pattern;
protected string _text;
public SearchBase(string text, string pattern) { _text = text; _pattern = pattern; }
public abstract int Search(int startIndex);
}
internal class BoyerMoore3 : SearchBase
{
readonly byte[] textBytes;
readonly byte[] patternBytes;
readonly int valueLength;
readonly int patternLength;
private readonly int[] badCharacters = new int[256];
private readonly int lastPatternByte;
public BoyerMoore3(string text, string pattern) : base(text, pattern)
{
textBytes = Encoding.UTF8.GetBytes(text);
patternBytes = Encoding.UTF8.GetBytes(pattern);
valueLength = textBytes.Length;
patternLength = patternBytes.Length;
for (int i = 0; i < 256; ++i)
badCharacters[i] = patternLength;
lastPatternByte = patternLength - 1;
for (int i = 0; i < lastPatternByte; ++i)
badCharacters[patternBytes[i]] = lastPatternByte - i;
}
public override int Search(int startIndex)
{
int index = startIndex;
while (index <= (valueLength - patternLength))
{
for (int i = lastPatternByte; textBytes[index + i] == patternBytes[i]; --i)
{
if (i == 0)
return index;
}
index += badCharacters[textBytes[index + lastPatternByte]];
}
// Text not found
return InvalidIndex;
}
}
Changed code from Form1:
private void RunSearch(string pattern, SearchBase search, SearchResults results)
{
var timer = new Stopwatch();
// Start timer
timer.Start();
// Find all matches
int pos = search.Search(0);
while (pos != -1)
{
results.Matches++;
pos = search.Search(pos + pattern.Length);
}
// Stop timer
timer.Stop();
// Add to total Ticks
results.Ticks += timer.ElapsedTicks;
}

Related

Using Dictionary to map byte to BitArray

I am developing an application that implements Simple Substitution Cypher. Now for speed reasons (and because that was one of the conditions) I need to use BitArray for encryption and decryption. The user will enter "coded" alphabet and I would need to map it in some way so I chose Dictionary since it uses hash table and has O(1) complexity when the user access data. But now I found myself wondering how can I do this when I have "coded" alphabet initialized like this:
BitArray codedAlphabet = new BitArray(bytes);
This would make me use 2 for loops to achieve my goal. Does anyone have different idea? Hopefully you understood what I am trying to achieve. Thank you in advance.
Code:
namespace Harpokrat.EncryptionAlgorithms
{
// Simple substitution cypher algorithm
public class SimpleSubstitutionStrategy : IEncryptionStrategy
{
private string alphabet; // message to be encrypted
private string coded; // this will be the key (input from file or from UI)
private ArrayList AlphabetBackUp = new ArrayList();
private ArrayList CodedBackUp = new ArrayList();
#region Properties
public string Alphabet
{
get
{
return this.alphabet;
}
set
{
this.alphabet = value;
foreach (char c in this.alphabet.ToCharArray())
{
this.AlphabetBackUp.Add(c);
}
}
}
public string Coded
{
get
{
return this.coded;
}
set
{
this.coded = "yqmnnsgwatkgetwtawuiqwemsg"; //for testing purposes
foreach (char c in this.coded.ToCharArray())
{
this.CodedBackUp.Add(c);
}
}
}
#endregion
public string Decrypt(string message)
{
message = message.ToLower();
string result = "";
for (int i = 0; i < message.Length; i++)
{
int indexOfSourceChar = CodedBackUp.IndexOf(message[i]);
if (indexOfSourceChar < 0 || (indexOfSourceChar > alphabet.Length - 1))
{
result += "#";
}
else
{
result += alphabet[indexOfSourceChar].ToString();
}
}
return result;
}
public string Encrypt(string message)
{
message = message.ToLower();
string result = "";
for(int i = 0; i < message.Length; i++)
{
int indexOfSourceChar = AlphabetBackUp.IndexOf(message[i]);
if (indexOfSourceChar < 0 || (indexOfSourceChar > coded.Length - 1))
{
result += "#";
}
else
{
result += coded[indexOfSourceChar].ToString();
}
}
return result;
}
}
}
I'd recommend a single method to set alphabet and coded at the same time, that internally builds the two dictionaries you'd need to do Encryption and Decryption, and a helper method to do a get-or-return-default ('#' in your case) on them.
That way you can implement a single function that does either Encryption or Decryption depending on the dictionary passed in (which could be implemented in a single line of code if you're comfortable using LINQ).

How to auto-increment number and letter to generate a string sequence wise in c#

I have to make a string which consists a string like - AAA0009, and once it reaches AAA0009, it will generate AA0010 to AAA0019 and so on.... till AAA9999 and when it will reach to AAA9999, it will give AAB0000 to AAB9999 and so on till ZZZ9999.
I want to use static class and static variables so that it can auto increment by itself on every hit.
I have tried some but not even close, so help me out thanks.
Thanks for being instructive I was trying as I Said already but anyways you already want to put negatives over there without even knowing the thing:
Code:
public class GenerateTicketNumber
{
private static int num1 = 0;
public static string ToBase36()
{
const string base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
var sb = new StringBuilder(9);
do
{
sb.Insert(0, base36[(byte)(num1 % 36)]);
num1 /= 36;
} while (num1 != 0);
var paddedString = "#T" + sb.ToString().PadLeft(8, '0');
num1 = num1 + 1;
return paddedString;
}
}
above is the code. this will generate a sequence but not the way I want anyways will use it and thanks for help.
Though there's already an accepted answer, I would like to share this one.
P.S. I do not claim that this is the best approach, but in my previous work we made something similar using Azure Table Storage which is a no sql database (FYI) and it works.
1.) Create a table to store your running ticket number.
public class TicketNumber
{
public string Type { get; set; } // Maybe you want to have different types of ticket?
public string AlphaPrefix { get; set; }
public string NumericPrefix { get; set; }
public TicketNumber()
{
this.AlphaPrefix = "AAA";
this.NumericPrefix = "0001";
}
public void Increment()
{
int num = int.Parse(this.NumericPrefix);
if (num + 1 >= 9999)
{
num = 1;
int i = 2; // We are assuming that there are only 3 characters
bool isMax = this.AlphaPrefix == "ZZZ";
if (isMax)
{
this.AlphaPrefix = "AAA"; // reset
}
else
{
while (this.AlphaPrefix[i] == 'Z')
{
i--;
}
char iChar = this.AlphaPrefix[i];
StringBuilder sb = new StringBuilder(this.AlphaPrefix);
sb[i] = (char)(iChar + 1);
this.AlphaPrefix = sb.ToString();
}
}
else
{
num++;
}
this.NumericPrefix = num.ToString().PadLeft(4, '0');
}
public override string ToString()
{
return this.AlphaPrefix + this.NumericPrefix;
}
}
2.) Make sure you perform row-level locking and issue an error when it fails.
Here's an oracle syntax:
SELECT * FROM TICKETNUMBER WHERE TYPE = 'TYPE' FOR UPDATE NOWAIT;
This query locks the row and returns an error if the row is currently locked by another session.
We need this to make sure that even if you have millions of users generating a ticket number, it will not mess up the sequence.
Just make sure to save the new ticket number before you perform a COMMIT.
I forgot the MSSQL version of this but I recall using WITH (ROWLOCK) or something. Just google it.
3.) Working example:
static void Main()
{
TicketNumber ticketNumber = new TicketNumber();
ticketNumber.AlphaPrefix = "ZZZ";
ticketNumber.NumericPrefix = "9999";
for (int i = 0; i < 10; i++)
{
Console.WriteLine(ticketNumber);
ticketNumber.Increment();
}
Console.Read();
}
Output:
Looking at your code that you've provided, it seems that you're backing this with a number and just want to convert that to a more user-friendly text representation.
You could try something like this:
private static string ValueToId(int value)
{
var parts = new List<string>();
int numberPart = value % 10000;
parts.Add(numberPart.ToString("0000"));
value /= 10000;
for (int i = 0; i < 3 || value > 0; ++i)
{
parts.Add(((char)(65 + (value % 26))).ToString());
value /= 26;
}
return string.Join(string.Empty, parts.AsEnumerable().Reverse().ToArray());
}
It will take the first 4 characters and use them as is, and then for the remainder of the value if will convert it into characters A-Z.
So 9999 becomes AAA9999, 10000 becomes AAB0000, and 270000 becomes ABB0000.
If the number is big enough that it exceeds 3 characters, it will add more letters at the start.
Here's an example of how you could go about implementing it
void Main()
{
string template = #"AAAA00";
var templateChars = template.ToCharArray();
for (int i = 0; i < 100000; i++)
{
templateChars = IncrementCharArray(templateChars);
Console.WriteLine(string.Join("",templateChars ));
}
}
public static char Increment(char val)
{
if(val == '9') return 'A';
if(val == 'Z') return '0';
return ++val;
}
public static char[] IncrementCharArray(char[] val)
{
if (val.All(chr => chr == 'Z'))
{
var newArray = new char[val.Length + 1];
for (int i = 0; i < newArray.Length; i++)
{
newArray[i] = '0';
}
return newArray;
}
int length = val.Length;
while (length > -1)
{
char lastVal = val[--length];
val[length] = Increment(lastVal);
if ( val[length] != '0') break;
}
return val;
}

Improve string parse performance

Before we start, I am aware of the term "premature optimization". However the following snippets have proven to be an area where improvements can be made.
Alright. We currently have some network code that works with string based packets. I am aware that using strings for packets is stupid, crazy and slow. Sadly, we don't have any control over the client and so have to use strings.
Each packet is terminated by \0\r\n and we currently use a StreamReader/Writer to read individual packets from the stream. Our main bottleneck comes from two places.
Firstly: We need to trim that nasty little null-byte off the end of the string. We currently use code like the following:
line = await reader.ReadLineAsync();
line = line.Replace("\0", ""); // PERF this allocates a new string
if (string.IsNullOrWhiteSpace(line))
return null;
var packet = ClientPacket.Parse(line, cl.Client.RemoteEndPoint);
As you can see by that cute little comment, we have a GC performance issue when trimming the '\0'. There are numerous different ways you could trim a '\0' off the end of a string, but all will result in the same GC hammering we get. Because all string operations are immutable, they result in a new string object being created. As our server handles 1000+ connections all communicating at around 25-40 packets per second (its a game server), this GC matter is becoming an issue. So here comes my first question: What is a more efficient way of trimming that '\0' off the end of our string? By efficient I don't only mean speed, but also GC wise (ultimately I'd like a way to get rid of it without creating a new string object!).
Our second issue also stems from GC land. Our code looks somewhat like the following:
private static string[] emptyStringArray = new string[] { }; // so we dont need to allocate this
public static ClientPacket Parse(string line, EndPoint from)
{
const char seperator = '|';
var first_seperator_pos = line.IndexOf(seperator);
if (first_seperator_pos < 1)
{
return new ClientPacket(NetworkStringToClientPacketType(line), emptyStringArray, from);
}
var name = line.Substring(0, first_seperator_pos);
var type = NetworkStringToClientPacketType(name);
if (line.IndexOf(seperator, first_seperator_pos + 1) < 1)
return new ClientPacket(type, new string[] { line.Substring(first_seperator_pos + 1) }, from);
return new ClientPacket(type, line.Substring(first_seperator_pos + 1).Split(seperator), from);
}
(Where NetworkStringToClientPacketType is simply a big switch-case block)
As you can see we already do a few things to handle GC. We reuse a static "empty" string and we check for packets with no parameters. My only issue here is that we are using Substring a lot, and even chain a Split on the end of a Substring. This leads to (for an average packet) almost 20 new string objects being created and 12 being disposed of EACH PACKET. This causes a lot of performance issues when load increases anything over 400 users (we gotz fast ram :3)
Has anyone had an experience with this sort of thing before or could give us some pointers into what to look into next? Maybe some magical classes or some nifty pointer magic?
(PS. StringBuilder doesn't help as we aren't building strings, we are generally splitting them.)
We currently have some ideas based on an index based system where we store the index and length of each parameter rather than splitting them. Thoughts?
A few other things. Decompiling mscorlib and browsing the string class code, it seems to me like IndexOf calls are done via P/Invoke, which would mean they have added overhead for each call, correct me if I'm wrong? Would it not be faster to implement an IndexOf manually using a char[] array?
public int IndexOf(string value, int startIndex, int count, StringComparison comparisonType)
{
...
return TextInfo.IndexOfStringOrdinalIgnoreCase(this, value, startIndex, count);
...
}
internal static int IndexOfStringOrdinalIgnoreCase(string source, string value, int startIndex, int count)
{
...
if (TextInfo.TryFastFindStringOrdinalIgnoreCase(4194304, source, startIndex, value, count, ref result))
{
return result;
}
...
}
...
[DllImport("QCall", CharSet = CharSet.Unicode)]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool InternalTryFindStringOrdinalIgnoreCase(int searchFlags, string source, int sourceCount, int startIndex, string target, int targetCount, ref int foundIndex);
Then we get to String.Split which ends up calling Substring itself (somewhere along the line):
// string
private string[] InternalSplitOmitEmptyEntries(int[] sepList, int[] lengthList, int numReplaces, int count)
{
int num = (numReplaces < count) ? (numReplaces + 1) : count;
string[] array = new string[num];
int num2 = 0;
int num3 = 0;
int i = 0;
while (i < numReplaces && num2 < this.Length)
{
if (sepList[i] - num2 > 0)
{
array[num3++] = this.Substring(num2, sepList[i] - num2);
}
num2 = sepList[i] + ((lengthList == null) ? 1 : lengthList[i]);
if (num3 == count - 1)
{
while (i < numReplaces - 1)
{
if (num2 != sepList[++i])
{
break;
}
num2 += ((lengthList == null) ? 1 : lengthList[i]);
}
break;
}
i++;
}
if (num2 < this.Length)
{
array[num3++] = this.Substring(num2);
}
string[] array2 = array;
if (num3 != num)
{
array2 = new string[num3];
for (int j = 0; j < num3; j++)
{
array2[j] = array[j];
}
}
return array2;
}
Thankfully Substring looks fast (and efficient!):
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (startIndex == 0 && length == this.Length && !fAlwaysCopy)
{
return this;
}
string text = string.FastAllocateString(length);
fixed (char* ptr = &text.m_firstChar)
{
fixed (char* ptr2 = &this.m_firstChar)
{
string.wstrcpy(ptr, ptr2 + (IntPtr)startIndex, length);
}
}
return text;
}
After reading this answer here, I'm thinking a pointer based solution could be found... Thoughts?
Thanks.
You could "cheat" and work at the Encoder level...
public class UTF8NoZero : UTF8Encoding
{
public override Decoder GetDecoder()
{
return new MyDecoder();
}
}
public class MyDecoder : Decoder
{
public Encoding UTF8 = new UTF8Encoding();
public override int GetCharCount(byte[] bytes, int index, int count)
{
return UTF8.GetCharCount(bytes, index, count);
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
int count2 = UTF8.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
int i, j;
for (i = charIndex, j = charIndex; i < charIndex + count2; i++)
{
if (chars[i] != '\0')
{
chars[j] = chars[i];
j++;
}
}
for (int k = j; k < charIndex + count2; k++)
{
chars[k] = '\0';
}
return count2 + (i - j);
}
}
Note that this cheat is based on the fact that StreamReader.ReadLineAsync uses only the GetChars(). We remove the '\0' in the temporary buffer char[] buffer used by StreamReader.ReadLineAsync.

Get the index of the nth occurrence of a string?

Unless I am missing an obvious built-in method, what is the quickest way to get the nth occurrence of a string within a string?
I realize that I could loop the IndexOf method by updating its start index on each iteration of the loop. But doing it this way seems wasteful to me.
You really could use the regular expression /((s).*?){n}/ to search for n-th occurrence of substring s.
In C# it might look like this:
public static class StringExtender
{
public static int NthIndexOf(this string target, string value, int n)
{
Match m = Regex.Match(target, "((" + Regex.Escape(value) + ").*?){" + n + "}");
if (m.Success)
return m.Groups[2].Captures[n - 1].Index;
else
return -1;
}
}
Note: I have added Regex.Escape to original solution to allow searching characters which have special meaning to regex engine.
That's basically what you need to do - or at least, it's the easiest solution. All you'd be "wasting" is the cost of n method invocations - you won't actually be checking any case twice, if you think about it. (IndexOf will return as soon as it finds the match, and you'll keep going from where it left off.)
That's basically what you need to do - or at least, it's the easiest solution. All you'd be "wasting" is the cost of n method invocations - you won't actually be checking any case twice, if you think about it. (IndexOf will return as soon as it finds the match, and you'll keep going from where it left off.)
Here is the recursive implementation (of the above idea) as an extension method, mimicing the format of the framework method(s):
public static int IndexOfNth(this string input,
string value, int startIndex, int nth)
{
if (nth < 1)
throw new NotSupportedException("Param 'nth' must be greater than 0!");
if (nth == 1)
return input.IndexOf(value, startIndex);
var idx = input.IndexOf(value, startIndex);
if (idx == -1)
return -1;
return input.IndexOfNth(value, idx + 1, --nth);
}
Also, here are some (MBUnit) unit tests that might help you (to prove it is correct):
using System;
using MbUnit.Framework;
namespace IndexOfNthTest
{
[TestFixture]
public class Tests
{
//has 4 instances of the
private const string Input = "TestTest";
private const string Token = "Test";
/* Test for 0th index */
[Test]
public void TestZero()
{
Assert.Throws<NotSupportedException>(
() => Input.IndexOfNth(Token, 0, 0));
}
/* Test the two standard cases (1st and 2nd) */
[Test]
public void TestFirst()
{
Assert.AreEqual(0, Input.IndexOfNth("Test", 0, 1));
}
[Test]
public void TestSecond()
{
Assert.AreEqual(4, Input.IndexOfNth("Test", 0, 2));
}
/* Test the 'out of bounds' case */
[Test]
public void TestThird()
{
Assert.AreEqual(-1, Input.IndexOfNth("Test", 0, 3));
}
/* Test the offset case (in and out of bounds) */
[Test]
public void TestFirstWithOneOffset()
{
Assert.AreEqual(4, Input.IndexOfNth("Test", 4, 1));
}
[Test]
public void TestFirstWithTwoOffsets()
{
Assert.AreEqual(-1, Input.IndexOfNth("Test", 8, 1));
}
}
}
private int IndexOfOccurence(string s, string match, int occurence)
{
int i = 1;
int index = 0;
while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
{
if (i == occurence)
return index;
i++;
}
return -1;
}
or in C# with extension methods
public static int IndexOfOccurence(this string s, string match, int occurence)
{
int i = 1;
int index = 0;
while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
{
if (i == occurence)
return index;
i++;
}
return -1;
}
After some benchmarking, this seems to be the simplest and most effcient solution
public static int IndexOfNthSB(string input,
char value, int startIndex, int nth)
{
if (nth < 1)
throw new NotSupportedException("Param 'nth' must be greater than 0!");
var nResult = 0;
for (int i = startIndex; i < input.Length; i++)
{
if (input[i] == value)
nResult++;
if (nResult == nth)
return i;
}
return -1;
}
Here I go again! Another benchmark answer from yours truly :-) Once again based on the fantastic BenchmarkDotNet package (if you're serious about benchmarking dotnet code, please, please use this package).
The motivation for this post is two fold: PeteT (who asked it originally) wondered that it seems wasteful to use String.IndexOf varying the startIndex parameter in a loop to find the nth occurrence of a character while, in fact, it's the fastest method, and because some answers uses regular expressions which are an order of magnitude slower (and do not add any benefits, in my opinion not even readability, in this specific case).
Here is the code I've ended up using in my string extensions library (it's not a new answer to this question, since others have already posted semantically identical code here, I'm not taking credit for it). This is the fastest method (even, possibly, including unsafe variations - more on that later):
public static int IndexOfNth(this string str, char ch, int nth, int startIndex = 0) {
if (str == null)
throw new ArgumentNullException("str");
var idx = str.IndexOf(ch, startIndex);
while (idx >= 0 && --nth > 0)
idx = str.IndexOf(ch, startIndex + idx + 1);
return idx;
}
I've benchmarked this code against two other methods and the results follow:
The benchmarked methods were:
[Benchmark]
public int FindNthRegex() {
Match m = Regex.Match(text, "((" + Regex.Escape("z") + ").*?){" + Nth + "}");
return (m.Success)
? m.Groups[2].Captures[Nth - 1].Index
: -1;
}
[Benchmark]
public int FindNthCharByChar() {
var occurrence = 0;
for (int i = 0; i < text.Length; i++) {
if (text[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthIndexOfStartIdx() {
var idx = text.IndexOf('z', 0);
var nth = Nth;
while (idx >= 0 && --nth > 0)
idx = text.IndexOf('z', idx + 1);
return idx;
}
The FindNthRegex method is the slower of the bunch, taking an order (or two) of magnitude more time than the fastest. FindNthByChar loops over each char on the string and counts each match until it finds the nth occurrence. FindNthIndexOfStartIdx uses the method suggested by the opener of this question which, indeed, is the same I've been using for ages to accomplish this and it is the fastest of them all.
Why is it so much faster than FindNthByChar? It's because Microsoft went to great lengths to make string manipulation as fast as possible in the dotnet framework. And they've accomplished that! They did an amazing job! I've done a deeper investigation on string manipulations in dotnet in an CodeProject article which tries to find the fastest method to remove all whitespace from a string:
Fastest method to remove all whitespace from Strings in .NET
There you'll find why string manipulations in dotnet are so fast, and why it's next to useless trying to squeeze more speed by writing our own versions of the framework's string manipulation code (the likes of string.IndexOf, string.Split, string.Replace, etc.)
The full benchmark code I've used follows (it's a dotnet6 console program):
UPDATE: Added two methods FindNthCharByCharInSpan and FindNthCharRecursive (and now FindNthByLinq).
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text;
using System.Text.RegularExpressions;
var summary = BenchmarkRunner.Run<BenchmarkFindNthChar>();
public class BenchmarkFindNthChar
{
const string BaseText = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
[Params(100, 1000)]
public int BaseTextRepeatCount { get; set; }
[Params(500)]
public int Nth { get; set; }
private string text;
[GlobalSetup]
public void BuildTestData() {
var sb = new StringBuilder();
for (int i = 0; i < BaseTextRepeatCount; i++)
sb.AppendLine(BaseText);
text = sb.ToString();
}
[Benchmark]
public int FindNthRegex() {
Match m = Regex.Match(text, "((" + Regex.Escape("z") + ").*?){" + Nth + "}");
return (m.Success)
? m.Groups[2].Captures[Nth - 1].Index
: -1;
}
[Benchmark]
public int FindNthCharByChar() {
var occurrence = 0;
for (int i = 0; i < text.Length; i++) {
if (text[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthIndexOfStartIdx() {
var idx = text.IndexOf('z', 0);
var nth = Nth;
while (idx >= 0 && --nth > 0)
idx = text.IndexOf('z', idx + 1);
return idx;
}
[Benchmark]
public int FindNthCharByCharInSpan() {
var span = text.AsSpan();
var occurrence = 0;
for (int i = 0; i < span.Length; i++) {
if (span[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthCharRecursive() => IndexOfNth(text, "z", 0, Nth);
public static int IndexOfNth(string input, string value, int startIndex, int nth) {
if (nth == 1)
return input.IndexOf(value, startIndex);
var idx = input.IndexOf(value, startIndex);
if (idx == -1)
return -1;
return IndexOfNth(input, value, idx + 1, --nth);
}
[Benchmark]
public int FindNthByLinq() {
var items = text.Select((c, i) => (c, i)).Where(t => t.c == 'z');
return (items.Count() > Nth - 1)
? items.ElementAt(Nth - 1).i
: -1;
}
}
UPDATE 2: The new benchmark results (with Linq-based benchmark) follows:
The Linq-based solution is only better than the recursive method, but it's good to have it here for completeness.
Maybe it would also be nice to work with the String.Split() Method and check if the requested occurrence is in the array, if you don't need the index, but the value at the index
Or something like this with the do while loop
private static int OrdinalIndexOf(string str, string substr, int n)
{
int pos = -1;
do
{
pos = str.IndexOf(substr, pos + 1);
} while (n-- > 0 && pos != -1);
return pos;
}
System.ValueTuple ftw:
var index = line.Select((x, i) => (x, i)).Where(x => x.Item1 == '"').ElementAt(5).Item2;
writing a function from that is homework
Tod's answer can be simplified somewhat.
using System;
static class MainClass {
private static int IndexOfNth(this string target, string substring,
int seqNr, int startIdx = 0)
{
if (seqNr < 1)
{
throw new IndexOutOfRangeException("Parameter 'nth' must be greater than 0.");
}
var idx = target.IndexOf(substring, startIdx);
if (idx < 0 || seqNr == 1) { return idx; }
return target.IndexOfNth(substring, --seqNr, ++idx); // skip
}
static void Main () {
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 1));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 2));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 3));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 4));
}
}
Output
1
3
5
-1
This might do it:
Console.WriteLine(str.IndexOf((#"\")+2)+1);

Comparing names

Is there any simple algorithm to determine the likeliness of 2 names representing the same person?
I'm not asking for something of the level that Custom department might be using. Just a simple algorithm that would tell me if 'James T. Clark' is most likely the same name as 'J. Thomas Clark' or 'James Clerk'.
If there is an algorithm in C# that would be great, but I can translate from any language.
Sounds like you're looking for a phonetic-based algorithms, such as soundex, NYSIIS, or double metaphone. The first actually is what several government departments use, and is trivial to implement (with many implementations readily available). The second is a slightly more complicated and more precise version of the first. The latter-most works with some non-English names and alphabets.
Levenshtein distance is a definition of distance between two arbitrary strings. It gives you a distance of 0 between identical strings and non-zero between different strings, which might also be useful if you decide to make a custom algorithm.
Levenshtein is close, although maybe not exactly what you want.
I've faced similar problem and tried to use Levenstein distance first, but it did not work well for me. I came up with an algorithm that gives you "similarity" value between two strings (higher value means more similar strings, "1" for identical strings). This value is not very meaningful by itself (if not "1", always 0.5 or less), but works quite well when you throw in Hungarian Matrix to find matching pairs from two lists of strings.
Use like this:
PartialStringComparer cmp = new PartialStringComparer();
tbResult.Text = cmp.Compare(textBox1.Text, textBox2.Text).ToString();
The code behind:
public class SubstringRange {
string masterString;
public string MasterString {
get { return masterString; }
set { masterString = value; }
}
int start;
public int Start {
get { return start; }
set { start = value; }
}
int end;
public int End {
get { return end; }
set { end = value; }
}
public int Length {
get { return End - Start; }
set { End = Start + value;}
}
public bool IsValid {
get { return MasterString.Length >= End && End >= Start && Start >= 0; }
}
public string Contents {
get {
if(IsValid) {
return MasterString.Substring(Start, Length);
} else {
return "";
}
}
}
public bool OverlapsRange(SubstringRange range) {
return !(End < range.Start || Start > range.End);
}
public bool ContainsRange(SubstringRange range) {
return range.Start >= Start && range.End <= End;
}
public bool ExpandTo(string newContents) {
if(MasterString.Substring(Start).StartsWith(newContents, StringComparison.InvariantCultureIgnoreCase) && newContents.Length > Length) {
Length = newContents.Length;
return true;
} else {
return false;
}
}
}
public class SubstringRangeList: List<SubstringRange> {
string masterString;
public string MasterString {
get { return masterString; }
set { masterString = value; }
}
public SubstringRangeList(string masterString) {
this.MasterString = masterString;
}
public SubstringRange FindString(string s){
foreach(SubstringRange r in this){
if(r.Contents.Equals(s, StringComparison.InvariantCultureIgnoreCase))
return r;
}
return null;
}
public SubstringRange FindSubstring(string s){
foreach(SubstringRange r in this){
if(r.Contents.StartsWith(s, StringComparison.InvariantCultureIgnoreCase))
return r;
}
return null;
}
public bool ContainsRange(SubstringRange range) {
foreach(SubstringRange r in this) {
if(r.ContainsRange(range))
return true;
}
return false;
}
public bool AddSubstring(string substring) {
bool result = false;
foreach(SubstringRange r in this) {
if(r.ExpandTo(substring)) {
result = true;
}
}
if(FindSubstring(substring) == null) {
bool patternfound = true;
int start = 0;
while(patternfound){
patternfound = false;
start = MasterString.IndexOf(substring, start, StringComparison.InvariantCultureIgnoreCase);
patternfound = start != -1;
if(patternfound) {
SubstringRange r = new SubstringRange();
r.MasterString = this.MasterString;
r.Start = start++;
r.Length = substring.Length;
if(!ContainsRange(r)) {
this.Add(r);
result = true;
}
}
}
}
return result;
}
private static bool SubstringRangeMoreThanOneChar(SubstringRange range) {
return range.Length > 1;
}
public float Weight {
get {
if(MasterString.Length == 0 || Count == 0)
return 0;
float numerator = 0;
int denominator = 0;
foreach(SubstringRange r in this.FindAll(SubstringRangeMoreThanOneChar)) {
numerator += r.Length;
denominator++;
}
if(denominator == 0)
return 0;
return numerator / denominator / MasterString.Length;
}
}
public void RemoveOverlappingRanges() {
SubstringRangeList l = new SubstringRangeList(this.MasterString);
l.AddRange(this);//create a copy of this list
foreach(SubstringRange r in l) {
if(this.Contains(r) && this.ContainsRange(r)) {
Remove(r);//try to remove the range
if(!ContainsRange(r)) {//see if the list still contains "superset" of this range
Add(r);//if not, add it back
}
}
}
}
public void AddStringToCompare(string s) {
for(int start = 0; start < s.Length; start++) {
for(int len = 1; start + len <= s.Length; len++) {
string part = s.Substring(start, len);
if(!AddSubstring(part))
break;
}
}
RemoveOverlappingRanges();
}
}
public class PartialStringComparer {
public float Compare(string s1, string s2) {
SubstringRangeList srl1 = new SubstringRangeList(s1);
srl1.AddStringToCompare(s2);
SubstringRangeList srl2 = new SubstringRangeList(s2);
srl2.AddStringToCompare(s1);
return (srl1.Weight + srl2.Weight) / 2;
}
}
Levenstein distance one is much simpler (adapted from http://www.merriampark.com/ld.htm):
public class Distance {
/// <summary>
/// Compute Levenshtein distance
/// </summary>
/// <param name="s">String 1</param>
/// <param name="t">String 2</param>
/// <returns>Distance between the two strings.
/// The larger the number, the bigger the difference.
/// </returns>
public static int LD(string s, string t) {
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if(n == 0) return m;
if(m == 0) return n;
// Step 2
for(int i = 0; i <= n; d[i, 0] = i++) ;
for(int j = 0; j <= m; d[0, j] = j++) ;
// Step 3
for(int i = 1; i <= n; i++) {
//Step 4
for(int j = 1; j <= m; j++) {
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
I doubt there is, considering even the Customs Department doesn't seem to have a satisfactory answer...
If there is a solution to this problem I seriously doubt it's a part of core C#. Off the top of my head, it would require a database of first, middle and last name frequencies, as well as account for initials, as in your example. This is fairly complex logic that relies on a database of information.
Second to Levenshtein distance, what language do you want? I was able to find an implementation in C# on codeproject pretty easily.
In an application I worked on, the Last name field was considered reliable.
So presented all the all the records with the same last name to the user.
User could sort by the other fields to look for similar names.
This solution was good enough to greatly reduce the issue of users creating duplicate records.
Basically looks like the issue will require human judgement.

Categories