Improve string parse performance - c#

Before we start, I am aware of the term "premature optimization". However the following snippets have proven to be an area where improvements can be made.
Alright. We currently have some network code that works with string based packets. I am aware that using strings for packets is stupid, crazy and slow. Sadly, we don't have any control over the client and so have to use strings.
Each packet is terminated by \0\r\n and we currently use a StreamReader/Writer to read individual packets from the stream. Our main bottleneck comes from two places.
Firstly: We need to trim that nasty little null-byte off the end of the string. We currently use code like the following:
line = await reader.ReadLineAsync();
line = line.Replace("\0", ""); // PERF this allocates a new string
if (string.IsNullOrWhiteSpace(line))
return null;
var packet = ClientPacket.Parse(line, cl.Client.RemoteEndPoint);
As you can see by that cute little comment, we have a GC performance issue when trimming the '\0'. There are numerous different ways you could trim a '\0' off the end of a string, but all will result in the same GC hammering we get. Because all string operations are immutable, they result in a new string object being created. As our server handles 1000+ connections all communicating at around 25-40 packets per second (its a game server), this GC matter is becoming an issue. So here comes my first question: What is a more efficient way of trimming that '\0' off the end of our string? By efficient I don't only mean speed, but also GC wise (ultimately I'd like a way to get rid of it without creating a new string object!).
Our second issue also stems from GC land. Our code looks somewhat like the following:
private static string[] emptyStringArray = new string[] { }; // so we dont need to allocate this
public static ClientPacket Parse(string line, EndPoint from)
{
const char seperator = '|';
var first_seperator_pos = line.IndexOf(seperator);
if (first_seperator_pos < 1)
{
return new ClientPacket(NetworkStringToClientPacketType(line), emptyStringArray, from);
}
var name = line.Substring(0, first_seperator_pos);
var type = NetworkStringToClientPacketType(name);
if (line.IndexOf(seperator, first_seperator_pos + 1) < 1)
return new ClientPacket(type, new string[] { line.Substring(first_seperator_pos + 1) }, from);
return new ClientPacket(type, line.Substring(first_seperator_pos + 1).Split(seperator), from);
}
(Where NetworkStringToClientPacketType is simply a big switch-case block)
As you can see we already do a few things to handle GC. We reuse a static "empty" string and we check for packets with no parameters. My only issue here is that we are using Substring a lot, and even chain a Split on the end of a Substring. This leads to (for an average packet) almost 20 new string objects being created and 12 being disposed of EACH PACKET. This causes a lot of performance issues when load increases anything over 400 users (we gotz fast ram :3)
Has anyone had an experience with this sort of thing before or could give us some pointers into what to look into next? Maybe some magical classes or some nifty pointer magic?
(PS. StringBuilder doesn't help as we aren't building strings, we are generally splitting them.)
We currently have some ideas based on an index based system where we store the index and length of each parameter rather than splitting them. Thoughts?
A few other things. Decompiling mscorlib and browsing the string class code, it seems to me like IndexOf calls are done via P/Invoke, which would mean they have added overhead for each call, correct me if I'm wrong? Would it not be faster to implement an IndexOf manually using a char[] array?
public int IndexOf(string value, int startIndex, int count, StringComparison comparisonType)
{
...
return TextInfo.IndexOfStringOrdinalIgnoreCase(this, value, startIndex, count);
...
}
internal static int IndexOfStringOrdinalIgnoreCase(string source, string value, int startIndex, int count)
{
...
if (TextInfo.TryFastFindStringOrdinalIgnoreCase(4194304, source, startIndex, value, count, ref result))
{
return result;
}
...
}
...
[DllImport("QCall", CharSet = CharSet.Unicode)]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool InternalTryFindStringOrdinalIgnoreCase(int searchFlags, string source, int sourceCount, int startIndex, string target, int targetCount, ref int foundIndex);
Then we get to String.Split which ends up calling Substring itself (somewhere along the line):
// string
private string[] InternalSplitOmitEmptyEntries(int[] sepList, int[] lengthList, int numReplaces, int count)
{
int num = (numReplaces < count) ? (numReplaces + 1) : count;
string[] array = new string[num];
int num2 = 0;
int num3 = 0;
int i = 0;
while (i < numReplaces && num2 < this.Length)
{
if (sepList[i] - num2 > 0)
{
array[num3++] = this.Substring(num2, sepList[i] - num2);
}
num2 = sepList[i] + ((lengthList == null) ? 1 : lengthList[i]);
if (num3 == count - 1)
{
while (i < numReplaces - 1)
{
if (num2 != sepList[++i])
{
break;
}
num2 += ((lengthList == null) ? 1 : lengthList[i]);
}
break;
}
i++;
}
if (num2 < this.Length)
{
array[num3++] = this.Substring(num2);
}
string[] array2 = array;
if (num3 != num)
{
array2 = new string[num3];
for (int j = 0; j < num3; j++)
{
array2[j] = array[j];
}
}
return array2;
}
Thankfully Substring looks fast (and efficient!):
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (startIndex == 0 && length == this.Length && !fAlwaysCopy)
{
return this;
}
string text = string.FastAllocateString(length);
fixed (char* ptr = &text.m_firstChar)
{
fixed (char* ptr2 = &this.m_firstChar)
{
string.wstrcpy(ptr, ptr2 + (IntPtr)startIndex, length);
}
}
return text;
}
After reading this answer here, I'm thinking a pointer based solution could be found... Thoughts?
Thanks.

You could "cheat" and work at the Encoder level...
public class UTF8NoZero : UTF8Encoding
{
public override Decoder GetDecoder()
{
return new MyDecoder();
}
}
public class MyDecoder : Decoder
{
public Encoding UTF8 = new UTF8Encoding();
public override int GetCharCount(byte[] bytes, int index, int count)
{
return UTF8.GetCharCount(bytes, index, count);
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
int count2 = UTF8.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
int i, j;
for (i = charIndex, j = charIndex; i < charIndex + count2; i++)
{
if (chars[i] != '\0')
{
chars[j] = chars[i];
j++;
}
}
for (int k = j; k < charIndex + count2; k++)
{
chars[k] = '\0';
}
return count2 + (i - j);
}
}
Note that this cheat is based on the fact that StreamReader.ReadLineAsync uses only the GetChars(). We remove the '\0' in the temporary buffer char[] buffer used by StreamReader.ReadLineAsync.

Related

Replacing chars with padded index

I Have a string with special chars and i have to replace those chars with an index (padded n '0' left).
Fast example for better explanation:
I have the string "0980 0099 8383 $$$$" and an index (integer) 3
result should be "0980 0099 8383 0003"
The special characters are not necessarily in sequence.
the source string could be empty or it may not contain any special characters
I've already written functions that works.
public static class StringExtensions
{
public static string ReplaceCounter(this string source, int counter, string character)
{
string res = source;
try
{
if (!string.IsNullOrEmpty(character))
{
if (res.Contains(character))
{
// Get ALL Indexes position of character
var Indexes = GetIndexes(res, character);
int max = GetMaxValue(Indexes.Count);
while (counter >= max)
{
counter -= max;
}
var new_value = counter.ToString().PadLeft(Indexes.Count, '0');
for (int i = 0; i < Indexes.Count; i++)
{
res = res.Remove(Indexes[i], 1).Insert(Indexes[i], new_value[i].ToString());
}
}
}
}
catch (Exception)
{
res = source;
}
return res;
}
private static List<int> GetIndexes(string mainString, string toFind)
{
var Indexes = new List<int>();
for (int i = mainString.IndexOf(toFind); i > -1; i = mainString.IndexOf(toFind, i + 1))
{
// for loop end when i=-1 (line.counter not found)
Indexes.Add(i);
}
return Indexes;
}
private static int GetMaxValue(int numIndexes)
{
int max = 0;
for (int i = 0; i < numIndexes; i++)
{
if (i == 0)
max = 9;
else
max = max * 10 + 9;
}
return max;
}
}
but i don't really like it (first of all because i'm passing the char as string.. and not as a char).
string source = "000081059671####=1811";
int index = 5;
string character = "#";
string result = source.ReplaceCounter(index, character);
can it be more optimized and compact?
Can some good soul help me?
Thanks in advance
EDIT
The index is variable so:
If the index is 15
string source = "000081059671####=1811";
int index = 15;
string character = "#";
string result = source.ReplaceCounter(index, character);
// result = "0000810596710015=1811"
it should be a check if the index > max number
in my code i posted above, if this case happened i remove from index the "max" value until index < max number
What is mux number? if the special chars number is 4 (as in the example below) the max number will be 9999
string source = "000081059671####=1811";
// max number 9999
Yet another edit
From a comment it seems that more than one digit can be used. In this case the counter can be converted to a string and treated as a char[] to pick the character to use in each iteration :
public static string ReplaceCounter(this string source,
int counter,
char character)
{
var sb=new StringBuilder(source);
var replacements=counter.ToString();
int r=replacements.Length-1;
for(int i=sb.Length-1;i>=0;i--)
{
if(sb[i]==character)
{
sb[i]=r>=0 ? replacements[r--] : '0';
}
}
return sb.ToString();
}
This can be used for any number of digits."0980 0099 8383 $$$$".ReplaceCounter(15,'$') produces 0980 0099 8383 0015
An edit
After posting the original answer I remembered one can modify a string without allocations by using a StringBuilder. In this case, the last match needs to be replaced with one character, all other matches with another. This ca be a simple reverse iteration :
public static string ReplaceCounter(this string source,
int counter,
char character)
{
var sb=new StringBuilder(source);
bool useChar=true;
for(int i=sb.Length-1;i>=0;i--)
{
if(sb[i]==character)
{
sb[i]=useChar?(char)('0'+counter):'0';
useChar=false;
}
}
return sb.ToString();
}
Console.WriteLine("0000##81#059671####=1811".ReplaceCounter(5,'#'));
Console.WriteLine("0980 0099 8383 $$$$".ReplaceCounter(3,'$'));
------
0000008100596710005=1811
0980 0099 8383 0003
Original Answer
Any string modification operation produces a new temporary string that need to be garbage collected. This adds up so quickly that avoiding temporary strings can result in >10x speed improvements when processing lots of text or lots of requests. That's better than using parallel processing.
You can use Regex.Replace to perform complex replacements without allocating temporary strings. You can use one of the Replace overloads that use a MatchEvaluator to produce dynamic output, not just a single value.
In this case :
var source = "0000##81#059671####=1811";
var result = Regex.Replace(source,"#", m=>m.NextMatch().Success?"0":"5");
Console.WriteLine(result);
--------
0000008100596710005=1811
Match.NextMatch() returns the next match in the source, so m.NextMatch().Success can be used to identify the last match and replace it with the index.
This would fail if the character was one of the Regex pattern characters. This can be avoided by escaping the character with Regex.Escape(string)
This can be packed in an extension method
public static string ReplaceCounter(this string source,
int counter,
string character)
{
return Regex.Replace(source,
Regex.Escape(character),
m=>m.NextMatch().Success?"0":counter.ToString());
}
public static string ReplaceCounter(this string source,
int counter,
char character)
=>ReplaceCounter(source,counter,character.ToString());
This code
var source= "0980 0099 8383 $$$$";
var result=source.ReplaceCounter(5,"$");
Returns
0980 0099 8383 0003
I would suggest such solutiuon (got rid out of helper methods:
public static class StringExtensions
{
public static string ReplaceCounter(this string source, int counter, char character)
{
string res = source;
string strCounter = counter.ToString();
bool counterTooLong = false;
int idx;
// Going from the and backwards, we fill with counter digits.
for(int i = strCounter.Length - 1; i >= 0; i--)
{
idx = res.LastIndexOf(character);
// if we run out of special characters, break the loop.
if (idx == -1)
{
counterTooLong = true;
break;
}
res = res.Remove(idx, 1).Insert(idx, strCounter[i].ToString());
}
// If we could not fit the counter, we simply throw exception
if (counterTooLong) throw new InvalidOperationException();
// If we did not fill all placeholders, we fill it with zeros.
while (-1 != (idx = res.IndexOf(character))) res = res.Remove(idx, 1).Insert(idx, "0");
return res;
}
}
Here's fiddle

Shuffle the characters

I need to shuffle the characters in the way that at each iteration, the odd characters of the string are combined and wrapped to its beginning, and the even characters are wrapped to the end.
public static string ShuffleChars(string source, int count)
{
if (string.IsNullOrEmpty(source))
{
throw new ArgumentException("source is null or empty");
}
if (string.IsNullOrWhiteSpace(source))
{
throw new ArgumentException("source is white space");
}
if (count < 0)
{
throw new ArgumentException("count < 0");
}
for (int j = 0; j < count; j++)
{
string tempOdd = string.Empty;
string tempEven = string.Empty;
for (int i = 0; i < source.Length; i++)
{
if (i % 2 == 0)
{
tempOdd += source[i];
}
else if (i % 2 != 0)
{
tempEven += source[i];
}
}
source = tempOdd + tempEven;
}
return source;
}
This works perfectly fine BUT, when count = int.MaxValue then it is in an seemingly endless loading
The task given to me says that I will have to optimize this, and people adviced using StringBuilder, so I came up with something like this:
public static string ShuffleChars(string source, int count)
{
if (string.IsNullOrEmpty(source))
{
throw new ArgumentException("source is null or empty");
}
if (string.IsNullOrWhiteSpace(source))
{
throw new ArgumentException("source is white space");
}
if (count < 0)
{
throw new ArgumentException("count < 0");
}
StringBuilder sourceString = new StringBuilder(source);
StringBuilder tempOdd = new StringBuilder(string.Empty);
StringBuilder tempEven = new StringBuilder(string.Empty);
for (int j = 0; j < count; j++)
{
tempOdd.Clear();
tempEven.Clear();
for (int i = 0; i < sourceString.Length; i++)
{
if (i % 2 == 0)
{
tempOdd.Append(sourceString[i]);
}
else
{
tempEven.Append(sourceString[i]);
}
}
sourceString = tempOdd.Append(tempEven);
}
return sourceString.ToString();
}
As far as I understand when I clear tempOdd and tempEven, sourceString gets cleared as well, and that is why when I shuffle the string more than once it returns me empty string.
May be there are other ways to optimize this?
The problem is that you are setting sourceString = tempOdd.Append(tempEven);. I.e., sourceString is now a reference pointing to the same StringBuilder object than tempOdd! Then you are clearing tempOdd, which is in fact is the same object as sourceString. And btw., you have inverted even and odd. i % 2 == 0 is even.
Instead, append both, the odd and even string to sourceString after having cleared it.
sourceString.Clear();
sourceString.Append(tempOdd).Append(tempEven);
Note that Append returns the StringBuilder itself. Therefore, this is equivalent to
sourceString.Clear();
sourceString.Append(tempOdd);
sourceString.Append(tempEven);
Strings are immutable. Therefore, when you are manipulating strings, you are always creating new strings. E.g., when you add a character to tempOdd, this creates a new string object having a length longer by one character. Then it copies the old string into the new one and appends the character. This generates a lot of new objects and involves a lot of copying.
StringBuilder works with an internal mutable buffer. Since the size of these buffers remains the same at each iteration, the characters can be appended to the already existing buffers, with no object creation (except for the initialization phase) and copying of strings involved.
Therefore StringBuilder is more efficient than string.
But there are more optimizations you can make, as #JL0PD already pointed out. The length of the even and odd parts is known in advance. Therefore, we can copy the characters to the final places and thus avoid having to concatenate the result at the end.
Also, this solution reuses the same character buffers at each iteration. To achieve this, we must swap the two buffers at each iteration to make the previous result the new source.
public static string ShuffleChars(string source, int count)
{
if (string.IsNullOrWhiteSpace(source)) {
throw new ArgumentException("source is null or empty or white space");
}
if (count < 0) {
throw new ArgumentException("count < 0");
}
// Initialize the wrong way, since we are swapping later.
var resultChars = source.ToCharArray();
var sourceChars = new char[source.Length];
for (int j = 0; j < count; j++) {
// Swap source and result. This enables us to reuse the same buffers.
var temp = sourceChars;
sourceChars = resultChars;
resultChars = temp;
// We don't need to clear, since we fill every character position anyway.
int iOdd = 0;
int iEven = source.Length / 2;
for (int i = 0; i < source.Length; i++) {
if (i % 2 == 0) {
resultChars[iEven++] = sourceChars[i];
} else {
resultChars[iOdd++] = sourceChars[i];
}
}
}
return new String(resultChars);
}

Boyer-Moore Practical in C#?

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.
I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.
I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?
Below is the key code.
// Base for search classes
abstract class SearchBase
{
public const int InvalidIndex = -1;
protected string _pattern;
public SearchBase(string pattern) { _pattern = pattern; }
public abstract int Search(string text, int startIndex);
public int Search(string text) { return Search(text, 0); }
}
/// <summary>
/// A simplified Boyer-Moore implementation.
///
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
private byte[] _skipArray;
public BoyerMoore2(string pattern)
: base(pattern)
{
// TODO: To be replaced with multi-stage table
_skipArray = new byte[0x10000];
for (int i = 0; i < _skipArray.Length; i++)
_skipArray[i] = (byte)_pattern.Length;
for (int i = 0; i < _pattern.Length - 1; i++)
_skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
}
public override int Search(string text, int startIndex)
{
int i = startIndex;
// Loop while there's still room for search term
while (i <= (text.Length - _pattern.Length))
{
// Look if we have a match at this position
int j = _pattern.Length - 1;
while (j >= 0 && _pattern[j] == text[i + j])
j--;
if (j < 0)
{
// Match found
return i;
}
// Advance to next comparision
i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
}
// No match found
return InvalidIndex;
}
}
EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.
Based on my own tests and the comments made here, I've concluded that the reason String.IndexOf() performs so well with StringComparision.Ordinal is because the method calls into unmanaged code that likely employs hand-optimized assembly language.
I have run a number of different tests and String.IndexOf() just seems to be faster than anything I can implement using managed C# code.
If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.
My bet is that setting that flag allows String.IndexOf to use Boyer-Moore itself. And its implementation is better than yours.
Without that flag it has to be careful using Boyer-Moore (and probably doesn't) because of potential issues around Unicode. In particular the possibility of Unicode causes the transition tables that Boyer-Moore uses to blow up.
I made some small changes to your code, and made a different implementation to the Boyer-Moore algorithm and got better results.
I got the idea for this implementation from here
But to be honest, I would expect to reach a higher speed compared to IndexOf.
class SearchResults
{
public int Matches { get; set; }
public long Ticks { get; set; }
}
abstract class SearchBase
{
public const int InvalidIndex = -1;
protected string _pattern;
protected string _text;
public SearchBase(string text, string pattern) { _text = text; _pattern = pattern; }
public abstract int Search(int startIndex);
}
internal class BoyerMoore3 : SearchBase
{
readonly byte[] textBytes;
readonly byte[] patternBytes;
readonly int valueLength;
readonly int patternLength;
private readonly int[] badCharacters = new int[256];
private readonly int lastPatternByte;
public BoyerMoore3(string text, string pattern) : base(text, pattern)
{
textBytes = Encoding.UTF8.GetBytes(text);
patternBytes = Encoding.UTF8.GetBytes(pattern);
valueLength = textBytes.Length;
patternLength = patternBytes.Length;
for (int i = 0; i < 256; ++i)
badCharacters[i] = patternLength;
lastPatternByte = patternLength - 1;
for (int i = 0; i < lastPatternByte; ++i)
badCharacters[patternBytes[i]] = lastPatternByte - i;
}
public override int Search(int startIndex)
{
int index = startIndex;
while (index <= (valueLength - patternLength))
{
for (int i = lastPatternByte; textBytes[index + i] == patternBytes[i]; --i)
{
if (i == 0)
return index;
}
index += badCharacters[textBytes[index + lastPatternByte]];
}
// Text not found
return InvalidIndex;
}
}
Changed code from Form1:
private void RunSearch(string pattern, SearchBase search, SearchResults results)
{
var timer = new Stopwatch();
// Start timer
timer.Start();
// Find all matches
int pos = search.Search(0);
while (pos != -1)
{
results.Matches++;
pos = search.Search(pos + pattern.Length);
}
// Stop timer
timer.Stop();
// Add to total Ticks
results.Ticks += timer.ElapsedTicks;
}

How do you perform string replacement on just a subsection of a string?

I'd like an efficient method that would work something like this
EDIT: Sorry I didn't put what I'd tried before. I updated the example now.
// Method signature, Only replaces first instance or how many are specified in max
public int MyReplace(ref string source,string org, string replace, int start, int max)
{
int ret = 0;
int len = replace.Length;
int olen = org.Length;
for(int i = 0; i < max; i++)
{
// Find the next instance of the search string
int x = source.IndexOf(org, ret + olen);
if(x > ret)
ret = x;
else
break;
// Insert the replacement
source = source.Insert(x, replace);
// And remove the original
source = source.Remove(x + len, olen); // removes original string
}
return ret;
}
string source = "The cat can fly but only if he is the cat in the hat";
int i = MyReplace(ref source,"cat", "giraffe", 8, 1);
// Results in the string "The cat can fly but only if he is the giraffe in the hat"
// i contains the index of the first letter of "giraffe" in the new string
The only reason I'm asking is because my implementation I'd imagine getting slow with 1,000s of replaces.
How about:
public static int MyReplace(ref string source,
string org, string replace, int start, int max)
{
if (start < 0) throw new System.ArgumentOutOfRangeException("start");
if (max <= 0) return 0;
start = source.IndexOf(org, start);
if (start < 0) return 0;
StringBuilder sb = new StringBuilder(source, 0, start, source.Length);
int found = 0;
while (max-- > 0) {
int index = source.IndexOf(org, start);
if (index < 0) break;
sb.Append(source, start, index - start).Append(replace);
start = index + org.Length;
found++;
}
sb.Append(source, start, source.Length - start);
source = sb.ToString();
return found;
}
it uses StringBuilder to avoid lots of intermediate strings; I haven't tested it rigorously, but it seems to work. It also tries to avoid an extra string when there are no matches.
To start, try something like this:
int count = 0;
Regex.Replace(source, Regex.Escape(literal), (match) =>
{
return (count++ > something) ? "new value" : match.Value;
});
To replace only the first match:
private string ReplaceFirst(string source, string oldString, string newString)
{
var index = source.IndexOf(oldString);
var begin = source.Substring(0, index);
var end = source.Substring(index + oldString.Length);
return begin + newString + end;
}
You have a bug in that you will miss the item to replace if it is in the beginning.
change these lines;
int ret = start; // instead of zero, or you ignore the start parameter
// Find the next instance of the search string
// Do not skip olen for the first search!
int x = i == 0 ? source.IndexOf(org, ret) : source.IndexOf(org, ret + olen);
Also your routine does 300 thousand replaces a second on my machine. Are you sure this will be a bottleneck?
And just found that your code also has an issue if you replace larger texts by smaller texts.
This code is 100% faster if you have four replaces and around 10% faster with one replacement (faster when compared with the posted original code). It uses the specified start parameter and works when replacing larger texts by smaller texts.
Mark Gravells solution is (no offense ;-) 60% slower as the original code and it also returns another value.
// Method signature, Only replaces first instance or how many are specified in max
public static int MyReplace(ref string source, string org, string replace, int start, int max)
{
var ret = 0;
int x = start;
int reps = 0;
int l = source.Length;
int lastIdx = 0;
string repstring = "";
while (x < l)
{
if ((source[x] == org[0]) && (reps < max) && (x >= start))
{
bool match = true;
for (int y = 1; y < org.Length; y++)
{
if (source[x + y] != org[y])
{
match = false;
break;
}
}
if (match)
{
repstring += source.Substring(lastIdx, x - lastIdx) + replace;
ret = x;
x += org.Length - 1;
reps++;
lastIdx = x + 1;
// Done?
if (reps == max)
{
source = repstring + source.Substring(lastIdx);
return ret;
}
}
}
x++;
}
if (ret > 0)
{
source = repstring + source.Substring(lastIdx);
}
return ret;
}

Get the index of the nth occurrence of a string?

Unless I am missing an obvious built-in method, what is the quickest way to get the nth occurrence of a string within a string?
I realize that I could loop the IndexOf method by updating its start index on each iteration of the loop. But doing it this way seems wasteful to me.
You really could use the regular expression /((s).*?){n}/ to search for n-th occurrence of substring s.
In C# it might look like this:
public static class StringExtender
{
public static int NthIndexOf(this string target, string value, int n)
{
Match m = Regex.Match(target, "((" + Regex.Escape(value) + ").*?){" + n + "}");
if (m.Success)
return m.Groups[2].Captures[n - 1].Index;
else
return -1;
}
}
Note: I have added Regex.Escape to original solution to allow searching characters which have special meaning to regex engine.
That's basically what you need to do - or at least, it's the easiest solution. All you'd be "wasting" is the cost of n method invocations - you won't actually be checking any case twice, if you think about it. (IndexOf will return as soon as it finds the match, and you'll keep going from where it left off.)
That's basically what you need to do - or at least, it's the easiest solution. All you'd be "wasting" is the cost of n method invocations - you won't actually be checking any case twice, if you think about it. (IndexOf will return as soon as it finds the match, and you'll keep going from where it left off.)
Here is the recursive implementation (of the above idea) as an extension method, mimicing the format of the framework method(s):
public static int IndexOfNth(this string input,
string value, int startIndex, int nth)
{
if (nth < 1)
throw new NotSupportedException("Param 'nth' must be greater than 0!");
if (nth == 1)
return input.IndexOf(value, startIndex);
var idx = input.IndexOf(value, startIndex);
if (idx == -1)
return -1;
return input.IndexOfNth(value, idx + 1, --nth);
}
Also, here are some (MBUnit) unit tests that might help you (to prove it is correct):
using System;
using MbUnit.Framework;
namespace IndexOfNthTest
{
[TestFixture]
public class Tests
{
//has 4 instances of the
private const string Input = "TestTest";
private const string Token = "Test";
/* Test for 0th index */
[Test]
public void TestZero()
{
Assert.Throws<NotSupportedException>(
() => Input.IndexOfNth(Token, 0, 0));
}
/* Test the two standard cases (1st and 2nd) */
[Test]
public void TestFirst()
{
Assert.AreEqual(0, Input.IndexOfNth("Test", 0, 1));
}
[Test]
public void TestSecond()
{
Assert.AreEqual(4, Input.IndexOfNth("Test", 0, 2));
}
/* Test the 'out of bounds' case */
[Test]
public void TestThird()
{
Assert.AreEqual(-1, Input.IndexOfNth("Test", 0, 3));
}
/* Test the offset case (in and out of bounds) */
[Test]
public void TestFirstWithOneOffset()
{
Assert.AreEqual(4, Input.IndexOfNth("Test", 4, 1));
}
[Test]
public void TestFirstWithTwoOffsets()
{
Assert.AreEqual(-1, Input.IndexOfNth("Test", 8, 1));
}
}
}
private int IndexOfOccurence(string s, string match, int occurence)
{
int i = 1;
int index = 0;
while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
{
if (i == occurence)
return index;
i++;
}
return -1;
}
or in C# with extension methods
public static int IndexOfOccurence(this string s, string match, int occurence)
{
int i = 1;
int index = 0;
while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
{
if (i == occurence)
return index;
i++;
}
return -1;
}
After some benchmarking, this seems to be the simplest and most effcient solution
public static int IndexOfNthSB(string input,
char value, int startIndex, int nth)
{
if (nth < 1)
throw new NotSupportedException("Param 'nth' must be greater than 0!");
var nResult = 0;
for (int i = startIndex; i < input.Length; i++)
{
if (input[i] == value)
nResult++;
if (nResult == nth)
return i;
}
return -1;
}
Here I go again! Another benchmark answer from yours truly :-) Once again based on the fantastic BenchmarkDotNet package (if you're serious about benchmarking dotnet code, please, please use this package).
The motivation for this post is two fold: PeteT (who asked it originally) wondered that it seems wasteful to use String.IndexOf varying the startIndex parameter in a loop to find the nth occurrence of a character while, in fact, it's the fastest method, and because some answers uses regular expressions which are an order of magnitude slower (and do not add any benefits, in my opinion not even readability, in this specific case).
Here is the code I've ended up using in my string extensions library (it's not a new answer to this question, since others have already posted semantically identical code here, I'm not taking credit for it). This is the fastest method (even, possibly, including unsafe variations - more on that later):
public static int IndexOfNth(this string str, char ch, int nth, int startIndex = 0) {
if (str == null)
throw new ArgumentNullException("str");
var idx = str.IndexOf(ch, startIndex);
while (idx >= 0 && --nth > 0)
idx = str.IndexOf(ch, startIndex + idx + 1);
return idx;
}
I've benchmarked this code against two other methods and the results follow:
The benchmarked methods were:
[Benchmark]
public int FindNthRegex() {
Match m = Regex.Match(text, "((" + Regex.Escape("z") + ").*?){" + Nth + "}");
return (m.Success)
? m.Groups[2].Captures[Nth - 1].Index
: -1;
}
[Benchmark]
public int FindNthCharByChar() {
var occurrence = 0;
for (int i = 0; i < text.Length; i++) {
if (text[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthIndexOfStartIdx() {
var idx = text.IndexOf('z', 0);
var nth = Nth;
while (idx >= 0 && --nth > 0)
idx = text.IndexOf('z', idx + 1);
return idx;
}
The FindNthRegex method is the slower of the bunch, taking an order (or two) of magnitude more time than the fastest. FindNthByChar loops over each char on the string and counts each match until it finds the nth occurrence. FindNthIndexOfStartIdx uses the method suggested by the opener of this question which, indeed, is the same I've been using for ages to accomplish this and it is the fastest of them all.
Why is it so much faster than FindNthByChar? It's because Microsoft went to great lengths to make string manipulation as fast as possible in the dotnet framework. And they've accomplished that! They did an amazing job! I've done a deeper investigation on string manipulations in dotnet in an CodeProject article which tries to find the fastest method to remove all whitespace from a string:
Fastest method to remove all whitespace from Strings in .NET
There you'll find why string manipulations in dotnet are so fast, and why it's next to useless trying to squeeze more speed by writing our own versions of the framework's string manipulation code (the likes of string.IndexOf, string.Split, string.Replace, etc.)
The full benchmark code I've used follows (it's a dotnet6 console program):
UPDATE: Added two methods FindNthCharByCharInSpan and FindNthCharRecursive (and now FindNthByLinq).
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text;
using System.Text.RegularExpressions;
var summary = BenchmarkRunner.Run<BenchmarkFindNthChar>();
public class BenchmarkFindNthChar
{
const string BaseText = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
[Params(100, 1000)]
public int BaseTextRepeatCount { get; set; }
[Params(500)]
public int Nth { get; set; }
private string text;
[GlobalSetup]
public void BuildTestData() {
var sb = new StringBuilder();
for (int i = 0; i < BaseTextRepeatCount; i++)
sb.AppendLine(BaseText);
text = sb.ToString();
}
[Benchmark]
public int FindNthRegex() {
Match m = Regex.Match(text, "((" + Regex.Escape("z") + ").*?){" + Nth + "}");
return (m.Success)
? m.Groups[2].Captures[Nth - 1].Index
: -1;
}
[Benchmark]
public int FindNthCharByChar() {
var occurrence = 0;
for (int i = 0; i < text.Length; i++) {
if (text[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthIndexOfStartIdx() {
var idx = text.IndexOf('z', 0);
var nth = Nth;
while (idx >= 0 && --nth > 0)
idx = text.IndexOf('z', idx + 1);
return idx;
}
[Benchmark]
public int FindNthCharByCharInSpan() {
var span = text.AsSpan();
var occurrence = 0;
for (int i = 0; i < span.Length; i++) {
if (span[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthCharRecursive() => IndexOfNth(text, "z", 0, Nth);
public static int IndexOfNth(string input, string value, int startIndex, int nth) {
if (nth == 1)
return input.IndexOf(value, startIndex);
var idx = input.IndexOf(value, startIndex);
if (idx == -1)
return -1;
return IndexOfNth(input, value, idx + 1, --nth);
}
[Benchmark]
public int FindNthByLinq() {
var items = text.Select((c, i) => (c, i)).Where(t => t.c == 'z');
return (items.Count() > Nth - 1)
? items.ElementAt(Nth - 1).i
: -1;
}
}
UPDATE 2: The new benchmark results (with Linq-based benchmark) follows:
The Linq-based solution is only better than the recursive method, but it's good to have it here for completeness.
Maybe it would also be nice to work with the String.Split() Method and check if the requested occurrence is in the array, if you don't need the index, but the value at the index
Or something like this with the do while loop
private static int OrdinalIndexOf(string str, string substr, int n)
{
int pos = -1;
do
{
pos = str.IndexOf(substr, pos + 1);
} while (n-- > 0 && pos != -1);
return pos;
}
System.ValueTuple ftw:
var index = line.Select((x, i) => (x, i)).Where(x => x.Item1 == '"').ElementAt(5).Item2;
writing a function from that is homework
Tod's answer can be simplified somewhat.
using System;
static class MainClass {
private static int IndexOfNth(this string target, string substring,
int seqNr, int startIdx = 0)
{
if (seqNr < 1)
{
throw new IndexOutOfRangeException("Parameter 'nth' must be greater than 0.");
}
var idx = target.IndexOf(substring, startIdx);
if (idx < 0 || seqNr == 1) { return idx; }
return target.IndexOfNth(substring, --seqNr, ++idx); // skip
}
static void Main () {
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 1));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 2));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 3));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 4));
}
}
Output
1
3
5
-1
This might do it:
Console.WriteLine(str.IndexOf((#"\")+2)+1);

Categories