Get the index of the nth occurrence of a string? - c#

Unless I am missing an obvious built-in method, what is the quickest way to get the nth occurrence of a string within a string?
I realize that I could loop the IndexOf method by updating its start index on each iteration of the loop. But doing it this way seems wasteful to me.

You really could use the regular expression /((s).*?){n}/ to search for n-th occurrence of substring s.
In C# it might look like this:
public static class StringExtender
{
public static int NthIndexOf(this string target, string value, int n)
{
Match m = Regex.Match(target, "((" + Regex.Escape(value) + ").*?){" + n + "}");
if (m.Success)
return m.Groups[2].Captures[n - 1].Index;
else
return -1;
}
}
Note: I have added Regex.Escape to original solution to allow searching characters which have special meaning to regex engine.

That's basically what you need to do - or at least, it's the easiest solution. All you'd be "wasting" is the cost of n method invocations - you won't actually be checking any case twice, if you think about it. (IndexOf will return as soon as it finds the match, and you'll keep going from where it left off.)

That's basically what you need to do - or at least, it's the easiest solution. All you'd be "wasting" is the cost of n method invocations - you won't actually be checking any case twice, if you think about it. (IndexOf will return as soon as it finds the match, and you'll keep going from where it left off.)
Here is the recursive implementation (of the above idea) as an extension method, mimicing the format of the framework method(s):
public static int IndexOfNth(this string input,
string value, int startIndex, int nth)
{
if (nth < 1)
throw new NotSupportedException("Param 'nth' must be greater than 0!");
if (nth == 1)
return input.IndexOf(value, startIndex);
var idx = input.IndexOf(value, startIndex);
if (idx == -1)
return -1;
return input.IndexOfNth(value, idx + 1, --nth);
}
Also, here are some (MBUnit) unit tests that might help you (to prove it is correct):
using System;
using MbUnit.Framework;
namespace IndexOfNthTest
{
[TestFixture]
public class Tests
{
//has 4 instances of the
private const string Input = "TestTest";
private const string Token = "Test";
/* Test for 0th index */
[Test]
public void TestZero()
{
Assert.Throws<NotSupportedException>(
() => Input.IndexOfNth(Token, 0, 0));
}
/* Test the two standard cases (1st and 2nd) */
[Test]
public void TestFirst()
{
Assert.AreEqual(0, Input.IndexOfNth("Test", 0, 1));
}
[Test]
public void TestSecond()
{
Assert.AreEqual(4, Input.IndexOfNth("Test", 0, 2));
}
/* Test the 'out of bounds' case */
[Test]
public void TestThird()
{
Assert.AreEqual(-1, Input.IndexOfNth("Test", 0, 3));
}
/* Test the offset case (in and out of bounds) */
[Test]
public void TestFirstWithOneOffset()
{
Assert.AreEqual(4, Input.IndexOfNth("Test", 4, 1));
}
[Test]
public void TestFirstWithTwoOffsets()
{
Assert.AreEqual(-1, Input.IndexOfNth("Test", 8, 1));
}
}
}

private int IndexOfOccurence(string s, string match, int occurence)
{
int i = 1;
int index = 0;
while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
{
if (i == occurence)
return index;
i++;
}
return -1;
}
or in C# with extension methods
public static int IndexOfOccurence(this string s, string match, int occurence)
{
int i = 1;
int index = 0;
while (i <= occurence && (index = s.IndexOf(match, index + 1)) != -1)
{
if (i == occurence)
return index;
i++;
}
return -1;
}

After some benchmarking, this seems to be the simplest and most effcient solution
public static int IndexOfNthSB(string input,
char value, int startIndex, int nth)
{
if (nth < 1)
throw new NotSupportedException("Param 'nth' must be greater than 0!");
var nResult = 0;
for (int i = startIndex; i < input.Length; i++)
{
if (input[i] == value)
nResult++;
if (nResult == nth)
return i;
}
return -1;
}

Here I go again! Another benchmark answer from yours truly :-) Once again based on the fantastic BenchmarkDotNet package (if you're serious about benchmarking dotnet code, please, please use this package).
The motivation for this post is two fold: PeteT (who asked it originally) wondered that it seems wasteful to use String.IndexOf varying the startIndex parameter in a loop to find the nth occurrence of a character while, in fact, it's the fastest method, and because some answers uses regular expressions which are an order of magnitude slower (and do not add any benefits, in my opinion not even readability, in this specific case).
Here is the code I've ended up using in my string extensions library (it's not a new answer to this question, since others have already posted semantically identical code here, I'm not taking credit for it). This is the fastest method (even, possibly, including unsafe variations - more on that later):
public static int IndexOfNth(this string str, char ch, int nth, int startIndex = 0) {
if (str == null)
throw new ArgumentNullException("str");
var idx = str.IndexOf(ch, startIndex);
while (idx >= 0 && --nth > 0)
idx = str.IndexOf(ch, startIndex + idx + 1);
return idx;
}
I've benchmarked this code against two other methods and the results follow:
The benchmarked methods were:
[Benchmark]
public int FindNthRegex() {
Match m = Regex.Match(text, "((" + Regex.Escape("z") + ").*?){" + Nth + "}");
return (m.Success)
? m.Groups[2].Captures[Nth - 1].Index
: -1;
}
[Benchmark]
public int FindNthCharByChar() {
var occurrence = 0;
for (int i = 0; i < text.Length; i++) {
if (text[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthIndexOfStartIdx() {
var idx = text.IndexOf('z', 0);
var nth = Nth;
while (idx >= 0 && --nth > 0)
idx = text.IndexOf('z', idx + 1);
return idx;
}
The FindNthRegex method is the slower of the bunch, taking an order (or two) of magnitude more time than the fastest. FindNthByChar loops over each char on the string and counts each match until it finds the nth occurrence. FindNthIndexOfStartIdx uses the method suggested by the opener of this question which, indeed, is the same I've been using for ages to accomplish this and it is the fastest of them all.
Why is it so much faster than FindNthByChar? It's because Microsoft went to great lengths to make string manipulation as fast as possible in the dotnet framework. And they've accomplished that! They did an amazing job! I've done a deeper investigation on string manipulations in dotnet in an CodeProject article which tries to find the fastest method to remove all whitespace from a string:
Fastest method to remove all whitespace from Strings in .NET
There you'll find why string manipulations in dotnet are so fast, and why it's next to useless trying to squeeze more speed by writing our own versions of the framework's string manipulation code (the likes of string.IndexOf, string.Split, string.Replace, etc.)
The full benchmark code I've used follows (it's a dotnet6 console program):
UPDATE: Added two methods FindNthCharByCharInSpan and FindNthCharRecursive (and now FindNthByLinq).
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text;
using System.Text.RegularExpressions;
var summary = BenchmarkRunner.Run<BenchmarkFindNthChar>();
public class BenchmarkFindNthChar
{
const string BaseText = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
[Params(100, 1000)]
public int BaseTextRepeatCount { get; set; }
[Params(500)]
public int Nth { get; set; }
private string text;
[GlobalSetup]
public void BuildTestData() {
var sb = new StringBuilder();
for (int i = 0; i < BaseTextRepeatCount; i++)
sb.AppendLine(BaseText);
text = sb.ToString();
}
[Benchmark]
public int FindNthRegex() {
Match m = Regex.Match(text, "((" + Regex.Escape("z") + ").*?){" + Nth + "}");
return (m.Success)
? m.Groups[2].Captures[Nth - 1].Index
: -1;
}
[Benchmark]
public int FindNthCharByChar() {
var occurrence = 0;
for (int i = 0; i < text.Length; i++) {
if (text[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthIndexOfStartIdx() {
var idx = text.IndexOf('z', 0);
var nth = Nth;
while (idx >= 0 && --nth > 0)
idx = text.IndexOf('z', idx + 1);
return idx;
}
[Benchmark]
public int FindNthCharByCharInSpan() {
var span = text.AsSpan();
var occurrence = 0;
for (int i = 0; i < span.Length; i++) {
if (span[i] == 'z')
occurrence++;
if (Nth == occurrence)
return i;
}
return -1;
}
[Benchmark]
public int FindNthCharRecursive() => IndexOfNth(text, "z", 0, Nth);
public static int IndexOfNth(string input, string value, int startIndex, int nth) {
if (nth == 1)
return input.IndexOf(value, startIndex);
var idx = input.IndexOf(value, startIndex);
if (idx == -1)
return -1;
return IndexOfNth(input, value, idx + 1, --nth);
}
[Benchmark]
public int FindNthByLinq() {
var items = text.Select((c, i) => (c, i)).Where(t => t.c == 'z');
return (items.Count() > Nth - 1)
? items.ElementAt(Nth - 1).i
: -1;
}
}
UPDATE 2: The new benchmark results (with Linq-based benchmark) follows:
The Linq-based solution is only better than the recursive method, but it's good to have it here for completeness.

Maybe it would also be nice to work with the String.Split() Method and check if the requested occurrence is in the array, if you don't need the index, but the value at the index

Or something like this with the do while loop
private static int OrdinalIndexOf(string str, string substr, int n)
{
int pos = -1;
do
{
pos = str.IndexOf(substr, pos + 1);
} while (n-- > 0 && pos != -1);
return pos;
}

System.ValueTuple ftw:
var index = line.Select((x, i) => (x, i)).Where(x => x.Item1 == '"').ElementAt(5).Item2;
writing a function from that is homework

Tod's answer can be simplified somewhat.
using System;
static class MainClass {
private static int IndexOfNth(this string target, string substring,
int seqNr, int startIdx = 0)
{
if (seqNr < 1)
{
throw new IndexOutOfRangeException("Parameter 'nth' must be greater than 0.");
}
var idx = target.IndexOf(substring, startIdx);
if (idx < 0 || seqNr == 1) { return idx; }
return target.IndexOfNth(substring, --seqNr, ++idx); // skip
}
static void Main () {
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 1));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 2));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 3));
Console.WriteLine ("abcbcbcd".IndexOfNth("bc", 4));
}
}
Output
1
3
5
-1

This might do it:
Console.WriteLine(str.IndexOf((#"\")+2)+1);

Related

Index out of bounds of array but only sometimes [duplicate]

Suppose I had a string:
string str = "1111222233334444";
How can I break this string into chunks of some size?
e.g., breaking this into sizes of 4 would return strings:
"1111"
"2222"
"3333"
"4444"
static IEnumerable<string> Split(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
Please note that additional code might be required to gracefully handle edge cases (null or empty input string, chunkSize == 0, input string length not divisible by chunkSize, etc.). The original question doesn't specify any requirements for these edge cases and in real life the requirements might vary so they are out of scope of this answer.
In a combination of dove+Konstatin's answers...
static IEnumerable<string> WholeChunks(string str, int chunkSize) {
for (int i = 0; i < str.Length; i += chunkSize)
yield return str.Substring(i, chunkSize);
}
This will work for all strings that can be split into a whole number of chunks, and will throw an exception otherwise.
If you want to support strings of any length you could use the following code:
static IEnumerable<string> ChunksUpto(string str, int maxChunkSize) {
for (int i = 0; i < str.Length; i += maxChunkSize)
yield return str.Substring(i, Math.Min(maxChunkSize, str.Length-i));
}
However, the the OP explicitly stated he does not need this; it's somewhat longer and harder to read, slightly slower. In the spirit of KISS and YAGNI, I'd go with the first option: it's probably the most efficient implementation possible, and it's very short, readable, and, importantly, throws an exception for nonconforming input.
Why not loops? Here's something that would do it quite well:
string str = "111122223333444455";
int chunkSize = 4;
int stringLength = str.Length;
for (int i = 0; i < stringLength ; i += chunkSize)
{
if (i + chunkSize > stringLength) chunkSize = stringLength - i;
Console.WriteLine(str.Substring(i, chunkSize));
}
Console.ReadLine();
I don't know how you'd deal with case where the string is not factor of 4, but not saying you're idea is not possible, just wondering the motivation for it if a simple for loop does it very well? Obviously the above could be cleaned and even put in as an extension method.
Or as mentioned in comments, you know it's /4 then
str = "1111222233334444";
for (int i = 0; i < stringLength; i += chunkSize)
{Console.WriteLine(str.Substring(i, chunkSize));}
This is based on #dove solution but implemented as an extension method.
Benefits:
Extension method
Covers corner cases
Splits string with any chars: numbers, letters, other symbols
Code
public static class EnumerableEx
{
public static IEnumerable<string> SplitBy(this string str, int chunkLength)
{
if (String.IsNullOrEmpty(str)) throw new ArgumentException();
if (chunkLength < 1) throw new ArgumentException();
for (int i = 0; i < str.Length; i += chunkLength)
{
if (chunkLength + i > str.Length)
chunkLength = str.Length - i;
yield return str.Substring(i, chunkLength);
}
}
}
Usage
var result = "bobjoecat".SplitBy(3); // bob, joe, cat
Unit tests removed for brevity (see previous revision)
Using regular expressions and Linq:
List<string> groups = (from Match m in Regex.Matches(str, #"\d{4}")
select m.Value).ToList();
I find this to be more readable, but it's just a personal opinion. It can also be a one-liner : ).
How's this for a one-liner?
List<string> result = new List<string>(Regex.Split(target, #"(?<=\G.{4})", RegexOptions.Singleline));
With this regex it doesn't matter if the last chunk is less than four characters, because it only ever looks at the characters behind it.
I'm sure this isn't the most efficient solution, but I just had to toss it out there.
Starting with .NET 6, we can also use the Chunk method:
var result = str
.Chunk(4)
.Select(x => new string(x))
.ToList();
I recently had to write something that accomplishes this at work, so I thought I would post my solution to this problem. As an added bonus, the functionality of this solution provides a way to split the string in the opposite direction and it does correctly handle unicode characters as previously mentioned by Marvin Pinto above. So, here it is:
using System;
using Extensions;
namespace TestCSharp
{
class Program
{
static void Main(string[] args)
{
string asciiStr = "This is a string.";
string unicodeStr = "これは文字列です。";
string[] array1 = asciiStr.Split(4);
string[] array2 = asciiStr.Split(-4);
string[] array3 = asciiStr.Split(7);
string[] array4 = asciiStr.Split(-7);
string[] array5 = unicodeStr.Split(5);
string[] array6 = unicodeStr.Split(-5);
}
}
}
namespace Extensions
{
public static class StringExtensions
{
/// <summary>Returns a string array that contains the substrings in this string that are seperated a given fixed length.</summary>
/// <param name="s">This string object.</param>
/// <param name="length">Size of each substring.
/// <para>CASE: length > 0 , RESULT: String is split from left to right.</para>
/// <para>CASE: length == 0 , RESULT: String is returned as the only entry in the array.</para>
/// <para>CASE: length < 0 , RESULT: String is split from right to left.</para>
/// </param>
/// <returns>String array that has been split into substrings of equal length.</returns>
/// <example>
/// <code>
/// string s = "1234567890";
/// string[] a = s.Split(4); // a == { "1234", "5678", "90" }
/// </code>
/// </example>
public static string[] Split(this string s, int length)
{
System.Globalization.StringInfo str = new System.Globalization.StringInfo(s);
int lengthAbs = Math.Abs(length);
if (str == null || str.LengthInTextElements == 0 || lengthAbs == 0 || str.LengthInTextElements <= lengthAbs)
return new string[] { str.ToString() };
string[] array = new string[(str.LengthInTextElements % lengthAbs == 0 ? str.LengthInTextElements / lengthAbs: (str.LengthInTextElements / lengthAbs) + 1)];
if (length > 0)
for (int iStr = 0, iArray = 0; iStr < str.LengthInTextElements && iArray < array.Length; iStr += lengthAbs, iArray++)
array[iArray] = str.SubstringByTextElements(iStr, (str.LengthInTextElements - iStr < lengthAbs ? str.LengthInTextElements - iStr : lengthAbs));
else // if (length < 0)
for (int iStr = str.LengthInTextElements - 1, iArray = array.Length - 1; iStr >= 0 && iArray >= 0; iStr -= lengthAbs, iArray--)
array[iArray] = str.SubstringByTextElements((iStr - lengthAbs < 0 ? 0 : iStr - lengthAbs + 1), (iStr - lengthAbs < 0 ? iStr + 1 : lengthAbs));
return array;
}
}
}
Also, here is an image link to the results of running this code: http://i.imgur.com/16Iih.png
It's not pretty and it's not fast, but it works, it's a one-liner and it's LINQy:
List<string> a = text.Select((c, i) => new { Char = c, Index = i }).GroupBy(o => o.Index / 4).Select(g => new String(g.Select(o => o.Char).ToArray())).ToList();
This should be much faster and more efficient than using LINQ or other approaches used here.
public static IEnumerable<string> Splice(this string s, int spliceLength)
{
if (s == null)
throw new ArgumentNullException("s");
if (spliceLength < 1)
throw new ArgumentOutOfRangeException("spliceLength");
if (s.Length == 0)
yield break;
var start = 0;
for (var end = spliceLength; end < s.Length; end += spliceLength)
{
yield return s.Substring(start, spliceLength);
start = end;
}
yield return s.Substring(start);
}
You can use morelinq by Jon Skeet. Use Batch like:
string str = "1111222233334444";
int chunkSize = 4;
var chunks = str.Batch(chunkSize).Select(r => new String(r.ToArray()));
This will return 4 chunks for the string "1111222233334444". If the string length is less than or equal to the chunk size Batch will return the string as the only element of IEnumerable<string>
For output:
foreach (var chunk in chunks)
{
Console.WriteLine(chunk);
}
and it will give:
1111
2222
3333
4444
Personally I prefer my solution :-)
It handles:
String lengths that are a multiple of the chunk size.
String lengths that are NOT a multiple of the chunk size.
String lengths that are smaller than the chunk size.
NULL and empty strings (throws an exception).
Chunk sizes smaller than 1 (throws an exception).
It is implemented as a extension method, and it calculates the number of chunks is going to generate beforehand. It checks the last chunk because in case the text length is not a multiple it needs to be shorter. Clean, short, easy to understand... and works!
public static string[] Split(this string value, int chunkSize)
{
if (string.IsNullOrEmpty(value)) throw new ArgumentException("The string cannot be null.");
if (chunkSize < 1) throw new ArgumentException("The chunk size should be equal or greater than one.");
int remainder;
int divResult = Math.DivRem(value.Length, chunkSize, out remainder);
int numberOfChunks = remainder > 0 ? divResult + 1 : divResult;
var result = new string[numberOfChunks];
int i = 0;
while (i < numberOfChunks - 1)
{
result[i] = value.Substring(i * chunkSize, chunkSize);
i++;
}
int lastChunkSize = remainder > 0 ? remainder : chunkSize;
result[i] = value.Substring(i * chunkSize, lastChunkSize);
return result;
}
Simple and short:
// this means match a space or not a space (anything) up to 4 characters
var lines = Regex.Matches(str, #"[\s\S]{0,4}").Cast<Match>().Select(x => x.Value);
I know question is years old, but here is a Rx implementation. It handles the length % chunkSize != 0 problem out of the box:
public static IEnumerable<string> Chunkify(this string input, int size)
{
if(size < 1)
throw new ArgumentException("size must be greater than 0");
return input.ToCharArray()
.ToObservable()
.Buffer(size)
.Select(x => new string(x.ToArray()))
.ToEnumerable();
}
public static IEnumerable<IEnumerable<T>> SplitEvery<T>(this IEnumerable<T> values, int n)
{
var ls = values.Take(n);
var rs = values.Skip(n);
return ls.Any() ?
Cons(ls, SplitEvery(rs, n)) :
Enumerable.Empty<IEnumerable<T>>();
}
public static IEnumerable<T> Cons<T>(T x, IEnumerable<T> xs)
{
yield return x;
foreach (var xi in xs)
yield return xi;
}
Best , Easiest and Generic Answer :).
string originalString = "1111222233334444";
List<string> test = new List<string>();
int chunkSize = 4; // change 4 with the size of strings you want.
for (int i = 0; i < originalString.Length; i = i + chunkSize)
{
if (originalString.Length - i >= chunkSize)
test.Add(originalString.Substring(i, chunkSize));
else
test.Add(originalString.Substring(i,((originalString.Length - i))));
}
static IEnumerable<string> Split(string str, int chunkSize)
{
IEnumerable<string> retVal = Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize))
if (str.Length % chunkSize > 0)
retVal = retVal.Append(str.Substring(str.Length / chunkSize * chunkSize, str.Length % chunkSize));
return retVal;
}
It correctly handles input string length not divisible by chunkSize.
Please note that additional code might be required to gracefully handle edge cases (null or empty input string, chunkSize == 0).
static IEnumerable<string> Split(string str, double chunkSize)
{
return Enumerable.Range(0, (int) Math.Ceiling(str.Length/chunkSize))
.Select(i => new string(str
.Skip(i * (int)chunkSize)
.Take((int)chunkSize)
.ToArray()));
}
and another approach:
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
var x = "Hello World";
foreach(var i in x.ChunkString(2)) Console.WriteLine(i);
}
}
public static class Ext{
public static IEnumerable<string> ChunkString(this string val, int chunkSize){
return val.Select((x,i) => new {Index = i, Value = x})
.GroupBy(x => x.Index/chunkSize, x => x.Value)
.Select(x => string.Join("",x));
}
}
Six years later o_O
Just because
public static IEnumerable<string> Split(this string str, int chunkSize, bool remainingInFront)
{
var count = (int) Math.Ceiling(str.Length/(double) chunkSize);
Func<int, int> start = index => remainingInFront ? str.Length - (count - index)*chunkSize : index*chunkSize;
Func<int, int> end = index => Math.Min(str.Length - Math.Max(start(index), 0), Math.Min(start(index) + chunkSize - Math.Max(start(index), 0), chunkSize));
return Enumerable.Range(0, count).Select(i => str.Substring(Math.Max(start(i), 0),end(i)));
}
or
private static Func<bool, int, int, int, int, int> start = (remainingInFront, length, count, index, size) =>
remainingInFront ? length - (count - index) * size : index * size;
private static Func<bool, int, int, int, int, int, int> end = (remainingInFront, length, count, index, size, start) =>
Math.Min(length - Math.Max(start, 0), Math.Min(start + size - Math.Max(start, 0), size));
public static IEnumerable<string> Split(this string str, int chunkSize, bool remainingInFront)
{
var count = (int)Math.Ceiling(str.Length / (double)chunkSize);
return Enumerable.Range(0, count).Select(i => str.Substring(
Math.Max(start(remainingInFront, str.Length, count, i, chunkSize), 0),
end(remainingInFront, str.Length, count, i, chunkSize, start(remainingInFront, str.Length, count, i, chunkSize))
));
}
AFAIK all edge cases are handled.
Console.WriteLine(string.Join(" ", "abc".Split(2, false))); // ab c
Console.WriteLine(string.Join(" ", "abc".Split(2, true))); // a bc
Console.WriteLine(string.Join(" ", "a".Split(2, true))); // a
Console.WriteLine(string.Join(" ", "a".Split(2, false))); // a
List<string> SplitString(int chunk, string input)
{
List<string> list = new List<string>();
int cycles = input.Length / chunk;
if (input.Length % chunk != 0)
cycles++;
for (int i = 0; i < cycles; i++)
{
try
{
list.Add(input.Substring(i * chunk, chunk));
}
catch
{
list.Add(input.Substring(i * chunk));
}
}
return list;
}
I took this to another level. Chucking is an easy one liner, but in my case I needed whole words as well. Figured I would post it, just in case someone else needs something similar.
static IEnumerable<string> Split(string orgString, int chunkSize, bool wholeWords = true)
{
if (wholeWords)
{
List<string> result = new List<string>();
StringBuilder sb = new StringBuilder();
if (orgString.Length > chunkSize)
{
string[] newSplit = orgString.Split(' ');
foreach (string str in newSplit)
{
if (sb.Length != 0)
sb.Append(" ");
if (sb.Length + str.Length > chunkSize)
{
result.Add(sb.ToString());
sb.Clear();
}
sb.Append(str);
}
result.Add(sb.ToString());
}
else
result.Add(orgString);
return result;
}
else
return new List<string>(Regex.Split(orgString, #"(?<=\G.{" + chunkSize + "})", RegexOptions.Singleline));
}
Results based on below comment:
string msg = "336699AABBCCDDEEFF";
foreach (string newMsg in Split(msg, 2, false))
{
Console.WriteLine($">>{newMsg}<<");
}
Console.ReadKey();
Results:
>>33<<
>>66<<
>>99<<
>>AA<<
>>BB<<
>>CC<<
>>DD<<
>>EE<<
>>FF<<
>><<
Another way to pull it:
List<string> splitData = (List<string>)Split(msg, 2, false);
for (int i = 0; i < splitData.Count - 1; i++)
{
Console.WriteLine($">>{splitData[i]}<<");
}
Console.ReadKey();
New Results:
>>33<<
>>66<<
>>99<<
>>AA<<
>>BB<<
>>CC<<
>>DD<<
>>EE<<
>>FF<<
An important tip if the string that is being chunked needs to support all Unicode characters.
If the string is to support international characters like 𠀋, then split up the string using the System.Globalization.StringInfo class. Using StringInfo, you can split up the string based on number of text elements.
string internationalString = '𠀋';
The above string has a Length of 2, because the String.Length property returns the number of Char objects in this instance, not the number of Unicode characters.
Changed slightly to return parts whose size not equal to chunkSize
public static IEnumerable<string> Split(this string str, int chunkSize)
{
var splits = new List<string>();
if (str.Length < chunkSize) { chunkSize = str.Length; }
splits.AddRange(Enumerable.Range(0, str.Length / chunkSize).Select(i => str.Substring(i * chunkSize, chunkSize)));
splits.Add(str.Length % chunkSize > 0 ? str.Substring((str.Length / chunkSize) * chunkSize, str.Length - ((str.Length / chunkSize) * chunkSize)) : string.Empty);
return (IEnumerable<string>)splits;
}
I think this is an straight forward answer:
public static IEnumerable<string> Split(this string str, int chunkSize)
{
if(string.IsNullOrEmpty(str) || chunkSize<1)
throw new ArgumentException("String can not be null or empty and chunk size should be greater than zero.");
var chunkCount = str.Length / chunkSize + (str.Length % chunkSize != 0 ? 1 : 0);
for (var i = 0; i < chunkCount; i++)
{
var startIndex = i * chunkSize;
if (startIndex + chunkSize >= str.Length)
yield return str.Substring(startIndex);
else
yield return str.Substring(startIndex, chunkSize);
}
}
And it covers edge cases.
static List<string> GetChunks(string value, int chunkLength)
{
var res = new List<string>();
int count = (value.Length / chunkLength) + (value.Length % chunkLength > 0 ? 1 : 0);
Enumerable.Range(0, count).ToList().ForEach(f => res.Add(value.Skip(f * chunkLength).Take(chunkLength).Select(z => z.ToString()).Aggregate((a,b) => a+b)));
return res;
}
demo
Here's my 2 cents:
IEnumerable<string> Split(string str, int chunkSize)
{
while (!string.IsNullOrWhiteSpace(str))
{
var chunk = str.Take(chunkSize).ToArray();
str = str.Substring(chunk.Length);
yield return new string(chunk);
}
}//Split
I've slightly build up on João's solution.
What I've done differently is in my method you can actually specify whether you want to return the array with remaining characters or whether you want to truncate them if the end characters do not match your required chunk length, I think it's pretty flexible and the code is fairly straight forward:
using System;
using System.Linq;
using System.Text.RegularExpressions;
namespace SplitFunction
{
class Program
{
static void Main(string[] args)
{
string text = "hello, how are you doing today?";
string[] chunks = SplitIntoChunks(text, 3,false);
if (chunks != null)
{
chunks.ToList().ForEach(e => Console.WriteLine(e));
}
Console.ReadKey();
}
private static string[] SplitIntoChunks(string text, int chunkSize, bool truncateRemaining)
{
string chunk = chunkSize.ToString();
string pattern = truncateRemaining ? ".{" + chunk + "}" : ".{1," + chunk + "}";
string[] chunks = null;
if (chunkSize > 0 && !String.IsNullOrEmpty(text))
chunks = (from Match m in Regex.Matches(text,pattern)select m.Value).ToArray();
return chunks;
}
}
}
public static List<string> SplitByMaxLength(this string str)
{
List<string> splitString = new List<string>();
for (int index = 0; index < str.Length; index += MaxLength)
{
splitString.Add(str.Substring(index, Math.Min(MaxLength, str.Length - index)));
}
return splitString;
}
I can't remember who gave me this, but it works great. I speed tested a number of ways to break Enumerable types into groups. The usage would just be like this...
List<string> Divided = Source3.Chunk(24).Select(Piece => string.Concat<char>(Piece)).ToList();
The extention code would look like this...
#region Chunk Logic
private class ChunkedEnumerable<T> : IEnumerable<T>
{
class ChildEnumerator : IEnumerator<T>
{
ChunkedEnumerable<T> parent;
int position;
bool done = false;
T current;
public ChildEnumerator(ChunkedEnumerable<T> parent)
{
this.parent = parent;
position = -1;
parent.wrapper.AddRef();
}
public T Current
{
get
{
if (position == -1 || done)
{
throw new InvalidOperationException();
}
return current;
}
}
public void Dispose()
{
if (!done)
{
done = true;
parent.wrapper.RemoveRef();
}
}
object System.Collections.IEnumerator.Current
{
get { return Current; }
}
public bool MoveNext()
{
position++;
if (position + 1 > parent.chunkSize)
{
done = true;
}
if (!done)
{
done = !parent.wrapper.Get(position + parent.start, out current);
}
return !done;
}
public void Reset()
{
// per http://msdn.microsoft.com/en-us/library/system.collections.ienumerator.reset.aspx
throw new NotSupportedException();
}
}
EnumeratorWrapper<T> wrapper;
int chunkSize;
int start;
public ChunkedEnumerable(EnumeratorWrapper<T> wrapper, int chunkSize, int start)
{
this.wrapper = wrapper;
this.chunkSize = chunkSize;
this.start = start;
}
public IEnumerator<T> GetEnumerator()
{
return new ChildEnumerator(this);
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
private class EnumeratorWrapper<T>
{
public EnumeratorWrapper(IEnumerable<T> source)
{
SourceEumerable = source;
}
IEnumerable<T> SourceEumerable { get; set; }
Enumeration currentEnumeration;
class Enumeration
{
public IEnumerator<T> Source { get; set; }
public int Position { get; set; }
public bool AtEnd { get; set; }
}
public bool Get(int pos, out T item)
{
if (currentEnumeration != null && currentEnumeration.Position > pos)
{
currentEnumeration.Source.Dispose();
currentEnumeration = null;
}
if (currentEnumeration == null)
{
currentEnumeration = new Enumeration { Position = -1, Source = SourceEumerable.GetEnumerator(), AtEnd = false };
}
item = default(T);
if (currentEnumeration.AtEnd)
{
return false;
}
while (currentEnumeration.Position < pos)
{
currentEnumeration.AtEnd = !currentEnumeration.Source.MoveNext();
currentEnumeration.Position++;
if (currentEnumeration.AtEnd)
{
return false;
}
}
item = currentEnumeration.Source.Current;
return true;
}
int refs = 0;
// needed for dispose semantics
public void AddRef()
{
refs++;
}
public void RemoveRef()
{
refs--;
if (refs == 0 && currentEnumeration != null)
{
var copy = currentEnumeration;
currentEnumeration = null;
copy.Source.Dispose();
}
}
}
/// <summary>Speed Checked. Works Great!</summary>
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunksize)
{
if (chunksize < 1) throw new InvalidOperationException();
var wrapper = new EnumeratorWrapper<T>(source);
int currentPos = 0;
T ignore;
try
{
wrapper.AddRef();
while (wrapper.Get(currentPos, out ignore))
{
yield return new ChunkedEnumerable<T>(wrapper, chunksize, currentPos);
currentPos += chunksize;
}
}
finally
{
wrapper.RemoveRef();
}
}
#endregion
class StringHelper
{
static void Main(string[] args)
{
string str = "Hi my name is vikas bansal and my email id is bansal.vks#gmail.com";
int offSet = 10;
List<string> chunks = chunkMyStr(str, offSet);
Console.Read();
}
static List<string> chunkMyStr(string str, int offSet)
{
List<string> resultChunks = new List<string>();
for (int i = 0; i < str.Length; i += offSet)
{
string temp = str.Substring(i, (str.Length - i) > offSet ? offSet : (str.Length - i));
Console.WriteLine(temp);
resultChunks.Add(temp);
}
return resultChunks;
}
}

How can I make my trie more efficient?

I'm working on a hacker rank problem and I believe my solution is correct. However, most of my test cases are being timed out. Could some one suggest how to improve the efficiency of my code?
The important part of the code starts at the "Trie Class".
The exact question can be found here: https://www.hackerrank.com/challenges/contacts
using System;
using System.Collections.Generic;
using System.IO;
class Solution
{
static void Main(String[] args)
{
int N = Int32.Parse(Console.ReadLine());
string[,] argList = new string[N, 2];
for (int i = 0; i < N; i++)
{
string[] s = Console.ReadLine().Split();
argList[i, 0] = s[0];
argList[i, 1] = s[1];
}
Trie trie = new Trie();
for (int i = 0; i < N; i++)
{
switch (argList[i, 0])
{
case "add":
trie.add(argList[i, 1]);
break;
case "find":
Console.WriteLine(trie.find(argList[i, 1]));
break;
default:
break;
}
}
}
}
class Trie
{
Trie[] trieArray = new Trie[26];
private int findCount = 0;
private bool data = false;
private char name;
public void add(string s)
{
s = s.ToLower();
add(s, this);
}
private void add(string s, Trie t)
{
char first = Char.Parse(s.Substring(0, 1));
int index = first - 'a';
if(t.trieArray[index] == null)
{
t.trieArray[index] = new Trie();
t.trieArray[index].name = first;
}
if (s.Length > 1)
{
add(s.Substring(1), t.trieArray[index]);
}
else
{
t.trieArray[index].data = true;
}
}
public int find(string s)
{
int ans;
s = s.ToLower();
find(s, this);
ans = findCount;
findCount = 0;
return ans;
}
private void find(string s, Trie t)
{
if (t == null)
{
return;
}
if (s.Length > 0)
{
char first = Char.Parse(s.Substring(0, 1));
int index = first - 'a';
find(s.Substring(1), t.trieArray[index]);
}
else
{
for(int i = 0; i < 26; i++)
{
if (t.trieArray[i] != null)
{
find("", t.trieArray[i]);
}
}
if (t.data == true)
{
findCount++;
}
}
}
}
EDIT: I did some suggestions in the comments but realized that I can't replace s.Substring(1) with s[0]... because I actually need s[1..n]. AND s[0] returns a char so I'm going to need to do .ToString on it anyways.
Also, to add a little more information. The idea is that it needs to count ALL names after a prefix for example.
Input: "He"
Trie Contains:
"Hello"
"Help"
"Heart"
"Ha"
"No"
Output: 3
I could just post a solution here that gives you 40 points but i guess this wont be any fun.
Make use of Enumerator<char> instead of any string operations
Count while adding values
any charachter minus 'a' makes a good arrayindex (gives you values between 0 and 25)
I think, this link must be very helpfull
There, you can find the good implementation of a trie data structure, but it's a java implementation. I implemented trie on c# with that great article one year ago, and i tried to find solution to an another hackerrank task too ) It was successful.
In my task, i had to a simply trie (i mean my alphabet equals 2) and only one method for adding new nodes:
public class Trie
{
//for integer representation in binary system 2^32
public static readonly int MaxLengthOfBits = 32;
//alphabet trie size
public static readonly int N = 2;
class Node
{
public Node[] next = new Node[Trie.N];
}
private Node _root;
}
public void AddValue(bool[] binaryNumber)
{
_root = AddValue(_root, binaryNumber, 0);
}
private Node AddValue(Node node, bool[] val, int d)
{
if (node == null) node = new Node();
//if least sagnificient bit has been added
//need return
if (d == val.Length)
{
return node;
}
// get 0 or 1 index of next array(length 2)
int index = Convert.ToInt32(val[d]);
node.next[index] = AddValue(node.next[index], val, ++d);
return node;
}

Is there a higher performance method for removing rare unwanted chars from a string?

EDIT
Apologies if the original unedited question is misleading.
This question is not asking how to remove Invalid XML Chars from a string, answers to that question would be better directed here.
I'm not asking you to review my code.
What I'm looking for in answers is, a function with the signature
string <YourName>(string input, Func<char, bool> check);
that will have performance similar or better than RemoveCharsBufferCopyBlackList. Ideally this function would be more generic and if possible simpler to read, but these requirements are secondary.
I recently wrote a function to strip invalid XML chars from a string. In my application the strings can be modestly long and the invalid chars occur rarely. This excerise got me thinking. What ways can this be done in safe managed c# and, which would offer the best performance for my scenario.
Here is my test program, I've subtituted the "valid XML predicate" for one the omits the char 'X'.
class Program
{
static void Main()
{
var attempts = new List<Func<string, Func<char, bool>, string>>
{
RemoveCharsLinqWhiteList,
RemoveCharsFindAllWhiteList,
RemoveCharsBufferCopyBlackList
}
const string GoodString = "1234567890abcdefgabcedefg";
const string BadString = "1234567890abcdefgXabcedefg";
const int Iterations = 100000;
var timer = new StopWatch();
var testSet = new List<string>(Iterations);
for (var i = 0; i < Iterations; i++)
{
if (i % 1000 == 0)
{
testSet.Add(BadString);
}
else
{
testSet.Add(GoodString);
}
}
foreach (var attempt in attempts)
{
//Check function works and JIT
if (attempt.Invoke(BadString, IsNotUpperX) != GoodString)
{
throw new ApplicationException("Broken Function");
}
if (attempt.Invoke(GoodString, IsNotUpperX) != GoodString)
{
throw new ApplicationException("Broken Function");
}
timer.Reset();
timer.Start();
foreach (var t in testSet)
{
attempt.Invoke(t, IsNotUpperX);
}
timer.Stop();
Console.WriteLine(
"{0} iterations of function \"{1}\" performed in {2}ms",
Iterations,
attempt.Method,
timer.ElapsedMilliseconds);
Console.WriteLine();
}
Console.Readkey();
}
private static bool IsNotUpperX(char value)
{
return value != 'X';
}
private static string RemoveCharsLinqWhiteList(string input,
Func<char, bool> check);
{
return new string(input.Where(check).ToArray());
}
private static string RemoveCharsFindAllWhiteList(string input,
Func<char, bool> check);
{
return new string(Array.FindAll(input.ToCharArray(), check.Invoke));
}
private static string RemoveCharsBufferCopyBlackList(string input,
Func<char, bool> check);
{
char[] inputArray = null;
char[] outputBuffer = null;
var blackCount = 0;
var lastb = -1;
var whitePos = 0;
for (var b = 0; b , input.Length; b++)
{
if (!check.invoke(input[b]))
{
var whites = b - lastb - 1;
if (whites > 0)
{
if (outputBuffer == null)
{
outputBuffer = new char[input.Length - blackCount];
}
if (inputArray == null)
{
inputArray = input.ToCharArray();
}
Buffer.BlockCopy(
inputArray,
(lastb + 1) * 2,
outputBuffer,
whitePos * 2,
whites * 2);
whitePos += whites;
}
lastb = b;
blackCount++;
}
}
if (blackCount == 0)
{
return input;
}
var remaining = inputArray.Length - 1 - lastb;
if (remaining > 0)
{
Buffer.BlockCopy(
inputArray,
(lastb + 1) * 2,
outputBuffer,
whitePos * 2,
remaining * 2);
}
return new string(outputBuffer, 0, inputArray.Length - blackCount);
}
}
If you run the attempts you'll note that the performance improves as the functions get more specialised. Is there a faster and more generic way to perform this operation? Or if there is no generic option is there a way that is just faster?
Please note that I am not actually interested in removing 'X' and in practice the predicate is more complicated.
You certainly don't want to use LINQ to Objects aka enumerators to do this if you require high performance. Also, don't invoke a delegate per char. Delegate invocations are costly compared to the actual operation you are doing.
RemoveCharsBufferCopyBlackList looks good (except for the delegate call per character).
I recommend that you inline the contents of the delegate hard-coded. Play around with different ways to write the condition. You may get better performance by first checking the current char against a range of known good chars (e.g. 0x20-0xFF) and if it matches let it through. This test will pass almost always so you can save the expensive checks against individual characters which are invalid in XML.
Edit: I just remembered I solved this problem a while ago:
static readonly string invalidXmlChars =
Enumerable.Range(0, 0x20)
.Where(i => !(i == '\u000A' || i == '\u000D' || i == '\u0009'))
.Select(i => (char)i)
.ConcatToString()
+ "\uFFFE\uFFFF";
public static string RemoveInvalidXmlChars(string str)
{
return RemoveInvalidXmlChars(str, false);
}
internal static string RemoveInvalidXmlChars(string str, bool forceRemoveSurrogates)
{
if (str == null) throw new ArgumentNullException("str");
if (!ContainsInvalidXmlChars(str, forceRemoveSurrogates))
return str;
str = str.RemoveCharset(invalidXmlChars);
if (forceRemoveSurrogates)
{
for (int i = 0; i < str.Length; i++)
{
if (IsSurrogate(str[i]))
{
str = str.Where(c => !IsSurrogate(c)).ConcatToString();
break;
}
}
}
return str;
}
static bool IsSurrogate(char c)
{
return c >= 0xD800 && c < 0xE000;
}
internal static bool ContainsInvalidXmlChars(string str)
{
return ContainsInvalidXmlChars(str, false);
}
public static bool ContainsInvalidXmlChars(string str, bool forceRemoveSurrogates)
{
if (str == null) throw new ArgumentNullException("str");
for (int i = 0; i < str.Length; i++)
{
if (str[i] < 0x20 && !(str[i] == '\u000A' || str[i] == '\u000D' || str[i] == '\u0009'))
return true;
if (str[i] >= 0xD800)
{
if (forceRemoveSurrogates && str[i] < 0xE000)
return true;
if ((str[i] == '\uFFFE' || str[i] == '\uFFFF'))
return true;
}
}
return false;
}
Notice, that RemoveInvalidXmlChars first invokes ContainsInvalidXmlChars to save the string allocation. Most strings do not contain invalid XML chars so we can be optimistic.

How do you perform string replacement on just a subsection of a string?

I'd like an efficient method that would work something like this
EDIT: Sorry I didn't put what I'd tried before. I updated the example now.
// Method signature, Only replaces first instance or how many are specified in max
public int MyReplace(ref string source,string org, string replace, int start, int max)
{
int ret = 0;
int len = replace.Length;
int olen = org.Length;
for(int i = 0; i < max; i++)
{
// Find the next instance of the search string
int x = source.IndexOf(org, ret + olen);
if(x > ret)
ret = x;
else
break;
// Insert the replacement
source = source.Insert(x, replace);
// And remove the original
source = source.Remove(x + len, olen); // removes original string
}
return ret;
}
string source = "The cat can fly but only if he is the cat in the hat";
int i = MyReplace(ref source,"cat", "giraffe", 8, 1);
// Results in the string "The cat can fly but only if he is the giraffe in the hat"
// i contains the index of the first letter of "giraffe" in the new string
The only reason I'm asking is because my implementation I'd imagine getting slow with 1,000s of replaces.
How about:
public static int MyReplace(ref string source,
string org, string replace, int start, int max)
{
if (start < 0) throw new System.ArgumentOutOfRangeException("start");
if (max <= 0) return 0;
start = source.IndexOf(org, start);
if (start < 0) return 0;
StringBuilder sb = new StringBuilder(source, 0, start, source.Length);
int found = 0;
while (max-- > 0) {
int index = source.IndexOf(org, start);
if (index < 0) break;
sb.Append(source, start, index - start).Append(replace);
start = index + org.Length;
found++;
}
sb.Append(source, start, source.Length - start);
source = sb.ToString();
return found;
}
it uses StringBuilder to avoid lots of intermediate strings; I haven't tested it rigorously, but it seems to work. It also tries to avoid an extra string when there are no matches.
To start, try something like this:
int count = 0;
Regex.Replace(source, Regex.Escape(literal), (match) =>
{
return (count++ > something) ? "new value" : match.Value;
});
To replace only the first match:
private string ReplaceFirst(string source, string oldString, string newString)
{
var index = source.IndexOf(oldString);
var begin = source.Substring(0, index);
var end = source.Substring(index + oldString.Length);
return begin + newString + end;
}
You have a bug in that you will miss the item to replace if it is in the beginning.
change these lines;
int ret = start; // instead of zero, or you ignore the start parameter
// Find the next instance of the search string
// Do not skip olen for the first search!
int x = i == 0 ? source.IndexOf(org, ret) : source.IndexOf(org, ret + olen);
Also your routine does 300 thousand replaces a second on my machine. Are you sure this will be a bottleneck?
And just found that your code also has an issue if you replace larger texts by smaller texts.
This code is 100% faster if you have four replaces and around 10% faster with one replacement (faster when compared with the posted original code). It uses the specified start parameter and works when replacing larger texts by smaller texts.
Mark Gravells solution is (no offense ;-) 60% slower as the original code and it also returns another value.
// Method signature, Only replaces first instance or how many are specified in max
public static int MyReplace(ref string source, string org, string replace, int start, int max)
{
var ret = 0;
int x = start;
int reps = 0;
int l = source.Length;
int lastIdx = 0;
string repstring = "";
while (x < l)
{
if ((source[x] == org[0]) && (reps < max) && (x >= start))
{
bool match = true;
for (int y = 1; y < org.Length; y++)
{
if (source[x + y] != org[y])
{
match = false;
break;
}
}
if (match)
{
repstring += source.Substring(lastIdx, x - lastIdx) + replace;
ret = x;
x += org.Length - 1;
reps++;
lastIdx = x + 1;
// Done?
if (reps == max)
{
source = repstring + source.Substring(lastIdx);
return ret;
}
}
}
x++;
}
if (ret > 0)
{
source = repstring + source.Substring(lastIdx);
}
return ret;
}

Finding multiple indexes from source string

Basically I need to do String.IndexOf() and I need to get array of indexes from the source string.
Is there easy way to get array of indexes?
Before asking this question I have Googled a lot, but have not found easy solution to solve this simple problem.
How about this extension method:
public static IEnumerable<int> IndexesOf(this string haystack, string needle)
{
int lastIndex = 0;
while (true)
{
int index = haystack.IndexOf(needle, lastIndex);
if (index == -1)
{
yield break;
}
yield return index;
lastIndex = index + needle.Length;
}
}
Note that when looking for "AA" in "XAAAY" this code will now only yield 1.
If you really need an array, call ToArray() on the result. (This is assuming .NET 3.5 and hence LINQ support.)
var indexs = "Prashant".MultipleIndex('a');
//Extension Method's Class
public static class Extensions
{
static int i = 0;
public static int[] MultipleIndex(this string StringValue, char chChar)
{
var indexs = from rgChar in StringValue
where rgChar == chChar && i != StringValue.IndexOf(rgChar, i + 1)
select new { Index = StringValue.IndexOf(rgChar, i + 1), Increament = (i = i + StringValue.IndexOf(rgChar)) };
i = 0;
return indexs.Select(p => p.Index).ToArray<int>();
}
}
You would have to loop, I suspect:
int start = 0;
string s = "abcdeafghaji";
int index;
while ((index = s.IndexOf('a', start)) >= 0)
{
Console.WriteLine(index);
start = index + 1;
}
Using a solution that utilizes regex can be more reliable, using the indexOf function can be unreliable. It will find all matches and indexes, not matching an exact phrase which can lead to unexpected results. This function resolves that by making use of the Regex library.
public static IEnumerable<int> IndexesOf(string haystack, string needle)
{
Regex r = new Regex("\\b(" + needle + ")\\b");
MatchCollection m = r.Matches(haystack);
return from Match o in m select o.Index;
}

Categories