C# - Code optimization to get all substrings from a string - c#

I was working on a code snippet to get all substrings from a given string.
Here is the code that I use
var stringList = new List<string>();
for (int length = 1; length < mainString.Length; length++)
{
for (int start = 0; start <= mainString.Length - length; start++)
{
var substring = mainString.Substring(start, length);
stringList.Add(substring);
}
}
It looks not so great to me, with two for loops. Is there any other way that I can achieve this with better time complexity.
I am stuck on the point that, for getting a substring, I will surely need two loops. Is there any other way I can look into ?

The number of substrings in a string is O(n^2), so one loop inside another is the best you can do. You are correct in your code structure.
Here's how I would've phrased your code:
void Main()
{
var stringList = new List<string>();
string s = "1234";
for (int i=0; i <s.Length; i++)
for (int j=i; j < s.Length; j++)
stringList.Add(s.Substring(i,j-i+1));
}

You do need 2 for loops
Demo here
var input = "asd sdf dfg";
var stringList = new List<string>();
for (int i = 0; i < input.Length; i++)
{
for (int j = i; j < input.Length; j++)
{
var substring = input.Substring(i, j-i+1);
stringList.Add(substring);
}
}
foreach(var item in stringList)
{
Console.WriteLine(item);
}
Update
You cannot improve on the iterations.
However you can improve performance, by using fixed arrays and pointers

In some cases you can significantly increase execution speed by reducing object allocations. In this case by using a single char[] and ArraySegment<of char> to process substrings. This will also lead to use of less address space and decrease in garbage collector load.
Relevant excerpt from Using the StringBuilder Class in .NET page on Microsoft Docs:
The String object is immutable. Every time you use one of the methods in the System.String class, you create a new string object in memory, which requires a new allocation of space for that new object. In situations where you need to perform repeated modifications to a string, the overhead associated with creating a new String object can be costly.
Example implementation:
static List<ArraySegment<char>> SubstringsOf(char[] value)
{
var substrings = new List<ArraySegment<char>>(capacity: value.Length * (value.Length + 1) / 2 - 1);
for (int length = 1; length < value.Length; length++)
for (int start = 0; start <= value.Length - length; start++)
substrings.Add(new ArraySegment<char>(value, start, length));
return substrings;
}
For more information check Fundamentals of Garbage Collection page on Microsoft Docs, what is the use of ArraySegment class? discussion on StackOverflow, ArraySegment<T> Structure page on MSDN and List<T>.Capacity page on MSDN.

Well, O(n**2) time complexity is inevitable, however you can try impove space consumption. In many cases, you don't want all the substrings being materialized, say, as a List<string>:
public static IEnumerable<string> AllSubstrings(string value) {
if (value == null)
yield break; // Or throw ArgumentNullException
for (int length = 1; length < value.Length; ++length)
for (int start = 0; start <= value.Length - length; ++start)
yield return value.Substring(start, length);
}
For instance, let's count all substrings in "abracadabra" which start from a and longer than 3 characters. Please, notice that all we have to do is to loop over susbstrings without saving them into a list:
int count = AllSubstrings("abracadabra")
.Count(item => item.StartsWith("a") && item.Length > 3);
If for any reason you want a List<string>, just add .ToList():
var stringList = AllSubstrings(mainString).ToList();

Related

How to shuffle string characters to right and left until int.MaxValue?

My task is to make a organized shuffle, from source all odd numbers will go to left and even number will go to right.
I have done that much like this, and it is good for normal scenario:
public static string ShuffleChars(string source, int count)
{
if (string.IsNullOrWhiteSpace(source) || source.Length == 0)
{
throw new ArgumentException(null);
}
if (count < 0)
{
throw new ArgumentException(null);
}
for (int i = 0; i < count; i++)
{
source = string.Concat(source.Where((item, index) => index % 2 == 0)) +
string.Concat(source.Where((item, index) => index % 2 != 0));
}
return source;
}
Now the problem is, what if the count is int.MaxValue or a other huge number in millions, it will loop trough a lot. How can I optimize the code in terms of speed and resource consumption?
You should be able to determine by the string's length how many iterations it will take before it's back to it's original sort order. Then take the modulus of the iteration count and the input count, and only iterate that many times.
For example, a string that is three characters will be back to it's original sort order in 2 iterations. If the input count was to do 11 iterations, we know that 11 % 2 == 1, so we only need to iterate one time.
Once you determine a formula for how many iterations it takes to reach the original sort order for any length of string, you can always reduce the number of iterations to that number or less.
Coming up with a formula will be tricky, however. A string with 14 characters takes 12 iterations until it matches itself, but a string with 15 characters only takes 4 iterations.
Therefore, a shortcut might be to simply start iterating until we reach the original sort order (or the specified count, whichever comes first). If we reach the count first, then we return that answer. Otherwise, we can determine the answer from the idea in the first paragraph - take the modulus of the input count and the iteration count, and return that answer.
This would require that we store the values from our iterations (in a dictionary, for example) so we can retrieve a specific previous value.
For example:
public static string ShuffleChars(string source, int count)
{
string s = source;
var results = new Dictionary<int, string>();
for (int i = 0; i < count; i++)
{
s = string.Concat(s.Where((item, index) => index % 2 == 0)) +
string.Concat(s.Where((item, index) => index % 2 != 0));
// If we've repeated our original string, return the saved
// value of the input count modulus the current iteration
if (s == source)
{
return results[count % (i + 1) - 1];
}
// Otherwise, save the value for later
else
{
results[i] = s;
}
}
// If we get here it means we hit the requested count before
// ever returning to the original sort order of the input
return s;
}
Instead of creating new immutable strings on each loop, you could work with a mutable array of characters (char[]), and swap characters between places. This would be the most efficient in terms of memory consumption, but doing the swaps on a single array could be quite tricky. Using two arrays is much easier, because you can just copy characters from one array to the other, and at the end of each loop swap the two arrays.
One more optimization you could do is to work with the indices of the char array, instead of its values. I am not sure if this will make any difference in practice, since in modern 64 bit machines both char and int types occupy 8 bytes (AFAIK). It will surely make a difference on 32 bit machines though. Here is an implementation, with all these ideas put together:
public static string ShuffleChars(string source, int count)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (count < 0) throw new ArgumentOutOfRangeException(nameof(count));
// Instantiate the two arrays
int[] indices = new int[source.Length];
int[] temp = new int[source.Length];
// Initialize the indices array with incremented numbers
for (int i = 0; i < indices.Length; i++)
indices[i] = i;
for (int k = 0; k < count; k++)
{
// Copy the odds to the temp array
for (int i = 0, j = 0; j < indices.Length; i += 1, j += 2)
temp[i] = indices[j];
// Copy the evens to the temp array
int lastEven = (indices.Length >> 1 << 1) - 1;
for (int i = indices.Length - 1, j = lastEven; j >= 0; i -= 1, j -= 2)
temp[i] = indices[j];
// Swap the two arrays, using value tuples
(indices, temp) = (temp, indices);
}
// Map the indices to characters from the source string
return String.Concat(indices.Select(i => source[i]));
}

How do you store the index position of a repeated keyword and store it in an array?

I want to make a program that finds how many times a key word has been repeated (i.e. "the") and then store the index position in a array. At the moment, my code only store's the first time it reads "the" in the string sentence. How do you make it that it stores the index position of the first time it reads "the" and the second?
It outputs on the console:
11
0
My current code:
string sentence = "John likes the snow and the winter.";
string keyWord = "the";
var test = sentence.Split(new char[] { ' ', '.' });
var count = Array.FindAll(test, s => s.Equals(keyWord.Trim())).Length;
int[] arr = new int[count];
for (int i = 0; i < arr.Length; i++)
{
arr[i] = sentence.IndexOf("the", i);
i++;
}
foreach (int num in arr)
{
Console.WriteLine(num);
}
Console.ReadLine();
Second result (0) is there because of unnecessary i++ in the for loop. Because of that, you are entering the loop only once. To achieve what you want you could try code like below (please take a closer look at body of the for loop:
string sentence = "John likes the snow and the winter.";
            string keyWord = "the";
            var test = sentence.Split(new char[] { ' ', '.' });
            var count = Array.FindAll(test, s => s.Equals(keyWord.Trim())).Length;
            int[] arr = new int[count];
            int lastIndex = 0;
            for (int i = 0; i < arr.Length; i++)
            {
                lastIndex = sentence.IndexOf("the", lastIndex + keyWord.Length); //We are adding length of the `keyWord`, because we want to skip word we already found.
                arr[i] = lastIndex;
            }
            foreach (int num in arr)
            {
                Console.WriteLine(num);
            }
            
            Console.ReadLine();
I hope it makes sense.
There are two problems that I see with your code. First, you're incrementing i twice, so it will only ever get half the items. Second, you're passing i as the second parameter to IndexOf (which represents the starting index for the search). Instead, you should be starting the search after the previous found instance by passing in the index of the last instance found plus its length.
Here's a fixed example of the for loop:
for (int i = 0; i < arr.Length; i++)
{
arr[i] = sentence.IndexOf(keyword, i == 0 ? 0 : arr[i - 1] + keyword.Length);
}
Also, your code could be simplified a little if you use a List<int> instead of an int[] to store the indexes, because with List you don't need to know the count ahead of time:
string sentence = "John likes the snow and the winter.";
string keyWord = "the";
var indexes = new List<int>();
var index = 0;
while (true)
{
index = sentence.IndexOf(keyWord, index); // Find the next index of the keyword
if (index < 0) break; // If we didn't find it, exit the loop
indexes.Add(index++); // Otherwise, add the index to our list
}
foreach (int num in indexes)
{
Console.WriteLine(num);
}

How to keep the latest X elements of a list

I need to use a data structure that would keep the latest X elements of a list. A colleague gave me this solution:
int start = 0;
const int latestElementsToKeep = 20;
int[] numbers = new int[latestElementsToKeep];
for (int i = 0; i < 30; i++)
{
numbers[start] = i;
if (start < numbers.Length - 1)
{
start++;
}
else
{
start = 0;
}
}
So after this is run, the numbers array has numbers 19-29 (the latest 20 numbers).
That's nice, but difficult to use this in the real world. Is there an easier way to do this?
This seems like a pretty standard Circular Buffer. My only suggestion would be to create a class for it or download one of the libraries available. There seem to be a few promising looking ones near the top of the Google results.
Easier way to do this:
int[] numbers = new int[latestElementsToKeep];
for (int i = 0; i < 30; i++)
numbers[i % latestElementsToKeep] = i;
Modulus operator returns the reminder of dividing i by latestElementsToKeep. When i reaches latestElementsToKeep, you will start from the beginning.
For a range of numbers, you can use:
int keep = 20;
int lastItem = 29;
int[] numbers = Enumerable.Range(lastItem - keep, keep).ToArray();
To get the last items from any collection (where you can get the size), you can use:
int keep = 20;
someType[] items = someCollection.Skip(someCollection.Count() - keep).ToArray();

using queue to get some characters from input string. return value is weird

i can't find any mistakes in my code.
here i'm trying to pick all numbers from the string:
(just to simplify the example,, i want to pick numbers that will satisfy some condition)
i use Queue cause i don't want to deal with array's indexes.
Console.Write("enter string: ");
string s = Console.ReadLine();
char[] array = s.ToCharArray();
Queue<char> q = new Queue<char>();
for (int i = 0; i < array.Length; i++)
{
q.Enqueue(array[i]);
}
char[] new_array = new char[q.Count];
for (int i = 0; i < q.Count; i++)
{
new_array[i] = q.Dequeue();
}
Console.WriteLine(new String(new_array));
Input string: 123456
And the output is a little weird:
123
another input: 123
output: 12
of course i made some mistake) but everything seems to be OK
Thank YOU in advance
The problem is the second loop:
for (int i = 0; i < q.Count; i++)
{
new_array[i] = q.Dequeue();
}
As q.Count decrements on every loop iteration, and i increases on every interation, you get only half of the elements.
try something like:
for (int i = 0; q.Count > 0; i++)
{
new_array[i] = q.Dequeue();
}
also consider: Queue.toArray
I would suggest using List<char> instead of Queue<char> and char[]. There's nothing here that particularly needs a queue, and it would avoid the problem that Rudolf pointed out, and a List is much easier to work with than an array. You can also use foreach instead of a for loop, and avoid the intermediate step.
Console.Write("enter string: ");
string s = Console.ReadLine();
List<char> new_array = new List<char>();
foreach(char c in s.ToCharArray())
{
new_array.Add(c);
}
Console.WriteLine(new String(new_array.ToArray()));
As the reason for your error is already stated,you can replace your two loops with just two statements
//A version of Queue constructor accepts IEnumerable object.
//you can directly pass the string to the queue constructor.
Queue<char> Que = new Queue<char>("123456");
//Copies the array and the position is preserved
var new_arr= Que.ToArray();
According to MSDN:
Removes and returns the object at the beginning of the Queue.
As you use Dequeue(), the q.Count value changes in each iteration.
So rather than using q.Count in this loop;
for (int i = 0; i < q.Count; i++)
use
int queueSize = q.Count;
for (int i = 0; i < queueSize; i++)
This will keep your looping limit as a constant number rather than calculating it in each iteration to find a different value because of using Dequeue().

Fast intersection of two sorted integer arrays

I need to find the intersection of two sorted integer arrays and do it very fast.
Right now, I am using the following code:
int i = 0, j = 0;
while (i < arr1.Count && j < arr2.Count)
{
if (arr1[i] < arr2[j])
{
i++;
}
else
{
if (arr2[j] < arr1[i])
{
j++;
}
else
{
intersect.Add(arr2[j]);
j++;
i++;
}
}
}
Unfortunately it might to take hours to do all work.
How to do it faster? I found this article where SIMD instructions are used. Is it possible to use SIMD in .NET?
What do you think about:
http://docs.go-mono.com/index.aspx?link=N:Mono.Simd Mono.SIMD
http://netasm.codeplex.com/ NetASM(inject asm code to managed)
and something like http://www.atrevido.net/blog/PermaLink.aspx?guid=ac03f447-d487-45a6-8119-dc4fa1e932e1
EDIT:
When i say thousands i mean following (in code)
for(var i=0;i<arrCollection1.Count-1;i++)
{
for(var j=i+1;j<arrCollection2.Count;j++)
{
Intersect(arrCollection1[i],arrCollection2[j])
}
}
UPDATE
The fastest I got was 200ms with arrays size 10mil, with the unsafe version (Last piece of code).
The test I've did:
var arr1 = new int[10000000];
var arr2 = new int[10000000];
for (var i = 0; i < 10000000; i++)
{
arr1[i] = i;
arr2[i] = i * 2;
}
var sw = Stopwatch.StartNew();
var result = arr1.IntersectSorted(arr2);
sw.Stop();
Console.WriteLine(sw.Elapsed); // 00:00:00.1926156
Full Post:
I've tested various ways to do it and found this to be very good:
public static List<int> IntersectSorted(this int[] source, int[] target)
{
// Set initial capacity to a "full-intersection" size
// This prevents multiple re-allocations
var ints = new List<int>(Math.Min(source.Length, target.Length));
var i = 0;
var j = 0;
while (i < source.Length && j < target.Length)
{
// Compare only once and let compiler optimize the switch-case
switch (source[i].CompareTo(target[j]))
{
case -1:
i++;
// Saves us a JMP instruction
continue;
case 1:
j++;
// Saves us a JMP instruction
continue;
default:
ints.Add(source[i++]);
j++;
// Saves us a JMP instruction
continue;
}
}
// Free unused memory (sets capacity to actual count)
ints.TrimExcess();
return ints;
}
For further improvement you can remove the ints.TrimExcess();, which will also make a nice difference, but you should think if you're going to need that memory.
Also, if you know that you might break loops that use the intersections, and you don't have to have the results as an array/list, you should change the implementation to an iterator:
public static IEnumerable<int> IntersectSorted(this int[] source, int[] target)
{
var i = 0;
var j = 0;
while (i < source.Length && j < target.Length)
{
// Compare only once and let compiler optimize the switch-case
switch (source[i].CompareTo(target[j]))
{
case -1:
i++;
// Saves us a JMP instruction
continue;
case 1:
j++;
// Saves us a JMP instruction
continue;
default:
yield return source[i++];
j++;
// Saves us a JMP instruction
continue;
}
}
}
Another improvement is to use unsafe code:
public static unsafe List<int> IntersectSorted(this int[] source, int[] target)
{
var ints = new List<int>(Math.Min(source.Length, target.Length));
fixed (int* ptSrc = source)
{
var maxSrcAdr = ptSrc + source.Length;
fixed (int* ptTar = target)
{
var maxTarAdr = ptTar + target.Length;
var currSrc = ptSrc;
var currTar = ptTar;
while (currSrc < maxSrcAdr && currTar < maxTarAdr)
{
switch ((*currSrc).CompareTo(*currTar))
{
case -1:
currSrc++;
continue;
case 1:
currTar++;
continue;
default:
ints.Add(*currSrc);
currSrc++;
currTar++;
continue;
}
}
}
}
ints.TrimExcess();
return ints;
}
In summary, the most major performance hit was in the if-else's.
Turning it into a switch-case made a huge difference (about 2 times faster).
Have you tried something simple like this:
var a = Enumerable.Range(1, int.MaxValue/100).ToList();
var b = Enumerable.Range(50, int.MaxValue/100 - 50).ToList();
//var c = a.Intersect(b).ToList();
List<int> c = new List<int>();
var t1 = DateTime.Now;
foreach (var item in a)
{
if (b.BinarySearch(item) >= 0)
c.Add(item);
}
var t2 = DateTime.Now;
var tres = t2 - t1;
This piece of code takes 1 array of 21,474,836 elements and the other one with 21,474,786
If I use var c = a.Intersect(b).ToList(); I get an OutOfMemoryException
The result product would be 461,167,507,485,096 iterations using nested foreach
But with this simple code, the intersection occurred in TotalSeconds = 7.3960529 (using one core)
Now I am still not happy, so I am trying to increase the performance by breaking this in parallel, as soon as I finish I will post it
Yorye Nathan gave me the fastest intersection of two arrays with the last "unsafe code" method. Unfortunately it was still too slow for me, I needed to make combinations of array intersections, which goes up to 2^32 combinations, pretty much no? I made following modifications and adjustments and time dropped to 2.6X time faster. You need to make some pre optimization before, for sure you can do it some way or another. I am using only indexes instead the actual objects or ids or some other abstract comparison. So, by example if you have to intersect big number like this
Arr1: 103344, 234566, 789900, 1947890,
Arr2: 150034, 234566, 845465, 23849854
put everything into and array
Arr1: 103344, 234566, 789900, 1947890, 150034, 845465,23849854
and use, for intersection, the ordered indexes of the result array
Arr1Index: 0, 1, 2, 3
Arr2Index: 1, 4, 5, 6
Now we have smaller numbers with whom we can build some other nice arrays. What I did after taking the method from Yorye, I took Arr2Index and expand it into, theoretically boolean array, practically into byte arrays, because of the memory size implication, to following:
Arr2IndexCheck: 0, 1, 0, 0, 1, 1 ,1
that is more or less a dictionary which tells me for any index if second array contains it.
The next step I did not use memory allocation which also took time, instead I pre-created the result array before calling the method, so during the process of finding my combinations I never instantiate anything. Of course you have to deal with the length of this array separately, so maybe you need to store it somewhere.
Finally the code looks like this:
public static unsafe int IntersectSorted2(int[] arr1, byte[] arr2Check, int[] result)
{
int length;
fixed (int* pArr1 = arr1, pResult = result)
fixed (byte* pArr2Check = arr2Check)
{
int* maxArr1Adr = pArr1 + arr1.Length;
int* arr1Value = pArr1;
int* resultValue = pResult;
while (arr1Value < maxArr1Adr)
{
if (*(pArr2Check + *arr1Value) == 1)
{
*resultValue = *arr1Value;
resultValue++;
}
arr1Value++;
}
length = (int)(resultValue - pResult);
}
return length;
}
You can see the result array size is returned by the function, then you do what you wish(resize it, keep it). Obviously the result array has to have at least the minimum size of arr1 and arr2.
The big improvement, is that I only iterate through the first array, which would be best to have less size than the second one, so you have less iterations. Less iterations means less CPU cycles right?
So here is the really fast intersection of two ordered arrays, that if you need a reaaaaalllyy high performance ;).
Are arrCollection1 and arrCollection2 collections of arrays of integers? IN this case you should get some notable improvement by starting second loop from i+1 as opposed to 0
C# doesn't support SIMD. Additionally, and I haven't yet figured out why, DLL's that use SSE aren't any faster when called from C# than the non-SSE equivalent functions. Also, all SIMD extensions that I know of don't work with branching anyway, ie your "if" statements.
If you're using .net 4.0, you can use Parallel For to gain speed if you have multiple cores. Otherwise you can write a multithreaded version if you have .net 3.5 or less.
Here is a method similar to yours:
IList<int> intersect(int[] arr1, int[] arr2)
{
IList<int> intersect = new List<int>();
int i = 0, j = 0;
int iMax = arr1.Length - 1, jMax = arr2.Length - 1;
while (i < iMax && j < jMax)
{
while (i < iMax && arr1[i] < arr2[j]) i++;
if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
while (i < iMax && arr1[i] == arr2[j]) i++; //prevent reduntant entries
while (j < jMax && arr2[j] < arr1[i]) j++;
if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
while (j < jMax && arr2[j] == arr1[i]) j++; //prevent redundant entries
}
return intersect;
}
This one also prevents any entry from appearing twice. With 2 sorted arrays both of size 10 million, it completed in about a second. The compiler is supposed to remove array bounds checks if you use array.Length in a For statement, I don't know if that works in a while statement though.

Categories