Different benchmarks for string manipulation in C# and C++

Different benchmarks for string manipulation in C# and C++ - c#

I have two very simple code in C++ and C#
in c#
for (int counter = 0; counter < 100000; counter ++)
{
String a = "";
a = "xyz";
a = a + 'd';
a = a + 'c';
a = a + 'h';
}
in c++
for (int counter = 0; counter < 100000; counter ++)
{
string a = "";
a.append("xyz");
a = a + 'd';
a = a + 'c';
a = a + 'h';
}
the strange thing is the c# code took 1/20 time for execution than the c++ code.
could you please help me to find why this happened? and how can I change my c++ code to become faster.

It's probably a quirk of the implementations. For example, one optimizer might have figured out that the result of the operations isn't used. Or one might happen to allocate a string large enough to add three extra characters without reallocating while the other didn't. Or it could be a million other things.
Benchmarking with "toy" code really isn't helpful. I wouldn't assume the results apply to any realistic situation.
There are so many obvious optimizations to this code, for example:
string a;
for (int counter = 0; counter < 100000; counter ++)
{
a = "xyz";
a.append(1, 'd');
a.append(1, 'c');
a.append(1, 'h');
}
That may make a huge difference by reusing the buffer and avoiding extra allocate/copy/free cycles.

For large string manipulation use StringBuilder

Related

In C#, are for loops linear?

Here I'm using the mathematical term for linear. If we look at, for example, the definition of averaging, we know that:
That's what I mean by linear. In C#, suppose I want to do the following:
for (int i = 0; i<N; i++)
{
someVar[i] += i;
someOtherVar[i] += i;
}
Does this cost the same overhead as:
for (int i = 0; i<N; i++)
{
someVar[i] += i;
}
for (int i = 0; i<N; i++)
{
someOtherVar[i] += i;
}
Would the difference change if the operation in the for loop is more complicated, like:
for (int i = 0; i<N; i++)
{
someOtherVar[i] *= Math.Cos(2.0 * Math.Pi * i / N);
}
Assume N is large, 16384 entries.

As far as complexity goes, both a one and two loop algorithm are linear O(n) operations. But as far as actual performance (overhead), the two loop version is most likely slower. The compiler or runtime may opt to do optimizations, but two loops generally translate to:
set i to 0;
a:
if i < N goto done;
set someVar[i] = someVar[i] + i;
set i = i + 1
goto a;
done:
set i to 0;
b:
if i < N goto done;
set someOtherVar[i] = someOtherVar[i] + i;
set i = i + 1;
goto b;
...which is clearly more steps than one loop.

I did an IL dump, and using one loop is faster than using two loops as there's the overhead incurred by incrementing "i".
That said, the difference will be minimal.
If you're dealing with a much larger data set and need to optimize this further, look into SIMD (single instruction, multiple data) operations.
https://habr.com/en/post/467689/

Version of C# StringBuilder to allow for strings larger than 2 billion characters

In C#, 64bit Windows + .NET 4.5 (or later) + enabling gcAllowVeryLargeObjects in the App.config file allows for objects larger than two gigabyte. That's cool, but unfortunately, the maximum number of elements that C# allows in a character array is still limited to about 2^31 = 2.15 billion chars. Testing confirmed this.
To overcome this, Microsoft recommends in Option B creating the arrays natively (their 'Option C' doesn't even compile). That suits me, as speed is also a concern. Is there some tried and trusted unsafe / native / interop / PInvoke code for .NET out there that can replace and act as an enhanced StringBuilder to get around the 2 billion element limit?
Unsafe/pinvoke code is preferred, but not a deal breaker. Alternatively, is there a .NET (safe) version available?
Ideally, the StringBuilder replacement will start off small (preferably user-defined), and then repeatedly double in size each time the capacity has been exceeded. I'm mostly looking for append() functionality here. Saving the string to a file would be useful too, though I'm sure I could program that bit if substring() functionality is also incorporated. If the code uses pinvoke, then obviously some degree of memory management must be taken into account to avoid memory loss.
I don't want to recreate the wheel if some simple code already exists, but on the other hand, I don't want to download and incorporate a DLL just for this simple functionality.
I'm also using .NET 3.5 to cater for users who don't have the latest version of Windows.

The size of strings in C++ is unlimited according to this answer.
You could write your string processing code in C++ and use a DLL import to communicate between your C# code and C++ code. This makes it simple to call your C++ functions from the C# code.
The parts of your code which do the processing on the large strings will dictate where the border between the C++ and C# code will need to be. Obviously any references to the large strings will need to be kept on the C++ side, but aggregate processing result information can then be communicated back to the C# code.
Here is a link to a code project page that gives some guidance on C# to C++ DLL imports.

So I ended up creating my own BigStringBuilder function in the end. It's a list where each list element (or page) is a char array (type List<char[]>).
Providing you're using 64 bit Windows, you can now easily surpass the 2 billion character element limit. I managed to test creating a giant string around 32 gigabytes large (needed to increase virtual memory in the OS first, otherwise I could only get around 7GB on my 8GB RAM PC). I'm sure it handles more than 32GB easily. In theory, it should be able to handle around 1,000,000,000 * 1,000,000,000 chars or one quintillion characters, which should be enough for anyone.
Speed-wise, some quick tests show that it's only around 33% slower than a StringBuilder when appending. I got very similar performance if I went for a 2D jagged char array (char[][]) instead of List<char[]>, but Lists are simpler to work with, so I stuck with that.
Hope somebody else finds it useful! There may be bugs, so use with caution. I tested it fairly well though.
// A simplified version specially for StackOverflow
public class BigStringBuilder
{
List<char[]> c = new List<char[]>();
private int pagedepth;
private long pagesize;
private long mpagesize; // https://stackoverflow.com/questions/11040646/faster-modulus-in-c-c
private int currentPage = 0;
private int currentPosInPage = 0;
public BigStringBuilder(int pagedepth = 12) { // pagesize is 2^pagedepth (since must be a power of 2 for a fast indexer)
this.pagedepth = pagedepth;
pagesize = (long)Math.Pow(2, pagedepth);
mpagesize = pagesize - 1;
c.Add(new char[pagesize]);
}
// Indexer for this class, so you can use convenient square bracket indexing to address char elements within the array!!
public char this[long n] {
get { return c[(int)(n >> pagedepth)][n & mpagesize]; }
set { c[(int)(n >> pagedepth)][n & mpagesize] = value; }
}
public string[] returnPagesForTestingPurposes() {
string[] s = new string[currentPage + 1];
for (int i = 0; i < currentPage + 1; i++) s[i] = new string(c[i]);
return s;
}
public void clear() {
c = new List<char[]>();
c.Add(new char[pagesize]);
currentPage = 0;
currentPosInPage = 0;
}
public void fileOpen(string path)
{
clear();
StreamReader sw = new StreamReader(path);
int len = 0;
while ((len = sw.ReadBlock(c[currentPage], 0, (int)pagesize)) != 0) {
if (!sw.EndOfStream) {
currentPage++;
if (currentPage > (c.Count - 1)) c.Add(new char[pagesize]);
}
else {
currentPosInPage = len;
break;
}
}
sw.Close();
}
// See: https://stackoverflow.com/questions/373365/how-do-i-write-out-a-text-file-in-c-sharp-with-a-code-page-other-than-utf-8/373372
public void fileSave(string path) {
StreamWriter sw = File.CreateText(path);
for (int i = 0; i < currentPage; i++) sw.Write(new string(c[i]));
sw.Write(new string(c[currentPage], 0, currentPosInPage));
sw.Close();
}
public long length() {
return (long)currentPage * (long)pagesize + (long)currentPosInPage;
}
public string ToString(long max = 2000000000) {
if (length() < max) return substring(0, length());
else return substring(0, max);
}
public string substring(long x, long y) {
StringBuilder sb = new StringBuilder();
for (long n = x; n < y; n++) sb.Append(c[(int)(n >> pagedepth)][n & mpagesize]); //8s
return sb.ToString();
}
public bool match(string find, long start = 0) {
//if (s.Length > length()) return false;
for (int i = 0; i < find.Length; i++) if (i + start == find.Length || this[start + i] != find[i]) return false;
return true;
}
public void replace(string s, long pos) {
for (int i = 0; i < s.Length; i++) {
c[(int)(pos >> pagedepth)][pos & mpagesize] = s[i];
pos++;
}
}
public void Append(string s)
{
for (int i = 0; i < s.Length; i++)
{
c[currentPage][currentPosInPage] = s[i];
currentPosInPage++;
if (currentPosInPage == pagesize)
{
currentPosInPage = 0;
currentPage++;
if (currentPage == c.Count) c.Add(new char[pagesize]);
}
}
}
}

C# - Code optimization to get all substrings from a string

I was working on a code snippet to get all substrings from a given string.
Here is the code that I use
var stringList = new List<string>();
for (int length = 1; length < mainString.Length; length++)
{
for (int start = 0; start <= mainString.Length - length; start++)
{
var substring = mainString.Substring(start, length);
stringList.Add(substring);
}
}
It looks not so great to me, with two for loops. Is there any other way that I can achieve this with better time complexity.
I am stuck on the point that, for getting a substring, I will surely need two loops. Is there any other way I can look into ?

The number of substrings in a string is O(n^2), so one loop inside another is the best you can do. You are correct in your code structure.
Here's how I would've phrased your code:
void Main()
{
var stringList = new List<string>();
string s = "1234";
for (int i=0; i <s.Length; i++)
for (int j=i; j < s.Length; j++)
stringList.Add(s.Substring(i,j-i+1));
}

You do need 2 for loops
Demo here
var input = "asd sdf dfg";
var stringList = new List<string>();
for (int i = 0; i < input.Length; i++)
{
for (int j = i; j < input.Length; j++)
{
var substring = input.Substring(i, j-i+1);
stringList.Add(substring);
}
}
foreach(var item in stringList)
{
Console.WriteLine(item);
}
Update
You cannot improve on the iterations.
However you can improve performance, by using fixed arrays and pointers

In some cases you can significantly increase execution speed by reducing object allocations. In this case by using a single char[] and ArraySegment<of char> to process substrings. This will also lead to use of less address space and decrease in garbage collector load.
Relevant excerpt from Using the StringBuilder Class in .NET page on Microsoft Docs:
The String object is immutable. Every time you use one of the methods in the System.String class, you create a new string object in memory, which requires a new allocation of space for that new object. In situations where you need to perform repeated modifications to a string, the overhead associated with creating a new String object can be costly.
Example implementation:
static List<ArraySegment<char>> SubstringsOf(char[] value)
{
var substrings = new List<ArraySegment<char>>(capacity: value.Length * (value.Length + 1) / 2 - 1);
for (int length = 1; length < value.Length; length++)
for (int start = 0; start <= value.Length - length; start++)
substrings.Add(new ArraySegment<char>(value, start, length));
return substrings;
}
For more information check Fundamentals of Garbage Collection page on Microsoft Docs, what is the use of ArraySegment class? discussion on StackOverflow, ArraySegment<T> Structure page on MSDN and List<T>.Capacity page on MSDN.

Well, O(n**2) time complexity is inevitable, however you can try impove space consumption. In many cases, you don't want all the substrings being materialized, say, as a List<string>:
public static IEnumerable<string> AllSubstrings(string value) {
if (value == null)
yield break; // Or throw ArgumentNullException
for (int length = 1; length < value.Length; ++length)
for (int start = 0; start <= value.Length - length; ++start)
yield return value.Substring(start, length);
}
For instance, let's count all substrings in "abracadabra" which start from a and longer than 3 characters. Please, notice that all we have to do is to loop over susbstrings without saving them into a list:
int count = AllSubstrings("abracadabra")
.Count(item => item.StartsWith("a") && item.Length > 3);
If for any reason you want a List<string>, just add .ToList():
var stringList = AllSubstrings(mainString).ToList();

Does C# have a way of creating an alias of a local variable?

I know this seems trivial and the compiler may make the optimization anyways, but let's say I have a piece of code like
for (i = 0; i < str.Length; ++i)
{
Console.WriteLine(str[str.Length - i]);
}
and I want to write it like
for (int i = 0, n = str.Length; i < n; ++i)
{
Console.WriteLine(str[n - i]);
}
but I don't want the extra memory from the copy n = str.Length. Is there some way I can simply say that n points to str.Length without creating any extra memory?

You can do the following, and it allows you to point multiple pointers to the reference of a single primitive value type. This accordingly prevents the object from the being cloned (it's value being copied and assigned).
unsafe{
int targetValue = 200;
int* ptr1 = &targetValue;
int* ptr2 = &targetValue;
targetValue = 400;
Console.WriteLine("{0} {1}", *ptr1, *ptr2);
}
The output will be 400 400.

C# vs C++ for loop performance measurment

For kicks, I wanted to see how the speed of a C# for-loop compares with that of a C++ for-loop. My test is to simply iterate over a for-loop 100000 times, 100000 times, and average the result.
Here is my C# implementation:
static void Main(string[] args) {
var numberOfMeasurements = 100000;
var numberOfLoops = 100000;
var measurements = new List < long > ();
var stopwatch = new Stopwatch();
for (var i = 0; i < numberOfMeasurements; i++) {
stopwatch.Start();
for (int j = 0; j < numberOfLoops; j++) {}
measurements.Add(stopwatch.ElapsedMilliseconds);
}
Console.WriteLine("Average runtime = " + measurements.Average() + " ms.");
Console.Read();
}
Result: Average runtime = 10301.92929 ms.
Here is my C++ implementation:
void TestA()
{
auto numberOfMeasurements = 100000;
auto numberOfLoops = 100000;
std::vector<long> measurements;
for (size_t i = 0; i < numberOfMeasurements; i++)
{
auto start = clock();
for (size_t j = 0; j < numberOfLoops; j++){}
auto duration = start - clock();
measurements.push_back(duration);
}
long avg = std::accumulate(measurements.begin(), measurements.end(), 0.0) / measurements.size();
std::cout << "TestB: Time taken in milliseconds: " << avg << std::endl;
}
int main()
{
TestA();
return 0;
}
Result: TestA: Time taken in milliseconds: 0
When I had a look at what was in measurements, I noticed that it was filled with zeros... So, what is it, what is the problem here? Is it clock? Is there a better/correct way to measure the for-loop?

There is no "problem". Being able to optimize away useless code is one of the key features of C++. As the inner loop does nothing, it should be removed by every sane compiler.
Tip of the day: Only profile meaningful code that does something.
If you want to learn something about micro-benchmarks, you might be interested in this.

As "Baum mit Augen" already said the compiler will remove code that doesn't do anything. That is a common mistake when "benchmarking" C++ code. The same thing will happen if you create some kind of benchmark function which just calculates some things that will never used (won't be returned or used otherwise in code) - the compiler will just remove it.
You can avoid this behavior by not using optimize flags like O2, Ofast and so on. Since nobody would do that with real code it won't display the real performance of C++.
TL;DR Just benchmark real production code.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.