C# Improve performance of SIMD Sum [closed]

C# Improve performance of SIMD Sum [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm writing a SIMD library and trying to squeeze every bit of performance.
I'm already casting in-place the array into a Span<Vector<int>>, instead of creating new objects.
Target arrays are of large size (more than 1000 elements).
Is there a more efficient way of summing an array?
Ideas are welcome.
public static int Sum(int[] array)
{
Vector<int> vSum = Vector<T>.Zero;
int sum;
int i;
Span<Vector<int>> vsArray = MemoryMarshal.Cast<int, Vector<int>>(array);
for (i = 0; i < vsArray.Length; i++)
{
vSum += vsArray[i];
}
sum = Vector.Dot(vSum, Vector<int>.One);
i *= Vector<int>.Count;
for (; i < array.Length; i++)
{
sum += array[i];
}
return sum;
}

Your code is good. Only possible to improve by 4%, here's how:
// Test result: only 4% win on my PC.
[MethodImpl( MethodImplOptions.AggressiveInlining )]
static int sumUnsafeAvx2( int[] array )
{
unsafe
{
fixed( int* sourcePointer = array )
{
int* pointerEnd = sourcePointer + array.Length;
int* pointerEndAligned = sourcePointer + ( array.Length - array.Length % 16 );
Vector256<int> sumLow = Vector256<int>.Zero;
Vector256<int> sumHigh = sumLow;
int* pointer;
for( pointer = sourcePointer; pointer < pointerEndAligned; pointer += 16 )
{
var a = Avx.LoadVector256( pointer );
var b = Avx.LoadVector256( pointer + 8 );
sumLow = Avx2.Add( sumLow, a );
sumHigh = Avx2.Add( sumHigh, b );
}
sumLow = Avx2.Add( sumLow, sumHigh );
Vector128<int> res4 = Sse2.Add( sumLow.GetLower(), sumLow.GetUpper() );
res4 = Sse2.Add( res4, Sse2.Shuffle( res4, 0x4E ) );
res4 = Sse2.Add( res4, Sse2.Shuffle( res4, 1 ) );
int scalar = res4.ToScalar();
for( ; pointer < pointerEnd; pointer++ )
scalar += *pointer;
return scalar;
}
}
}
Here's a complete test.
To be clear, I don’t recommend doing what I wrote above. Not for the 4% improvement. Unsafe code is, well, unsafe. Your version will work without AVX2, and benefits from AVX512 if available, my optimized version gonna crash without AVX2, and won’t use AVX512 even if hardware supports it.

Related

Intrinsics SIMD instruction to replace values

I wonder how it would be possible to replace byte values in a Vector128<byte>
I think it is okay to assume the code below where we have a resultvector with
those values :
<0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0>
Here I like to create a new vector where all "0" will be replaced with "2"
and all "1" will be replaced with "0" like this :
<2,2,2,2,0,0,0,0,2,2,2,2,2,2,2,2>
I am not sure if there is an intrinsics for this or how to achieve this?
Thank you!
//Create array
byte[] array = new byte[16];
for (int i = 0; i < 4; i++) { array[i] = 0; }
for (int i = 4; i < 8; i++) { array[i] = 1; }
for (int i = 8; i < 16; i++) { array[i] = 0; }
fixed (byte* ptr = array)
{
byte* pointarray = &*((byte*)(ptr + 0));
System.Runtime.Intrinsics.Vector128<byte> resultvector = System.Runtime.Intrinsics.X86.Avx.LoadVector128(&pointarray[0]);
//<0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0>
//resultvector
}

The instruction for that is pshufb, available in modern .NET as Avx2.Shuffle, and Ssse3.Shuffle for 16-byte version. Both are really fast, 1 cycle latency on modern CPUs.
Pass your source data into shuffle control mask argument, and a special value for the first argument which is the bytes being shuffled, something like this:
// Create AVX vector with all zeros except the first byte in each 16-byte lane which is 2
static Vector256<byte> makeShufflingVector()
{
Vector128<byte> res = Vector128<byte>.Zero;
res = Sse2.Insert( res.AsInt16(), 2, 0 ).AsByte();
return Vector256.Create( res, res );
}
See _mm_shuffle_epi8 section on page 18 of this article for details.
Update: if you don’t have SSSE3, you can do the same in SSE2, in 2 instructions instead of 1:
static Vector128<byte> replaceZeros( Vector128<byte> src )
{
src = Sse2.CompareEqual( src, Vector128<byte>.Zero );
return Sse2.And( src, Vector128.Create( (byte)2 ) );
}
By the way, there’s a performance problem in .NET that prevents compiler from loading constants outside of loops. If you gonna call that method in a loop and want to maximize the performance, consider passing both constant vectors, with zero and 2, as method parameters.

Why is this code searching for a substring so much slower in C++ than in C#? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have two different .txt file that contains 1.000.000 digits of pi and the first 200 fibonacci numbers.
Here pi1000000 ---> https://dotnetfiddle.net/DbcWBQ
Here fibonacci200 ---> https://dotnetfiddle.net/8o9hnB
My purpose is to search for all fibonacci numbers one by one in the pi.
I wrote in two programming languages: c++ and c#.
There is huge execution time difference between them. I don't know the reason.
For the same process, c# completes it in 4seconds and the c++ completes it in 80seconds.
Why there is huge execution time difference betweem them.
This is my algorithm to searh for a small string in a bigger one.
c# code
public static void search(string text, string pattern)
{
for (int i = 0; i <= text.Length - pattern.Length; i++)
{
int j = 0;
while (j < pattern.Length)
{
if (text[i + j] != pattern[j]) break;
j++;
}
if (j == pattern.Length)
{
//Console.WriteLine("Pattern is found at index: " + i.ToString() + " and the value is: " + pattern.ToString());
}
}
}
public static void Main()
{
string pi = File.ReadAllText("pi1000000.txt", Encoding.ASCII);
string[] fibo = File.ReadAllLines("fibonacci200.txt", Encoding.ASCII);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int i = 0; i < 200; i++)
{
search(pi, fibo[i]);
}
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0}", stopwatch.Elapsed);
}
c++ code
#include <iostream>
#include <string>
#include <fstream>
#include <ctime>
using namespace std;
void search(string text, string pattern)
{
int t_l = text.length();
int p_l = pattern.length();
int difference = t_l - p_l;
for (int i = 0; i <= difference; i++)
{
int j = 0;
for (j; j < p_l; j++)
{
if (text[i + j] != pattern[j]) break;
}
if (j == p_l)
{
//cout << i << endl;
}
}
}
int main()
{
ifstream infile1;
string pi;
infile1.open("pi1000000.txt");
infile1 >> pi;
infile1.close();
short int i = 0;
string fibo[200];
string a;
ifstream infile2;
infile2.open("fibonacci200.txt");
while (getline(infile2, a))
{
fibo[i] += a;
i++;
}
infile2.close();
clock_t begin = clock();
for (int i = 0; i < 2; i++)
{
search(pi, fibo[i]);
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed_secs << endl;
string x;
cin >> x;
return 0;
}

In your C++ code you are appending all 1.000.000 digits of pi to the string pi, then you proceed to pass that huge huge string by value to search 200 times, that's at least 200 copies of that same huge string, with huge memory allocations and deletion.
Instead, pass it as reference:
void search(const string& text, const string& pattern)
And then check how the code snippets fare.
Even though you're doing the same in C# this isn't an issue because you're already passing a reference to the actual string because of how C# works.
I just tested the code myself with the new reference passing and tried release and debug x64 on MSVC, release optimizes out the whole loop because it's useless ( so not even testable ), debug finishes in 1 second.

C++ code compiles down to machine instructions which is just in time(JIT) compilation but c# code compiles down to Comman Language Runtime (CLR) which is .NET framework from Microsoft provide good memory management and thread management.

Change from iterative to recursive method [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
int arraySum (int [] a, int n)
{
int sum = 0;
n = a.size();
for (int i = 1; i < n; i++)
sum += a[i];
return sum;
}
I want to convert this code from iterative to recursive.

C# Version:
int arraySum ( int [] a, int sum = 0, int i = 0 ) /*i = 0, technically means this code is logically different from yours, however it will count every element and is just a default :)*/
{
if( i < a.Length )
return arraySum( a, sum + a[i], ++i );
return sum;
}

You need:
1- Recursive definition like: sum(n) = n + sum(n-1)
2- You need to specify where should you stop so the recursion does not last forever.
for example: if (n == 0) return 0;
based on this you can code at any language.
C++ Example:
int arraySum (int a[], int n)
{
if(n==1)
return a[n-1];
else
return a[n-1] + arraySum (a, n-1);
}

How to express this function mathematically [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
How would you express this loop in C# as a mathematical expression?
private string FormatBytes(long bytes)
{
string[] Suffix = { "B", "KB", "MB", "GB", "TB" };
int i;
double dblSByte = bytes;
for (i = 0; i < Suffix.Length && bytes >= 1024; i++, bytes /= 1024)
{
dblSByte = bytes / 1024.0;
}
return String.Format("{0:0.##} {1}", dblSByte, Suffix[i]);
}

You can calculate this mathematically by first working out the nearest smaller power of 1024 to the number:
int power = (int) Math.Log(bytes, 1024)
Then you can limit that number to the number of suffixes so you don't go past the end of the array:
int power = Math.Min(Suffix.Length-1, (int) Math.Log(bytes, 1024));
Then you work out what you should divide the original number by based on that power:
double div = Math.Pow(1024, power);
Then you can format the string using the suffix for the specified power of 1024:
return string.Format("{0:f1}{1}", bytes / div, Suffix[power]);
Putting this all together (and throwing in "PB" for petabytes):
private string FormatBytes(long bytes)
{
string[] Suffix = { "B", "KB", "MB", "GB", "TB", "PB" };
int power = Math.Min(Suffix.Length-1, (int) Math.Log(bytes, 1024));
double div = Math.Pow(1024, power);
return string.Format("{0:f1}{1}", bytes / div, Suffix[power]);
}
Et voila! Calculated mathematically without using a loop.
(I bet this isn't measurably faster than the loop though...)
If you wanted to you could extend the suffix array to include "exobyte" and then it would work nicely all the way to int.MaxValue, which is 8.0EB.

C# using loops to generate a table with correct arithmetic relationship [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
The following is the ShowTab() method, how to apply dynamic numbers and result in to the table?
using System;
const int MAX = 4;
int cage = 500/total;
int month = 1;
int adults = 1;
int babies = 0;
int total = 1;
Console.WriteLine("Month\tAdults\tBabies\tTotal");
Console.WriteLine("{0, -10}{1, -10}{2, -10}{3, -10}", month, adults, babies, total);
for(int i = 0; i < MAX; i++) {
Console.writeLine(
}

Maybe I missed something ; but if it's only about formatting somehow ; something like this should do the job :
int month = 1;
int adults = 1;
int babies = 0;
int total = 1;
Console.WriteLine ("header row"); // optional (if needed)
while (/* there is still cages to hold them */)
{
// print current state (-10 width chosen for example, negative for left align)
Console.WriteLine ($"{month, -10}{adults, -10}{babies, -10}{total, -10}");
// do the maths to update values
month = /* ... */;
adults = /* ... */;
babies = /* ... */;
total = /* ... */;
}
Here is a dummy exemple which illustrate why I choose to use width formatting specifier rather than tabulation (as hinted in one comment link).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Improve performance of SIMD Sum [closed] - c#

Related

Intrinsics SIMD instruction to replace values

Why is this code searching for a substring so much slower in C++ than in C#? [closed]

Change from iterative to recursive method [closed]

How to express this function mathematically [closed]

C# using loops to generate a table with correct arithmetic relationship [closed]

Categories

Resources