What are the fastest possible iteration techniques in C# for the following scenario ?
Since im working on a small archetype based ECS in c#, i want to make use of cache efficient iterations for maximum performance. What could i do to make the iteration faster and get the maximum cache hits ?
var chunks = archetype.Chunks; // Property that returns a Chunk[] array
for (var chunkIndex = 0; chunkIndex < archetype.Size; chunkIndex++) {
ref var chunk = ref chunks[chunkIndex];
var transforms = chunk.GetArray<Transform>(); // Returns a Transform[] array
var rotations = chunk.GetArray<Rotation>(); // Returns a Rotation[] array
for (var index = 0; index < chunk.Capacity; index++) {
ref var transform = ref transforms[index];
ref var rotation = ref rotations[index];
transform.x++;
rotation.w++;
}
}
Details...
public struct Transform{ float x; float y; }
public struct Rotation{ float x; float y; float z; float w; }
T[] (chunk).GetArray<T>(){
return fittingTightlyPackedManagedArrayForT as T[]; // Pseudocode
}
int (chunk).Capcity{ get; set; } // Just a property of how big each array is in the chunk, all having the same size
I already tested a unsafe variant to reduce the bound checks, however this increased the cache misses according to my benchmark and was only slightly faster ( not noticeable, not even for high amounts ).
What elese could i do to increase the iteration speed ? Glad for any feedback, techniques and tricks ! :)
A plain loop over an array or list is as fast as you can do iteration in c#, at least unless you have some special knowledge not available to the compiler. The compiler should recognize that you are looping over an array, and skip the bounds-check. And doing an linear iteration should allow the CPU to prefetch data before it is actually needed.
In your example I would not be certain the compiler could remove the bounds-checks, since the loop check is not against for the array length. So I would at least try changing it to two separate loops over the array instead.
I'm not sure why the unsafe version had lower cache hit rate, the cache is controlled by the CPU, not the compiler, and I would expect an unsafe version to produce very similar code to the compiler, at least with regards to memory access.
In some special cases it might be useful to manually unroll loops, but the compiler should be able to do this automatically, and this question suggest it is of little use. But compiler optimizations can be fickle, it might not always apply optimizations you expect it would, and what optimizations it applies might be different between versions, how long it is run, if you apply profile guided optimizations etc.
To get any real gains I would look at SIMD techniques, if you can process larger chunks of data you might get some very significant gains. But the gains might depend in large part on how the data is stored and accessed.
In some cases there can be major gains by using a structure of arrays (SoA) approach rather than the more common arrays of structures (AoS). In your example, if all the x and w values where stored in separate arrays you could just process the entire array in 128/256/512 bit SIMD blocks, and that would be fairly difficult to beat. This also has great cache efficiency, since you are not loading any unnecessary bytes. But using the SoA approach might have performance implications for other parts of the code.
Related
Essentially I'm not sure how to store a 3D data structure for the fastest access possible as I'm not sure what is going on under the hood for multi-dimensional arrays.
NOTE: The arrays will be a constant and known size each and every time, and each element will be exactly 16 bits.
Option one is to have a multi-dimension array data[16, 16, 16] and simply access via data[x, y, z] option two is to have a single dimension array data[16 * 16 * 16] and access via data[x + (y * 16) + (z * 16 * 16)].
As each element should only be 16 bits long, and I have a suspicion that a multi-dimension array would store a lot of references to other arrays internally at a minimum of 32 bits per one, that is a lot of wasted memory. However, I fear it may be faster than running the equation specified in option two each time, and speed is key to this project.
So, can anyone enlighten me as to how much difference in speed there would likely to be compared to how much difference in memory consumption?
C# stores multidimensional arrays as a single block of memory, so they compile to almost the same thing. (One difference is that there are three sets of bounds to check).
I.e. arr[x,y,z] is just about equivalent to arr[x + y*ny +z*nz*ny] and will generally have similar performance characteristics.
The exact performance however will be dominated by the pattern of memory access, and how this affects cache coherence (at least for large amounts of data). You may find that nested loops over x, then y then z may be faster or slower than doing the loops in a different order, if one does a better job of keeping currently used data in the processor cache.
This is highly dependent on the exact algorithm, so it isn't possible to give an answer which is correct for all algorithms.
The other cause of any speed reduction versus C or C++ is the bounds-checking, which will still be needed in the one-dimensional array case. However these will often, but not always, be removed automatically.
https://blogs.msdn.microsoft.com/clrcodegeneration/2009/08/13/array-bounds-check-elimination-in-the-clr/
Again, the exact algorithm will affect whether the optimiser is able to remove the bounds checks.
Your course of action should be as follows:
Write a naïve version of the algorithm with arr[x,y,z].
If it's fast enough you can stop.
Otherwise profile the algorithm to check it is actually array accesses which are the issue, analyse the memory access patterns and so on.
I think it's worth pointing out that if your array dimensions are really all 16, then you can calculate the index for the array from (x, y, z) much more efficiently:
int index = x | y << 4 | z << 8;
And the inverse:
int x = index & 0xf;
int y = (index >> 4) & 0xf;
int z = (index >> 8) & 0xf;
If this is the case, then I recommend using the single-dimensional array since it will almost certainly be faster.
Note that it's entirely possible that the JIT compiler would perform this optimisation anyway (assuming that the multiplication is hard-coded as per your OP), but it's worth doing explicitly.
The reason that I say the single-dimensional array would be faster is because the latest compiler is lacking some of the optimisations for multi-dimensional array access, as discussed in this thread.
That said, you should perform careful timings to see what really is the fastest.
As Eric Lippert says: "If you want to know which horse is faster, race your horses".
I would vote for single-demension array, it should work much faster. Basically you can write some tests, performing your most common tasks and measuring the time spent.
Also if you have 2^n array sizes it is much faster to access element position by using left shift operation instead of multiplication.
I have a very large two dimensional array and I need to compute vector operations on this array. NTerms and NDocs are both very large integers.
var myMat = new double[NTerms, NDocs];
I need to to extract vector columns from this matrix. Currently, I'm using for loops.
col = 100;
for (int i = 0; i < NTerms; i++)
{
myVec[i] = myMat[i, col];
}
This operation is very slow. In Matlab I can extract the vector without the need for iteration, like so:
myVec = myMat[:,col];
Is there any way to do this in C#?
There are no such constructs in C# that will allow you to work with arrays as in Matlab. With the code you already have you can speed up process of vector creation using Task Parallel Library that was introduced in .NET Framework 4.0.
Parallel.For(0, NTerms, i => myVec[i] = myMat[i, col]);
If your CPU has more than one core then you will get some improvement in performance otherwise there will be no effect.
For more examples of how Task Parallel Library could be used with matrixes and arrays you can reffer to the MSDN article Matrix Decomposition.
But I doubt that C# is a good choice when it comes to some serious math calculations.
Some possible problems:
Could it be the way that elements are accessed for multi-dimensional arrays in C#. See this earlier article.
Another problem may be that you are accessing non-contiguous memory - so not much help from cache, and maybe you're even having to fetch from virtual memory (disk) if the array is very large.
What happens to your speed when you access a whole row at a time, instead of a column? If that's significantly faster, you can be 90% sure it's a contiguous-memory issue...
I have a 123MB big intarray, and it is basically used like this:
private static int[] data = new int[32487834];
static int eval(int[] c)
{
int p = data[c[0]];
p = data[p + c[1]];
p = data[p + c[2]];
p = data[p + c[3]];
p = data[p + c[4]];
p = data[p + c[5]];
return data[p + c[6]];
}
eval() is called a lot (~50B times) with different c and I would like to know if (and how) I could speed it up.
I already use a unsafe function with an fixed array that makes use of all the CPUs. It's a C# port of the TwoPlusTwo 7 card evaluator by RayW. The C++ version is insignificantly faster.
Can the GPU be used to speed this up?
Cache the array reference into a local variable. Static field accesses are generally slower than locals for multiple reasons (one of them is that the field can change so it has to be reloaded all the time. The JIT can optimize locals much more freely).
Don't use an array as the argument to the method. Hard-code 7 integer-indices. That reduces array allocation, indirection-penalty and bounds checking.
Use unsafe code to index into the array. This will eliminate bounds checking. Use a GCHandle to fix the array and cache the pointer in a static field (don't just use a fixed-block - I believe it has certain (small) overhead associated with entering it. Not sure).
As an alternative to fixing the array, allocate the 123MB array using VirtualAlloc and use huge pages. That cuts down on TLB misses.
All of these are hardcore low-level optimizations. They only apply if you need maximum performance.
I think we are pretty much at the limit here when it comes to optimizing this function. We probably can only do better if you show the caller of the function so that they can be optimized as a single unit.
I'm writing an app that will create thousands of small objects and store them recursively in array. By "recursively" I mean that each instance of K will have an array of K instances which will have and array of K instances and so on, and this array + one int field are the only properties + some methods. I found that memory usage grows very fast for even small amount of data - about 1MB), and when the data I'm processing is about 10MB I get the "OutOfMemoryException", not to mention when it's bigger (I have 4GB of RAM) :). So what do you suggest me to do? I figured, that if I'd create separate class V to process those objects, so that instances of K would have only array of K's + one integer field and make K as a struct, not a class, it should optimize things a bit - no garbage collection and stuff... But it's a bit of a challenge, so I'd rather ask you whether it's a good idea, before I start a total rewrite :).
EDIT:
Ok, some abstract code
public void Add(string word) {
int i;
string shorter;
if (word.Length > 0) {
i = //something, it's really irrelevant
if (t[i] == null) {
t[i] = new MyClass();
}
shorterWord = word.Substring(1);
//end of word
if(shorterWord.Length == 0) {
t[i].WordEnd = END;
}
//saving the word letter by letter
t[i].Add(shorterWord);
}
}
}
For me already when researching deeper into this I had the following assumptions (they may be inexact; i'm getting old for a programmer). A class has extra memory consumption because a reference is required to address it. Store the reference and an Int32 sized pointer is needed on a 32bit compile. Allocated always on the heap (can't remember if C++ has other possibilities, i would venture yes?)
The short answer, found in this article, Object has a 12bytes basic footprint + 4 possibly unused bytes depending on your class (has no doubt something to do with padding).
http://www.codeproject.com/Articles/231120/Reducing-memory-footprint-and-object-instance-size
Other issues you'll run into is Arrays also have an overhead. A possibility would be to manage your own offset into a larger array or arrays. Which in turn is getting closer to something a more efficient language would be better suited for.
I'm not sure if there are libraries that may provide Storage for small objects in an efficient manner. Probably are.
My take on it, use Structs, manage your own offset in a large array, and use proper packing instructions if it serves you (although i suspect this comes at a cost at runtime of a few extra instructions each time you address unevenly packed data)
[StructLayout(LayoutKind.Sequential, Pack = 1)]
Your stack is blowing up.
Do it iteratively instead of recursively.
You're not blowing the system stack up, your blowing the code stack up, 10K function calls will blow it out of the water.
You need proper tail recursion, which is just an iterative hack.
Make sure you have enough memory in your system. Over 100mb+ etc. It really depends on your system. Linked list, recursive objects is what you are looking at. If you keep recursing, it is going to hit the memory limit and nomemoryexception will be thrown. Make sure you keep track of the memory usage on any program. Nothing is unlimited, especially memory. If memory is limited, save it to a disk.
Looks like there is infinite recursion in your code and out of memory is thrown. Check the code. There should be start and end in recursive code. Otherwise it will go over 10 terrabyte memory at some point.
You can use a better data structure
i.e. each letter can be a byte (a-0, b-1 ... ). each word fragment can be in indexed also especially substrings - you should get away with significantly less memory (though a performance penalty)
Just list your recursive algorithm and sanitize variable names. If you are doing BFS type of traversal and keep all objects in memory, you will run out of mem. For example, in this case, replace it with DFS.
Edit 1:
You can speed up the algo by estimating how many items you will generate then allocate that much memory at once. As the algo progresses, fill up the allocated memory. This reduces fragmentation and reallocation & copy-on-full-array operations.
Nonetheless, after you are done operating on these generated words you should delete them from your datastructure so they can be GC-ed so you don't run out of mem.
I have two queries to reverse a string. Need to compare them:
public string ReverseD(string text)
{
return new string(text.ToCharArray().Reverse().ToArray());
}
public string ReverseB(string text)
{
char[] charArray = text.ToCharArray();
Array.Reverse(charArray);
return new string(charArray);
}
How can I determine the run time for these two algorithms in O() notation? Need to compare.
Both are O(n) - they both go through the full array, which is O(n).
The first algorithm has worse constant factor though since effectively you are "buffering" the full array and then emit it in reverse order (unless Enumerable.Reverse() is optimized for arrays under the hood, don't have Reflector handy right now). Since you are buffering the full array, then emit a new array in reverse order you could say that effort is 2*N, so the constant factor c = 2.
The second algorithm will use array indexes so you are performing n/2 element swaps within the same array - still O(n) but a constant factor c = 0.5.
While knowing the asymptotic performance of these two approaches to reversing a string is useful, it's probably not what you really need. You're probably not going to be applying the algorithm to strings whose length becomes larger and larger.
In cases like this, it's actualy more helpful to just run both algorithms a bunch of times, and see which one takes less time.
Most likely it will be the straight Array.Reverse() version, since it will intelligently swap items within the array, whereas the Enumerable.Reverse() method will yield return each element in reverse order. Either will be O(n), since they both manipulate each of the n items in the array a constant number of times.
But again, the best way to see which one will perform better is to actually run them and see.