I have a 123MB big intarray, and it is basically used like this:
private static int[] data = new int[32487834];
static int eval(int[] c)
{
int p = data[c[0]];
p = data[p + c[1]];
p = data[p + c[2]];
p = data[p + c[3]];
p = data[p + c[4]];
p = data[p + c[5]];
return data[p + c[6]];
}
eval() is called a lot (~50B times) with different c and I would like to know if (and how) I could speed it up.
I already use a unsafe function with an fixed array that makes use of all the CPUs. It's a C# port of the TwoPlusTwo 7 card evaluator by RayW. The C++ version is insignificantly faster.
Can the GPU be used to speed this up?
Cache the array reference into a local variable. Static field accesses are generally slower than locals for multiple reasons (one of them is that the field can change so it has to be reloaded all the time. The JIT can optimize locals much more freely).
Don't use an array as the argument to the method. Hard-code 7 integer-indices. That reduces array allocation, indirection-penalty and bounds checking.
Use unsafe code to index into the array. This will eliminate bounds checking. Use a GCHandle to fix the array and cache the pointer in a static field (don't just use a fixed-block - I believe it has certain (small) overhead associated with entering it. Not sure).
As an alternative to fixing the array, allocate the 123MB array using VirtualAlloc and use huge pages. That cuts down on TLB misses.
All of these are hardcore low-level optimizations. They only apply if you need maximum performance.
I think we are pretty much at the limit here when it comes to optimizing this function. We probably can only do better if you show the caller of the function so that they can be optimized as a single unit.
Related
What are the fastest possible iteration techniques in C# for the following scenario ?
Since im working on a small archetype based ECS in c#, i want to make use of cache efficient iterations for maximum performance. What could i do to make the iteration faster and get the maximum cache hits ?
var chunks = archetype.Chunks; // Property that returns a Chunk[] array
for (var chunkIndex = 0; chunkIndex < archetype.Size; chunkIndex++) {
ref var chunk = ref chunks[chunkIndex];
var transforms = chunk.GetArray<Transform>(); // Returns a Transform[] array
var rotations = chunk.GetArray<Rotation>(); // Returns a Rotation[] array
for (var index = 0; index < chunk.Capacity; index++) {
ref var transform = ref transforms[index];
ref var rotation = ref rotations[index];
transform.x++;
rotation.w++;
}
}
Details...
public struct Transform{ float x; float y; }
public struct Rotation{ float x; float y; float z; float w; }
T[] (chunk).GetArray<T>(){
return fittingTightlyPackedManagedArrayForT as T[]; // Pseudocode
}
int (chunk).Capcity{ get; set; } // Just a property of how big each array is in the chunk, all having the same size
I already tested a unsafe variant to reduce the bound checks, however this increased the cache misses according to my benchmark and was only slightly faster ( not noticeable, not even for high amounts ).
What elese could i do to increase the iteration speed ? Glad for any feedback, techniques and tricks ! :)
A plain loop over an array or list is as fast as you can do iteration in c#, at least unless you have some special knowledge not available to the compiler. The compiler should recognize that you are looping over an array, and skip the bounds-check. And doing an linear iteration should allow the CPU to prefetch data before it is actually needed.
In your example I would not be certain the compiler could remove the bounds-checks, since the loop check is not against for the array length. So I would at least try changing it to two separate loops over the array instead.
I'm not sure why the unsafe version had lower cache hit rate, the cache is controlled by the CPU, not the compiler, and I would expect an unsafe version to produce very similar code to the compiler, at least with regards to memory access.
In some special cases it might be useful to manually unroll loops, but the compiler should be able to do this automatically, and this question suggest it is of little use. But compiler optimizations can be fickle, it might not always apply optimizations you expect it would, and what optimizations it applies might be different between versions, how long it is run, if you apply profile guided optimizations etc.
To get any real gains I would look at SIMD techniques, if you can process larger chunks of data you might get some very significant gains. But the gains might depend in large part on how the data is stored and accessed.
In some cases there can be major gains by using a structure of arrays (SoA) approach rather than the more common arrays of structures (AoS). In your example, if all the x and w values where stored in separate arrays you could just process the entire array in 128/256/512 bit SIMD blocks, and that would be fairly difficult to beat. This also has great cache efficiency, since you are not loading any unnecessary bytes. But using the SoA approach might have performance implications for other parts of the code.
I have a string array of about 20,000,000 values.
And i need to convert it to a string
I've tried:
string data = "";
foreach (var i in tm)
{
data = data + i;
}
But that takes too long time
does someone know a faster way?
Try StringBuilder:
StringBuilder sb = new StringBuilder();
foreach (var i in tm)
{
sb.Append(i);
}
To get the resulting String use ToString():
string result = sb.ToString();
The answer is going to depend on the size of the output string and the amount of memory you have available and usable. The hard limit on string length appears to be 2^31-1 (int.MaxValue) characters, occupying just over 4GB of memory. Whether you can actually allocate that is dependent on your framework version, etc. If you're going to be producing a larger output then you can't put it into a single string anyway.
You've already discovered that naive concatenation is going to be tragically slow. The problem is that every pass through the loop creates a new string, then immediately discards it on the next iteration. This is going to fill up memory pretty quickly, forcing the Garbage Collector to work overtime finding old strings to clear out of memory, not to mention the amount of memory fragmentation and all that stuff that modern programmers don't pay much attention to.
A StringBuiler, is a reasonable solution. Internally it allocates blocks of characters that it then stitches together at the end using pointers and memory copies. Saves a lot of hassles that way and is quite speedy.
As for String.Join... it uses a StringBuilder. So does String.Concat although it is certainly quicker when not inserting separator characters.
For simplicity I would use String.Concat and be done with it.
But then I'm not much for simplicity.
Here's an untested and possibly horribly slow answer using LINQ. When I get time I'll test it and see how it performs, but for now:
string result = new String(lines.SelectMany(l => (IEnumerable<char>)l).ToArray());
Obviously there is a potential overflow here since the ToArray call can potentially create an array larger than the String constructor can handle. Try it out and see if it's as quick as String.Concat.
So you can do it in LINQ, like such.
string data = tm.Aggregate("", (current, i) => current + i);
Or you can use the string.Join function
string data = string.Join("", tm);
Cant check it right now but I'm curious on how this option would perform:
var data = String.Join(string.Empty, tm);
Is Join optimized and ignores concatenation a with String.Empty?
For this big data unfortunately memory based methods will fail and this will be a real headache for GC. For this operation create a file and put every string in it. Like this:
using (StreamWriter sw = new StreamWriter("some_file_to_write.txt")){
for (int i=0; i<tm.Length;i++)
sw.Write(tm[i]);
}
Try to avoid using "var" on this performance demanding approach. Correction: "var" does not effect perfomance. "dynamic" does.
Should primitive array content be accessed by int for best performance?
Here's an example
int[] arr = new arr[]{1,2,3,4,5};
Array is only 5 elements in length, so the index doesn't have to be int, but short or byte, that would save useless 3 byte memory allocation if byte is used instead of int. Of course, if only i know that array wont overflow size of 255.
byte index = 1;
int value = arr[index];
But does this work as good as it sounds?
Im worried about how this is executed on lower level, does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it.
In C and C++, arr[index] is formally equivalent to *(arr + index). Your concerns about casting should be answerable in terms of the simpler question about what the machine will do when it needs to add add an integer offset to a pointer.
I think it's safe to say that on most modern machines when you add a "byte" to a pointer, its going to use the same instruction as it would if you added a 32-bit integer to a pointer. And indeed it's still going to represent that byte using the machine word size, padded with some unused space. So this isn't going to make using the array faster.
Your optimization might make a difference if you need to store millions of these indices in a table, and then using byte instead of int would use 4 times less memory and take less time to move that memory around. If the array you are indexing is huge, and the index needs to be larger than the machine word side, then that's a different consideration. But I think it's safe to say that in most normal situations this optimization doesn't really make sense, and size_t is probably the most appropriate generic type for array indices all things being equal (since it corresponds exactly to the machine word size, on the majority of architectures).
does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it
No, but
that would save useless 3 byte memory allocation
You don't gain anything by saving 3 bytes.
Only if you are storing a huge array of those indices then the amount of space you would save might make it a worthwhile investment.
Otherwise stick with a plain int, it's the processor's native word size and thus the fastest.
So we are told that StringBuilder should be used when you are doing more than a few operations on a string (I've heard as low as three). Therefore we should replace this:
string s = "";
foreach (var item in items) // where items is IEnumerable<string>
s += item;
With this:
string s = new StringBuilder(items).ToString();
I assume that internally StringBuilder holds references to each Appended string, combining then on request. Lets compare this to the HybridDictionary, that uses a LinkedList for the first 10 elements, then swaps to a HashTable when the list grows more then 10. As we can see the same kind of pattern is here, small number of references = linkedList, else make ever increasing blocks of arrays.
Lets look at how a List works. Start off with a list size (internal default is 4). Add elements to the internal array, if the array is full, make a new array of double the size of the current array, copy the current array's elements across, then add the new element and make the new array the current array.
Can you see my confusion as to the performance benefits? For all elements besides strings, we make new arrays, copy old values and add the new value. But for strings that's bad? because we know that "a" + "b" makes a new string reference from the two old references, "a" and "b".
Hope my question isn't too confusing. Why does there seem to be a double standard between string concatenation and array concatenation (I know strings are arrays of chars)?
String: Making new references is bad!
T : where T != String: Making new references is good!
Edit: Maybe what I'm really asking here, is when does making new, bigger arrays and copying the old values across, start being faster than have references to randomly places objects all over the heap?
Double edit: By faster I mean reading, writing and finding variables, not inserting or removing (i.e. LinkedList would kickass at inserting for example, but I don't care about that).
Final edit: I don't care about StringBuilder, I'm interested in the trade off in time taken to copy data from one part of the heap to another for cache alignments, vs just taking the cache misses from teh cpu and have references all over the heap. When does one become faster then the other?*
Therefore we should replace this:
No you shouldn't. The first case you showed string concatenation that can take place at compile time and have replaced it with string concatenation that takes place a runtime. The former is much more desirable, and will execute faster than the latter.
It's important to use a string builder when the number of strings being concatted is not known at compile time. Often (but not always) this means concatting strings in a loop.
Earlier versions of String Builder (before 4.0, if memory serves), did internally look more or less like a List<char>, and it's correct that post 4.0 it looks more like a LinkedList<char[]>. However, the key difference here between using a StringBuilder and using regular string concatenation in a loop is not the difference between a linked list style in which objects contain references to the next object in the "chain" and an array-based style in which an internal buffer overallocates space and is reallocated occasionally as needed, but rather the difference between a mutable object and an immutable object. The problem with traditional string concatenation is that, since strings are immutable, each concatenation must copy all of the memory from both strings into a new string. When using a StringBuilder the new string only needs to be copied onto the end of some type of data structure, leaving all of the existing memory as it is. What type of data structure that is isn't terribly important here; we can rely on Microsoft to use a structure/algorithm that has been proven to have the best performance characteristics for the most common situations.
It seems to me that you are conflating the resizing of a list with the evaluation of a string expression, and assuming that the two should behave the same way.
Consider your example: string s = "a" + "b" + "c" + "d"
Assuming no optimisations of the constant expression (which the compiler would handle automatically), what this will do is evaluate each operation in turn:
string s = (("a" + "b") + "c") + "d"
This results in the strings "ab" and "abc" being created as part of that single expression. This has to happen, because strings in.NET are immutable, which means their values cannot be changed once created. This is because, if strings were mutable, you'd have code like this:
string a = "hello";
string b = a; // would assign b the same reference as a
string b += "world"; // would update the string it references
// now a == "helloworld"
If this were a List, the code would make more sense, and doesn't even need explanation:
var a = new List<int> { 1, 2, 3 };
var b = a;
b.Add(4);
// now a == { 1, 2, 3, 4 }
So the reason that non-string "list" types allocate extra memory early is for reasons of efficiency, and to reduce allocations when the list is extended. The reason that a string does not do that is because a string's value is never updated once created.
Your assumption about the operation of the StringBuilder is irrelevant, but the purpose of a StringBuilder is essentially to create a non-immutable object that reduces the overhead of multiple string operations.
The backing store of a StringBuilder is a char[] that gets resized as needed. Nothing is turned into a string until you invoke StringBuilder.ToString() on it.
The backing store of List<T> is a T[] that gets resized as needed.
The problem with something like
string s = a + b + c + d ;
is that the compiler parses it as
+
/ \
a +
/ \
b +
/ \
c d
and, unless it can see opportunities for optimization, do something like
string t1 = c + d ;
string t2 = b + t1 ;
string s = a + t2 ;
thus creating two temporaries and the final string. With a StringBuilder, though, it's going to build out the character array it needs and at the end create one string.
This is a win because strings, once created, are immutable (can't be changed) and are generally interned in the string pool (meaning that there is only ever one instance of the string...no matter how many time your create the string "abc", every instance will always be a reference to the same object in the string pool.
This adds cost to string creation as well: having determined the candidate string, the runtime has to check the string pool to see if it already exists. If it does, that reference is used; if it does not the candidate string is added to the string pool.
Your example, though:
string s = "a" + "b" + "c" + "d" ;
is a non-sequitur: the compile sees the constant expression and does an optimization called constant folding, so it becomes (even in debug mode):
string s = "abcd" ;
Similar optimizations happen with arithmetic expressions:
int x = 12 / 3 ;
is going to be optimized away to
int x = 4 ;
I'm writing an app that will create thousands of small objects and store them recursively in array. By "recursively" I mean that each instance of K will have an array of K instances which will have and array of K instances and so on, and this array + one int field are the only properties + some methods. I found that memory usage grows very fast for even small amount of data - about 1MB), and when the data I'm processing is about 10MB I get the "OutOfMemoryException", not to mention when it's bigger (I have 4GB of RAM) :). So what do you suggest me to do? I figured, that if I'd create separate class V to process those objects, so that instances of K would have only array of K's + one integer field and make K as a struct, not a class, it should optimize things a bit - no garbage collection and stuff... But it's a bit of a challenge, so I'd rather ask you whether it's a good idea, before I start a total rewrite :).
EDIT:
Ok, some abstract code
public void Add(string word) {
int i;
string shorter;
if (word.Length > 0) {
i = //something, it's really irrelevant
if (t[i] == null) {
t[i] = new MyClass();
}
shorterWord = word.Substring(1);
//end of word
if(shorterWord.Length == 0) {
t[i].WordEnd = END;
}
//saving the word letter by letter
t[i].Add(shorterWord);
}
}
}
For me already when researching deeper into this I had the following assumptions (they may be inexact; i'm getting old for a programmer). A class has extra memory consumption because a reference is required to address it. Store the reference and an Int32 sized pointer is needed on a 32bit compile. Allocated always on the heap (can't remember if C++ has other possibilities, i would venture yes?)
The short answer, found in this article, Object has a 12bytes basic footprint + 4 possibly unused bytes depending on your class (has no doubt something to do with padding).
http://www.codeproject.com/Articles/231120/Reducing-memory-footprint-and-object-instance-size
Other issues you'll run into is Arrays also have an overhead. A possibility would be to manage your own offset into a larger array or arrays. Which in turn is getting closer to something a more efficient language would be better suited for.
I'm not sure if there are libraries that may provide Storage for small objects in an efficient manner. Probably are.
My take on it, use Structs, manage your own offset in a large array, and use proper packing instructions if it serves you (although i suspect this comes at a cost at runtime of a few extra instructions each time you address unevenly packed data)
[StructLayout(LayoutKind.Sequential, Pack = 1)]
Your stack is blowing up.
Do it iteratively instead of recursively.
You're not blowing the system stack up, your blowing the code stack up, 10K function calls will blow it out of the water.
You need proper tail recursion, which is just an iterative hack.
Make sure you have enough memory in your system. Over 100mb+ etc. It really depends on your system. Linked list, recursive objects is what you are looking at. If you keep recursing, it is going to hit the memory limit and nomemoryexception will be thrown. Make sure you keep track of the memory usage on any program. Nothing is unlimited, especially memory. If memory is limited, save it to a disk.
Looks like there is infinite recursion in your code and out of memory is thrown. Check the code. There should be start and end in recursive code. Otherwise it will go over 10 terrabyte memory at some point.
You can use a better data structure
i.e. each letter can be a byte (a-0, b-1 ... ). each word fragment can be in indexed also especially substrings - you should get away with significantly less memory (though a performance penalty)
Just list your recursive algorithm and sanitize variable names. If you are doing BFS type of traversal and keep all objects in memory, you will run out of mem. For example, in this case, replace it with DFS.
Edit 1:
You can speed up the algo by estimating how many items you will generate then allocate that much memory at once. As the algo progresses, fill up the allocated memory. This reduces fragmentation and reallocation & copy-on-full-array operations.
Nonetheless, after you are done operating on these generated words you should delete them from your datastructure so they can be GC-ed so you don't run out of mem.