In managed code, how do I achieve good locality of reference?

In managed code, how do I achieve good locality of reference? - c#

Since RAM seems to be the new disk, and since that statement also means that access to memory is now considered slow similarly to how disk access has always been, I do want to maximize locality of reference in memory for high performance applications. For example, in a sorted index, I want adjacent values to be close (unlike say, in a hashtable), and I want the data the index is pointing to close by, too.
In C, I can whip up a data structure with a specialized memory manager, like the developers of the (immensely complex) Judy array did. With direct control over the pointers, they even went so far as to encode additional information in the pointer value itself. When working in Python, Java or C#, I am deliberately one (or more) level(s) of abstraction away from this type of solution and I'm entrusting the JIT compilers and optimizing runtimes with doing clever tricks on the low levels for me.
Still, I guess, even at this high level of abstraction, there are things that can be semantically considered "closer" and therefore are likely to be actually closer at the low levels. For example, I was wondering about the following (my guess in parentheses):
Can I expect an array to be an adjacent block of memory (yes)?
Are two integers in the same instance closer than two in different instances of the same class (probably)?
Does an object occupy a contigous region in memory (no)?
What's the difference between an array of objects with only two int fields and a single object with two int[] fields? (this example is probably Java specific)
I started wondering about these in a Java context, but my wondering has become more general, so I'd suggest to not treat this as a Java question.

In .NET, elements of an array are certainly contiguous. In Java I'd expect them to be in most implementations, but it appears not to be guaranteed.
I think it's reasonable to assume that the memory used by an instance for fields is in a single block... but don't forget that some of those fields may be references to other objects.
For the Java array part, Sun's JNI documentation includes this comment, tucked away in a discussion about strings:
For example, the Java virtual machine may not store arrays contiguously.
For your last question, if you have two int[] then each of those arrays will be a contiguous block of memory, but they could be very "far apart" in memory. If you have an array of objects with two int fields, then each object could be a long way from each other, but the two integers within each object will be close together. Potentially more importantly, you'll end up taking a lot more memory with the "lots of objects" solution due to the per-object overhead. In .NET you could use a custom struct with two integers instead, and have an array of those - that would keep all the data in one big block.
I believe that in both Java and .NET, if you allocate a lot of smallish objects in quick succession within a single thread then those objects are likely to have good locality of reference. When the GC compacts a heap, this may improve - or it may potentially become worse, if a heap with
A B C D E
is compacted to
A D E B
(where C is collected) - suddenly A and B, which may have been "close" before, are far apart. I don't know whether this actually happens in any garbage collector (there are loads around!) but it's possible.
Basically in a managed environment you don't usually have as much control over locality of reference as you do in an unmanaged environment - you have to trust that the managed environment is sufficiently good at managing it, and that you'll have saved enough time by coding to a higher level platform to let you spend time optimising elsewhere.

First, your title is implying C#. "Managed code" is a term coined by Microsoft, if I'm not mistaken.
Java primitive arrays are guaranteed to be a continuous block of memory. If you have a
int[] array = new int[4];
you can from JNI (native C) get a int *p to point to the actual array. I think this goes for the Array* class of containers as well (ArrayList, ArrayBlockingQueue, etc).
Early implementations of the JVM had objects as contiuous struct, I think, but this cannot be assumed with newer JVMs. (JNI abstracts away this).
Two integers in the same object will as you say probably be "closer", but they may not be. This will probably vary even using the same JVM.
An object with two int fields is an object and I don't think any JVM makes any guarantee that the members will be "close". An int-array with two elements will very likely be backed by a 8 byte long array.

With regards to arrays here is an excerpt from CLI (Common Language Infrastructure) specification:
Array elements shall be laid out
within the array object in row-major
order (i.e., the elements associated
with the rightmost array dimension
shall be laid out contiguously from lowest to highest index). The
actual storage allocated for each
array element can include
platform-specific padding. (The size
of this storage, in bytes, is returned
by the sizeof instruction when it is
applied to the type of that array’s
elements.

Good question! I think I would resort to writing extensions in C++ that handle memory in a more carefully managed way and just exposing enough of an interface to allow the rest of the application to manipulate the objects. If I was that concerned about performance I would probably resort to a C++ extension anyway.

I don't think anyone has talked about Python so I'll have a go
Can I expect an array to be an adjacent block of memory (yes)?
In python arrays are more like arrays of pointers in C. So the pointers will be adjacent, but the actual objects are unlikely to be.
Are two integers in the same instance closer than two in different instances of the same class (probably)?
Probably not for the same reason as above. The instance will only hold pointers to the objects which are the actual integers. Python doesn't have native int (like Java), only boxed Int (in Java-speak).
Does an object occupy a contigous region in memory (no)?
Probably not. However if you use the __slots__ optimisation then some parts of it will be contiguous!
What's the difference between an array of objects with only two int fields and a single object with two int[] fields?
(this example is probably Java specific)
In python, in terms of memory locality, they are both pretty much the same! One will make an array of pointers to objects which will in turn contain two pointers to ints, the other will make two arrays of pointers to integers.

If you need to optimise to that level then I suspect a VM based language is not for you ;)

Related

Which data types to use? And how the CPU reads them?

Let's start small, say I need to store a const value of 200, should I always be using a unsigned byte for this?
This is just a minimal thing I guess. But what about structs? Is it wise to build up my structs so that it is dividable by 32 on a 32 bit system? Let's say I need to iterate over a very large array of structs, does it matter much if the struct consists of 34 bits or 64? I would think it gains a lot if I could squeeze off 2 bits from the 34 bit struct?
Or does all this make unnecessary overhead and am I better off replacing all my bits and shorts to ints inside this struct so the CPU does not have to "go looking" for the right memory block?

This is a strong processor implementation detail, the CLR and the jitter already do a lot of work to ensure that your data types are optimal to get the best perf out of the program. There is for example never a case where a struct ever occupies 34 bits, the CLR design choices already ensure that you get a running start on using types that work well on modern processors.
Structs are laid-out to be optimal and that involves alignment choices that depend on the data type. An int for example will always be aligned to an offset that's a multiple of 4. Which gives the processor an easy time to read the int, it doesn't have to multiplex the misaligned bytes back into an int and avoids a scenario where the value straddles a cpu cache line and needs to be glued back together from multiple memory bus reads. Some processors event treat misaligned reads and writes as fatal errors, one of the reasons you don't have an Itanium in your machine.
So if you have a struct that has a byte and an int then you'll end up with a data type that takes 8 bytes which doesn't use 3 of the bytes, the ones between the byte and the int. These unused bytes are called padding. There can also be padding at the end of a struct to ensure that alignment is still optimal when you put them in an array.
Declaring a single variable as a byte is okay, Intel/AMD processors take the same amount of time to read/write one as a 32-bit int. But using short is not okay, that requires an extra byte in the cpu instruction (a size override prefix) and can cost an extra cpu cycle. In practice you don't often save any memory because of the alignment rule. Using byte only buys you something if it can be combined with another byte. An array of bytes is fine, a struct with multiple byte members is fine. Your example is not, it works just as well when you declare it int.
Using types smaller than an int can be awkward in C# code, the MSIL code model is int-based. Basic operators like + and - are only defined for int and larger, there is no operator for smaller types. So you end up having to use a cast to truncate the result back to a smaller size. The sweet spot is int.

Wow, it really depends on a bunch of stuff. Are you concerned about performance or memory? If it's performance you are generally better off staying with the "natural" word size alignment. So for example if you are using a 64-bit processor using 64-bit ints, aligned on 64-bit boundaries provides the best performance. I don't think C# makes any guarantees about this type of thing (it's meant to remain abstract from the hardware).
That said there is a informal rule that says "Avoid the sin of premature optimization". This is particularly true in C#. If you aren't having a performance or memory issue, don't worry about it.
If you find you are having a performance problem, use a profiler to determine where the problem actually is (it might not be where you think). If it's a memory problem determine the objects consuming the most memory and determine where you can optimize (as per your example using a byte rather than an int or short, if possible).
If you really have to worry about such details you might want to consider using C++, where you can better control memory usage (for example you can allocate large blocks of memory without it being initialized), access bitfields, etc.

C# Garbage Collection -> to C++ delete

I'm converting a C# project to C++ and have a question about deleting objects after use. In C# the GC of course takes care of deleting objects, but in C++ it has to be done explicitly using the delete keyword.
My question is, is it ok to just follow each object's usage throughout a method and then delete it as soon as it goes out of scope (ie method end/re-assignment)?
I know though that the GC waits for a certain size of garbage (~1MB) before deleting; does it do this because there is an overhead when using delete?
As this is a game I am creating there will potentially be lots of objects being created and deleted every second, so would it be better to keep track of pointers that go out of scope, and once that size reachs 1MB to then delete the pointers?
(as a side note: later when the game is optimised, objects will be loaded once at startup so there is not much to delete during gameplay)

Your problem is that you are using pointers in C++.
This is a fundamental problem that you must fix, then all your problems go away. As chance would have it, I got so fed up with this general trend that I created a set of presentation slides on this issue. – (CC BY, so feel free to use them).
Have a look at the slides. While they are certainly not entirely serious, the fundamental message is still true: Don’t use pointers. But more accurately, the message should read: Don’t use delete.
In your particular situation you might find yourself with a lot of long-lived small objects. This is indeed a situation which a modern GC handles quite well, and which reference-counting smart pointers (shared_ptr) handle less efficiently. If (and only if!) this becomes a performance problem, consider switching to a small object allocator library.

You should be using RAII as much as possible in C++ so you do not have to explicitly deleteanything anytime.
Once you use RAII through smart pointers and your own resource managing classes every dynamic allocation you make will exist only till there are any possible references to it, You do not have to manage any resources explicitly.

Memory management in C# and C++ is completely different. You shouldn't try to mimic the behavior of .NET's GC in C++. In .NET allocating memory is super fast (basically moving a pointer) whereas freeing it is the heavy task. In C++ allocating memory isn't that lightweight for several reasons, mainly because a large enough chunk of memory has to be found. When memory chunks of different sizes are allocated and freed many times during the execution of the program the heap can get fragmented, containing many small "holes" of free memory. In .NET this won't happen because the GC will compact the heap. Freeing memory in C++ is quite fast, though.
Best practices in .NET don't necessarily work in C++. For example, pooling and reusing objects in .NET isn't recommended most of the time, because the objects get promoted to higher generations by the GC. The GC works best for short lived objects. On the other hand, pooling objects in C++ can be very useful to avoid heap fragmentation. Also, allocating a larger chunk of memory and using placement new can work great for many smaller objects that need to be allocated and freed frequently, as it can occur in games. Read up on general memory management techniques in C++ such as RAII or placement new.
Also, I'd recommend getting the books "Effective C++" and "More effective C++".

Well, the simplest solution might be to just use garbage collection in
C++. The Boehm collector works well, for example. Still, there are
pros and cons (but porting code originally written in C# would be a
likely candidate for a case where the pros largely outweigh the cons.)
Otherwise, if you convert the code to idiomatic C++, there shouldn't be
that many dynamically allocated objects to worry about. Unlike C#, C++
has value semantics by default, and most of your short lived objects
should be simply local variables, possibly copied if they are returned,
but not allocated dynamically. In C++, dynamic allocation is normally
only used for entity objects, whose lifetime depends on external events;
e.g. a Monster is created at some random time, with a probability
depending on the game state, and is deleted at some later time, in
reaction to events which change the game state. In this case, you
delete the object when the monster ceases to be part of the game. In
C#, you probably have a dispose function, or something similar, for
such objects, since they typically have concrete actions which must be
carried out when they cease to exist—things like deregistering as
an Observer, if that's one of the patterns you're using. In C++, this
sort of thing is typically handled by the destructor, and instead of
calling dispose, you call delete the object.

Substituting a shared_ptr in every instance that you use a reference in C# would get you the closest approximation at probably the lowest effort input when converting the code.
However you specifically mention following an objects use through a method and deleteing at the end - a better approach is not to new up the object at all but simply instantiate it inline/on the stack. In fact if you take this approach even for returned objects with the new copy semantics being introduced this becomes an efficient way to deal with returned objects also - so there is no need to use pointers in almost every scenario.

There are a lot more things to take into considerations when deallocating objects than just calling delete whenever it goes out of scope. You have to make sure that you only call delete once and only call it once all pointers to that object have gone out of scope. The garbage collector in .NET handles all of that for you.
The construct that is mostly corresponding to that in C++ is tr1::shared_ptr<> which keeps a reference counter to the object and deallocates when it drops to zero. A first approach to get things running would be to make all C# references in to C++ tr1::shared_ptr<>. Then you can go into those places where it is a performance bottleneck (only after you've verified with a profile that it is an actual bottleneck) and change to more efficient memory handling.

GC feature of c++ has been discussed a lot in SO.
Try Reading through this!!
Garbage Collection in C++

Does the CLR store small values in 'natural' sized locations?

In Java, a byte or short is stored in the JVM's 'natural' word length, i.e. for the most part, 32-bits. An exception would be an array of bytes, where each byte occupies a byte of memory.
Does the CLR do the same thing?
If it does do this, in what situations are there exceptions to this? E.g. How much memory does this occupy?
struct MyStruct
{
short s1;
short s2;
}

Although it's not really intended for this purpose, and may at times give slightly different answers (because it's thinking about things from a Marshalling point of view, not a CLR internal structures point of view), Marhsal.SizeOf can give an answer:
System.Runtime.InteropServices.Marshal.SizeOf(typeof(MyStruct))
In this case, it answers 4. (i.e. the shorts are being stored as shorts). Please note that this is an implementation detail, so the answer today should not be relied upon for any purpose.

It is actually the job of the JIT compiler to assign the memory layout of classes and structures. The actual layout is not discoverable in any way (beyond looking at the generated machine code), a [StructLayout] attribute is required to marshal the object to a known layout. The JIT takes advantage of this by re-ordering fields to keep them aligned and minimize the allocation size.
There will be no surprises in the struct you quoted, the fields are already aligned on any current CPU architecture that can execute managed code. The size of value types is guaranteed by the CLI, a short always takes 16 bits. Your structure will take 32 bits.

The CLR does to some extent pack members of the same size. It does pack arrays, and I would expect your example structure to take up four bytes on any platform.
Exactly which types are packed and how depends on the CLR implementation and the current platform. The rules are not strictly defined, so that the CLR has some freedom to rearrange the members to store them in the most efficient manner.

Why should a .NET struct be less than 16 bytes?

I've read in a few places now that the maximum instance size for a struct should be 16 bytes.
But I cannot see where that number (16) comes from.
Browsing around the net, I've found some who suggest that it's an approximate number for good performance but Microsoft talk like it is a hard upper limit. (e.g. MSDN )
Does anyone have a definitive answer about why it is 16 bytes?

It is just a performance rule of thumb.
The point is that because value types are passed by value, the entire size of the struct has to be copied if it is passed to a function, whereas for a reference type, only the reference (4 bytes) has to be copied. A struct might save a bit of time though because you remove a layer of indirection, so even if it is larger than these 4 bytes, it might still be more efficient than passing a reference around. But at some point, it becomes so big that the cost of copying becomes noticeable. And a common rule of thumb is that this typically happens around 16 bytes. 16 is chosen because it's a nice round number, a power of two, and the alternatives are either 8 (which is too small, and would make structs almost useless), or 32 (at which point the cost of copying the struct is already problematic if you're using structs for performance reasons)
But ultimately, this is performance advice. It answers the question of "which would be most efficient to use? A struct or a class?". But it doesn't answer the question of "which best maps to my problem domain".
Structs and classes behave differently. If you need a struct's behavior, then I would say to make it a struct, no matter the size. At least until you run into performance problems, profile your code, and find that your struct is a problem.
your link even says that it is just a matter of performance:
If one or more of these conditions are
not met, create a reference type
instead of a structure. Failure to
adhere to this guideline can
negatively impact performance.

If a structure is not larger than 16 bytes, it can be copied with a few simple processor instructions. If it's larger, a loop is used to copy the structure.
As long as the structure is not larger than 16 bytes, the processor has to do about the same work when copying the structure as when copying a reference. If the structure is larger, you lose the performance benefit of having a structure, and you should generally make it a class instead.

The size figure comes largely from the amount of time it takes to copy the struct on the stack, for example to pass to a method. Anything much larger than this and you are consuming a lot of stack space and CPU cycles just copying data - when a reference to an immutable class (even with dereferencing) could be a lot more efficient.

As other answers have noted, the per-byte cost of copying a structure which is larger than a certain threshold (which was 16 bytes in earlier versions of .NET, but has since grown to 20-24) is significantly greater than the per-byte cost of a smaller structure. It's important to note, however, that copying a structure of any particular size once will be a fraction of the cost creating a new class object instance of that same size. If a struct would be copied many times in its lifetime, and the value-type semantics are not particularly required, a class object may be preferable. If, however, a struct would end up being copied only once or twice, such copying would likely be cheaper than the creation of a new class object. The break-even number of copies where a class object would become cheaper varies with the size of the struct/object in question, but is much higher for things that are below the "cheap copying" threshold, than for things above.
BTW, another point worth mentioning is that the cost of passing a struct as a ref parameter is independent of the size of the struct. In many cases, optimal performance may be achieved by using value types and passing them by ref. Once must be careful to avoid using properties or readonly fields of structure types, however, since accessing either of those will create an implicit temporary copy of the struct in question.

Here is a scenario where structs can exhibit superior performance:
When you need to create 1000s of instances. In this case if you were to use a class, you would first need to allocate the array to hold the 1000s of instances and then in a loop allocate each instance. But instead if you were to use structs, then the 1000s of instances become available immediately after you allocate the array that is going to hold them.
In addition, structs are extremely useful when you need to do interop or want to dip into unsafe code for performance reasons.
As always there is a trade-off and one needs to analyze what they are doing to determine the best way to implement something.
ps: This scenario came into play when I was working with LIDAR data where there could be millions of points representing x,y,z and other attributes for ground data. This data needed to be loaded into memory for some intensive computation to output all kinds of stuff.

I think the 16 bytes is just a rule of thumb from a performance point of view. An object in .NET uses at least 24 bytes of memory (IIRC), so if you made your structure much larger than that, a reference type would be preferable.
I can't think of any reason why they chose 16 bytes specifically.

Pinning pointer arrays in memory

I'm currently working on a ray-tracer in C# as a hobby project. I'm trying to achieve a decent rendering speed by implementing some tricks from a c++ implementation and have run into a spot of trouble.
The objects in the scenes which the ray-tracer renders are stored in a KdTree structure and the tree's nodes are, in turn, stored in an array. The optimization I'm having problems with is while trying to fit as many tree nodes as possible into a cache line. One means of doing this is for nodes to contain a pointer to the left child node only. It is then implicit that the right child follows directly after the left one in the array.
The nodes are structs and during tree construction they are succesfully put into the array by a static memory manager class. When I begin to traverse the tree it, at first, seems to work just fine. Then at a point early in the rendering (about the same place each time), the left child pointer of the root node is suddenly pointing at a null pointer. I have come to the conclusion that the garbage collecter has moved the structs as the array lies on the heap.
I've tried several things to pin the addresses in memory but none of them seems to last for the entire application lifetime as I need. The 'fixed' keyword only seems to help during single method calls and declaring 'fixed' arrays can only be done on simple types which a node isn't. Is there a good way to do this or am I just too far down the path of stuff C# wasn't meant for.
Btw, changing to c++, while perhaps the better choice for a high performance program, is not an option.

Firstly, if you're using C# normally, you can't suddenly get a null reference due to the garbage collector moving stuff, because the garbage collector also updates all references, so you don't need to worry about it moving stuff around.
You can pin things in memory but this may cause more problems than it solves. For one thing, it prevents the garbage collector from compacting memory properly, and may impact performance in that way.
One thing I would say from your post is that using structs may not help performance as you hope. C# fails to inline any method calls involving structs, and even though they've fixed this in their latest runtime beta, structs frequently don't perform that well.
Personally, I would say C++ tricks like this don't generally tend to carry over too well into C#. You may have to learn to let go a bit; there can be other more subtle ways to improve performance ;)

What is your static memory manager actually doing? Unless it is doing something unsafe (P/Invoke, unsafe code), the behaviour you are seeing is a bug in your program, and not due to the behaviour of the CLR.
Secondly, what do you mean by 'pointer', with respect to links between structures? Do you literally mean an unsafe KdTree* pointer? Don't do that. Instead, use an index into the array. Since I expect that all nodes for a single tree are stored in the same array, you won't need a separate reference to the array. Just a single index will do.
Finally, if you really really must use KdTree* pointers, then your static memory manager should allocate a large block using e.g. Marshal.AllocHGlobal or another unmanaged memory source; it should both treat this large block as a KdTree array (i.e. index a KdTree* C-style) and it should suballocate nodes from this array, by bumping a "free" pointer.
If you ever have to resize this array, then you'll need to update all the pointers, of course.
The basic lesson here is that unsafe pointers and managed memory do not mix outside of 'fixed' blocks, which of course have stack frame affinity (i.e. when the function returns, the pinned behaviour goes away). There is a way to pin arbitrary objects, like your array, using GCHandle.Alloc(yourArray, GCHandleType.Pinned), but you almost certainly don't want to go down that route.
You will get more sensible answers if you describe in more detail what you are doing.

If you really want to do this, you can use the GCHandle.Alloc method to specify that a pointer should be pinned without being automatically released at the end of the scope like the fixed statement.
But, as other people have been saying, doing this is putting undue pressure on the garbage collector. What about just creating a struct that holds onto a pair of your nodes and then managing an array of NodePairs rather than an array of nodes?
If you really do want to have completely unmanaged access to a chunk of memory, you would probably be better off allocating the memory directly from the unmanaged heap rather than permanently pinning a part of the managed heap (this prevents the heap from being able to properly compact itself). One quick and simple way to do this would be to use Marshal.AllocHGlobal method.

Is it really prohibitive to store the pair of array reference and index?

What is your static memory manager actually doing? Unless it is doing something unsafe (P/Invoke, unsafe code), the behaviour you are seeing is a bug in your program, and not due to the behaviour of the CLR.
I was in fact speaking about unsafe pointers. What I wanted was something like Marshal.AllocHGlobal, though with a lifetime exceeding a single method call. On reflection it seems that just using an index is the right solution as I might have gotten too caught up in mimicking the c++ code.
One thing I would say from your post is that using structs may not help performance as you hope. C# fails to inline any method calls involving structs, and even though they've fixed this in their latest run-time beta, structs frequently don't perform that well.
I looked into this a bit and I see it has been fixed in .NET 3.5SP1; I assume that's what you were referring to as the run-time beta. In fact, I now understand that this change accounted for a doubling of my rendering speed. Now, structs are aggressively in-lined, improving their performance greatly on X86 systems (X64 had better struct performance in advance).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.