.NET - Reducing GC contention on array creation - c#

I have some code that deal with a lot of copying of arrays. Basically my class is a collection that uses arrays as backing fields, and since I don't want to run the risk of anyone modifying an existing collection, most operations involves creating copies of the collection before modifying it, hence also copying the backing arrays.
I have noticed that the copying can be slow sometimes, within acceptable limits but I am worried that it might be a problem when the application is scaled up and starts using more data.
Some performance analysis testing suggests that while barely consuming CPU resources at all, my array copy code spends a lot of time blocked. There are few contentions, but a lot of time blocked. Since the testing application is single threaded, I assume there is some GC contention magic going on. I'm not confident enough in how the GC works in these scenarios, so I'm asking here.
My question - is there a way to create new arrays that reduces the strain on the GC? Or is there some other way I can speed this up (simplified for testing and readability purposes):
public MyCollection(MyCollection copyFrom)
{
_items = new KeyValuePair<T, double>[copyFrom._items.Length]; //this line is reported to have a lot of contention time
Array.Copy(copyFrom._items, _items, copyFrom._items.Length);
_numItems = copyFrom._numItems;
}

Not so sure what's going on here, but contention is a threading problem, not an array copying problem. And yes, a concurrency analyzer is liable to point at a new statement since memory allocation requires acquiring a lock that protects the heap.
That lock is held for a very short time when allocations come from the gen #0 heap. So having threads fighting over the lock and losing a great deal of time being locked out is a very unlikely mishap. It is not so fast when the allocation comes from the Large Object Heap. Happens when the allocation is 85,000 bytes or more. But then a thread would of course be pretty busy with copying the array elements as well.
Do watch out for what the tool tells you, a very large number of total contentions does not automatically mean you have a problem. It only gets ugly when threads end up getting blocked for a substantial amount of time. If that is a real problem then you next need to look at how much time is spent on garbage collection. There is a basic perf counter for that, you can see it in Perfmon.exe. Category ".NET CLR Memory", counter "% Time in GC", instance = yourapp. Which is liable to be high, considering the amount of copying you do. A knob you can tweak if that is the real problem is to enable server GC.

There's a concept of persistent immutable data structure. This is one of the possible solutions that basically let's you create immutable objects, while still modifying them, in a memory efficient way.
For example,
Roslyn has a SyntaxTree object, that is immutable. You can modify the immutable object, and get back modified immutable object. Note that the "modified immutable object" has possibly no memory allocations, because it can build on the "first immutable object".
The same concept is also used in Visual Studio text editor itself. The TextBuffer is immutable object, but each time you press a keyboard button, new immutable TextBuffer is created, however, they do not allocate memory(as that would be slow).
Also, if it's true that you're facing the issues with LOH, it can help sometimes when you allocate the big memory block yourself, and use that as "reusable" memory pool, thus avoiding GC completely. It's worth considering.

No. You can wait for the new runtime in 2015 though that will use SIMD instructions for the Array.Copy operation. This will be quite a lot faster. The current implementation is very sub-optimal.
At the end, the trick is in avoiding memory operations - which sometime just is not possible.

Related

What are approaches to optimize the mark phase of a non-generational GC?

I am running on Unity's Boehm–Demers–Weiser garbage collector, which is a non-generational GC.
I have a large tree of managed objects in memory (~100k objects, ~200MiB allocation).
These objects are essentially a cache and never go out of scope, so they never actually get sweeped by the GC.
However, because Boehm is non-generational, this stale cache never gets moved up to higher generations. This causes the mark phase to take a very high amount of processing time, as it has to traverse this whole cache on every collection, causing noticeable lag spikes.
This is "by-design", as the Unity documentation puts it:
Crucially, Unity’s garbage collection – which uses the Boehm GC algorithm – is non-generational and non-compacting. “Non-generational” means that the GC must sweep through the entire heap when performing a collection pass, and its performance therefore degrades as the heap expands.
I am well aware of approaches to reduce recurring garbage allocation, however I cannot find any information on how to optimize a large, stale, baseline allocation in a non-generational GC.
More specifically:
Is there any way to mark a root pointer (e.g. static field) as ignored from GC entirely?
Are there some data structure patterns that are faster to traverse in the mark phase?
Conversely, are there known data structure patterns that hinder the mark phase speed?
These questions are just some of my hypotheses to solve this, but I'm open to all suggestions.
One could approximate generational behavior by separating program startup with initialization of static data structures from steady state operation. All pointers into the startup memory region could be ignored while pointers from it should not exist since nothing allocated after the switchpoint (which would be under GC control) has been allocated yet.
One could even GC the startup region once before switching to a new region. Essentially you would end up with a limited form of region-based, non-moving collector where references between regions only happen in one direction.

Producer/Consumer with C# structs?

I have a singleton object that process requests. Each request takes around one millisecond to be completed, usually less. This object is not thread-safe and it expects requests in a particular format, encapsulated in the Request class, and returns the result as Response. This processor has another producer/consumer that sends/receives through a socket.
I implemented the producer/consumer approach to work fast:
Client prepares a RequestCommand command object, that contains a TaskCompletionSource<Response> and the intended Request.
Client add the command to the "request queue" (Queue<>) and awaits command.Completion.Task.
A different thread (and actual background Thread) pulls the command from the "request queue", process the command.Request, generates Response and signals the command as done using command.Completion.SetResult(response).
Client continues working.
But when doing a small memory benchmark I see LOTS of these objects being created and topping the list of most common object in memory. Note that there is no memory leak, the GC can clean everything up nicely each time triggers, but obviously so many objects being created fast, makes Gen 0 very big. I wonder if a better memory usage may yield better performance.
I was considering convert some of these objects to structs to avoid allocations, specially now that there are some new features to work with them C# 7.1. But I do not see a way of doing it.
Value types can be instantiated in the stack, but if they pass from thread to thread, they must be copied to the stackA->heap and heap->stackB I guess. Also when enqueuing in the queue, it goes from stack to heap.
The singleton object is truly asynchronous. There is some in-memory processing, but 90% of the time it needs to call outside and going through the internal producer/consumer.
ValueTask<> does not seem to fit here, because things are asynchronous.
TaskCompletionSource<> has a state, but it is object, so it would be boxed.
The command also jumps from thread to thread.
Reciclying objects only works for the command itself, its content cannot be recycled (TaskCompletionSource<> and a string).
Is there any way I could leverage structs to reduce the memory usage or/and improve the performance? Any other option?
Value types can be instantiated in the stack, but if they pass from thread to thread, they must be copied to the stackA->heap and heap->stackB I guess.
No, that's not at all true. But you have a deeper problem in your thinking here:
Immediately stop thinking of structs as living on the stack. When you make an int array with a million ints, you think those four million bytes of ints live on your one-million-byte stack? Of course not.
The truth is that stack vs heap has nothing whatsoever to do with value types. Instead of "stack and heap", start saying "short term allocation pool" and "long term allocation pool". Variables that have short lifetimes are allocated from the short term allocation pool, regardless of whether that variable contains an int or a reference to an object. Once you start thinking about variable lifetime correctly then your reasoning becomes entirely straightforward. Short-lived things live in the short term pool, obviously.
So: when you pass a struct from one thread to another, does it ever live "on the heap"? The question is nonsensical because values are not things that live on the heap. Variables are things that are storage; variables store value.
So: Is it the case that turning classes into structs will improve performance because "those structs can live on the stack"? No, of course not. The relevant difference between reference types and value types is not where they live but how they are copied. Value types are copied by value, reference types are copied by reference, and reference copies are the fastest copies.
I see LOTS of these objects being created and topping the list of most common object in memory. Note that there is no memory leak, the GC can clean everything up nicely each time triggers, but obviously so many objects being created fast, makes Gen 0 very big. I wonder if a better memory usage may yield better performance.
OK, now we come to the sensible part of your question. This is an excellent observation and it is one which is testable with science. The first thing you should do is to use a profiler to determine what is the actual burden of gen 0 collections on the performance of your application.
It may be that this burden is not the slowest thing in your program and in fact it is irrelevant. In that case, you will now know to concentrate your efforts on the real problem, rather than chasing down memory allocation problems that aren't real problems.
Suppose you discover that gen 0 collections really are killing your performance; what can you do? Is the answer to make more things structs? That can work, but you have to be very careful:
If the structs themselves contain references, you've just pushed the problem off one level, you haven't solved it.
If the structs are larger than reference size -- and of course they almost always are -- then now you are copying them by copying the entire struct rather than copying a reference, and you've traded a GC time problem for a copy time problem. That might be a win, or a loss; use science to find out which it is.
When we were faced with this problem in Roslyn, we thought about it very carefully and did a lot of experiments. The strategy we went with was in general not to move things onto the stack. Rather, we identified how many small, short-lived objects there were active in memory at any one time, of each type -- using a profiler -- and then implemented a pooling strategy on those objects. You need a small object, you take it out of the pool. When you're done, you put it back in the pool. What happens is, you end up with O(number of objects active at any one time) in the pool, which quickly gets moved into the gen 2 heap; you then greatly lower your collection pressure on the gen 0 heap while increasing the cost of comparatively rare gen 2 collections.
I'm not saying that's the best choice for you. I'm saying that we had this same problem in Roslyn, and we solved it with science. You can do the same.

Weird out of memory exceptions [duplicate]

If you application is such that it has to do lot of allocation/de-allocation of large size objects (>85000 Bytes), its eventually will cause memory fragmentation and you application will throw an Out of memory exception.
Is there any solution to this problem or is it a limitation of CLR memory management?
Unfortunately, all the info I've ever seen only suggests managing risk factors yourself: reuse large objects, allocate them at the beginning, make sure they're of sizes that are multiples of each other, use alternative data structures (lists, trees) instead of arrays. That just gave me an another idea of creating a non-fragmenting List that instead of one large array, splits into smaller ones. Arrays / Lists seem to be the most frequent culprits IME.
Here's an MSDN magazine article about it:
http://msdn.microsoft.com/en-us/magazine/cc534993.aspx, but there isn't that much useful in it.
The thing about large objects in the CLR's Garbage Collector is that they are managed in a different heap.
The garbage collector uses a mechanism called "Compacting", which is basically fragmentation and re-linkage of objects in the regular heap.
The thing is, since "compacting" large objects (copying and re-linking them) is an expensive procedure, the GC provides a different heap for them, which is never being compacted.
Note also that memory allocation is contiguous. Meaning if you allocate Object #1 and then Object #2, Object #2 will always be placed after Object #1.
This is probably what's causing you to get OutOfMemoryExceptions.
I would suggest having a look at design patterns like Flyweight, Lazy Initialization and Object Pool.
You could also force GC collection, if you're suspecting that some of those large objects are already dead and have not been collected due to flaws in your flow of control, causing them to reach higher generations just before being ready for collection.
A program always bombs on OOM because it is asking for a chunk of memory that's too large, never because it completely exhausted all virtual memory address space. You could argue that's a problem with the LOH getting fragmented, it is just as easy to argue that the program is using too much virtual memory.
Once a program goes beyond allocating half the addressable virtual memory (a gigabyte), it is really time to either consider making its code smarter so it doesn't gobble so much memory. Or making a 64-bit operating system a prerequisite. The latter is always cheaper. It doesn't come out of your pocket either.
Is there any solution to this problem or is it a limitation of CLR memory management?
There is no solution besides reconsidering your design. And it is not a problem of the CLR. Note, the problem is the same for unmanaged applications. It is given by the fact, that too much memory is used by the application at the same time and in segments laying 'disadvantageous' out in memory. If some external culprit has to be pointed at nevertheless, I would rather point at the OS memory manager, which (of course) does not compact its vm address space.
The CLR manages free regions of the LOH in a free list. This in most cases is the best what can be done against fragmentation. But since for really large objects, the number of objects per LOH segment decreases - we eventually end up having only one object per segment. And where those objects are positioned in the vm space is completely up to the memory manager of the OS. This means, the fragmentation mostly happens on the OS level - not on the CLR. This is an often overseen aspect of heap fragmentation and it is not .NET to blame for it. (But it is also true, fragmentation can also occour on the managed side like nicely demonstrated in that article.)
Common solutions have been named already: reuse your large objects. I up to now was not confronted with any situation, where this could not be done by proper design. However, it can be tricky sometimes and therefore may be expensive though.
We were precessing images in multiple threads. With images being large enough, this also caused OutOfMemory exceptions due to memory fragmentation. We tried to solve the problem by using unsafe memory and pre-allocating heap for every thread. Unfortunately, this didn't help completely since we relied on several libraries: we were able to solve the problem in our code, but not 3rd party.
Eventually we replaced threads with processes and let operating system do the hard work. Operating systems have long ago built a solution for memory fragmentation, so it's unwise to ignore it.
I have seen in a different answer that the LOH can shrink in size:
Large Arrays, and LOH Fragmentation. What is the accepted convention?
"
...
Now, having said that, the LOH can shrink in size if the area at its end is completely free of live objects, so the only problem is if you leave objects in there for a long time (e.g. the duration of the application).
...
"
Other then that you can make your program run with extended memory up to 3GB on 32bit system and up to 4 GB on 64bit system.
Just add the flag /LARGEADDRESSAWARE in your linker or this post build event:
call "$(DevEnvDir)..\tools\vsvars32.bat"
editbin /LARGEADDRESSAWARE "$(TargetPath)"
In the end if you are planning to run the program for a long time with lots of large objects you will have to optimize the memory usage and you might even have to reuse allocated objects to avoid garbage collector which is similar in concept, to working with real time systems.

c# how can I sidestep the memory allocation bottleneck to improve multithreading performance

I use C# as a research tool, and frequently need to run CPU intensive tasks such as optimisations. In theory I should be able to get big performance improvements by multi-threading my code, but in practice when I use the same number of threads as the number of cores available on my workstation I usually find that the CPU is still only running at 25%-50% of max. Interrupting the code to see what all the threads are doing strongly suggests that memory allocation is the bottleneck, because most threads will be waiting for new statements to execute.
One solution would be to try and re-engineer all my code to be much more memory efficient, but that would be a big and time-consuming task. However, since I have an abundance of memory on my workstation, I'm wondering if I can sidestep this problem by setting up the different threads so that they each have their own private pool of memory to work from. Of course, some objects will still need to be public between all threads, otherwise it won't be possible to specify the tasks for each thread or to harvest the results.
Does anyone know if this kind of approach is possible in C#, and if so, how should I go about it?
If you has memory allocation bottleneck, you should:
Use "objects pool" (as #MartinJames said). Initialize objects pool, when application is started. Objects pool should improve performance of heap allocation.
Use structs (or any value type), as local variables, because stack allocation is much faster than heap.
Avoid implicit memory allocation. For example, when you add item into List<>:
If Count already equals Capacity, the capacity of the List is
increased by automatically reallocating the internal array, and the
existing elements are copied to the new array before the new element
is added (source MSDN).
Avoid boxing. It's very expensive:
In relation to simple assignments, boxing and unboxing are
computationally expensive processes. When a value type is boxed, a new
object must be allocated and constructed. To a lesser degree, the cast
required for unboxing is also expensive computationally. (source MSDN)
Avoid lambda expressions which captures a variable (because new object will be created for captured variable)
That is similar to what I do in servers - use object pools for freqently-used classes, (though not in C#).
I guess that, in C#, you could use a BlockingCollection. Prefill it with a load of T's and Take() objects from it, use them and then return with Add().
This works well with objects that are numerous and large, (eg. server data buffers), or have complex and lengthy ctors/dtors, (eg. an http receiver/parser component) - popping/pushing such objects, ('cos essentially pointers in NET), off/on queues is much quicker than continually creating them and later having the GC destroy them.
NOTE: an object popped from such a pool queue has probably been used before and may need some explicit initialization!
Its not a particularly a C# or .NET problem. For a CPU core to run optimally it needs all its data to be in the CPU cache. If a particular data is not in the CPU cache, a cache fault happens and CPU sit idle until the data is fetched from memory to Cache.
If your in memory data is too much fragmented the chance of Cache fault increases.
The way CLR does heap allocation is much more optimal for CPU cache. Its unlikely that you can achieve the same performance by handling the memory allocation yourself.

Does allocating objects of the same size improve GC or "new" performance?

Suppose we have to create many small objects of byte array type. The size varies but it always below 1024 bytes , say 780,256,953....
Will it improve operator new or GC efficiency over time if we always allocate only bytes[1024], and use only space needed?
UPD: This is short living objects, created for parsing binary protocol messages.
UPD: The number of the objects is the same in both cases, it just the size of allocation which changes (random vs. always 1024).
In C++ it would matter because of fragmentation and C++ new performance. But in C#....
Will it improve operator new or GC efficiency over time if we always allocate only bytes[1024], and use only space needed?
Maybe. You're going to have to profile it and see.
The way we allocate syntax tree nodes inside the Roslyn compiler is quite interesting, and I'm eventually going to do a blog post about it. Until then, the relevant bit to your question is this interesting bit of trivia. Our allocation pattern typically involves allocating an "underlying" immutable node (which we call the "green" node) and a "facade" mutable node that wraps it (which we call the "red" node). As you might imagine, it is frequently the case that we end up allocating these in pairs: green, red, green, red, green, red.
The green nodes are persistent and therefore long-lived; the facades are short-lived, because they are discarded on every edit. Therefore it is frequently the case that the garbage collector has green / hole / green / hole / green / hole, and then the green nodes move up a generation.
Our assumption had always been that making data structures smaller will always improve GC performance. Smaller structures equals less memory allocated, equals less collection pressure, equals fewer collections, equals more performance, right? But we discovered through profiling that making the red nodes smaller in this scenario actually decreases GC performance. Something about the particular size of the holes affects the GC in some odd way; not being an expert on the internals of the garbage collector, it is opaque to me why that should be.
So is it possible that changing the size of your allocations can affect the GC in some unforseen way? Yes, it is possible. But, first off, it is unlikely, and second it is impossible to know whether you are in that situation until you actually try it in real-world scenarios and carefully measure GC performance.
And of course, you might not be gated on GC performance. Roslyn does so many small allocations that it is crucial that we tune our GC-impacting behaviour, but we do an insane number of small allocations. The vast majority of .NET programs do not stress the GC the way we do. If you are in the minority of programs that stress the GC in interesting ways then there is no way around it; you're going to have to profile and gather empirical data, just like we do on the Roslyn team.
If you are not in that minority, then don't worry about GC performance; you probably have a bigger problem somewhere else that you should be dealing with first.
new is fast, it is the GC that causes problems. So, it depends on how long your arrays live for.
If they only live a short time, I don't think there will be any improvement from allocating 1024 byte arrays. In fact this will put more pressure on the GC because of the wasted space and will probably degrade performance.
If they live for the life of your application, I would consider allocating one large array and using chunks of it for each small array. You would need to profile this to see if it helps.
Not really, Allocating or clearing a bytes array requires only one instruction, regardless of its size. (I speak about your case. there are exceptions)
You shouldn't worry about the performance aspect of garbage collection, unless you are sure that it's a bottleneck for your application (ie you create a lot of references with complex relationship, and throw it shortly afterward... And the garbage collection is noticeable.)
To read an excellent story about a well known (and quite useful) site having performance issues with the .NET GC (in an impressive use case) see this blog. http://samsaffron.com/archive/2011/10/28/in-managed-code-we-trust-our-recent-battles-with-the-net-garbage-collector ;)
But the most important thing about GC is: Never, ever do optimisations before being sure that you have a problem. Because if you do, you will probably have one. Applications are complex, and the GC interacts with every parts of it, at runtime. Apart from simple cases, predicting its behavior and bottlenecks beforehand seems (in my opinion) difficult.
I also don't think only allocating 1024 byte arrays will improve the GC.
Since the GC is not determistic, I also think it will not be your problem.
You can influence the GC by using using {} statements arround your arrays to free memory (maybe) sooner.
I do not think GC will be the problem and/ or bottleneck.
Allocating objects of different sizes and freeing them in a different sequence can lead to fragmented heap memory. So this is where allocating objects of the same size might come in handy.
If you really do allocate/ free a lot and really do see this as your bottleneck, try to re-use the objects with a local object cache. This can lead to performance increase, especially if they are data-only objects that do not implement a lot of logic. If the objects do implement a lot of logic and require mor complex initialisation (RAII-pattern) I would forego the performance increase for robustness of the program...
hth
Mario

Categories