c# multithreading mem usage - App slowing down and CPU usage drops - c#

I've got a multithreaded app that manipulates in-memory data (no database or network access). I tried this on 2 machines, one machine is Xeon dual quad core CPU, the other is twin dial-cores. 5 threads are spawned.
Then this multithreaded process starts it runs very quickly and the CPU usage is at 60% for 5 cores, the physical memory is 50% of the RAM capacity. (Info from task manager). After it's about 1/3 of the way through it starts to slow down and the CPU utilisation drops to just below 20%. By the time it gets to 2/3s of the way it's so slow that it takes 1 day to complete the last third while it takes half an hour to do the first 1/3.
The process creates many SortedLists and Lists, so I am starting to suspect that the Garbage Collector can't cope, although the Task Manager memory usage is not so bad... I want to try to force the GC to free the unused collections immediately, is this a reasonable or even doable? And why does CPU utilitsation drop?

Forcing the garbage collector to run is almost always a bad idea. (In some instances, forcing it to run early could actually promote the lifetimes of objects)
Download a tool like Memprofiler, Ants or dotTrace (they all have trial versions), to identify whether you are leaking memory. Are you allocating objects larger than 85Kb?
Also, what version of the OS and .NET Framework are you using? (there were differences in how the server and PC versions of the GC worked)
Also, be aware that insertion into a SortedList is O(N) (whereas a SortedDictionary insertion is O(logN):
The SortedList generic class is a
binary search tree with O(log n)
retrieval, where n is the number of
elements in the dictionary. In this,
it is similar to the SortedDictionary
generic class. The two classes have
similar object models, and both have
O(log n) retrieval. Where the two
classes differ is in memory use and
speed of insertion and removal:
SortedList uses less memory than
SortedDictionary.
SortedDictionary has faster insertion
and removal operations for unsorted
data, O(log n) as opposed to O(n) for
SortedList.
If the list is populated all at once
from sorted data, SortedList is faster
than SortedDictionary.
Ref.
How are you managing multithreaded access to these lists? Can you post some cut-down code?

I guess adding lots of items to a heavily loaded collection isn;t as performant as it could be. I noticed something similar with an old SQL query - 100 records in the recordset was quick, but half a million records slowed things down exponentially.
To check the GC, run up perfmon and view (or log) the performance counters for the garbage collector and memory allocations.

Sounds like a data structure locking issue. It's difficult to say without knowing exactly what you're doing.
Try using one of the lock-free non-contiguous collections such as ConcurrentDictionary or ConcurrentBag and/or a proper queue like BlockingCollection.

You are more than likely using all you physical memory up with your data and Windows starts to use virtual memory after that which is much slower. You should try a memmory profiler to see which object are takeing up all your memmory, and consider disposing of some of those objest periodically to keep from using up all your RAM.

60% CPU on 5 cores from 5 threads. I assume that is 60% on each core. This is actually very bad. You cannot drive the CPU to 100% doing memory operation alone (no Database, no network, no file IO) it means your contention on locking is huge. As the program progresses, your structures presumably grow in size (more elements in some list/dictionaries), you are holding locks for longer, and the result is less CPU and even slower performance.
Is hard to tell w/o any real performance data, but this does not look like GC related. It looks more like high contention in the data structures. You should trace your app under profiler and see where the CPU/wait times are spent most. See Pinpoint a performance issue using hotpath in Visual Studio 2008 for a quick introduction to the sampling profiler.

Related

C# Excessive Garbage Collection - Large Strings, G2 pressure?

I'm writing a high-ish volume web service in C# running in 64-bit IIS on Win 2k8 (.NET 4.5) that works with XML payloads and does a variety of operations on small and large objects (where the large objects are mainly strings, some over 85k (so going onto the LOH)). Requests are stateless, and memory usage remains steady over time. Lots of memory is being allocated and released per request, no memory appears to be being leaked.
Operating at a maximum of 25 transactions per second, with an average call lasting 5s, it's spending 40-60% of it's time in GC according to two profiling tools, and perfmon shows a steady 20 G0 and G1 collections over 5 seconds, and 15 G2 collections over 5 seconds - meaning lots of (we think) premature promtion into G2 for data that we'd expect to stay in G0. Everything I read indicates this is very excessive. We expect that the system should be able to perform at a higher throughput than 25 tps and assume the GC activity is preventing this.
The machines serving the requests have lots of memory - 16GB - and the application, under load, consumes at most 1GB when under load for an hour. I understand that a bigger heap won't necessarily make things better, but there is spare memory.
I appreciate this is light on specifics (will try to recreate the conditions with a trivial application if time permits) - but can anyone explain why we see so much G2 GC activity? Should I be focusing on the LOH? People keep telling me that the CLR's GC "adapts" to your load, but it's not changing it's behavior in this case and, unlike other runtimes, there seems to be little I can do to tune it (have tried workstation GC, but there is very little observable difference).
Microsoft decided to design the String class so that all strings are stored in memory as a monolithic sequence of characters. While this works well for some usage patterns, it works dreadfully for others.
One thing I've found very helpful is to avoid creating instances of String whenever possible. If a method will often be used to operate on part of a supplied string, and will in turn ask other methods to operate on parts of it, the methods should accept arguments specifying the range of the String upon which they should operate. This will avoid the need for callers of the first method to use Subst to construct a new String for the method to act upon, and will avoid the need to have the method call Subst to feed portions of the string to its callers. In some cases where I have used this technique, the creation of thousands of String instances--some quite large--could be replaced with zero.
CLR's GC "adapts" to your load
It can't know how much memory you are willing to tolerate as overhead. Here, you probably want to give the app like 5GB of heap so that collections are much rarer. The GC has no built-in tuning knobs for that (subjective note: that's a pitty).
You can force bigger heap sizes by using one of the low latency modes for short durations. That should cause the GC to try hard to avoid G2 collections. Monitor the RAM usage and disable low latency mode when consumption reaches 5GB.
This is a risky strategy but it's the best I think you can do.
I would not do it. You can maximally gain 2x throughput. Your CPU is maxed out, right? Workstation GC does not scale to multiple cores and leaves CPUs unused.

c# how can I sidestep the memory allocation bottleneck to improve multithreading performance

I use C# as a research tool, and frequently need to run CPU intensive tasks such as optimisations. In theory I should be able to get big performance improvements by multi-threading my code, but in practice when I use the same number of threads as the number of cores available on my workstation I usually find that the CPU is still only running at 25%-50% of max. Interrupting the code to see what all the threads are doing strongly suggests that memory allocation is the bottleneck, because most threads will be waiting for new statements to execute.
One solution would be to try and re-engineer all my code to be much more memory efficient, but that would be a big and time-consuming task. However, since I have an abundance of memory on my workstation, I'm wondering if I can sidestep this problem by setting up the different threads so that they each have their own private pool of memory to work from. Of course, some objects will still need to be public between all threads, otherwise it won't be possible to specify the tasks for each thread or to harvest the results.
Does anyone know if this kind of approach is possible in C#, and if so, how should I go about it?
If you has memory allocation bottleneck, you should:
Use "objects pool" (as #MartinJames said). Initialize objects pool, when application is started. Objects pool should improve performance of heap allocation.
Use structs (or any value type), as local variables, because stack allocation is much faster than heap.
Avoid implicit memory allocation. For example, when you add item into List<>:
If Count already equals Capacity, the capacity of the List is
increased by automatically reallocating the internal array, and the
existing elements are copied to the new array before the new element
is added (source MSDN).
Avoid boxing. It's very expensive:
In relation to simple assignments, boxing and unboxing are
computationally expensive processes. When a value type is boxed, a new
object must be allocated and constructed. To a lesser degree, the cast
required for unboxing is also expensive computationally. (source MSDN)
Avoid lambda expressions which captures a variable (because new object will be created for captured variable)
That is similar to what I do in servers - use object pools for freqently-used classes, (though not in C#).
I guess that, in C#, you could use a BlockingCollection. Prefill it with a load of T's and Take() objects from it, use them and then return with Add().
This works well with objects that are numerous and large, (eg. server data buffers), or have complex and lengthy ctors/dtors, (eg. an http receiver/parser component) - popping/pushing such objects, ('cos essentially pointers in NET), off/on queues is much quicker than continually creating them and later having the GC destroy them.
NOTE: an object popped from such a pool queue has probably been used before and may need some explicit initialization!
Its not a particularly a C# or .NET problem. For a CPU core to run optimally it needs all its data to be in the CPU cache. If a particular data is not in the CPU cache, a cache fault happens and CPU sit idle until the data is fetched from memory to Cache.
If your in memory data is too much fragmented the chance of Cache fault increases.
The way CLR does heap allocation is much more optimal for CPU cache. Its unlikely that you can achieve the same performance by handling the memory allocation yourself.

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.
You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.
The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).
A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?
If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).
Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

Should I be concerned about a memory leak?

I'm new to writing Windows Services. I decided to write one that makes outbound calls through Twilio. I am utilizing using statements when I use a resource which implements IDisposable. I ran the service for a total of four hours so far and here is a look at my memory usage:
Start: 9k
15 Min: 10k
30 Min: 13k
1 hr: 13k
2 hr: 13k
3 hr: 13k
After an 30 minutes it seems to be consistent (between 13,100 and 13,200) but I am not sure why resources are still being allocated after the first 30 minutes. The OnStart() method initiates 4 timers and a few small objects. The construction of my objects certainly does not take 30 minutes. The timers just wait for a specific time, execute a query, then queue the results with Twilio and wait for the next event.
Should I be concerned about a memory leak at this point? Is this normal for such an application?
No, it doesn't look like you need to be concerned about a memory leak.
On a machine with several gigabytes of memory available, consumption of 13k of memory is ... trivially small. If this grows steadily and never decreases then you have a leak: otherwise, you're fine.
It's worth remembering that strings in the CLR are invariant, so every time you "change" a string a new copy is created and the memory allocated to the old version is marked as unused. So most programs churn through a bit of memory just in their usual day-to-day use: this is normal and only something to be concerned about in edge conditions such as very tight loops or huge collections or both.
Even then, the .NET garbage collector (GC) does a great job of sweeping up and consolidating this old memory from time to time.
There are some situations where strings (and other objects) can be allocated memory (and other resources such as file handles) that are not freed after use, and that's where you need to use Dispose().
An educated guess might be that the framework still allocates some things when you do HTTP requests and such.
I wouldn't be worried at this point, but if you really want to, you can always use CLR Profiler or another .NET memory profiler to see what's going on and if it's something to worry about.

C# memory leak?? Gen 0 and 1 increasing constantly in perfmon - What does it mean?

I am using C# 2.0 for a multi-threaded application that receives atleast thousand callbacks per second from an unmanaged dll and periodically send messages out of socket. GUI remains on main thread.
My application mostly creates object at the startup and periodically during the execution for a short lived period.
The problem I am experiencing is periodic latency spike (measured by time stamping a function at start and end) which I suppose happen when GC run.
I ran perfmon and here are my observations...
Gen0 Heap Size is flat with a spike every few seconds with periodic spike.
Gen1 Heap Size is always on the roll. Up and down
Gen2 Heap Size follows a cycle. It keep increasing till it becomes flat for a while and then drops.
Gen 0 and 1 Collections are always increasing in a range of 1 to 5 units.
Gen 2 collections is constant.
I recommend using a memory profiler in order to know if you have a real memory leak or not. There are many available and they will allow you to quickly isolate any issue.
The garbage collector is adaptive and will modify how often it runs in response to the way your application is using memory. Just looking at the generation heap sizes is going to tell you very little in terms of isolating the source of any problem. Second quessing how it works is a bad idea.
RedGate Ants Memory Profiler
SciTech .NET Memory Profiler
EQATEC .NET Profiler
CLR Profiler (Free)
So as #Jalf says, there's no evidence of a memory "leak" as such: what you discuss is closer to latency caused by garbage collection.
Others may disagree but I'd suggest anything above a few hundred callbacks per second is going to stretch a general purpose language like C#, especially one that manages memory on your behalf. So you're going to have to get smart with your memory allocation and give the runtime some help.
First, get a real profiler. Perfmon has its uses but even the profiler in later versions of Visual Studio can give you much more information. I like the SciTech profiler best (http://memprofiler.com/); there are others including a well respected one from RedGate reviewed here: http://devlicio.us/blogs/scott_seely/archive/2009/08/23/review-of-ants-memory-profiler.aspx
Once you know your baseline, aim to eliminate gen2 collections. They'll be the really slow ones. Work hard in any tight loops to eliminate as much memory allocation as you can -- strings are the usual offenders.
Some useful tips are in an old but still relevant MSDN article here: http://msdn.microsoft.com/en-us/library/ms973837.aspx.
It's also good to read Tess Ferrandez's (outstanding) blog series on debugging ASP.NET applications - book a day out of the office and start here: http://blogs.msdn.com/b/tess/archive/2008/02/04/net-debugging-demos-information-and-setup-instructions.aspx.
I remember reading a blog post about memory performance in .NET (specifically, XNA on the XBox 360) a while ago (unfortunately I can't find said link anymore).
The nutshell of achieving low latency memory performance was to make sure you never run gen 2 GC's at a performance critical time (although it is OK to run them when latency is not important; there are a bunch of notification callback functions on the GC class in more recent versions of the framework which may help with this). There are two ways to make sure this happens:
Don't allocate anything that escapes to gen 2. It's alarmingly easy for objects to escape to gen 2 when you don't realise it, so this often translates into: don't allocate anything in performance critical code. Because no objects escape to gen 2, the GC doesn't need to collect.
Allocate everything you need upfront and use object pooling. Your gen 2 heap will be big but because nothing is being added to it, the GC doesn't need to collect it.
It may be helpful to look into some XNA or Silverlight related performance articles because
games and resource constrained devices are often very latency sensitive. (Note that you have it easy because the XBox 360 and, until Mango, Windows Phone only had a single generation GC (mark-and-sweep collector)).

Categories