I'm writing a high-ish volume web service in C# running in 64-bit IIS on Win 2k8 (.NET 4.5) that works with XML payloads and does a variety of operations on small and large objects (where the large objects are mainly strings, some over 85k (so going onto the LOH)). Requests are stateless, and memory usage remains steady over time. Lots of memory is being allocated and released per request, no memory appears to be being leaked.
Operating at a maximum of 25 transactions per second, with an average call lasting 5s, it's spending 40-60% of it's time in GC according to two profiling tools, and perfmon shows a steady 20 G0 and G1 collections over 5 seconds, and 15 G2 collections over 5 seconds - meaning lots of (we think) premature promtion into G2 for data that we'd expect to stay in G0. Everything I read indicates this is very excessive. We expect that the system should be able to perform at a higher throughput than 25 tps and assume the GC activity is preventing this.
The machines serving the requests have lots of memory - 16GB - and the application, under load, consumes at most 1GB when under load for an hour. I understand that a bigger heap won't necessarily make things better, but there is spare memory.
I appreciate this is light on specifics (will try to recreate the conditions with a trivial application if time permits) - but can anyone explain why we see so much G2 GC activity? Should I be focusing on the LOH? People keep telling me that the CLR's GC "adapts" to your load, but it's not changing it's behavior in this case and, unlike other runtimes, there seems to be little I can do to tune it (have tried workstation GC, but there is very little observable difference).
Microsoft decided to design the String class so that all strings are stored in memory as a monolithic sequence of characters. While this works well for some usage patterns, it works dreadfully for others.
One thing I've found very helpful is to avoid creating instances of String whenever possible. If a method will often be used to operate on part of a supplied string, and will in turn ask other methods to operate on parts of it, the methods should accept arguments specifying the range of the String upon which they should operate. This will avoid the need for callers of the first method to use Subst to construct a new String for the method to act upon, and will avoid the need to have the method call Subst to feed portions of the string to its callers. In some cases where I have used this technique, the creation of thousands of String instances--some quite large--could be replaced with zero.
CLR's GC "adapts" to your load
It can't know how much memory you are willing to tolerate as overhead. Here, you probably want to give the app like 5GB of heap so that collections are much rarer. The GC has no built-in tuning knobs for that (subjective note: that's a pitty).
You can force bigger heap sizes by using one of the low latency modes for short durations. That should cause the GC to try hard to avoid G2 collections. Monitor the RAM usage and disable low latency mode when consumption reaches 5GB.
This is a risky strategy but it's the best I think you can do.
I would not do it. You can maximally gain 2x throughput. Your CPU is maxed out, right? Workstation GC does not scale to multiple cores and leaves CPUs unused.
Related
I have a server app which receives lots of UDP packets from several sensors 50 times per second (new packet every 20ms), does some analysis, stores them, does some debug logging and other "server stuff".
The problem is that the GC performs a full blocking collection every once in a while, suspending all threads for up to 200ms (perhaps even more in some rare percentiles). I don't have a problem with lagging behind each packet for couple of milliseconds (even sustained latency of say 10ms for each packet wouldn't be an issue), but long suspends are really annoying.
According to MSDN, there is the SustainedLowLatency mode for the GC, which, according to MSDN:
Enables garbage collection that tries to minimize latency over an extended period. The collector tries to perform only generation 0, generation 1, and concurrent generation 2 collections.
Full blocking collections may still occur if the system is under memory pressure.
Does "extended period" still mean I cannot simply set the mode to SustainedLowLatency and forget it?
Is there any way to prevent full blocking collections, at least for a single core?
Basically, this mode will try to use only background gen 2 collections, and do a full GC only when the system starts lacking memory. The delay between blocking collections depends entirely on your code: the less you use the gen 2 the less the memory will be fragmented, and the longer you'll last. Unfortunately, sockets use pinned buffers, which are a typical cause of memory fragmentation. You can try to reuse your buffers, but it involves writing tricky low-level code with the socket API.
The only way to prevent full blocking collections is to use TryStartNoGCRegion. There again, how much you can last depends on how much memory you allocate. Use small objects. Use structs instead of classes whenever possible. Use pooling whenever you need large arrays.
Profiling (for instance, with Jetbrains dotTrace) will help you spot and optimize codepaths that allocate a lot of memory.
Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.
You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.
The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).
A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?
If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).
Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.
I'm new to writing Windows Services. I decided to write one that makes outbound calls through Twilio. I am utilizing using statements when I use a resource which implements IDisposable. I ran the service for a total of four hours so far and here is a look at my memory usage:
Start: 9k
15 Min: 10k
30 Min: 13k
1 hr: 13k
2 hr: 13k
3 hr: 13k
After an 30 minutes it seems to be consistent (between 13,100 and 13,200) but I am not sure why resources are still being allocated after the first 30 minutes. The OnStart() method initiates 4 timers and a few small objects. The construction of my objects certainly does not take 30 minutes. The timers just wait for a specific time, execute a query, then queue the results with Twilio and wait for the next event.
Should I be concerned about a memory leak at this point? Is this normal for such an application?
No, it doesn't look like you need to be concerned about a memory leak.
On a machine with several gigabytes of memory available, consumption of 13k of memory is ... trivially small. If this grows steadily and never decreases then you have a leak: otherwise, you're fine.
It's worth remembering that strings in the CLR are invariant, so every time you "change" a string a new copy is created and the memory allocated to the old version is marked as unused. So most programs churn through a bit of memory just in their usual day-to-day use: this is normal and only something to be concerned about in edge conditions such as very tight loops or huge collections or both.
Even then, the .NET garbage collector (GC) does a great job of sweeping up and consolidating this old memory from time to time.
There are some situations where strings (and other objects) can be allocated memory (and other resources such as file handles) that are not freed after use, and that's where you need to use Dispose().
An educated guess might be that the framework still allocates some things when you do HTTP requests and such.
I wouldn't be worried at this point, but if you really want to, you can always use CLR Profiler or another .NET memory profiler to see what's going on and if it's something to worry about.
I am using C# 2.0 for a multi-threaded application that receives atleast thousand callbacks per second from an unmanaged dll and periodically send messages out of socket. GUI remains on main thread.
My application mostly creates object at the startup and periodically during the execution for a short lived period.
The problem I am experiencing is periodic latency spike (measured by time stamping a function at start and end) which I suppose happen when GC run.
I ran perfmon and here are my observations...
Gen0 Heap Size is flat with a spike every few seconds with periodic spike.
Gen1 Heap Size is always on the roll. Up and down
Gen2 Heap Size follows a cycle. It keep increasing till it becomes flat for a while and then drops.
Gen 0 and 1 Collections are always increasing in a range of 1 to 5 units.
Gen 2 collections is constant.
I recommend using a memory profiler in order to know if you have a real memory leak or not. There are many available and they will allow you to quickly isolate any issue.
The garbage collector is adaptive and will modify how often it runs in response to the way your application is using memory. Just looking at the generation heap sizes is going to tell you very little in terms of isolating the source of any problem. Second quessing how it works is a bad idea.
RedGate Ants Memory Profiler
SciTech .NET Memory Profiler
EQATEC .NET Profiler
CLR Profiler (Free)
So as #Jalf says, there's no evidence of a memory "leak" as such: what you discuss is closer to latency caused by garbage collection.
Others may disagree but I'd suggest anything above a few hundred callbacks per second is going to stretch a general purpose language like C#, especially one that manages memory on your behalf. So you're going to have to get smart with your memory allocation and give the runtime some help.
First, get a real profiler. Perfmon has its uses but even the profiler in later versions of Visual Studio can give you much more information. I like the SciTech profiler best (http://memprofiler.com/); there are others including a well respected one from RedGate reviewed here: http://devlicio.us/blogs/scott_seely/archive/2009/08/23/review-of-ants-memory-profiler.aspx
Once you know your baseline, aim to eliminate gen2 collections. They'll be the really slow ones. Work hard in any tight loops to eliminate as much memory allocation as you can -- strings are the usual offenders.
Some useful tips are in an old but still relevant MSDN article here: http://msdn.microsoft.com/en-us/library/ms973837.aspx.
It's also good to read Tess Ferrandez's (outstanding) blog series on debugging ASP.NET applications - book a day out of the office and start here: http://blogs.msdn.com/b/tess/archive/2008/02/04/net-debugging-demos-information-and-setup-instructions.aspx.
I remember reading a blog post about memory performance in .NET (specifically, XNA on the XBox 360) a while ago (unfortunately I can't find said link anymore).
The nutshell of achieving low latency memory performance was to make sure you never run gen 2 GC's at a performance critical time (although it is OK to run them when latency is not important; there are a bunch of notification callback functions on the GC class in more recent versions of the framework which may help with this). There are two ways to make sure this happens:
Don't allocate anything that escapes to gen 2. It's alarmingly easy for objects to escape to gen 2 when you don't realise it, so this often translates into: don't allocate anything in performance critical code. Because no objects escape to gen 2, the GC doesn't need to collect.
Allocate everything you need upfront and use object pooling. Your gen 2 heap will be big but because nothing is being added to it, the GC doesn't need to collect it.
It may be helpful to look into some XNA or Silverlight related performance articles because
games and resource constrained devices are often very latency sensitive. (Note that you have it easy because the XBox 360 and, until Mango, Windows Phone only had a single generation GC (mark-and-sweep collector)).
I was hoping someone could explain why my application when loaded uses varying amounts of RAM. I'm speaking about a compiled version that uses the exe directly. It's a pretty basic applications and there are no conditional branches in the startup of the application. Yet every time I start it up the RAM amount varies from 6MB-16MB.
I know it's on the small end of usage anyways but I'm curious of why this happens.
Edit: to give a bit more clarification on what the app actually does.
It is a WinForm project.
It connects to a database using sqlclient to retrieve a list of servers.
Based on that list a series of buttons are created to start and stop a service on those servers.
Using the System.Timers class to audit the status of the services on those servers every 20 seconds.
The applications at this point sits there and waits for user input via one of the button clicks to start/stop the service.
The trick here is that the amount of RAM reported by the task schedule is not the amount of RAM used by your application. Rather, it is the amount of RAM reserved for use by your application.
Remember that with managed frameworks like .Net, you don't request or release memory directly. Rather, a garbage collector manages the memory for you. The amount of memory reserved for your application at a given time can vary and depends on a lot of different factors, including memory pressure created at the time by other programs.
Think of it this way: if you need 10 MBs of RAM for your app, is it faster to request and return it to the operating system 1 MB at a time over 10 requests/releases or reserve the block at once with one request/release? Now extend that to a scenario where you don't know exactly how much RAM you'll need, only that it's somewhere in the neighborhood of 10 MB. Additionally, your computer has 1 GB sitting there unused. Of course the best thing to do is take a good-sized chunk of that available RAM. Even 20 or 30 MB wouldn't be unreasonable relative to the ram that's sitting there unused, because unused RAM is wasted performance.
If your system later starts to feel some memory pressure then .Net can easily return some RAM to the system. This is one of the ways managed languages can sometimes give better performance than languages like C++ with traditional memory management: a garbage collector that can more easily take the entire system health into account when allocating memory.
What are you using to determine how much memory is being "used". Even with regular applications Windows will aggressively allocate unused memory in advance, with .NET applications it's even more complicated as to how much memory is actually being used, and how much Windows is just tacking on so that it will be available instantly when needed. If another application actually asks for memory this reserved memory will be repurposed.
One way to check is to minimize the application (at least on XP). If you are looking at the memory use in something like task manager you'll notice it drops off right away, eliminating the seemly "random" amount allocated.
It may be related to the jitter, after the first load the jitter already created a compiled version and it doesn't need to run. Other than that you would have to give us some more details about the app and which kind of memory you are referring to.