Troubleshooting creeping CPU utilization on Azure websites

Troubleshooting creeping CPU utilization on Azure websites - c#

As the load on our Azure website has increased (along with the complexity of the work that it's doing), we've noticed that we're running into CPU utilization issues. CPU utilization gradually creeps upward over the course of several hours, even as traffic levels remain fairly steady. Over time, if the Azure stats are correct, we'll somehow manage to get > 60 seconds of CPU per instance (not quite clear how that works), and response times will start increasing dramatically.
If I restart the web server, CPU drops immediately, and then begins a slow creep back up. For instance, in the image below, you can see CPU creeping up, followed by the restart (with the red circle) and then a recovery of the CPU.
I'm strongly inclined to suspect that this is a problem somewhere in my own code, but I'm scratching my head as to how to figure it out. So far, any attempts to reproduce this on my dev or testing environments have proven ineffectual. Nearly all the suggestions for profiling IIS/C# performance seem to presume either direct access to the machine in question or at least a "Cloud Service" instance rather than an Azure Website.
I know this is a bit of a long shot, but... any suggestions, either for what it might be, or how to troubleshoot it?
(We're using C# 5.0, .NET 4.5.1, ASP.NET MVC 5.2.0, WebAPI 2.2, EF 6.1.1, Azure System Bus, Azure SQL Database, Azure redis cache, and async for every significant code path.)
Edit 8/5/14 - I've tried some of the suggestions below. But when the website is truly busy, i.e., ~100% CPU utilization, any attempt to download a mini-dump or GC dump results in a 500 error, with the message, "Not enough storage." The various times that I have been able to download a mini-dump or GC dump, they haven't shown anything particularly interesting, at least, so far as I could figure out. (For instance, the most interesting thing in the GC dump was a half dozen or so >100KB string instances - those seem to be associated with the bundling subsystem in some fashion, so I suspect that they're just cached ScriptBundle or StyleBundle instances.)

Try remote debugging to your site from visual studio.
Try https://{sitename}.scm.azurewebsites.net/ProcessExplorer/ there you can take memory dumps an GC dumps of your w3wp process.
Then you can compare 2 GC dumps to find memory leaks and open the memory dump with windbg/VS for further "offline" debugging.

Related

Webjob getting 'System.OutOfMemoryException'

I have a webjob running on Azure on their S3 standard platform meaning it has 7 gb of ram available to run my app.
On the machine 3 jobs are running of which one is the one doing all the processing and the other two handles small tasks. My problem lies in the fact that I on certain memoryintensive large tasks gets a memory exception meaning that results in the given job crashing.
The job I try to run is a very memory intensive job, and requires around 1,5 gb of ram, but based on the graph below I do not understand how this should be a problem since I never am above 2.2gb of used ram for the app service. I do have to add, that I run 3 instances, so it might be that one instance is using way more memory, but I can not find anywhere to view that information.
Memory consumption on server
When I look in the process explorer in the Kudo I see I use around 1.3gb of ram currently, which is still way below the needed memory for the job.
Kudo screenshot
The job has run without any problems no more than 2 days ago on the same server setup, so I am completely lost as to where to look.
Update: Code works fine in visual studio with same data running the same exact task.
Do anyone have ideas as to how to approach this problem

Per my understanding, you could capture the memory dump in your azure app service and analyze the dump to narrow this issue. You could refer to this tutorial about how to get a full memory dump in Azure App Services. Also, you could leverage the Crash Diagnoser extension to monitor the CPU and Memory, for more details you could refer to this blog.

Well well well first of all how do you deal with Garbage collector ?! i mean do you dispose disposable objects after they will do their tasks. You said app is being run for 2 days , seems app has clashed with more intensive memory loading stuff. As you said your app is "very memory intensive" , i guess you should fetch it out (source code) and make sure you are managing objects correctly, because garbage collector couldn't care all of your "source code mess". Good luck.

.NET Website Huge Peak Working Set Memory

I have an asp.net/c# 4.0 website on a shared server.
The os Windows Server 2012 is running IIS 7, has 32GB of ram and 3.29GHz of processor.
The server is running into difficulty now and again, such as problems RDP'ing and other PHP websites running slow.
Our sys admin has suggested my website as being a possible memory hog and cause of these issues.
At any given time the sites "Memory (private working set)" is 2GB and can peak as high as 15GB.
I ran a trial version of JetBrains dotMemory on the server, attached to the website's w3wp.exe process. My first time using this program, I am a complete novice here.
I took two memory snapshots using dotMemory.
And the basic snapshot comparison can be seen here: http://i.imgur.com/0Jk8yYE.jpg
From the above comparison we can see that System.String and BusinessObjects.Item are the two items with the most survived bytes.
Drilling down on system.string I could see that the main dominating survived object was System.Web.Caching.CacheEntry with 135MB. See the screengrab: http://i.imgur.com/t1hs8nd.jpg
Which leads me to suspect maybe I cache too much?
I cache all data that comes out of my database: HTML Page Content, Nav-Menu Items, Relationships between pages & children, articles etc. Using HttpContext.Current.Cache.Insert.
With the cache timeout set to 10080 minutes.
My Questions:
1 - is 2GB Memory (private working set) and a peak as high as 15GB to much for a server with 32 GB Ram?
2 - Have I used dotMemory correctly to identify an issue?
3 - Is my caching an issue?
4 - Possible other causes?
5 - Possible solutions?
Thanks

Usage of a large amount of memory itself can not slow down a program. It can be slowed down
if windows actively using swap file. Ask an admin to check this case
if .NET program causes garbage collecting too much (produces too much memory traffic)
You can see a memory traffic information in dotMemory. Unfortunately it can't gather such information if it attaches to the program, the only way to collect object creation stack traces and memory traffic information is to launch your program under dotMemory profiler, enabling corresponding settings.
BUT!
The problem is you can't evaluate is memory traffic amount "N" high or no, the only way to find a root of a performance problem is using performance profiler.
For example JetBrains dotTrace can show you, how much time you program spends in garbage collecting, and only in case this is a bottle neck, you should use memory profiler to find a root of traffic.
Conclusion: try to find a bottle neck using performance profiler first.
Then, if you still have a questions about dotMemory, ask me, I'll try to help you :)

Performance issues when running site in visual studio vs profiling

This is a very strange issue that I am experiencing, and almost goes against anything logical that I can think of. I am currently profiling a website which we are building, which sometimes takes 5 seconds for a page to load. This happens both on IIS, and Visual Studio Development Server. However, when I profile it using ANTS Performance Profiler, it performs up to 5x faster, and loads in less than a second.
I am quite baffled as to why this can happen, because as far as I know, profiling should increase the time, not decrease it. Anyone could maybe shed some light on this?
Site is developed in Visual Studio 2010, ASP.Net v4.0, C#.

This is interesting as its very rare (I work on ANTS support). The main difference ANTS imparts on a process is permissions (since (usually) the process is fully initiated by ANTS and inherits the permissions). We have some routines that optimise the start-up procedure but I've never heard of a speed-up like this. Using Taskmanager, take a look at the login account that the process runs under ANTS and normally- then try to run your process under the account that ANTS uses. You may find this helps to explain the speed-up.

Performance testing needs to be done in carefully controlled setting. Things like system file cache, network, machine load, NGEN status, virus scanner could affect perf result.
Use Perfview to understand how 5s is spent (could be waiting for disk IO):
http://www.microsoft.com/en-us/download/details.aspx?id=28567

C# service with Oracle .NET provider gets slower and slower

I have a C# service written for .NET 2.0 that uses the Oracle data access provider for .NET 2.102.2.20. The service runs multiple threads and runs lots of queries to an Oracle 9.2 database. I am using NHibernate.
What I am seeing is that when it starts up it runs fast, then gets slower and slower. The CPU usage starts low then goes up and up. After a few minutes it is crawling and the CPU is at 100%. I have looked in my code and find nothing that could be doing this. Percent time in GC is < 5%. I have tried varying the ODP.NET parameters to no avail.
Anyone have an idea what could be doing this?
More detail:
The threads are worker threads and need to keep running all the time. I have no runaway threads, they are doing real work. I have profiled the program and it looks like it is spending a lot of its time inside the Oracle provider, which you would expect, but why would it use so much CPU? It's as if it's spinning waiting for resultsets or something, but it doesn't happen right away, only after a while.
Update:
It may be something to do with the .NET CLR on the server.
A multi-threaded test program that does a lot of string manipulation also shows the same behavior on this machine, starting out fast, then slowing down over the course of 15 minutes to about 1/3 of the speed. The test program does not show this slowdown behavior on a another identical server with the same OS version and the same (we think) .NET CLR version.
Update:
It is now looking like this is a problem with the server overheating and thermal protection kicking in and slowing down the CPUs' frequency.
Update:
A firmware update for the server fixed this. I still think it was an overheating problem because if I stopped running the process and let the server "rest" for a little while it would start running fast when I restarted the process, but if I killed the process and restarted it right away the process would start already running slow. My guess is that the firmware control for the fans was malfunctioning so the CPUs would heat up and slow down. If I stopped running for a little while they would cool down and run fast again, until they heated up again.

Sounds like your threads aren't terminated and keep running all the time or you've some serious memory leaks. Without more information, this could be anything.

If CPU is at 100% you're probably facing a runaway thread. You can diagnose this using WinDbg + SoS. Attach to the process and use the !runaway command to get an overview of how much CPU each thread is using. Then use !clrstack to find out what the runaway thread is doing. Let me know if you need additional details.

I'm getting the exact same issue. The OracleConnection gets progressively slower and slower. What's interesting is that if I call:
cn.Close();
OracleConnection.ClearPool(cn);
every time, it never slows down.
It must have something to do with the oracle connection (caching??)

Windows Service Increasing CPU Consumption

At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don't want to be doing is restarting them more frequently.
#akmad: True, it is very difficult.
Yes, a service run in isolation will show the same symptom over time.
No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
We do; and they are. The services are doing exactly what they should be, and nothing else.
Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
#Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
#Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn't thought of (like Ian's suggestion).
So does the CPU spike happen
immediately preceding the timer
callback, within the timer callback,
or immediately following the timer
callback?
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
That made Ian's suggestion the logical silver bullet (NHibernate does a lot of stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seeming to go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
I just noted a sense of desperation in
the question -- that your problems
would continue barring a small miracle
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
That is extremely intriguing (insofar
as you trust your profiler).
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.

You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.

Here's where I'd start:
Get Process Explorer and show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
Compare before and after shots.
A couple of thoughts on possibilities:
excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
excessive threads and associated context switches (# of threads going up)
polling (stack traces are consistently caught in a single function)
excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)

It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
Best of luck!

I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.

'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
Hope this helps, but apologies up front if both my thoughts are red herrings.

Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.

Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Where is your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.