C# service with Oracle .NET provider gets slower and slower - c#

I have a C# service written for .NET 2.0 that uses the Oracle data access provider for .NET 2.102.2.20. The service runs multiple threads and runs lots of queries to an Oracle 9.2 database. I am using NHibernate.
What I am seeing is that when it starts up it runs fast, then gets slower and slower. The CPU usage starts low then goes up and up. After a few minutes it is crawling and the CPU is at 100%. I have looked in my code and find nothing that could be doing this. Percent time in GC is < 5%. I have tried varying the ODP.NET parameters to no avail.
Anyone have an idea what could be doing this?
More detail:
The threads are worker threads and need to keep running all the time. I have no runaway threads, they are doing real work. I have profiled the program and it looks like it is spending a lot of its time inside the Oracle provider, which you would expect, but why would it use so much CPU? It's as if it's spinning waiting for resultsets or something, but it doesn't happen right away, only after a while.
Update:
It may be something to do with the .NET CLR on the server.
A multi-threaded test program that does a lot of string manipulation also shows the same behavior on this machine, starting out fast, then slowing down over the course of 15 minutes to about 1/3 of the speed. The test program does not show this slowdown behavior on a another identical server with the same OS version and the same (we think) .NET CLR version.
Update:
It is now looking like this is a problem with the server overheating and thermal protection kicking in and slowing down the CPUs' frequency.
Update:
A firmware update for the server fixed this. I still think it was an overheating problem because if I stopped running the process and let the server "rest" for a little while it would start running fast when I restarted the process, but if I killed the process and restarted it right away the process would start already running slow. My guess is that the firmware control for the fans was malfunctioning so the CPUs would heat up and slow down. If I stopped running for a little while they would cool down and run fast again, until they heated up again.

Sounds like your threads aren't terminated and keep running all the time or you've some serious memory leaks. Without more information, this could be anything.

If CPU is at 100% you're probably facing a runaway thread. You can diagnose this using WinDbg + SoS. Attach to the process and use the !runaway command to get an overview of how much CPU each thread is using. Then use !clrstack to find out what the runaway thread is doing. Let me know if you need additional details.

I'm getting the exact same issue. The OracleConnection gets progressively slower and slower. What's interesting is that if I call:
cn.Close();
OracleConnection.ClearPool(cn);
every time, it never slows down.
It must have something to do with the oracle connection (caching??)

Related

Debugging a memory leak in a long-running mono/C# application

I have a C# program that runs on an Ubuntu VM as a server using mono 5.10. Every now and then, the program starts using lots of memory until it eventually crashes. It can often take days before such a memory leak occurs, and I haven't been able to reproduce the issue locally.
For the past few days, I've used the mono log profiler to be able to collect heapshots on demand. This has given me a 15Gb+ file. I managed to get a heapshot from when the memory leak occurs, which displays as follows in the Xamarin profiler:
Running mprof-report gives a similarly unhelpful report.
Neither really help debug the problem. Obviously there's something going on with the threadpool, but it seems like there is no way of figuring out what.
Being able to see where the objects are allocated might help, but that requires enabling alloc in the profiler, which is not possible because of the file size it will produce.
What is causing the massive amount of IThreadPoolWorkItems? The app is using 30-40% CPU, so it doesn't seem like tasks are being scheduled more quickly than the CPU can handle.
Is there a way to list the objects referenced by the ThreadPool jobs? That would at least allow me to identify what piece of code is being run so much. mprof-report only shows inverse references as far as I can see, which isn't very helpful as the IThreadPoolWorkItems are obviously owned by the ThreadPool itself.
Update: The tasks aren't (disk) IO bound. While leaking, the reads are in the 10s of kbs per second, which shouldn't be saturating the disk.
On second thought: if there were a lot of tasks scheduled, shouldn't I be seeing lots of individual IThreadPoolWorkItem objects too? Since IThreadPoolWorkItem isn't a struct, there should be a separate entry for the actual object and IThreadPoolWorkItem[] would just be an array of pointers. So this would suggest that those IThreadPoolWorkItem[]s are just arrays of (mostly) null pointers.
Update: ThreadPool.GetAvailableThreads shows 2 worker threads and 0 completion port threads are active, which seems to indicate that the ThreadPool is doing just fine. Custom logging also indicates that once again, I'm not scheduling too many tasks and the tasks that are scheduled are finished quickly.

RabbitMQ: erl.exe taking high CPU usages

I have implemented rabbitmq in my application and it's running on windows server 2008 server, the problem is that erl.exe taking high CPU usages like sometime it reaches 40-45% CPU usages, even in the ideal case (when not processing any queue) it takes at least 4-15% CPU usages.
What could be the reason for taking high CPU usages? Is there any setting or any other thing that I need to do.
You say that even when not processing a queue it is still at 4-15%, but is your application running? If you weren't before, try to monitor erl while no application is using Rabbit.
One thing that comes to mind is that you might be using the QueingBasicConsumer in a loop and that could be contributing to the CPU usage. If you are using QueingBasicConsumer and it is what is causing the hit, try substituting it with EventingBasicConsumer (such that you don't do busy waiting) and see if you have improvement.
Also, how is your application using Rabbit? According to the documentation every IConnection is backed up by a background thread and if you're creating a bunch of connections in your application it could be another reason for the slow down.

Troubleshooting creeping CPU utilization on Azure websites

As the load on our Azure website has increased (along with the complexity of the work that it's doing), we've noticed that we're running into CPU utilization issues. CPU utilization gradually creeps upward over the course of several hours, even as traffic levels remain fairly steady. Over time, if the Azure stats are correct, we'll somehow manage to get > 60 seconds of CPU per instance (not quite clear how that works), and response times will start increasing dramatically.
If I restart the web server, CPU drops immediately, and then begins a slow creep back up. For instance, in the image below, you can see CPU creeping up, followed by the restart (with the red circle) and then a recovery of the CPU.
I'm strongly inclined to suspect that this is a problem somewhere in my own code, but I'm scratching my head as to how to figure it out. So far, any attempts to reproduce this on my dev or testing environments have proven ineffectual. Nearly all the suggestions for profiling IIS/C# performance seem to presume either direct access to the machine in question or at least a "Cloud Service" instance rather than an Azure Website.
I know this is a bit of a long shot, but... any suggestions, either for what it might be, or how to troubleshoot it?
(We're using C# 5.0, .NET 4.5.1, ASP.NET MVC 5.2.0, WebAPI 2.2, EF 6.1.1, Azure System Bus, Azure SQL Database, Azure redis cache, and async for every significant code path.)
Edit 8/5/14 - I've tried some of the suggestions below. But when the website is truly busy, i.e., ~100% CPU utilization, any attempt to download a mini-dump or GC dump results in a 500 error, with the message, "Not enough storage." The various times that I have been able to download a mini-dump or GC dump, they haven't shown anything particularly interesting, at least, so far as I could figure out. (For instance, the most interesting thing in the GC dump was a half dozen or so >100KB string instances - those seem to be associated with the bundling subsystem in some fashion, so I suspect that they're just cached ScriptBundle or StyleBundle instances.)
Try remote debugging to your site from visual studio.
Try https://{sitename}.scm.azurewebsites.net/ProcessExplorer/ there you can take memory dumps an GC dumps of your w3wp process.
Then you can compare 2 GC dumps to find memory leaks and open the memory dump with windbg/VS for further "offline" debugging.

Callbacks are slow from com server

I am using a native COM server from a .Net C# application
Everything works fine except that callbacks from the COM server into the .Net application gradually becomes slower. The server and .Net application is always running on the same machine.
Calling from .Net to the COM server is always fast.
The strange thing is that it doesn't happen on all computers even if they run the same binaries.
I have spent a lot of time on this issue. Compared environments where the callbacks are fast with ones where it is slow without finding anything special.
The callbacks start fast but get exponentially slower over time.
Doesn't matter if the callbacks are assigned to .Net methods or not.
(There is a server switch to turn all callbacks off. That is how I know the problem is with the callbacks)
The slow computers use Window 7 64-bit but the same configuration is fast on other computers.
There are slow and fast computers on the same domain and network
Doesn't matter if the user is local admin or standard user
I have monitored Disk/Network activity but there is no difference between slow and fast
There is no noticeable difference in memory consumption
Looked at CLR memory from WinDbg but found nothing strange
Some things I have noted:
The server process use 100% CPU when the callbacks are slow.
Looked at the call stack with process explorer and the server is most of the time in one of the RPC Ndr* functions, i.e. NdrClientCall2.
I am now out of ideas and need some help to solve this.

Windows Service Increasing CPU Consumption

At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don't want to be doing is restarting them more frequently.
#akmad: True, it is very difficult.
Yes, a service run in isolation will show the same symptom over time.
No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
We do; and they are. The services are doing exactly what they should be, and nothing else.
Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
#Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
#Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn't thought of (like Ian's suggestion).
So does the CPU spike happen
immediately preceding the timer
callback, within the timer callback,
or immediately following the timer
callback?
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
That made Ian's suggestion the logical silver bullet (NHibernate does a lot of stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seeming to go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
I just noted a sense of desperation in
the question -- that your problems
would continue barring a small miracle
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
That is extremely intriguing (insofar
as you trust your profiler).
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.
You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.
Here's where I'd start:
Get Process Explorer and show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
Compare before and after shots.
A couple of thoughts on possibilities:
excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
excessive threads and associated context switches (# of threads going up)
polling (stack traces are consistently caught in a single function)
excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)
It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
Best of luck!
I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.
'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
Hope this helps, but apologies up front if both my thoughts are red herrings.
Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Where is your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.

Categories