(Read)File I/O jitter - c#

I have an application (C# .Net 3.5 and .Net 2.0) that performs multiple readfile operations. However, the system shows hickups (jitter) every now and then. I have attached VTune profiler and performed a locks&waits analysis, see the first image below.
The locks and waits analysis showed that a "Sync Object: Stream filepath" causes the application to be locked (waiting) on all threads. CPU utilization drops to 0% during this period.
Next, I used SysInternals Process Monitor to log what operation was performed when the hickups occurred. It shows a fileread operation that takes approx. 1 second, but only occasionally (jitter). See the second image.
single-click large version of image: here
Single-click large version of image: here
I am puzzled. What could cause this jitter in File I/O? It is a synchroneous read. I have tried to reduce the read buffer from the 32,768b to 4096b, but this did not chance anything. Maybe important to note, the machine used to collect these numbers has an SSD. However, we see similar hickups on machines without SSDs.
Any leads in where to look would be welcome.

This question needs an update. I will post this in the form of an answer as I have solved the issue, yet not in a way that I can say for sure what was the original issue.
I have tried a lot of things to find out what caused the occasional spike in IO read(file) duration. First of all, virusscanners matter, especially McAfee caused some trouble. The comments on the question hinted here already, and #remus rusanu's tip to use WPA/WPR combo showed this as well. WPA/WPR combo pleasantly surprised me and is a valuable tool next to VTune, and ProcMon. The first image shows a spike in McAfee taskmanager just before some long duration flushes and reads start (>1s). The second shows that all information in WPA is nicely linked over all graphs. A nice and strong tool, if searching for that needle in the haystack.
Quicklink large version: here.
Quick larger version: here.
Yet, when I uninstalled the virusscan software spikes did still occur. Less frequently, and they were shorter in duration, but still visible in the application. I have tried numerous things to find out what it was. Used VMWare setups so I could completely strip the system and see if other processes might be the issue. In the end, I gave up. I implemented a system to workaround the issue, and this is sufficient for now. Knowing all the actions I took I would say there was another conflicting process. Another option is the linked unmanaged program, which used Mutexes, maybe doing some problematic stuff. I changed the mutex to CriticalSections, but no direct visible results, so I gave up on that route.
To conclude, unfortunately I have no direct answer. Due to time constraints I was forced to work around it, and will probably never know what the root cause for the issue was. I guess that is real life as well..
Thanks for all the tips, I learned some things I will certainly use in the future.

Related

Program not releasing handles

I have a C# program that is doing some heavy lifting. It's creating many objects and running genetic algorithms with them.
Right now I have the issue where the program, when running at full clip (i.e. no attached debuggers or profilers) is creating ~300 handles/second. What's more, it doesn't seem to be releasing any of them. After about 15/16 hours, there are no more handles available and the program dies.
The problem I've had with debugging is that my memory management seems to be pretty good based on my profiler (JustTrace). The objects seem to be getting cleaned up (though there are a lot of Gen2 objects laying around after many hours of running), and the program seems to only take up 35MB of memory at the max. It still holds on to its handles though!
If I close the program, all the handles get released just fine.
Where should I be looking? Is there something that could be holding on to an excess of handles but not be related to the objects in memory?
Edit: Note, when I say "Handles", I mean I open up Task Manager in Windows and look at "Handles" in the "System" box on the Performance tab.
Edit2: So a couple of weeks late, I find out it's my antivirus preventing the thread handles from being released. What's strange is that the antivirus we're using shouldn't be doing this kind of process scanning according to what I've read. What's more, I don't have the ability to exclude my process from the AV scanning!
Is there something I could be doing to give the AV a chance to release the threads? I've tried adding some extra sleeps and that doesn't seem to be working. AV is BitDefender Endpoint Security 2015.
I usually use JustTrace as well, but I had a similar problem recently where the profiler wasn't telling me the whole story.
The redgate ANTS profiler can also show you unmanaged memory which was the problem in my case. Looking at the unmanaged memory may give you a clue as to what's going on.
It turns out my Antivirus was preventing the thread handles from being released. I haven't found a way to prevent this from happening except disabling/switching antiviruses. In the case I find a better way, I'll update here.

Why do .NET threads have inferior performance to separate .NET processes?

Lately I've been observing an interesting phenomenon, and before I reengineer my whole software architecture based on it, I'd like to know why this happens, and if it's perhaps possible to make thread performance on par with process performance.
Generally, the task is to download certain data. If we make one process with 6 threads, based on the Parallel library, the downloads take around 10s.
If we, however, make 6 processes, each being single threaded, and download the same data, the whole thing will only take around 6s.
The numbers are thoroughly verified and statistically significant, so do take them for granted.
The observation holds over a large (100s of trials) dataset and I've observed no deviation from this behavior.
Basically, the question is, why a non-synchronizing multithreaded process is slower than a few separate processes with the exact same working code, and how it can be fixed?
Thanks in advance!
Note: I've read similar questions but the answers haven't been satisfactory and practical.
My guess is the same as svick's: you probably have some kind of bottleneck inserted by the runtime.
In general, you can use a tool like Fiddler or Wireshark to see how the 10 downloads are interleaving. In your case, I would expect that there would only be two active at any one time and that once one finishes, another will start immediately.
Before you go and change the setting, you should understand why it's there. It is written into the HTTP spec as suggested client behavior so as to not overwhelm the server. If your code is going to be distributed out to hundreds/thousands/millions of machines, you should consider the effects of 10 simultaneous downloads per client.

System.OutOfMemory being thrown. How to find the culprit?

I am using Visual C# Express 2008 and I have an application that starts up on a form, but uses a thread with a delegated display function to take care of essentially all the processing. That way my form doesn't lock up while tasks are being processed.
Semi-recently, after going through a repeated process a number of times (the program processes incoming data, so when data comes in, the process repeats) my app will crash with a System.OutOfMemory error.
The stack trace in the error message is useless because it only directs me to the the line where I call the delegated form control function.
I've heard people say they use ProcMon from SysInternals to see why errors like this happen. But I, for the life of me, can't figure it out. The amount of memory I am using doesn't change as the program runs, if it goes up, it comes back down. Plus, even if it was going up, how do I figure out which part of my program is the problem?
How can I go about investigating this problem?
EDIT:
So, after delving further into this issue, I looked through anything that I was ever re-declaring. There were a few instances where I had hugematrix = new uint[gigantic], so I got rid of about 3 of those.
Instead of getting rid of the error, it is now far more obscured and confusing.
My application takes the incoming data, and renders it using OpenGL. Now, instead of throwing "System.OutOfMemory" it simply does not render anything with OpenGL.
The only difference in my code is that I do not make new matrices for holding the data I plot. That way, I hope, my array stays in the same place in memory and doesn't do anything suicidal to my LOH.
Unfortunately, this twists the beast far beyond my meager means. With zero errors popping up, and all my data structures apparently still properly filled, how can I find my problem? Does OpenGL use memory in an obscure way so as to not throw exceptions when it fails? Is memory still a problem? How do I find out? All the memory profilers in the world seem to tell me very little.
EDIT:
With the boatloads of support from this community (with extra kudos to Amissico) the error has finally been rooted out. Apparently I was adding items to an OpenGL list, and never taking them off the list.
The app that finally clued me in was .Net Memory Profiler. At the time of crash it showed 1.5GB of data in the <unknown> category. Through process of elimination (everything else in the list that was named), the last thing to be checked off the list was the OpenGL rendering pipleline. The rest is history.
Based on the description in your comments, I would suspect that you are either not disposing of your images correctly or that you have severe Large Object Heap fragmentation and, when trying to allocate for a new image, don't have enough contiguous space available. See this question for more info - Large Object Heap Fragmentation
You need to use a memory profiler, such as the ants memory profiler to find out what causes this error.
Are you re-registering an event handler on every loop and not un-registering it?
CLR Profiler for the .NET Framework 2.0 at https://github.com/MicrosoftArchive/clrprofiler
The most common cause of memory fragmentation is excessive string creation.
Following considerations:
Make sure that threads you spawn are destroyed (aborted or function return). Too much threads can fail application, although in Task Manager used memory is not too high
Memory leaks. Yes, yes, you can cause them in .net pretty well without setting reference to nulls. This can be solved by using memory profilers like dotTrace or ANTS Memory Profiler
I had an OutOfMemoryException-problem as well:
Microsoft Visual C# 2008 Reducing number of loaded dlls
The reason was fragmentation of 2GB GB virtual address space and poster nobugz suggested Sysinternal's Vmmap utility which has been very helpful for diagnostics. You can use it to check if your free memory areas become more fragmented over time. (First sort by size then by type -> refresh repeat sorting and you can see if contiguous free memory blocks become smaller)

Forcing the .NET JIT compiler to generate the most optimized code during application start-up

I'm writing a DSP application in C# (basically a multitrack editor). I've been profiling it for quite some time on different machines and I've noticed some 'curious' things.
On my home machine, the first run of the playback loop takes up about 50%-60% of the available time, (I assume it's due to the JIT doing its job), then for the subsequent loops it goes down to a steady 5% consumption. The problem is, if I run the application on a slower computer, the first run takes up more than the available time, causing the playback to get interrupted and messing the output audio, which is unacceptable. After that, it goes down to a 8%-10% consumption.
Even after the first run, the application keeps calling some time-consuming routines from time to time (every 2 seconds more or less), which causes the steady 5% consumption to experience very short peaks of 20%-25%. I've noticed that if I let the application run for a while these peaks will also go down to a 7%-10%. (I'm not sure if it's due to the JIT recompiling these portions of code).
So, I have a serious problem with the JIT. While the application will behave nicely even in very slow machines, these 'compiling storms' are going to be a big problem. I'm trying to figure out how to resolve this issue and I've come up with an idea, which is to mark all the 'sensible' routines with an attribute that will tell the application to 'squeeze' them beforehand during start-up, so they'll be fully optimized when they're really needed. But this is only an idea (and I don't like it too much either) and I wonder if there's a better solution to the whole problem.
I'd like to hear what you guys think.
(NGEN the application is not an option, I like and want all the JIT optimizations I can get.)
EDIT:
Memory consumption and garbage collection kicks are not an issue, I'm using object pools and the maximum peak of memory during playback is 304 Kb.
You can trigger the JIT compiler to compile your entire set of assemblies during your application's initialization routine using the PrepareMethod ... method (without having to use NGen).
This solution is described in more detail here: Forcing JIT Compilation During Runtime.
The initial speed indeed sounds like Fusion+JIT, which would be helped by ILMerge (for Fusion) and NGEN (for JIT); you could always play a silent track through the system at startup so that this does all the hard work without the user noticing any distortion?
NGEN is a good option; is there a reason you can't use it?
The issues you mention after the initial load do not sound like they are related to JIT. Perhaps garbage collection.
Have you tried profiling? Both CPU and memory (collections)?
As Marc mentioned, the ongoing spikes do not sound like JIT issues. Other things to look for:
Garbage collection - are you allocating memory during your audio processing? If you're creating a lot of garbage, or even objects which survive a Gen 0 collection, this might cause noticible spikes. It sounds like you are doing some kind of pre-allocation, but watch out for hidden allocations in library code (even a foreach loop can allocate!)
Denormals. There is an issue with certain types of processors when dealing with very small floating point numbers which can cause CPU spikes. See http://www.musicdsp.org/files/denormal.pdf for details.
Edit:
Even if you don't want to use NGen, at least compare an NGen'd version so you can see what difference JITing makes
If you believe you are being impacted by JIT, then precompile your app with NGEN and run the tests again. There is no JIT overhead in code that has been compiled by NGEN. If you still see spikes in the NGEN'd app, then you know they are not caused by JIT.

Windows Service Increasing CPU Consumption

At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don't want to be doing is restarting them more frequently.
#akmad: True, it is very difficult.
Yes, a service run in isolation will show the same symptom over time.
No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
We do; and they are. The services are doing exactly what they should be, and nothing else.
Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
#Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
#Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn't thought of (like Ian's suggestion).
So does the CPU spike happen
immediately preceding the timer
callback, within the timer callback,
or immediately following the timer
callback?
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
That made Ian's suggestion the logical silver bullet (NHibernate does a lot of stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seeming to go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
I just noted a sense of desperation in
the question -- that your problems
would continue barring a small miracle
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
That is extremely intriguing (insofar
as you trust your profiler).
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.
You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.
Here's where I'd start:
Get Process Explorer and show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
Compare before and after shots.
A couple of thoughts on possibilities:
excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
excessive threads and associated context switches (# of threads going up)
polling (stack traces are consistently caught in a single function)
excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)
It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
Best of luck!
I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.
'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
Hope this helps, but apologies up front if both my thoughts are red herrings.
Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Where is your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.

Categories