I've got an application that uses performance counters, that has worked for months. Now, on my dev machine and another developers machine, it has started hanging when I call PerformanceCounterCategory.Exists. As far as I can tell, it hangs indefinitely. It does not matter which category I use as input, and other applications using the API exhibits the same behaviour.
Debugging (using MS Symbol Servers) has shown that it is a call to Microsoft.Win32.RegistryKey that hangs. Further investigation shows that it is this line that hangs:
while (Win32Native.ERROR_MORE_DATA == (r = Win32Native.RegQueryValueEx(hkey, name, null, ref type, blob, ref sizeInput))) {
This is basically a loop that tries to allocate enough memory for the performance counter data. It starts at size = 65000 and does a few iterations. In the 4th call, when size = 520000, Win32Native.RegQueryValueEx hangs.
Furthermore, rather worringly, I found this comment in the reference source for PerformanceCounterLib.GetData:
// Win32 RegQueryValueEx for perf data could deadlock (for a Mutex) up to 2mins in some
// scenarios before they detect it and exit gracefully. In the mean time, ERROR_BUSY,
// ERROR_NOT_READY etc can be seen by other concurrent calls (which is the reason for the
// wait loop and switch case below). We want to wait most certainly more than a 2min window.
// The curent wait time of up to 10mins takes care of the known stress deadlock issues. In most
// cases we wouldn't wait for more than 2mins anyways but in worst cases how much ever time
// we wait may not be sufficient if the Win32 code keeps running into this deadlock again
// and again. A condition very rare but possible in theory. We would get back to the user
// in this case with InvalidOperationException after the wait time expires.
Has anyone seen this behaviour before ? What can I do to resolve this ?
This issue is now fixed, and since there has been no answers here, I will add an answer here in case the question is found in future searches.
I ultimately fixed this error by stopping the print spooler service (as a temporary measure).
It looks like the reading of Performance counters actually needs to enumerate the printers on the system (confirmed by a WinDbg dump of a hanging process, where I can see in the stack trace that winspool is enumerating printers, and is stuck in a network call). This was what was actually failing on the system (and sure enough, opening the "Devices and printers" window also hung). It baffles me that a printer/network issue can actually make the performance counters go down. One would think that there was some sort of fail-safe built in for such a case.
What I am guessing, is that this is cause by a bad printer/driver on the network. I haven't re-enabled printing on the affected systems yet, since we are hunting for the bad printer.
This really didn't help on my case, any operation done that uses the performance Category, will hung in there forever.
I am thinking it is more a problem of memory allocation for the call or something related to resources of the machine, I don't have a way to probe it, but trying exactly the same sample call to for example "PerformaceCXounterCategory.Exist" method in a computer with 32GB Ram will run just fine against the one another with only 16GB, if I get a chance to install more memory and test and verify this assumption I will update this ticket
Related
I built an add-on to Microsoft Word. When the user clicks a button, it runs a number of processes that export a list of Microsoft Word documents to Filtered HTML. This works fine.
Where the code falls down is in processing large amounts of files. After the file conversions are done and I call the next function, the app crashes and I get this information from Visual Studio:
Managed Debugging Assistant 'DisconnectedContext' has detected a problem in 'C:\Program Files\Microsoft Office\root\Office16\WINWORD.EXE'.
Additional information: Transition into COM context 0x56255b88 for
this RuntimeCallableWrapper failed with the following error: System
call failed. (Exception from HRESULT: 0x80010100
(RPC_E_SYS_CALL_FAILED)). This is typically because the COM context
0x56255b88 where this RuntimeCallableWrapper was created has been
disconnected or it is busy doing something else. Releasing the
interfaces from the current COM context (COM context 0x56255cb0). This
may cause corruption or data loss. To avoid this problem, please
ensure that all COM contexts/apartments/threads stay alive and are
available for context transition, until the application is completely
done with the RuntimeCallableWrappers that represents COM components
that live inside them.
After some testing, I realized that if I simply remove all the code after the file conversions, there are no problems. To resolve this, I place the remainder of my code in yet another button.
The problem is I don't want to give the user two buttons. After reading various other threads, it sounds like my code has a memory or threading issue. The answers I am reading do not help me truly understand what to do next.
I feel like this is what I want to do:
1- Run conversion.
2- Close thread/cleanup memory issue from conversion.
3- Continue running code.
Unfortunately, I really don't know how to do #2 or if it is even possible. Your help is very much appreciated.
or it is busy doing something else
The managed debugging assistant diagnostic you got is pretty gobbledygooky but that's the part of the message that accurately describes the real problem. You have a firehose problem, the 3rd most common issue associated with threading. The mishap is hard to diagnose because this goes wrong inside the Word plumbing and not your code.
Trying not to commit the same gobbledygook sin myself, what goes wrong is that the interop calls you make into the Office program are queued, waiting for their turn to get executed. The underlying "system call" that the error code hints at is PostMessage(). Wherever there is a queue, there is a risk that the queue gets too large. Happens when the producer (your program) is adding items too the queue far faster than the consumer (the Office program) removes them. The firehose problem. Unless the producer slows down, the queue will grow without bounds and something is going to fail if it is allowed to grow endlessly, at a minimum the process runs out of memory.
It is not allowed to get close to that problem. The underlying queue that PostMessage() uses is protected by the OS. Windows fails the call when the queue already contains 10,000 messages. That's a fatal error that RPC does not know how to recover from, or rather should not try to recover from. Something is amiss and it isn't pretty. It returns an error code to your program to tell you about it. That's RPC_E_SYS_CALL_FAILED. Nothing much better happens in your program, the CLR doesn't know how to recover from it either, nor does your code. So the show is over, the interop call you made got lost and was not executed by Word.
Finding a completely reliable workaround for this awkward problem is not that straight-forward. Beware that this can happen on any interop call, so catching the exception and trying again is pretty drastically unpractical. But do keep in mind that the Q+D fix is very simple. The plain problem is that your program is running too fast, slowing it down with a Thread.Sleep() or Task.Delay() call is quite crude but will always fix the issue. Well, assuming you delay enough.
I think, but don't know for a fact because nobody ever posts repro code, that this issue is also associated with using a console mode app or a worker thread in your program. If it is a console mode app then try applying the [STAThread] attribute to your Main() method. If it is a worker thread then call Thread.SetApartmentState() before starting the thread, but beware it is very important to also create the Application interface on that worker thread. Not otherwise a workaround for an add-in.
If neither of those workarounds is effective or too unpractical then consider that you can automagically slow your program down, and ensure the queue is emptied, by occasionally reading something back from the Office program. Something silly, any property getter call will do. Necessarily you can't get the property value until the Office program catches up. That can still fail, there is also a 60 second time-out on the interop call. But that's something you can fix, you can call CoRegisterMessageFilter() in your program to install a callback that runs when the timeout trips. Very gobbledygooky as well, but the cut-and-paste code is readily available.
Is there a way how to at least postpone termination of managed app (by few dozens of milliseconds) and set some shared flag to give other threads chance to gracefully terminate (the SO thread itself wouldn't obviously execute anything further)? I'm contemplating to use JIT debugger or CLR hosting for this - I'm curios if anybody tried this before.
Why would I want to do something so wrong?:
Without too much detail - imagine this analogy - you are in a casino betting on a roulette and suddenly find out that the roulette is unreliable fake. So you want to immediately leave the casino, BUT likely want to collect your bets from the table first.
Unfortunately I cannot leverage separate process for this as there are very tight performance requirements.
Tried and didn't work:
.NET behavior for StackOverflowException (and contradicting info on MSDN) has been discussed several times on SO - to quickly sum up:
HandleProcessCorruptedStateExceptionsAttribute (e.g. on appdomain unhandled exception handler) doesn't work
ExecuteCodeWithGuaranteedCleanup doesn't work
legacyUnhandledExceptionPolicy doesn't work
There may be few other attempts how to handle StackOverflowExceptions - but it seems to be apparent that CLR terminates the whole process as is mentioned in this great answer by Hans Passant.
Considering to try:
JIT debugger - leave the thread with exception frozen, set some
shared flag (likely in pinned location) and thaw other threads for a
short time.
CLR hosting and setting unhandled exception policy
Do you have any other idea? Or any experience (successful/unsuccessful) with those two ways?
The word "fake" isn't quite the correct one for your casino analogy. There was a magnitude 9 earth quake and the casino building along with the roulette table, the remaining chips and the player disappeared in a giant cloud of smoke and dust.
The only shot you have at running code after an SOE is to stay far away from that casino, it has to run in another process. A "guard" process that starts your misbehaving program, it can use the Process.ExitCode to detect the crash. It will be -1073741571 (0xc00000fd). The process state is gone, you'll have to use one of the .NET out-of-process interop methods (like WCF, named pipes, sockets, memory-mapped file) to make the guard process aware of things that need to be done to clean up. This needs to be transactional, you cannot reason about the exact point in time that the crash occurred since it might have died while updating the guard.
Do beware that this is rarely worth the effort. Because an SOE is pretty indistinguishable from an everyday process abort. Like getting killed by Task Manager. Or the machine losing power. Or being subjected to the effects of an earth quake :)
A StackOverflowException is an immediate and critical exception from which the runtime cannot recover - that's why you can't catch it, or recover from it, or anything else. In order to run another method (whether that's a cleanup method or anything else), you have to be able to create a stack frame for that method, and the stack is already full (that's what a StackOverflowException means!). You can't run another method because running a method is what causes the exception in the first place!
Fortunately, though, this kind of exception is always caused by program structure. You should be able to diagnose and fix the error in your code: when you get the exception, you will see in your call stack that there's a loop of one or more methods recursing indefinitely. You need to identify what the faulty logic is and fix it, and that'll be a lot easier than trying to fix the unfixable exception.
I've written a small C# console app that is used by many users on a shared storage server. It's runtime should always be < 3 seconds or so, and is run automatically in the background to assist another GUI app the user is really trying to use. Because of this, I want to make sure the program ALWAYS exits completely, no matter if it throws an error or what not.
In the Application_Startup, I have the basic structure of:
try
{
// Calls real code here
}
catch
{
// Log any errors (and the logging itself has a try with empty catch around it
// so that there's no way it can causes problems)
}
finally
{
Application.Shutdown();
}
I figured that with this structure, it was impossible for my app to become a zombie process. However, when trying to push new versions of this app, I repeatedly find that I cannot delete and replace the executable because the "file is in use", meaning that it's hanging on someone's computer out there, even though it should only run for a few seconds and always shutdown.
So, how is it that my app is seemingly becoming a hanging process on peoples' computers with the code structure I have? What am I missing?
Edit: Added "Application." to resolve ShutDown() for clarity.
There are two options here:
Your console application doesn't really finish in 3 seconds, but rather takes a lot longer. You need to debug it and see what takes it that long.
Your console application takes 3 seconds to exit, but it is run every minute by the GUI, and you have more than 40 users, so the probability of finding the executable unused are slim.
If it's the first one, and you don't want to debug it, you can always start a second thread, wait for 3 seconds and then kill the entire process.
Maybe the code inside the try block is still executing for at least one of the clients and is not really limited to 3s or so. To prevent such case, you would need multithreaded application - one thread for processing and one in the background killing the working thread after a timeout. Prior to that you should ask yourself if such infrastructure is really needed.
Another thing that comes to mind would be that one of the users had the application running right at the moment, probability depends on the number of your users.
Maybe designing your support app as a always running multithreaded service would be a much better idea instead of instantiating one running application for each client request.
I'm writing an application whose purpose involves a lot of logging of different events. Among those I would also like to have an event that the application was shut down - even if unexpectedly like because of a power loss.
Naturally, when the power goes out I don't get a chance to write anything anywhere. So my idea was to continuously write a timestamp in some known location (say, once per minute), and when the application was next run, it could determine the approximate time of the unexpected shutdown. A precision of 1 minute would be acceptable for me.
However I'm worried that caching at the OS and disk level might interfere with this approach. Is there a better way or if not - how to make sure that the data I just wrote is REALLY written out to the physical medium?
Added: Oh, almost forgot the buzzword line: Windows XP and above; .NET 3.5; C#.
Unexpected shutdowns are logged in the system event log.
When your application shuts down cleanly, write it to your logfile. Next time your application starts check if your application was shut down the way it supposed to, otherwise check the system event log.
Edit: removed the incorrect answer (due to not reading the question properly)
You could use the FlushFileBuffers method, but that will still only write it to the device, you still can't get the actual drive to write it (as far as I know).
http://www.pinvoke.net/default.aspx/kernel32/FlushFileBuffers.html?diff=y
I have a C# application that adds some performance counters when it starts up. But if the registry HKEY_LOCAL_MACHINE->SOFTWARE->Microsoft->Windows NT->CurrentVersion->Perflib is corrupted (missing or invalid data), the operation of checking the existence of the performance counters (PerformanceCounterCategory.Exists(category) takes a really long time (around 30 secs) before finally throwing exception (InvalidOperation: Category does not exist).
My question is how can i verify the validity of the registry before trying to add the performance counters (and what validity means) or if there is a way i can timeout the perf counter operations, so that it doesn't take 30 seconds to get an exception.
I suspect that "validity" is an internal implementation detail that you don't get to know. However, you could at least try to open the same registry keys and just see if they exist. You can use Process Explorer to figure out which keys it is reading.
However, I question why you would even care. A corrupted registry shouldn't be a very common thing, and if it is, what are you going to do about it? All you can do is quit. So you might as well just catch the exception. I would treat this like any other blocking operation and do it on a worker thread (not the UI thread) and show progress to your users so they know your app isn't hung.
Are you sure the registry is corrupt?
I found that my program did not have permission to create performance counters at run time. Instead, I added the creation of the counters to my installer program, and the installer must be run as an administrator. At run time my program had no problems accessing or updating the already created counters.
I don't know how to answer your question as asked, but if I understand correctly, the problem is that it looks to the user as if the application hangs for up to 30 seconds on startup.
If that is the case, I'd suggest that you might be able to avoid that by just creating a worker thread, telling it to create the performance monitors and then carrying on with starting the application.
I've not worked with performance counters so I can't say if there's anything with them that would stop this from working?