How to identify internal failures in a Windows service

How to identify internal failures in a Windows service - c#

We use a lot of custom Windows services in our applications. However, the one I'm currently working on has an infuriating problem: while the service keeps running, it simply stops functioning.
The Main method of the service is wrapped in a try/catch block, like this:
static void Main()
{
IRepository rep = new Repository();
ILogger log = LogManager.GetLogger(GetType().Name);
TimeSpan loadWindowStart = new TimeSpan(9, 0, 0);
TimeSpan loadWindowEnd = new TimeSpan(18, 0, 0);
foreach (SuppressionLoad sl in rep.GetSuppressionLoads().ToList())
{
try
{
// do stuff
}
catch(Exception ex)
{
// log error
}
}
}
The service also logs as it does stuff, and we can watch the logs fill up while it's busy.
Sometimes, however, the logs just stop. And activity elsewhere in the database suggests the entire service has stopped working. Checking in Services on the server, the service still shows a Status of "Started". It takes up almost zero system resources while it's in this state, although it's normally quite processor intensive. If you try and stop it, it just times out trying and, as far as we can tell, it never stops of its own accord. The process has to be killed in Task Manager.
There is nothing untoward in the log in the run up to these stalls. There is also nothing we can find in Event Viewer.
Since it doesn't log an error, I'm at a loss as to what's going on here, or what we can do to try and diagnose the fault from here. It's highly intermittent - it will often run for several days without problem before entering the state. What can we do to investigate what's going on?

Matt; Obscure problems such as these are difficult to find in the best of conditions - if your service happens to use threads (which I assume it does), it becomes tremendously more difficult and you can't rely on global try/catch.
A simple thing to try would be NBug (no association). It will catch un-handled exceptions and give you some info about them. I don't think it will get you enough though.
The general way to find these sorts of things is log, log, log. You have to be able to come as close to recreating the problem as possible - you need logs that tell your entry points into each method, the variable values, exception stack traces if hit, how long you spent in each method, etc. There are some really good tools out there for logging some logging tools so I won't bother with recommending any. You can wrap your logging in a conditional compile switch so once you find your issue you won't suffer a performance hit when you turn it off.
Probably not the answer you wanted, but the only thing that has really worked for me over the years.
SteveJ

It sounds like the issue could be anywhere and doesn't necessarily have much to do with code provided.
Suggestions on how to go about it
When service hangs, attach a debugger and take a look at threads and see where each one is. You may need to rebuild and run a debug version of your solution so that debugger has necessary contextual symbol data. Questions to ask:
Are all the threads that I'm expecting to be there are there, or are some gone or unaccounted for?
Are threads stuck in a deadlock (I'm suspecting that's what's happening), and if so, on what resources.
Turn on detailed logging and sprinkle in more debug log statements to isolate where in code flow it last was and where it didn't make it to, and then keep narrowing down the location. Consider logging contextual data so that when you isolate problematic line or code block, you have context to try to understand why odd behavior takes place. Just be mindful of logging sensitive information (i.e. passwords, PII, etc.)
With full credit to IInspectable's comment, you can try to take a full dump of the process (SysInternal's Process Explorer or ProcDump let's you do that, or Task Manager). It tends to be quite an involved experience using the tool, but used right can give a lot of insight, and possibly find the issue on first occurrence.
Considering that it happens infrequently, and the field of what and where is wide open, it'll likely take a few iterations of having the problem trigger in order to narrow down the scope.

Related

The Mystery of the Vanishing EF Call

Today I got an emergency call from the users on our ASP.NET production system. Some users (not all) were unable to enter certain data. The user posted the data, and the system then froze; the call never returned.
We tried to repro the problem on the QA system (which has a fresh restore of production data), and could not. I then ran from my dev environment and connected directly to the production DB, masquerading as one of the affected users. Again, no problem. Conclusion: there must be some kind of issue in the production environment, probably somewhere in the IIS process that's hosting the website.
So I fired up Visual Studio on the production server, and attached to the IIS process (Kids, don't do this at home!), set a breakpoint in the offending code, logged in as the user, and attempted to save the data. Hit the breakpoint and stepped line by line, until I hit a line of code like this:
try
{
...
using (var db = new MyDataContext())
{
...
var fooToUpdate = db.Foos.Single(f => f.ID == fooId); // <-- THIS LINE
...
}
}
catch (Exception ex)
{
// some error logging
}
After hitting "step" on that line, the thread simply vanished. Disappeared without a trace. I put a sniffer on the database, and no query was fired; needless to say there was no DB locking involved. No exception was thrown. The code entered Entity Framework and never left.
The way the data is is that every user has a different and unique fooId for every day, so no other user will have the same fooId. Most users were able to load their Foo, but a select handful of users fail consistently to load their personal Foo. I tried running the query to load the Foo in a SSMS window; no trouble at all. The only time it fails is in this particular IIS process, on the production server.
Now, I could just recycle the app pool or restart IIS, and that would probably paper over the problem. But something similar happened a week ago, and we couldn't trace it then, either. So we reset IIS then, hoping the problem would go away. And it did, for a week. And now it's back.
Does anyone have any ideas how it is possible for a thread to simply vaporize like this? Is Norman Bates hiding behind the EF door?

Given the fact that the thread did not magically vaporize, we could speculate some of the more likely options:
The debugger had a hard time following the production code compiled in Release mode. Just because debugging Release code works 90% of the time, don't fall under the illusion that it is dependable. Optimized code can very quickly throw the debugger off the track of actual execution. When this happens, it will look like the thread just vanished.
Assuming the thread does legitimately enter the call and not return (which seems to be supported by the original complaint of the application "freezing"), then the most likely scenario is a deadlock of some type. EntityFramework deadlocks are not common, but not unheard of either. The most common issues I'm aware of usually involve TransactionScope or CommitableTransaction. Are you using any transactions in the omitted code sections?

Turns out that the EF part was a red herring after all. I went and downloaded Telerik's JustDecompile and JustCode, in the hope of stepping into the EF code, but when I stepped in to that line, I found myself not in the Single() extension method, but inside one of my own method calls - that I thought I had executed on the previous line. Evidently the code was not perfectly in sync with the version in production.
LESSON 1: If you attach to a process, your execution point may not be where you think it is, if your code is not identical to the code that was
compiled into that process.
So anyway, now that I could step into the code without decompiling anything, the first thing I noticed was:
lock (_lockObj)
{
...
}
And when I tried to step into it, it froze there. Smoking gun.
So somewhere, some other thread is locking this object. Looked at other places where the lock is invoked, leading to a spaghetti of dependencies, along with another code-locked segment, with several DB calls and even a transaction boundary. It could be a code lock / db transaction deadlock, though a brief scan of the code in the DB transaction failed to pick up any contenders within the life of the transaction for blocking anything else. Plus, there's the evidence of the DB not showing any blocking or open transactions. Rather, it may just be the the fact of a few hundred queued up long-running processes, all inside code locks inside code locks, and in the end it all looks something like the West Side Highway at 17:05 on a Friday, with a jackknifed trailer truck lying across 3 lanes approaching the GW bridge.
LESSON 2: Code locks are dangerous, not only - but especially - when used in conjunction with DB transactions. Try to find ways to make your code thread safe without using code locks. And if you really must use code locks, make sure you get in and out as quickly as possible. Don't give your thread a magazine to read while it's occupying the only stall, so to speak.

StackOverflowException in .NET >= 4.0 - give other threads chance to gracefully exit

Is there a way how to at least postpone termination of managed app (by few dozens of milliseconds) and set some shared flag to give other threads chance to gracefully terminate (the SO thread itself wouldn't obviously execute anything further)? I'm contemplating to use JIT debugger or CLR hosting for this - I'm curios if anybody tried this before.
Why would I want to do something so wrong?:
Without too much detail - imagine this analogy - you are in a casino betting on a roulette and suddenly find out that the roulette is unreliable fake. So you want to immediately leave the casino, BUT likely want to collect your bets from the table first.
Unfortunately I cannot leverage separate process for this as there are very tight performance requirements.
Tried and didn't work:
.NET behavior for StackOverflowException (and contradicting info on MSDN) has been discussed several times on SO - to quickly sum up:
HandleProcessCorruptedStateExceptionsAttribute (e.g. on appdomain unhandled exception handler) doesn't work
ExecuteCodeWithGuaranteedCleanup doesn't work
legacyUnhandledExceptionPolicy doesn't work
There may be few other attempts how to handle StackOverflowExceptions - but it seems to be apparent that CLR terminates the whole process as is mentioned in this great answer by Hans Passant.
Considering to try:
JIT debugger - leave the thread with exception frozen, set some
shared flag (likely in pinned location) and thaw other threads for a
short time.
CLR hosting and setting unhandled exception policy
Do you have any other idea? Or any experience (successful/unsuccessful) with those two ways?

The word "fake" isn't quite the correct one for your casino analogy. There was a magnitude 9 earth quake and the casino building along with the roulette table, the remaining chips and the player disappeared in a giant cloud of smoke and dust.
The only shot you have at running code after an SOE is to stay far away from that casino, it has to run in another process. A "guard" process that starts your misbehaving program, it can use the Process.ExitCode to detect the crash. It will be -1073741571 (0xc00000fd). The process state is gone, you'll have to use one of the .NET out-of-process interop methods (like WCF, named pipes, sockets, memory-mapped file) to make the guard process aware of things that need to be done to clean up. This needs to be transactional, you cannot reason about the exact point in time that the crash occurred since it might have died while updating the guard.
Do beware that this is rarely worth the effort. Because an SOE is pretty indistinguishable from an everyday process abort. Like getting killed by Task Manager. Or the machine losing power. Or being subjected to the effects of an earth quake :)

A StackOverflowException is an immediate and critical exception from which the runtime cannot recover - that's why you can't catch it, or recover from it, or anything else. In order to run another method (whether that's a cleanup method or anything else), you have to be able to create a stack frame for that method, and the stack is already full (that's what a StackOverflowException means!). You can't run another method because running a method is what causes the exception in the first place!
Fortunately, though, this kind of exception is always caused by program structure. You should be able to diagnose and fix the error in your code: when you get the exception, you will see in your call stack that there's a loop of one or more methods recursing indefinitely. You need to identify what the faulty logic is and fix it, and that'll be a lot easier than trying to fix the unfixable exception.

How to debug random crashes?

we have a dotnet 2.0 desktop winforms app and it seems to randomly crash. no stack trace, vent log, or anything. it just dissapears.
There are a few theories:
machine simply runs out of resources. some people have said you will always get a window handle exception or a gdi exception but others say it might simply cause crashes.
we are using wrappers around non managed code for 2 modules. exceptions inside either of these modules could cause this behavior.
again, this is not reproducible so i wanted to see if there were any suggestions on how to debug better or anything i can put on the machine to "catch" the crash before it happens to help us understand whats going on.

Your best bet is to purchase John Robbins' book "Debugging Microsoft .NET 2.0 Applications". Your question can go WAY deeper than we have room to type here.

Sounds for me like you need to log at first - maybe you can attach with PostSharp a logger to your methods (see Log4PostSharp) . This will certainly slow you down a lot and produce tons of messages. But you should be able to narrow the problematic code down ... Attach there more logs - remove others. Maybe you can stress-test this parts, later. If the suspect parts are small enough you might even do a code review there.
I know, your question was about debugging - but this could be an approach, too.

You could use Process Monitor from SysInternals (now a part of Microsoft) filtered down to just what you want to monitor. This would give you a good place to start. Just start with a tight focus or make sure you have plenty of space for the log file.

I agree with Boydski. But I also offer this suggestion. Take a close look at the threads if you're doing multi-threading. I had an error like this once that took a long long time to figure out and actually ended up with John Robbins on the phone helping out with it. It turned out to be improper exception handling with threads.

Run your app, with pdb files, and attach WinDbg, make run.
Whem crash occur WinDbg stop app.
Execute this command to generate dump file :
.dump /ma c:\myapp.dmp
An auxiliary tool of analysis is ADPlus
Or try this :
Capturing user dumps using Performance alert
How to use ADPlus to troubleshoot "hangs" and "crashes"
Debugging on the Windows Platform

Do you have a try/catch block around the line 'Application.Run' call that starts the GUI thread? If not do add one and put some logging in for any exceptions thrown there.
You can also wrie up the Application.ThreadException event and log that in case that gives you some more hints.

you should use a global exception handler:
http://msdn.microsoft.com/en-us/library/system.windows.forms.application.threadexception.aspx

I my past I have this kind of behaviour primarily in relation to COM objects not being released and/or threading issues. Dive into the suggestions that you have gotten here already, but I would also suggest to look into whether you properly release the non-managed objects, so that they do not leak memory.

How can I schedule Debugs?

there is this check-in-out program here at my workplace, it only takes the data from check-in-out machine and store it in our database, but suddenly out of nowhere started to report an error on Thursdays but only once at a random time during the day, so when I detect the error, I run the program but nothing happens, so I want to debug it every 5-10 mins to see if I catch the error to see what is happening, how can I do this?

Logging is your friend. Add lots of logging (either use the built-in Trace logging or use some framework such as log4net). Use the different log levels to control how much logging you get out. At verbose levels you can for instance log when you enter and exit important methods, log the input arguments and return values. Log in catch blocks and so on. Then analyse the log files after the next error is reported.

What kind of error logging are you currently implementing in this application? If none, would you consider adding in comprehensive application logging, such as the Log4Net tool? Or if this is a web application the ELMAH tool?
This way you can log every error that happens along with its details, like stack trace to track down the problem.

Some thoughts:
Check the server event log to see if there are any crash minidumps you can pull out. These can tell you a lot about what happened when the program crashed (call stack, etc).
Or write a wrapper program that can run your program and detect when it fails, then take a snapshot of the server's state at that moment so you can (hopefully) re-execute the task with the necessary data to get a repeatable crash in your debugger.
Or just add loads of logging. You could use PostSharp to add trace that tells you every method that you enter and exit, so you can easily determine which method was running when it failed.
And you can add robustification code. Check religiously for nulls, etc. and you may well find you've corrected the problem without necessarily knowing which fix fixed it.
And if the program's not too big, just being old fashioned and desk checking (reading a print-out of) the code may well turn up some bugs.
Another approach (getting a bit more experimental) might be to modify the program to run continuously so you can stick it in a debugger and leave it till you hit an exception. (run a loop, wait for a trigger file to be refreshed or something, and then kick off the normal process - about 5 lines of code would probably suffice)

What is Environment.FailFast?

What is Environment.FailFast?
How is it useful?

It is used to kill an application. It's a static method that will instantly kill an application without being caught by any exception blocks.
Environment.FastFail(String) can
actually be a great debugging tool.
For example, say you have an
application that is just downright
giving you some weird output. You have
no idea why. You know it's wrong, but
there are just no exceptions bubbling
to the surface to help you out. Well,
if you have access to Visual Studio
2005's Debug->Exceptions... menu item,
you can actually tell Visual Studio to
allow you to see those first chance
exceptions. If you don't have that,
however you can put
Environment.FastFail(String) in an
exception, and use deductive reasoning
and process of elimination to find out
where your problem in.
Reference

It also creates a dump and event viewer entry, which might be useful.

It's a way to immediately exit your application without throwing an exception.
Documentation is here.
Might be useful in some security or data-critical contexts.

Failfast can be used in situations where you might be endangering the user's data. Say in a database engine, when you detect a corruption of your internal data structures, the only sane course of action is to halt the process as quickly as possible, to avoid writing garbage to the database and risk corrupting it and lose the user's data. This is one possible scenario where failfast is useful.
Another use is to catch programmer errors. Say you are writing a library and some function accepts a pointer that cannot be null in any circumstance, that is, if it's null, you are clearly in presence of a programmer error. You can return an error like E_POINTER or throw some InvalidArgument exception and hope someone notices, but you'll get their attention better by failing fast :-)
Note that I'm not restricting the example to pointers, you can generalize to any parameter or condition that should never happen. Failing fast ultimately results in better quality apps, as many bugs no longer go unnoticed.
Finally, failing fast helps with capturing the state of the process as faithfully as possible (as a memory dump gets created), in particular when failing fast immediately upon detecting an unrecoverable error or a really unexpected condition.
If the process was allowed to continue, say the 'finally' clauses would run or the stack would be unwound, and things would get destroyed or disposed-of, before a memory dump is taken, then the state of the process might be altered in such as way that makes it much more difficult to diagnose the root cause of the problem.

It kills the application and even skips try/finally blocks.

From .NET Framework Design Guidelines on Exception Throwing:
✓ CONSIDER terminating the process by calling System.Environment.FailFast (.NET Framework 2.0 feature) instead of throwing an exception if your code encounters a situation where it is unsafe for further execution.

Joe Duffy discusses failing fast and the discipline to make it useful, here.
http://joeduffyblog.com/2014/10/13/if-youre-going-to-fail-do-it-fast/
Essentially, he's saying that for programming bugs - i.e. unexpected errors that are the fault of the programmer and not the programme user or other inputs or situations that can be reasonable expected to be bad - then deciding to always fail fast for unexpected errors has been seen to improve code quality.
I think since its an optional team decision and discipline, use of this API in C# is rare since in reality we're all mostly writing LoB apps for 12 people in HR or an online shop at best.
So for us, we'd maybe use this when we want deny the consumer of our API the opportunity of making any further moves.

An unhandled exception that is thrown (or rethrown) within a Task won't take effect until the Task is garbage-collected, at some perhaps-random time later.
This method lets you crash the process now -- see this answer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.