The Mystery of the Vanishing EF Call - c#

Today I got an emergency call from the users on our ASP.NET production system. Some users (not all) were unable to enter certain data. The user posted the data, and the system then froze; the call never returned.
We tried to repro the problem on the QA system (which has a fresh restore of production data), and could not. I then ran from my dev environment and connected directly to the production DB, masquerading as one of the affected users. Again, no problem. Conclusion: there must be some kind of issue in the production environment, probably somewhere in the IIS process that's hosting the website.
So I fired up Visual Studio on the production server, and attached to the IIS process (Kids, don't do this at home!), set a breakpoint in the offending code, logged in as the user, and attempted to save the data. Hit the breakpoint and stepped line by line, until I hit a line of code like this:
try
{
...
using (var db = new MyDataContext())
{
...
var fooToUpdate = db.Foos.Single(f => f.ID == fooId); // <-- THIS LINE
...
}
}
catch (Exception ex)
{
// some error logging
}
After hitting "step" on that line, the thread simply vanished. Disappeared without a trace. I put a sniffer on the database, and no query was fired; needless to say there was no DB locking involved. No exception was thrown. The code entered Entity Framework and never left.
The way the data is is that every user has a different and unique fooId for every day, so no other user will have the same fooId. Most users were able to load their Foo, but a select handful of users fail consistently to load their personal Foo. I tried running the query to load the Foo in a SSMS window; no trouble at all. The only time it fails is in this particular IIS process, on the production server.
Now, I could just recycle the app pool or restart IIS, and that would probably paper over the problem. But something similar happened a week ago, and we couldn't trace it then, either. So we reset IIS then, hoping the problem would go away. And it did, for a week. And now it's back.
Does anyone have any ideas how it is possible for a thread to simply vaporize like this? Is Norman Bates hiding behind the EF door?

Given the fact that the thread did not magically vaporize, we could speculate some of the more likely options:
The debugger had a hard time following the production code compiled in Release mode. Just because debugging Release code works 90% of the time, don't fall under the illusion that it is dependable. Optimized code can very quickly throw the debugger off the track of actual execution. When this happens, it will look like the thread just vanished.
Assuming the thread does legitimately enter the call and not return (which seems to be supported by the original complaint of the application "freezing"), then the most likely scenario is a deadlock of some type. EntityFramework deadlocks are not common, but not unheard of either. The most common issues I'm aware of usually involve TransactionScope or CommitableTransaction. Are you using any transactions in the omitted code sections?

Turns out that the EF part was a red herring after all. I went and downloaded Telerik's JustDecompile and JustCode, in the hope of stepping into the EF code, but when I stepped in to that line, I found myself not in the Single() extension method, but inside one of my own method calls - that I thought I had executed on the previous line. Evidently the code was not perfectly in sync with the version in production.
LESSON 1: If you attach to a process, your execution point may not be where you think it is, if your code is not identical to the code that was
compiled into that process.
So anyway, now that I could step into the code without decompiling anything, the first thing I noticed was:
lock (_lockObj)
{
...
}
And when I tried to step into it, it froze there. Smoking gun.
So somewhere, some other thread is locking this object. Looked at other places where the lock is invoked, leading to a spaghetti of dependencies, along with another code-locked segment, with several DB calls and even a transaction boundary. It could be a code lock / db transaction deadlock, though a brief scan of the code in the DB transaction failed to pick up any contenders within the life of the transaction for blocking anything else. Plus, there's the evidence of the DB not showing any blocking or open transactions. Rather, it may just be the the fact of a few hundred queued up long-running processes, all inside code locks inside code locks, and in the end it all looks something like the West Side Highway at 17:05 on a Friday, with a jackknifed trailer truck lying across 3 lanes approaching the GW bridge.
LESSON 2: Code locks are dangerous, not only - but especially - when used in conjunction with DB transactions. Try to find ways to make your code thread safe without using code locks. And if you really must use code locks, make sure you get in and out as quickly as possible. Don't give your thread a magazine to read while it's occupying the only stall, so to speak.

Related

How to identify internal failures in a Windows service

We use a lot of custom Windows services in our applications. However, the one I'm currently working on has an infuriating problem: while the service keeps running, it simply stops functioning.
The Main method of the service is wrapped in a try/catch block, like this:
static void Main()
{
IRepository rep = new Repository();
ILogger log = LogManager.GetLogger(GetType().Name);
TimeSpan loadWindowStart = new TimeSpan(9, 0, 0);
TimeSpan loadWindowEnd = new TimeSpan(18, 0, 0);
foreach (SuppressionLoad sl in rep.GetSuppressionLoads().ToList())
{
try
{
// do stuff
}
catch(Exception ex)
{
// log error
}
}
}
The service also logs as it does stuff, and we can watch the logs fill up while it's busy.
Sometimes, however, the logs just stop. And activity elsewhere in the database suggests the entire service has stopped working. Checking in Services on the server, the service still shows a Status of "Started". It takes up almost zero system resources while it's in this state, although it's normally quite processor intensive. If you try and stop it, it just times out trying and, as far as we can tell, it never stops of its own accord. The process has to be killed in Task Manager.
There is nothing untoward in the log in the run up to these stalls. There is also nothing we can find in Event Viewer.
Since it doesn't log an error, I'm at a loss as to what's going on here, or what we can do to try and diagnose the fault from here. It's highly intermittent - it will often run for several days without problem before entering the state. What can we do to investigate what's going on?
Matt; Obscure problems such as these are difficult to find in the best of conditions - if your service happens to use threads (which I assume it does), it becomes tremendously more difficult and you can't rely on global try/catch.
A simple thing to try would be NBug (no association). It will catch un-handled exceptions and give you some info about them. I don't think it will get you enough though.
The general way to find these sorts of things is log, log, log. You have to be able to come as close to recreating the problem as possible - you need logs that tell your entry points into each method, the variable values, exception stack traces if hit, how long you spent in each method, etc. There are some really good tools out there for logging some logging tools so I won't bother with recommending any. You can wrap your logging in a conditional compile switch so once you find your issue you won't suffer a performance hit when you turn it off.
Probably not the answer you wanted, but the only thing that has really worked for me over the years.
SteveJ
It sounds like the issue could be anywhere and doesn't necessarily have much to do with code provided.
Suggestions on how to go about it
When service hangs, attach a debugger and take a look at threads and see where each one is. You may need to rebuild and run a debug version of your solution so that debugger has necessary contextual symbol data. Questions to ask:
Are all the threads that I'm expecting to be there are there, or are some gone or unaccounted for?
Are threads stuck in a deadlock (I'm suspecting that's what's happening), and if so, on what resources.
Turn on detailed logging and sprinkle in more debug log statements to isolate where in code flow it last was and where it didn't make it to, and then keep narrowing down the location. Consider logging contextual data so that when you isolate problematic line or code block, you have context to try to understand why odd behavior takes place. Just be mindful of logging sensitive information (i.e. passwords, PII, etc.)
With full credit to IInspectable's comment, you can try to take a full dump of the process (SysInternal's Process Explorer or ProcDump let's you do that, or Task Manager). It tends to be quite an involved experience using the tool, but used right can give a lot of insight, and possibly find the issue on first occurrence.
Considering that it happens infrequently, and the field of what and where is wide open, it'll likely take a few iterations of having the problem trigger in order to narrow down the scope.

COM Add-in: Resolve the error DisconnectedContext in WinWord.exe

I built an add-on to Microsoft Word. When the user clicks a button, it runs a number of processes that export a list of Microsoft Word documents to Filtered HTML. This works fine.
Where the code falls down is in processing large amounts of files. After the file conversions are done and I call the next function, the app crashes and I get this information from Visual Studio:
Managed Debugging Assistant 'DisconnectedContext' has detected a problem in 'C:\Program Files\Microsoft Office\root\Office16\WINWORD.EXE'.
Additional information: Transition into COM context 0x56255b88 for
this RuntimeCallableWrapper failed with the following error: System
call failed. (Exception from HRESULT: 0x80010100
(RPC_E_SYS_CALL_FAILED)). This is typically because the COM context
0x56255b88 where this RuntimeCallableWrapper was created has been
disconnected or it is busy doing something else. Releasing the
interfaces from the current COM context (COM context 0x56255cb0). This
may cause corruption or data loss. To avoid this problem, please
ensure that all COM contexts/apartments/threads stay alive and are
available for context transition, until the application is completely
done with the RuntimeCallableWrappers that represents COM components
that live inside them.
After some testing, I realized that if I simply remove all the code after the file conversions, there are no problems. To resolve this, I place the remainder of my code in yet another button.
The problem is I don't want to give the user two buttons. After reading various other threads, it sounds like my code has a memory or threading issue. The answers I am reading do not help me truly understand what to do next.
I feel like this is what I want to do:
1- Run conversion.
2- Close thread/cleanup memory issue from conversion.
3- Continue running code.
Unfortunately, I really don't know how to do #2 or if it is even possible. Your help is very much appreciated.
or it is busy doing something else
The managed debugging assistant diagnostic you got is pretty gobbledygooky but that's the part of the message that accurately describes the real problem. You have a firehose problem, the 3rd most common issue associated with threading. The mishap is hard to diagnose because this goes wrong inside the Word plumbing and not your code.
Trying not to commit the same gobbledygook sin myself, what goes wrong is that the interop calls you make into the Office program are queued, waiting for their turn to get executed. The underlying "system call" that the error code hints at is PostMessage(). Wherever there is a queue, there is a risk that the queue gets too large. Happens when the producer (your program) is adding items too the queue far faster than the consumer (the Office program) removes them. The firehose problem. Unless the producer slows down, the queue will grow without bounds and something is going to fail if it is allowed to grow endlessly, at a minimum the process runs out of memory.
It is not allowed to get close to that problem. The underlying queue that PostMessage() uses is protected by the OS. Windows fails the call when the queue already contains 10,000 messages. That's a fatal error that RPC does not know how to recover from, or rather should not try to recover from. Something is amiss and it isn't pretty. It returns an error code to your program to tell you about it. That's RPC_E_SYS_CALL_FAILED. Nothing much better happens in your program, the CLR doesn't know how to recover from it either, nor does your code. So the show is over, the interop call you made got lost and was not executed by Word.
Finding a completely reliable workaround for this awkward problem is not that straight-forward. Beware that this can happen on any interop call, so catching the exception and trying again is pretty drastically unpractical. But do keep in mind that the Q+D fix is very simple. The plain problem is that your program is running too fast, slowing it down with a Thread.Sleep() or Task.Delay() call is quite crude but will always fix the issue. Well, assuming you delay enough.
I think, but don't know for a fact because nobody ever posts repro code, that this issue is also associated with using a console mode app or a worker thread in your program. If it is a console mode app then try applying the [STAThread] attribute to your Main() method. If it is a worker thread then call Thread.SetApartmentState() before starting the thread, but beware it is very important to also create the Application interface on that worker thread. Not otherwise a workaround for an add-in.
If neither of those workarounds is effective or too unpractical then consider that you can automagically slow your program down, and ensure the queue is emptied, by occasionally reading something back from the Office program. Something silly, any property getter call will do. Necessarily you can't get the property value until the Office program catches up. That can still fail, there is also a 60 second time-out on the interop call. But that's something you can fix, you can call CoRegisterMessageFilter() in your program to install a callback that runs when the timeout trips. Very gobbledygooky as well, but the cut-and-paste code is readily available.

Service call works in main thread, but crashes when multithreaded

My company has an application that keeps track of information related to web sites that are hosted on various machines. A central server runs a windows service that gets a list of sites to check, and then queries a service running on those target sites to get a response that can be used to update the local data.
My task has been to apply multithreading to this process to reduce the time it takes to run through all the sites (almost 3000 sites that take about 8 hours to run sequentially). The service runs through successfuly when it's not multithreaded, but the moment I spread out the work to multiple threads (testing with 3 right now, plus a watcher thread) there's a bizarre crash that seems to originate from the call to the remote services that are supposed to provide the data. It's a SOAP/XML call.
When run on the test server, the service just gives up and doesn't complete it's task, but doesn't stop running. When run through the debugger (Dev Studio 2010) the whole thing just stops. I'll run it, and seconds later it'll stop debugging, but not because it completed. It does not throw an exception or give me any kind of message. With breakpoints I can walk through to the point where it just stops. Event logging leads me to the same spot. It stops on the line of code that tries to get a response from the web service on the other sites. And again: it only does that when multithreaded.
I found some information that suggested there's a limit to the number of connections that defaults to 2. The proposed solution is to add some tags to the app.config, but that hasn't solved the problem...
<system.net>
<connectionManagement>
<add address="*" maxconnection="20"/>
</connectionManagement>
</system.net>
I still think it might be related to the number of allowed connections, but I have been unable to find information around it online very well. Is there something straightforward I'm missing? Any help would be much appreciated.
No crash however bizarre will escape the stack-dump. Try going through that dump and see if it points out to some obvious function.
Are you using some third party tool or some other component for the actual service call ? If yes, then please check the documentation/contact-the-person-who-wrote-it, to confirm that their components are thread safe. If they are not, you have large task ahead. :) (I have worked on DB which are not safe, so trust me it is not very uncommon to find few global static variables thrown around..)
Lastly if you are 100% sure that this is due multiple threads then, put a lock in your worked thread. Initially say it covers entire main-while-loop. Therotically it should not crash not as even though it is multi-threaded, you have serialized the execution.
Next step is to reduce to scope of the thread. Say, there are three functions in the
main-while-loop , say f1(), f2(), f3(), then start locking f2() and f3() while leaving f1 unlocked... If things work out, then problem is somewhere in f2 or f3().
I hope you got the idea of what I am suggest
I know this is like blind man guessing elephant, but that is the best you can do, if your code uses LOT many external component which are not adequately documented.

IIS Session hangs - How to resolve "website leaking resources to the finalizers"?

I am experiencing the exact same issue as a user reports on eggheadcafe, but don't know what steps to take after reading the following answer.:
Two problems you should chase down:
1. Why is the website leaking resources to the finalizers. That is
bad
2. What is Oracle code waiting on -- work with Oracle's support on it
This is the issue:
I have an intermittent problem with a
web site hosted on IIS6 (w2k3 sp2).
I appears to occur randomly to users
when they click on a hyperlink within
a page. The request is sent to the
web server but a response is never
returned. If the user tries to
navigate to another hyperlink they are
not able to (i.e. the web site appears
to hang for that user). Other users
of the website at the time are not
affected by this hang and if the user
with the problem opens a new http
session (closing IE and opening the
web site again) they no longer
experience the hang.
I've placed a debugger (IISState) on
the w3wp process with the following
output. Entries with "Thread is
waiting for a lock to be released.
Looking for lock owner." look like
they might be causing the issue. Can
anyone tell what lock the process is
waiting on?
Thanks
http://www.eggheadcafe.com/software/aspnet/33799697/session-hangs.aspx
In my case my .Net C# MVC application runs against a MySQL database for data and a MS SQL database for .Net membership.
I hope someone with more knowledge of IIS can help resolve this problem.
It sounds like you have a race condition in your database calls resulting in a deadlock at the database level. You may want to look at the settings you have in your application pool for database connections. Likely you will need to put some checks in somewhere or redefine procedures in order to reduce the likelihood of the race:
http://msdn.microsoft.com/en-us/library/ms178104.aspx
I would explain the experienced hang due to session serialization. Not the part about saving/loading it from some source, but that ASP.NET does not allow the same session to execute two parallel pages simultaneously, unless they execute with a readonly-session. The later is done either in the page directive, or in web.config, by setting EnableSessionState="ReadOnly".
Your problem still exists, this wont change that the first thread hangs. I would verify that your database connections are disposed correctly. However, you never mention any Oracle database in your question (only Mysql and SQL Server). Why are you using the Oracle drivers at all? (This seems like a valid place to start debugging.)
However, as stated by David Wang in his answer in your linked question, part two of your problem is a lock that's never released. You'll need support from Oracle (or their source code) to debug this further.
IIS hang is not something surprising. IISState is out of date, and you may use Debug Diag,
http://support.microsoft.com/kb/919791 (if CPU usage is high)
http://support.microsoft.com/kb/919792 (otherwise)
The hang dumps should tell you what is the root cause.
Microsoft support can help analyze the dumps, if you are not familiar with the tricks. http://support.microsoft.com

What would make PerformanceCounterCategory.Exists hang indefinitely?

I've got an application that uses performance counters, that has worked for months. Now, on my dev machine and another developers machine, it has started hanging when I call PerformanceCounterCategory.Exists. As far as I can tell, it hangs indefinitely. It does not matter which category I use as input, and other applications using the API exhibits the same behaviour.
Debugging (using MS Symbol Servers) has shown that it is a call to Microsoft.Win32.RegistryKey that hangs. Further investigation shows that it is this line that hangs:
while (Win32Native.ERROR_MORE_DATA == (r = Win32Native.RegQueryValueEx(hkey, name, null, ref type, blob, ref sizeInput))) {
This is basically a loop that tries to allocate enough memory for the performance counter data. It starts at size = 65000 and does a few iterations. In the 4th call, when size = 520000, Win32Native.RegQueryValueEx hangs.
Furthermore, rather worringly, I found this comment in the reference source for PerformanceCounterLib.GetData:
// Win32 RegQueryValueEx for perf data could deadlock (for a Mutex) up to 2mins in some
// scenarios before they detect it and exit gracefully. In the mean time, ERROR_BUSY,
// ERROR_NOT_READY etc can be seen by other concurrent calls (which is the reason for the
// wait loop and switch case below). We want to wait most certainly more than a 2min window.
// The curent wait time of up to 10mins takes care of the known stress deadlock issues. In most
// cases we wouldn't wait for more than 2mins anyways but in worst cases how much ever time
// we wait may not be sufficient if the Win32 code keeps running into this deadlock again
// and again. A condition very rare but possible in theory. We would get back to the user
// in this case with InvalidOperationException after the wait time expires.
Has anyone seen this behaviour before ? What can I do to resolve this ?
This issue is now fixed, and since there has been no answers here, I will add an answer here in case the question is found in future searches.
I ultimately fixed this error by stopping the print spooler service (as a temporary measure).
It looks like the reading of Performance counters actually needs to enumerate the printers on the system (confirmed by a WinDbg dump of a hanging process, where I can see in the stack trace that winspool is enumerating printers, and is stuck in a network call). This was what was actually failing on the system (and sure enough, opening the "Devices and printers" window also hung). It baffles me that a printer/network issue can actually make the performance counters go down. One would think that there was some sort of fail-safe built in for such a case.
What I am guessing, is that this is cause by a bad printer/driver on the network. I haven't re-enabled printing on the affected systems yet, since we are hunting for the bad printer.
This really didn't help on my case, any operation done that uses the performance Category, will hung in there forever.
I am thinking it is more a problem of memory allocation for the call or something related to resources of the machine, I don't have a way to probe it, but trying exactly the same sample call to for example "PerformaceCXounterCategory.Exist" method in a computer with 32GB Ram will run just fine against the one another with only 16GB, if I get a chance to install more memory and test and verify this assumption I will update this ticket

Categories