App terminating due to 'Watchdog' error with Hot Restart

App terminating due to 'Watchdog' error with Hot Restart - c#

Not long ago, I mistakenly deleted most of the 'Local' folder on my user profile, causing numerous issues with programs that were installed there or had some assets there. In the same way, my Hot Restart provisioning was mucked up, with an error saying the keychain didn't match with whatever was set somewhere else. So, I reset all the provisioning, and I am able to build now, but upon attempting to test on my iPad, it will try to load for a while, and then crash in a very timed fashion.
The debug output simply says that the 'app has terminated' or something along the lines of Hot Restart closing due to my explicit ending of the app (I didn't touch it at any point).
I looked into the device log, and this pops up at the end:
Provision violated for watchdog scene-create: <FBSProcessResourceProvision: 0x280cfb600; allowance: <; FBSProcessResourceAllowance; type: realTime; timeValue: 16.60s>; violated: YES>
Executing termination for reason (none) with request: <FBSProcessTerminationRequest: 0x283853500; label: "watchdog provision violated"; exceptionCode: "Watchdog Violation (0x8BADF00D)"; reportType: CrashLog; explanation: "scene-create watchdog transgression: com.companyname.TraceIt exhausted real (wall clock) time allowance of 16.60 seconds">
I did my own counting, and it does seem the app attempts loading for 20+ seconds, which, of course, 'Watchdog' then terminates due to the time constraint violation.
But this didn't happen before I deleted most of the Local folder, it was typically less than 10 seconds.
Could this at all be related to provisioning issues? I know that when it was displays the certificate in debug output, the team profile code is wrong, but it seems to attempt to load.

I had a static DataRepository object created in the App class, but in its constructor was Task.Run([method]).Wait();, so the App class never reached the class' actual constructor. Obviously, it was an infinite wait, hence nothing progressed until that finished. I removed the .Wait, now it works.

Related

Why would MessageQueue.Send(x) fail silently on a local private queue?

My application uses a local private MessageQueue(#".\private$\queuename") to sequence messages from its multiple threads, and has been doing so successfully for a long time. Recently, an error occurred that caused several of those messages to essentially disappear without a trace. From the app's internal logging and the eyewitness account, it seems that the Send(msg) method failed to place the message into the queue and raised no error. In a debugger, I simulated that scenario by having execution skip the Send() call, and the resulting log info matches what was logged in the actual error occurrence.
Most disturbing is that the error condition existed for 45 minutes, persisting through a computer reboot and application restart, and required a second restart after the reboot to finally clear it.
This unresolved post[^] hints that the message might end up some place other than the intended local queue. Unfortunately, all evidence of it that might have existed was gone by the time I was able to inspect the system where the error occurred. I considered using the TimeToReachQueue acknowledgement to detect failure of the Send() call, but MSDN asserts here[^] that "If the queue is local, the message always reaches the queue," although this event challenges that claim.
Recurrence of this error will be a serious problem, so if I can't prevent it, I need to be able to detect and report it. Not knowing what actually happened makes both options extremely difficult.

Attempted to read or write protected memory error. Suggestions please

After two months of trying to track down this random problem, I’m going to ask for some help in the hope that someone may see what I am missing or at least point me in the right direction. I have gone over many different posts from here and other websites that deal with the same error and tried a number of suggested fixes but none seem to be working.
My program is a windows service (because it needs to be able to run on machines without a user logged in) that runs out of the console when running in dev environment. It sits in-between a third-party program and a piece of hardware that is connected to the PC via a serial port. My program uses an Eltima virtual null modem com port pair to connect to the existing program and intercept messages passing between it and the hardware, which is connected to a real com port. Certain messages between the two, when detected, will trigger a JSON message to be sent through PubNub to a mobile device on a specific PubNub channel.
When the third-party program is not running, the service will continue to check the hardware for messages every second. Any important messages are stored in a database until the next time the other program connects at which point they will be sent on to it. The error has occurred when the third party program is not running (has not connected at all) and I am not getting any data on the virtual com port.
I also have a WCF servicehost duplex channel set up to communicate to another app that I wrote, which resides in the tray. This app is used for changing configurations and setups for the service, and to display pop up messages from the tray when certain conditions are present. Note that when the error happens that the tray app has not ever been connected to the service. The host is just waiting for a connection.
The error happens anywhere from within a few seconds of the program starting up and only having issued one or two requests for data from the hardware, to several days after starting and thousands of requests. When it happens, my code is not in the middle of executing any methods. At least, it doesn’t appear to be. I’ve placed debug statements at the beginning of each code block, and at all exit points and the output from that seems to show that the error happens outside of my code blocks.
I am using version 4.5.2 of the framework.
Not using any third party dlls or other unmanaged code. No OCX's or COM objects.
All my timers that I use are System.Timers.Timer type. Autoreset is set to false and I manually restart them when I’m sure that everything has been taken care of. There’s only one timer running when the error happens.
PubNub has been initialized and subscribed successfully to a channel (to listen for mobile device connection) when I am getting the error.
Some items that I have tried are:
1: Setting the build target to x86.
2: Setting legacyCorruptedStateExceptionsPolicy to true in the app.cofig file.
3: Setting legacyUnhandledExceptionPolicy to true in the app.cofig file.
4: Adding a trace to my WCF service to see if it has any errors that show up before the Access Violation and I don’t see any.
5: All my collections and dictionaries that I use are of the Concurrent type.
6: Tried putting [HandleProcessCorruptedStateExceptions] on each code block.
7: Added an AppDomain.CurrentDomain.UnhandledException UnhandledExceptionEventHandler.
8: Unchecked “Suppress JIT optimization”.
9: Set UseSynchronizationContext = false in ServiceBehavior attributes.
And probably some others that I have forgotten about in my efforts.
From WinDbg, and one of my minidumps. (I assume that since I’m building for x86, that I have to use the 32 bit version of WinDbg, which I am. I also have the 64 bit and tried that, but its results pointed me at a line in a file that does not have any code on it.)
The Managed Call Stack:
062ff6a8 7235a5e4 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\ndp\clr\src\BCL\system\threading\overlapped.cs # 117]
062ff7a4 7341eb16 [GCFrame: 062ff7a4]
062ff8a8 7341eb16 [DebuggerU2MCatchHandlerFrame: 062ff8a8]
The unmanaged Call Stack on the faulting thread:
062feb48 768815ce ntdll!ZwWaitForSingleObject+0x15
062febb4 770d1194 KERNELBASE!WaitForSingleObjectEx+0x98
062febcc 7354008b kernel32!WaitForSingleObjectExImplementation+0x75
062febfc 735400d2 clr!CLREventWaitHelper2+0x33
062fec4c 73540057 clr!CLREventWaitHelper+0x2a
062fec84 73540152 clr!CLREventBase::WaitEx+0x152
062fec94 71f5cb48 clr!CLREventBase::Wait+0x1a
WARNING: Stack unwind information not available. Following frames may be wrong.
00000000 00000000 WRusr!SynProc+0xee68
Thread processor time shows faulting thread 13 is not in a runaway:
0:013> !runaway
User Mode Time
Thread Time
6:1670 0 days 0:00:11.965
18:44e0 0 days 0:00:00.280
13:16dc 0 days 0:00:00.140
2:2500 0 days 0:00:00.093
0:43c0 0 days 0:00:00.046
I’ve put a screenshot of the code from overlapped.cs below. I got the file from Microsoft’s repository.
I also have a screenshot of my parallel stacks. The one thread in bold with the yellow pointer is the error. Also shown is the thread for the PubNub subscription, and my 2 serial ports waiting for com events.
So, any suggestions would be greatly appreciated.

COM Add-in: Resolve the error DisconnectedContext in WinWord.exe

I built an add-on to Microsoft Word. When the user clicks a button, it runs a number of processes that export a list of Microsoft Word documents to Filtered HTML. This works fine.
Where the code falls down is in processing large amounts of files. After the file conversions are done and I call the next function, the app crashes and I get this information from Visual Studio:
Managed Debugging Assistant 'DisconnectedContext' has detected a problem in 'C:\Program Files\Microsoft Office\root\Office16\WINWORD.EXE'.
Additional information: Transition into COM context 0x56255b88 for
this RuntimeCallableWrapper failed with the following error: System
call failed. (Exception from HRESULT: 0x80010100
(RPC_E_SYS_CALL_FAILED)). This is typically because the COM context
0x56255b88 where this RuntimeCallableWrapper was created has been
disconnected or it is busy doing something else. Releasing the
interfaces from the current COM context (COM context 0x56255cb0). This
may cause corruption or data loss. To avoid this problem, please
ensure that all COM contexts/apartments/threads stay alive and are
available for context transition, until the application is completely
done with the RuntimeCallableWrappers that represents COM components
that live inside them.
After some testing, I realized that if I simply remove all the code after the file conversions, there are no problems. To resolve this, I place the remainder of my code in yet another button.
The problem is I don't want to give the user two buttons. After reading various other threads, it sounds like my code has a memory or threading issue. The answers I am reading do not help me truly understand what to do next.
I feel like this is what I want to do:
1- Run conversion.
2- Close thread/cleanup memory issue from conversion.
3- Continue running code.
Unfortunately, I really don't know how to do #2 or if it is even possible. Your help is very much appreciated.

or it is busy doing something else
The managed debugging assistant diagnostic you got is pretty gobbledygooky but that's the part of the message that accurately describes the real problem. You have a firehose problem, the 3rd most common issue associated with threading. The mishap is hard to diagnose because this goes wrong inside the Word plumbing and not your code.
Trying not to commit the same gobbledygook sin myself, what goes wrong is that the interop calls you make into the Office program are queued, waiting for their turn to get executed. The underlying "system call" that the error code hints at is PostMessage(). Wherever there is a queue, there is a risk that the queue gets too large. Happens when the producer (your program) is adding items too the queue far faster than the consumer (the Office program) removes them. The firehose problem. Unless the producer slows down, the queue will grow without bounds and something is going to fail if it is allowed to grow endlessly, at a minimum the process runs out of memory.
It is not allowed to get close to that problem. The underlying queue that PostMessage() uses is protected by the OS. Windows fails the call when the queue already contains 10,000 messages. That's a fatal error that RPC does not know how to recover from, or rather should not try to recover from. Something is amiss and it isn't pretty. It returns an error code to your program to tell you about it. That's RPC_E_SYS_CALL_FAILED. Nothing much better happens in your program, the CLR doesn't know how to recover from it either, nor does your code. So the show is over, the interop call you made got lost and was not executed by Word.
Finding a completely reliable workaround for this awkward problem is not that straight-forward. Beware that this can happen on any interop call, so catching the exception and trying again is pretty drastically unpractical. But do keep in mind that the Q+D fix is very simple. The plain problem is that your program is running too fast, slowing it down with a Thread.Sleep() or Task.Delay() call is quite crude but will always fix the issue. Well, assuming you delay enough.
I think, but don't know for a fact because nobody ever posts repro code, that this issue is also associated with using a console mode app or a worker thread in your program. If it is a console mode app then try applying the [STAThread] attribute to your Main() method. If it is a worker thread then call Thread.SetApartmentState() before starting the thread, but beware it is very important to also create the Application interface on that worker thread. Not otherwise a workaround for an add-in.
If neither of those workarounds is effective or too unpractical then consider that you can automagically slow your program down, and ensure the queue is emptied, by occasionally reading something back from the Office program. Something silly, any property getter call will do. Necessarily you can't get the property value until the Office program catches up. That can still fail, there is also a 60 second time-out on the interop call. But that's something you can fix, you can call CoRegisterMessageFilter() in your program to install a callback that runs when the timeout trips. Very gobbledygooky as well, but the cut-and-paste code is readily available.

The Mystery of the Vanishing EF Call

Today I got an emergency call from the users on our ASP.NET production system. Some users (not all) were unable to enter certain data. The user posted the data, and the system then froze; the call never returned.
We tried to repro the problem on the QA system (which has a fresh restore of production data), and could not. I then ran from my dev environment and connected directly to the production DB, masquerading as one of the affected users. Again, no problem. Conclusion: there must be some kind of issue in the production environment, probably somewhere in the IIS process that's hosting the website.
So I fired up Visual Studio on the production server, and attached to the IIS process (Kids, don't do this at home!), set a breakpoint in the offending code, logged in as the user, and attempted to save the data. Hit the breakpoint and stepped line by line, until I hit a line of code like this:
try
{
...
using (var db = new MyDataContext())
{
...
var fooToUpdate = db.Foos.Single(f => f.ID == fooId); // <-- THIS LINE
...
}
}
catch (Exception ex)
{
// some error logging
}
After hitting "step" on that line, the thread simply vanished. Disappeared without a trace. I put a sniffer on the database, and no query was fired; needless to say there was no DB locking involved. No exception was thrown. The code entered Entity Framework and never left.
The way the data is is that every user has a different and unique fooId for every day, so no other user will have the same fooId. Most users were able to load their Foo, but a select handful of users fail consistently to load their personal Foo. I tried running the query to load the Foo in a SSMS window; no trouble at all. The only time it fails is in this particular IIS process, on the production server.
Now, I could just recycle the app pool or restart IIS, and that would probably paper over the problem. But something similar happened a week ago, and we couldn't trace it then, either. So we reset IIS then, hoping the problem would go away. And it did, for a week. And now it's back.
Does anyone have any ideas how it is possible for a thread to simply vaporize like this? Is Norman Bates hiding behind the EF door?

Given the fact that the thread did not magically vaporize, we could speculate some of the more likely options:
The debugger had a hard time following the production code compiled in Release mode. Just because debugging Release code works 90% of the time, don't fall under the illusion that it is dependable. Optimized code can very quickly throw the debugger off the track of actual execution. When this happens, it will look like the thread just vanished.
Assuming the thread does legitimately enter the call and not return (which seems to be supported by the original complaint of the application "freezing"), then the most likely scenario is a deadlock of some type. EntityFramework deadlocks are not common, but not unheard of either. The most common issues I'm aware of usually involve TransactionScope or CommitableTransaction. Are you using any transactions in the omitted code sections?

Turns out that the EF part was a red herring after all. I went and downloaded Telerik's JustDecompile and JustCode, in the hope of stepping into the EF code, but when I stepped in to that line, I found myself not in the Single() extension method, but inside one of my own method calls - that I thought I had executed on the previous line. Evidently the code was not perfectly in sync with the version in production.
LESSON 1: If you attach to a process, your execution point may not be where you think it is, if your code is not identical to the code that was
compiled into that process.
So anyway, now that I could step into the code without decompiling anything, the first thing I noticed was:
lock (_lockObj)
{
...
}
And when I tried to step into it, it froze there. Smoking gun.
So somewhere, some other thread is locking this object. Looked at other places where the lock is invoked, leading to a spaghetti of dependencies, along with another code-locked segment, with several DB calls and even a transaction boundary. It could be a code lock / db transaction deadlock, though a brief scan of the code in the DB transaction failed to pick up any contenders within the life of the transaction for blocking anything else. Plus, there's the evidence of the DB not showing any blocking or open transactions. Rather, it may just be the the fact of a few hundred queued up long-running processes, all inside code locks inside code locks, and in the end it all looks something like the West Side Highway at 17:05 on a Friday, with a jackknifed trailer truck lying across 3 lanes approaching the GW bridge.
LESSON 2: Code locks are dangerous, not only - but especially - when used in conjunction with DB transactions. Try to find ways to make your code thread safe without using code locks. And if you really must use code locks, make sure you get in and out as quickly as possible. Don't give your thread a magazine to read while it's occupying the only stall, so to speak.

How can I schedule Debugs?

there is this check-in-out program here at my workplace, it only takes the data from check-in-out machine and store it in our database, but suddenly out of nowhere started to report an error on Thursdays but only once at a random time during the day, so when I detect the error, I run the program but nothing happens, so I want to debug it every 5-10 mins to see if I catch the error to see what is happening, how can I do this?

Logging is your friend. Add lots of logging (either use the built-in Trace logging or use some framework such as log4net). Use the different log levels to control how much logging you get out. At verbose levels you can for instance log when you enter and exit important methods, log the input arguments and return values. Log in catch blocks and so on. Then analyse the log files after the next error is reported.

What kind of error logging are you currently implementing in this application? If none, would you consider adding in comprehensive application logging, such as the Log4Net tool? Or if this is a web application the ELMAH tool?
This way you can log every error that happens along with its details, like stack trace to track down the problem.

Some thoughts:
Check the server event log to see if there are any crash minidumps you can pull out. These can tell you a lot about what happened when the program crashed (call stack, etc).
Or write a wrapper program that can run your program and detect when it fails, then take a snapshot of the server's state at that moment so you can (hopefully) re-execute the task with the necessary data to get a repeatable crash in your debugger.
Or just add loads of logging. You could use PostSharp to add trace that tells you every method that you enter and exit, so you can easily determine which method was running when it failed.
And you can add robustification code. Check religiously for nulls, etc. and you may well find you've corrected the problem without necessarily knowing which fix fixed it.
And if the program's not too big, just being old fashioned and desk checking (reading a print-out of) the code may well turn up some bugs.
Another approach (getting a bit more experimental) might be to modify the program to run continuously so you can stick it in a debugger and leave it till you hit an exception. (run a loop, wait for a trigger file to be refreshed or something, and then kick off the normal process - about 5 lines of code would probably suffice)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.