My application uses a local private MessageQueue(#".\private$\queuename") to sequence messages from its multiple threads, and has been doing so successfully for a long time. Recently, an error occurred that caused several of those messages to essentially disappear without a trace. From the app's internal logging and the eyewitness account, it seems that the Send(msg) method failed to place the message into the queue and raised no error. In a debugger, I simulated that scenario by having execution skip the Send() call, and the resulting log info matches what was logged in the actual error occurrence.
Most disturbing is that the error condition existed for 45 minutes, persisting through a computer reboot and application restart, and required a second restart after the reboot to finally clear it.
This unresolved post[^] hints that the message might end up some place other than the intended local queue. Unfortunately, all evidence of it that might have existed was gone by the time I was able to inspect the system where the error occurred. I considered using the TimeToReachQueue acknowledgement to detect failure of the Send() call, but MSDN asserts here[^] that "If the queue is local, the message always reaches the queue," although this event challenges that claim.
Recurrence of this error will be a serious problem, so if I can't prevent it, I need to be able to detect and report it. Not knowing what actually happened makes both options extremely difficult.
Related
Not long ago, I mistakenly deleted most of the 'Local' folder on my user profile, causing numerous issues with programs that were installed there or had some assets there. In the same way, my Hot Restart provisioning was mucked up, with an error saying the keychain didn't match with whatever was set somewhere else. So, I reset all the provisioning, and I am able to build now, but upon attempting to test on my iPad, it will try to load for a while, and then crash in a very timed fashion.
The debug output simply says that the 'app has terminated' or something along the lines of Hot Restart closing due to my explicit ending of the app (I didn't touch it at any point).
I looked into the device log, and this pops up at the end:
Provision violated for watchdog scene-create: <FBSProcessResourceProvision: 0x280cfb600; allowance: <; FBSProcessResourceAllowance; type: realTime; timeValue: 16.60s>; violated: YES>
Executing termination for reason (none) with request: <FBSProcessTerminationRequest: 0x283853500; label: "watchdog provision violated"; exceptionCode: "Watchdog Violation (0x8BADF00D)"; reportType: CrashLog; explanation: "scene-create watchdog transgression: com.companyname.TraceIt exhausted real (wall clock) time allowance of 16.60 seconds">
I did my own counting, and it does seem the app attempts loading for 20+ seconds, which, of course, 'Watchdog' then terminates due to the time constraint violation.
But this didn't happen before I deleted most of the Local folder, it was typically less than 10 seconds.
Could this at all be related to provisioning issues? I know that when it was displays the certificate in debug output, the team profile code is wrong, but it seems to attempt to load.
I had a static DataRepository object created in the App class, but in its constructor was Task.Run([method]).Wait();, so the App class never reached the class' actual constructor. Obviously, it was an infinite wait, hence nothing progressed until that finished. I removed the .Wait, now it works.
We use rabbit mq to send messages to a server for processing.
We require the server to ack a message. That way if the server happens to die whilst processing the message, we will retry the message when it restarts, or with a different server.
The problem is, on a very rare occasion, we will get a message that deterministically crashes the server. This is because we call into some open source native dlls, those dlls have bugs, and sometimes these dlls just cause the process to crash with no exception. Of course it would be ideal to fix those bugs, but we don't expect to fix all such issues in pdfium or opencv any time soon. We have to reckon with the fact that whatever we do, we will eventually get such a message.
The result of this is that the message is then retried, the server restarts, picks ups the message, crashes, and so on ad infinitum. Nothing gets processed till we manually stop the server, and purge the message. Not ideal.
What can we do to solve this problem?
What we don't want to do is create another service that monitors the rabbitmq service, looks for such messages and purges them, since that just leads to spiralling complexity. Instead we want to deal with this at the rabbitmq client level. We would be perfectly happy to say that if a message is not processed 3 times, we should just fail the message. We could do this by maintaining a database entry of which messages we've processed, but ideally I wouldn't want to involve anything external, and just contain the solution to this problem in our rabbitmq client library. I'm not sure how to do this though.
One method I have used in my event driven architecture is to use dead letter exchanges (DLXs) or poison queues, that way if we see the same message multiple times due to service failure then it'll be pushed into the DLX instead of being re-queued into the original exchange. These messages then trigger a different type of process within our system to alert us messages are stuck and failing to process, we can then diagnose and fix the consumer. After a fix has been made we trigger another process to move the poison messages back into the original exchange to be then processed as normal.
In your scenario because your process crashes there is two possible options to deal with these messages:
If the message is marked as redelivered then clone the message and add an attempt count to the body or as a header (x-attempt-count) to the message. The copy will then be added to the back of the queue with the attempt count. When the copy is then consumed you can check if it hits the threshold and then move the message into a DLX or store in a database. The major drawback here is that it breaks the order of which the messages are processed.
Use an external services to keep track of the number of delivery attempts, I would recommend using something like redis/memcache where you can increment a counter based on a unique message id. At the start of your process if the message has been marked as redelivered then lookup the counter. If the message has reached the threshold, trigger a different process again like moving it into a DLX.
A process runs that parses many xml files, and within each file there are many nodes that represent work to be done.
Currently, errors that occur are logged to log4net one at a time, logging to a file once per error, and also generating an email once per error. This is not ideal, so I'm working to roll up all the errors that occur during one parsing session so they can be sent as a single email. For example, let's say the creator of the files accidentally left a crucial field out of each work node. Instead of 4,000 emails, I'd like to send a single email with 4,000 results, plus a summary at the top saying "4,000 field missing errors" and so on.
The trick comes when dealing with exceptions, because I don't know in advance what exception will occur. So, the previous state of the system worked well for tolerating exceptions because it flushed its knowledge as soon as it acquired it (sent an email, for example). But now I'm seeing that waiting until the end to send the email risks not sending an email at all. Let's say the nature of a repeating error is such that eventually it causes a system-level exception. By losing any error reporting at all, I never get the information on the repeating error that (almost certainly) led up to the big crash.
How can I get the best of both worlds, where logging is saved until the end, but whatever work is done and whatever information is collected has a decent chance of being logged even if an error occurs?
One thought I had was sending log information to a custom log4net appender that does caching or batching, and thus allows modification of the log message over time until it is finally triggered to be sent/logged (perhaps accepts individual items and a lambda that it executes at the end to assemble those items into a loggable result). Then, in my application:
var loggingContext = CreateNewLoggingContext();
try {
// Note: all reasonable and known or handleable exceptions will be
// caught from within this message. We're talking unforeseen or
// difficult to handle errors, here.
ParseAndProcessABunchOfFiles(loggingContext);
// within this method repeatedly use loggingContext.Push("error id", "error message");
}
catch (Exception e) {
loggingContext.Push("System-level exception", e);
loggingContext.Flush();
throw;
}
loggingContext.Flush();
But I am not particularly excited about catching every exception. An OutOfMemoryException is going to make it pretty hard to do an operation such as logging that might require even more memory. Or a disk full exception is going to make it hard to add to the log file on disk (though it would be grand if it still tried to run its email result and perhaps also log to the database).
How can I achieve these goals in a reasonable way?
UPDATE
Part of what's troubling me is distinguishing at each level in the call stack what is a recoverable error and what's one that should terminate the whole process. I guess what I have to do is simply catch the ones I can think of and reasonably expect, then as other exceptions show up, add specific handling for them.
I have a windows service which receives messages via RabbitMQ, this triggers an event handler which does some work and then attempts to persist the result to the database. It's threaded using:
ThreadPool.QueueUserWorkItem(ProcessMessageOnThread, messageReceived);
where ProcessMessageOnThread is a method which does the work on the messageReceived which is a representation of the message dequeued from RabbitMQ.
Under normal circumstances the windows service operated as expected, that is dequeue, process and persist.
I want to ensure that all of my messages are processed and given a fair change to be processed so if I can't open a connection to SQL Server I simply requeue the message for it to be processed again (hopefully that time the SQL Server will be back, otherwise this continues - and I'm fine with that).
Now the problem comes when the process has been running as expected for a period of time, the SQL Server connection pool has filled up and then SQL Server is disconnected, now this is when things get a bit unstable.
One of two things can happen:
An exception is thrown on connection.Open() - however I'm catching this and so not worried about it
An exception is thrown on cmd.ExecuteNonQuery() - which is where I'm executing a stored procedure
It is the second option that I need to figure out how to handle. Previously I assumed that any exception here meant that there was a problem with the data I was passing into the stored procedure and therefore should just move it out of the queue and have something else analyse it.
However, now I think I need a new approach to handle the cases where the exception is to do with the connection not actually being established.
I've had a look at the SqlException class and noticed a property called Class which has this description Gets the severity level of the error returned from SQL Server, now the info on this says:
Messages with a severity level of 10 or less are informational and indicate problems caused by mistakes in information that a user has entered. Severity levels from 11 through 16 are generated by the user, and can be corrected by the user. Severity levels from 17 through 25 indicate software or hardware errors. When a level 17, 18, or 19 error occurs, you can continue working, although you might not be able to execute a particular statement.
Does this mean to fix my exception handling I can just check if (ex.Class > 16) then requeue message because the problem is with the connection else throw it away as it is most likely to do with malformed data being send to the stored procedure?
So the question is, how should I do exception handling and how can I detect when calling cmd.ExecuteNonQuery() if the exception thrown is because of a disconnected connection.
Update:
I've experienced problems previously with connections not being returned to the pool (this was due to threading issues) and have fixed those problems, so I'm confident the issue isn't to do with connections not going back into the pool. Also, the logic around what the connections are being used for is so simple also I'm ensuring they are closed consistently...so I'm more interested in answers to do with the disconnection of the Sql Server and then the capturing the behaviour of cmd.ExecuteNonQuery()
Connections in the connection pool can get into a weird state for various reasons, all of which have to do with poor application design:
Closing the connection before its associated data reader
Change a setting (like transaction isolation level) that the pool does not reset
Starting an asynchronous query (BeginOpenReader) and then returning the connection to the pool before the asynchronous handler fires
You should investigate your application and make sure connections are properly returned to the pool. One thing that can help debugging is reducing the size of the application pool in a development setting. You change the size of the pool in the connection string:
...;Integrated Security=SSPI;Max Pool Size=2;Pooling=True;
This makes pooling issues much easy to reproduce.
If you can't find the cause, but still need to deploy a fix, you could use one of ClearPool or ClearAllPools. A good place to do that is when you detect one of the suspicious exceptions after Open() or ExecuteNonQuery(). Both are static methods on the SqlConnection class:
SqlConnection.ClearPool(yourConnection);
Or for an even rougher approach:
SqlConnection.ClearAllPools()
Note that this is basically Pokémon Exception Handling. If it works, you'll have no idea why. :)
there is this check-in-out program here at my workplace, it only takes the data from check-in-out machine and store it in our database, but suddenly out of nowhere started to report an error on Thursdays but only once at a random time during the day, so when I detect the error, I run the program but nothing happens, so I want to debug it every 5-10 mins to see if I catch the error to see what is happening, how can I do this?
Logging is your friend. Add lots of logging (either use the built-in Trace logging or use some framework such as log4net). Use the different log levels to control how much logging you get out. At verbose levels you can for instance log when you enter and exit important methods, log the input arguments and return values. Log in catch blocks and so on. Then analyse the log files after the next error is reported.
What kind of error logging are you currently implementing in this application? If none, would you consider adding in comprehensive application logging, such as the Log4Net tool? Or if this is a web application the ELMAH tool?
This way you can log every error that happens along with its details, like stack trace to track down the problem.
Some thoughts:
Check the server event log to see if there are any crash minidumps you can pull out. These can tell you a lot about what happened when the program crashed (call stack, etc).
Or write a wrapper program that can run your program and detect when it fails, then take a snapshot of the server's state at that moment so you can (hopefully) re-execute the task with the necessary data to get a repeatable crash in your debugger.
Or just add loads of logging. You could use PostSharp to add trace that tells you every method that you enter and exit, so you can easily determine which method was running when it failed.
And you can add robustification code. Check religiously for nulls, etc. and you may well find you've corrected the problem without necessarily knowing which fix fixed it.
And if the program's not too big, just being old fashioned and desk checking (reading a print-out of) the code may well turn up some bugs.
Another approach (getting a bit more experimental) might be to modify the program to run continuously so you can stick it in a debugger and leave it till you hit an exception. (run a loop, wait for a trigger file to be refreshed or something, and then kick off the normal process - about 5 lines of code would probably suffice)