Is it possible to send a heartbeat to hangfire (Redis Storage) to tell the system that the process is still alive? At the moment I set the InvisibilityTimeout to TimeSpan.MaxValue to prevent hangfire from restarting the job. But, if the process fails or the server restarts, the job will never be removed from the list of running jobs. So my idea was, to remove the large time out and send a kind of heartbeat instead. Is this possible?
I found https://discuss.hangfire.io/t/hangfire-long-job-stop-and-restart-several-time/4282/2 which deals with how to keep a long-running job alive in Hangfire.
The User zLanger says that jobs are considered dead and restarted once you ...
[...] are hitting hangfire’s invisibilityTimeout. You have two options.
increase the timeout to more than the job will ever take to run
have the job send a heartbeat to let hangfire’s know it’s still alive.
That's not new to you. But interestingly, the follow-up question there is:
How do you implement heartbeat on job?
This remains unanswered there, a hint that that your problem is really not trivial.
I have never handled long-running jobs in Hangfire, but I know the problem from other queuing systems like the former SunGrid Engine which is how I got interested in your question.
Back in the days, I had exactly your problem with SunGrid and the department's computer guru told me that one should at any cost avoid long-running jobs according to some mathematical queuing theory (I will try to contact him and find the reference to the book he quoted). His idea is maybe worth sharing with you:
If you have some job which takes longer than the tolerated maximal running time of the queuing system, do not submit the job itself, but rather multiple calls of a wrapper script which is able to (1) start, (2) freeze-stop, (3) unfreeze-continue the actual task.
This stop-continue can indeed be a suspend (CTRL+Z respectively fg in Linux) on operating-system level, see e.g. unix.stackexchange.com on that issue.
In practice, I had the binary myMonteCarloExperiment.x and the wrapper-script myMCjobStarter.sh. The maximum compute time I had was a day. I would fill the queue with hundreds of calls of the wrapper-script with the boundary condition that only one at a time of them should be running. The script would check whether there is already a process myMonteCarloExperiment.x started anywhere on the compute cluster, if not, it would start an instance. In case there was a suspended process, the wrapper script would forward it and let it run for 23 hours and 55 minutes, and suspend the process then. In any other case, the wrapper script would report an error.
This approach does not implement a job heartbeat, but it does indeed run a lengthy job. It also keeps the queue administrator happy by avoiding that job logs of Hangfire have to be cleaned up.
Further references
How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution seems to be a good read
Related
I have multiple windows services which run 24/7 on a server. For logging events etc. I already use log4net but I want to be able to see if all my services are still running. So I've stumbled upon this question and learned about the ServiceController class. Now I've had the idea to make another service in which I create a ServiceController object per service, and use the WaitForStatus method to be notified when any of the services are stopped. I'd be able to check for any statuses externally through a hosted WCF in the servicecontroller service.
But I've also seen the answer to this question which states a ServiceController should be closed and disposed. Would it be bad to let my ServiceController wait 24/7 until any of my services stopped? Or should I use Quartz or a simple Timer to run a check every x amount of time?
Thanks in advance
You shouldn't. There is no mechanism in Windows to let a service status change generate an event. So ServiceController.WaitForStatus() must poll. It is hard-coded to query the service status 4 times per second, a Thread.Sleep(250) hard-codes the poll interval. Use a decompiler to see this for yourself.
So you basically have many threads in your program, doing nothing but sleep for hours. That's pretty ugly, a thread is an expensive OS object. These threads don't burn any core but the OS thread scheduler is still involved, constantly re-activating the threads when their sleep period expires.
If you need this kind of responsiveness to status changes then it is okayish, but keep in mind that it cannot be more responsive than 250 msec. And keep in mind that increasing the interval by using a Timer sounds attractive but do consider the problem with polling. If you do it, say, once a minute and an admin stops and restarts the service in, say, 30 seconds between two polls then you'll never see the status change. Oops.
Consider to use only one thread that queries many ServiceControllers through their Status property. Your own polling code, minus the cost of the threads.
I using Azure Cloud Worker Role for processing incoming task from queues. Processing of each task can take up to several hours and each worker-role can handle up to N tasks simultaneously. Basically, it's working.
Now, you can read in documentation that from time to time, the worker role can be shutdown (for software update, OS upgrade, ...). Basically, it's fine. But, this planned shutdown cannot forcedly stop the worker-role already running tasks.
Expected:
When calling the OnStop() method by the environment:
the worker role will stop getting new tasks for processing.
Wait for running tasks completion.
Continue with the planned shutdown.
Actual:
OnStop() method can be block for up to 5 minutes. I cannot guaranty that I'll finish processing the task in 5 minutes - so, this is problem... My task is being killed in the middle of processing and this became unstable situation for my software.
How I'm can avoid this 5 minutes limit? Any tip will be welcome.
How I'm can avoid this 5 minutes limit?
Unfortunately, you can't. This is a hard limit imposed from Azure side. You will need to work around that.
There are two possible solutions I can think of and both of them would require you to rethink about your current architecture:
Break your one big task into many smaller tasks and create some kind of work flow.
Make your task idempotent so that even if it gets terminated in between (because of worker role shutdown or error in task itself) and when it gets pick up by another instance, it starts again in such a way that your output of the task is not corrupted.
No, you cannot bypass this limit. In general you should not rely on any of your instances running continuously for any long period of time. Instances may be suddenly stopped or they may suddenly disappear (because of an underlying server failure). You software should be designed such that when an instance is restarted (possibly redeployed) or some other instance finds capacity to take a previously released work item that work item is reprocessed without any adverse effects.
I have a Windows service that is calling a stored proc over and over (in an infinite loop).
The code looks like this:
while(1)
{
callStoredProc();
doSomethingWithResults();
}
However, how there might be cases where the loop gets stuck with no response, but the service is still technically running.
I imagine there are tools to monitor the health of a service, to let operations teams know to restart it.
But for my scenario this won't help since the service will still be technically running, but it's stuck and can't continue.
What's the best way to ensure this process restarts if this scenario happens?
Would the solution be to use a task scheduler that checks for the heartbeat of this process, and restarts the service if it there's no heartbeat for a period of time? To have another separate thread that monitors the progress of the first process?
Windows services have various recovery options which takes care of question 1. For question 2, the best bet would be to use a timeout approach whereby if the service takes more than X amount of time to complete it restarts or stops what it's doing (I don't know the nature of your service so can't provide implementation detail).
The heartbeat idea would work as well, however, that just becomes another thing to manage/maintain & install.
I wondering would this work. I have a simple C# cmd line application. It sends out emails at a set time(through windows scheduler).
I am wondering if the smtp would say fail would this be a good idea?
In the smtpException I put thread that sleeps for say 15mins. When it wakes up it just calls that method again. This time hopefully the smtp would be back up. If not it would keep doing this until the smpt is back online.
Is some down side that I am missing about this? I would of course do some logging that this is happening.
This is not a bad idea, in fact what you are effectively implementing is a simple variation of the Circuit-Breaker pattern.
The idea behind the pattern is the fact that if an external resource is down, it will probably not come back up a few milliseconds later. It might need some time to recover. Typically the circuit breaker pattern is used as a mean to fail fast - so that the user can get an error sooner; or in order not to consume more resources on the failing system. When you have stuff that can be put in a queue, and does not require instant delivery, like you do, it is perfectly reasonable to wait around for the resource to become available again.
Some things to note though: You might want to have a maximum count of retries, before failing completely, and you might want to start off with a delay less than 15 minutes.
Exponential back-off is the common choice here I think. Like the strategy that TCP uses to try to make a connection: double the timeout on each failed attempt. Prevents your program from flooding the event log with repeated failure notifications before somebody notices that something is wrong. Which can take a while.
However, using the task scheduler certainly doesn't help. You really ought to reprogram it so your program isn't consuming machine resource needlessly. But using the ITaskService interface from .NET isn't that easy. Check out this project.
I would strongly recommend using a Windows Service. Long-running processes that run in the background, wait for long periods of time and need a controlled, logged, 'monitorable' lifetime: it's what Windows Services do.
Thread.Sleep would do the job, but if you want it to be interruptable from another thread or something else going on, I would recommend Monitor.Wait (MSDN ref). Then you can run your process in a thread created and managed by the Service, and if you need to stop/interrupt, you Monitor.Pulse on the same sync object and the thread will come back to life.
Also ref:
Best architecture for a 30 + hour query
Hope that helps!
Infinite loops are always a worry. You should set it up so that it will fail after N attempts, and you definitely should have some way to shut it down from the user console.
Failure is not such a bad thing when the failure isn't yours. Let it fail and report why it failed.
Your choices are limited. Assuming that it is just a temporary condition and that it has worked at some point. The only thing you can do is notify of a problem, get somebody to fix it and then retry the operation later. The only thing you need to do is safeguard the messages so that you do not lose any.
If you stick with what you got watchout for concurrency, perhaps a named mutex to ensure only a single process is running at a time.
I send out Notifications to all our developers in a similar fashion. Only, I store the message body and subject in the database. After a message has been successfully processed then I set a success flag in the database. This way its easy to track and report errors and retries are a cakewalk.
I have a task that needs to run every 30 seconds. I can do one of two things:
Write a command line app that runs the task once, waits 30 seconds, runs it again and then exits. I can schedule this task with Scheduled Tasks in Windows to run every minute
Write a Service that runs a task repeatedly while waiting 30 seconds in between each run.
Number 1 is more trivial, in my opinion and I would opt to do it this way by default. Am I wimping out? Is there a reason why I should make this a Service and not a scheduled task? What are the pros and cons of both and which would you pick in the end?
I read a nice blog post about this question recently. It goes into a lot of good reasons why you should not write a service to run a recurring job. Additionally, this question has been asked before:
https://stackoverflow.com/questions/390307/windows-service-vs-scheduled-task
Windows Service or Scheduled Task, which one do we prefer?
One advantage of using the scheduled task, is that if there is some potential risk involved with running the service such as a memory leak or hanging network connection, then the windows service can potentially hang aroung for a long time, adversely affecting other users. On the other hand, the scheduled task is written to be short running, so even if it does leak, the effect is minimised.
On the other hand, someone in one of the above questions commented that the scheduler has a limit of accuracy of somewhere in the range of 1 minute, so you may see that the scheduler is unable to run your task every 30 seconds with accuracy.
Obviously there are a number of tradeoffs to consider, but hopefully this will help you make a good decision.
If you're trying to run every 30 seconds, I'd go for option 2. This is pretty much a continually running job, in that case. The overhead of starting and stopping the process is probably higher than the process itself, especially if you use an appropriate timer.
If you make a job that is running once a day (or a few times a day), then I'd go for option 1 - using a scheduled task.
The task scheduler in windows seems a bit flakey in my opinion. I think you would get a more reliable result running as a service.
Also, a service could keep resources in memory, such as reading input from a file, and only have to do this at start-up of the service, not every 30 seconds.
30 seconds is a pretty short interval (relatively speaking) between processing cycles. Like the others I have my concerns about the task scheduler and I am afraid such a short interval will only compound the issues you might encounter if you took that approach. If this were my project I would almost certainly go with the service.