Currently the Hangfire dashboard offers an option to requeue jobs (either succeed or failed) and in my case running twice a job can cause problems.
I have tried to add AutomaticRetry attribute...
[AutomaticRetry(Attempts = 0)]
Which solves the problem when jobs fails, jobs are not requeued automatically, but the button is still on the dashboard and they can be manually requeued.
Currently there is no way to make Hangfire stop running the jobs. For example when you have issue with one of services that are used in jobs, and you want to stop hangfire until issue resolution.
The idea is to have your job raise an exception. It will then go into the failed state and depending on your AutomaticRetry setting attempt to rerun the job if needed automatically (up to the defined number of retry attempts) or stay there so that once the problem is solved you can manually requeue the job from the dashboard.
Having the job sit there waiting for a service to come back online does not sound advisable (speaking in general, I obviously don’t know your specific scenario that well).
On the whole I find I am even extremely careful of even doing automatic retries. I only even consider doing those if I have a guarantee that whatever the job does is idempotent (i.e. running the same actions multiple times does not cause issues).
Imagine a job that adds 100 $ to the salary of every employee in a company (i.e. set salary = salary+100). You run the job updating the DB but halfway through the DB server connection drops. Half the employees have had the salary increase, the other half did not get it yet. Running the same job again should not apply the 100$ increase a second time to those employees done in the first run.
Stopping the whole server also seems a bit drastic. I believe the advised mechanism is to just delete the job if you don’t want the job (if it is a recurring one and not a fire and forget) to enqueue new runs for a while. Then when the issue is solved you just reschedule it. I do agree that a pause feature would be a nice to have. You could extend hangfire yourself to do this using the jobfilters and IElectStateFilters. Just have a boolean (i.e. IsHangfirePaused=true) somewhere that you can check in the OnStateElection event and prevent the job from transitioning to the EnqueuedState when it is set to true.
this is according to https://discuss.hangfire.io/t/ability-to-stop-running-jobs/4215/2
How about deleting the job?
RecurringJob.RemoveIfExists("myJobID");
You can set date for that task from hangfire database.
UPDATE [HangFire].[Hash]
SET [Value] = '2050-01-01T00:00:00.0000000Z'
where [Key]='Job_Name' and [Field]='NextExecution'
As odinserj suggests on Hangfire Discussion.
You can simply use a condition inside a recurring job and store it in your database to prevent job running:
public void MyMethod()
{
if (someCondition) { return; }
/* ... */
}
Related
i'm working with durable functions. I have already understood how Durable Functions work, so it has an Orchestration that controls the Flow (the order that activities work), that orchestration takes care of the Sequence of the Activities.
But currently i am having a question that i'm not finding the correct answer and maybe you can help me on that:
Imagine that i have one orchestration, with 5 activities.
One of the activities do a Call to an API that will get a Document as an Array of bytes.
If one of the activities fail, the orchestration can throw an exception and i can detect that through the code.
Also i have some retry options that Retry the activities with an interval of 2 minutes.
But... What if those retries doesn't success?
As i was able to read, i can use "ContinueasNew" method in order to restart the orchestration, but there is a problem i think.
If i use this method in order to restart the orchestration 1 hour after, will it resume the activity where it was?
I mean, if the first activity is done and when i restart the orchestration due to failure of one of the activities, will it resume on the 2nd activity as it was before?
Thank you for your time guys.
If you restart the orchestration, it doesn't have any state of the previous one.
So the first activity will run again.
If you don't want that to happen, you'll need to retry the second one until it succeeds.
I would not recommend making that infinite though, an orchestration should always finish at some point.
I'd just increase the retry count to a sufficiently high number so I can be confident that the processing will succeed in at least 99% of cases.
(How likely is your activity to fail?)
Then if it still fails, you could send a message to a queue and have it trigger some alert. You could then start that one from the beginning.
If something fails so many times that the retry amount is breached, there could be something wrong with the data itself and typically a manual intervention may be needed at that point.
Another option could be to send the alert from within the orchestration if the retries fail, and then wait for an external event to come from an admin who approves or denies it to retry.
Is it possible to send a heartbeat to hangfire (Redis Storage) to tell the system that the process is still alive? At the moment I set the InvisibilityTimeout to TimeSpan.MaxValue to prevent hangfire from restarting the job. But, if the process fails or the server restarts, the job will never be removed from the list of running jobs. So my idea was, to remove the large time out and send a kind of heartbeat instead. Is this possible?
I found https://discuss.hangfire.io/t/hangfire-long-job-stop-and-restart-several-time/4282/2 which deals with how to keep a long-running job alive in Hangfire.
The User zLanger says that jobs are considered dead and restarted once you ...
[...] are hitting hangfire’s invisibilityTimeout. You have two options.
increase the timeout to more than the job will ever take to run
have the job send a heartbeat to let hangfire’s know it’s still alive.
That's not new to you. But interestingly, the follow-up question there is:
How do you implement heartbeat on job?
This remains unanswered there, a hint that that your problem is really not trivial.
I have never handled long-running jobs in Hangfire, but I know the problem from other queuing systems like the former SunGrid Engine which is how I got interested in your question.
Back in the days, I had exactly your problem with SunGrid and the department's computer guru told me that one should at any cost avoid long-running jobs according to some mathematical queuing theory (I will try to contact him and find the reference to the book he quoted). His idea is maybe worth sharing with you:
If you have some job which takes longer than the tolerated maximal running time of the queuing system, do not submit the job itself, but rather multiple calls of a wrapper script which is able to (1) start, (2) freeze-stop, (3) unfreeze-continue the actual task.
This stop-continue can indeed be a suspend (CTRL+Z respectively fg in Linux) on operating-system level, see e.g. unix.stackexchange.com on that issue.
In practice, I had the binary myMonteCarloExperiment.x and the wrapper-script myMCjobStarter.sh. The maximum compute time I had was a day. I would fill the queue with hundreds of calls of the wrapper-script with the boundary condition that only one at a time of them should be running. The script would check whether there is already a process myMonteCarloExperiment.x started anywhere on the compute cluster, if not, it would start an instance. In case there was a suspended process, the wrapper script would forward it and let it run for 23 hours and 55 minutes, and suspend the process then. In any other case, the wrapper script would report an error.
This approach does not implement a job heartbeat, but it does indeed run a lengthy job. It also keeps the queue administrator happy by avoiding that job logs of Hangfire have to be cleaned up.
Further references
How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution seems to be a good read
I have the following WebAPI 2 method:
public HttpResponseMessage ProcessData([FromBody]ProcessDataRequestModel model)
{
var response = new JsonResponse();
if (model != null)
{
// checks if there are old records to process
var records = _utilityRepo.GetOldProcesses(model.ProcessUid);
if (records.Count > 0)
{
// there is an active process
// insert the new process
_utilityRepo.InsertNewProcess(records[0].ProcessUid);
response.message = "Process added to ProcessUid: " + records[0].ProcessUid.ToString();
}
else
{
// if this is a new process then do adjustments rules
var settings = _utilityRepo.GetSettings(model.Uid);
// create a new process
var newUid = Guid.NewGuid();
// if its a new adjustment
if (records.AdjustmentUid == null)
{
records.AdjustmentUid = Guid.NewGuid();
// create new Adjustment information
_utilityRepo.CreateNewAdjustment(records.AdjustmentUid.Value);
}
// if adjustment created
if (_utilityRepo.CreateNewProcess(newUid))
{
// insert the new body
_utilityRepo.InsertNewBody(newUid, model.Body, true);
}
// start AWS lambda function timer
_utilityRepo.AWSStartTimer();
response.message = "Process created";
}
response.success = true;
response.data = null;
}
return Request.CreateResponse(response);
}
The above method sometimes can take from 3-4 seconds to process (some db calls and other calculations) and I don't want the user to wait until all the executions are done.
I would like the user hit the web api method and almost inmediatly get a success response, meanwhile the server is finishing all the executions.
Any clue on how to implement Async / Await to achieve this?
If you don't need to return a meaningful response it's a piece of cake. Wrap your method body in a lambda you pass to Task.Run (which returns a Task). No need to use await or async. You just don't await the Task and the endpoint will return immediately.
However if you need to return a response that depends on the outcome of the operation, you'll need some kind of reporting mechanism in place, SignalR for example.
Edit: Based on the comments to the original post, my recommendation would be to wrap the code in await Task.Run(()=>...), i.e., indeed await it before returning. That will allow the long-ish process to run on a different thread asynchronously, but the response will still await the outcome rather than leaving the user in the dark about whether it finished (since you have no control over the UI). You'd have to test it though to see if there's really any performance benefit from doing this. I'm skeptical it'll make much difference.
2020-02-14 Edit:
Hooray, my answer's votes are no longer in the negative! I figured having had the benefit of two more years of experience I would share some new observations on this topic.
There's no question that asynchronous background operations running in a web server is a complex topic. But as with most things, there's a naive way of doing it, a "good enough for 99% of cases" way of doing it, and a "people will die (or worse, get sued) if we do it wrong" way of doing it. Things need to be put in perspective.
My original answer may have been a little naive, but to be fair the OP was talking about an API that was only taking a few seconds to finish, and all he wanted to do was save the user from having to wait for it to return. I also noted that the user would not get any report of progress or completion if it is done this way. If it were me, I'd say the user should suck it up for that short of a time. Alternatively, there's nothing that says the client has to wait for the API response before returning control to the user.
But regardless, if you really want to get that 200 right away JUST to acknowledge that the task was initiated successfully, then I still maintain that a simple Task.Run(()=>...) without the await is probably fine in this case. Unless there are truly severe consequences to the user not knowing the API failed, on the off chance that the app pool was recycled or the server restarted during those exact 4 seconds between the API return and its true completion, the user will just be ignorant of the failure and will presumably find out next time they go into the application. Just make sure that your DB operations are transactional so you don't end up in a partial success situation.
Then there's the "good enough for 99% of cases" way, which is what I do in my application. I have a "Job" system which is asynchronous, but not reentrant. When a job is initiated, we do a Task.Run and begin to execute it. The code in the task always holds onto a Job data structure whose ID is returned immediately by the API. The code in the task periodically updates the Job data with status, which is also saved to a database, and checks to see if the Job was cancelled by the user, in which case it wraps up immediately and the DB transaction is rolled back. The user cancels by calling another API which updates said Job object in the database to indicate it should be cancelled. A separate infinite loop periodically polls the job database server side and updates the in-memory Job objects used by the actual running code with any cancellation requests. Fundamentally it's just like any CancellationToken in .NET but it just works via a database and API calls. The front end can periodically poll the server for job status using the ID, or better yet, if they have WebSockets the server pushes job updates using SignalR.
So, what happens if the app domain is lost during the job? Well, first off, every job runs in a single DB transaction, so if it doesn't complete the DB rolls back. Second, when the ASP.NET app restarts, one of the first things it does is check for any jobs that are still marked as running in the DB. These are the zombies that died upon app pool restart but the DB still thinks they're alive. So we mark them as KIA, and send the user an email indicating their job failed and needs to be rerun. Sometimes it causes inconvenience and a puzzled user from time to time, but it works fine 99% of the time. Theoretically, we could even automatically restart the job on server startup if we wanted to, but we feel it's better to make that a manual process for a number of case-specific reasons.
Finally, there's the "people will die (or worse, get sued) if we get it wrong" way. This is what some of the other comments are more directed to. This is where have to break down all jobs into small atomic transactions that are tracked in a database at every step, and which can be picked up by any server (the same or maybe another server in a farm) at any time. If it's really top notch, multiple servers can even work on the same job concurrently, depending on what it is. It requires carefully coding every background operation with this in mind, constantly updating a database with your progress, dealing with concurrent changes to the database (because now the entire operation is no longer a single atomic transaction), etc. Needless to say, it's a LOT of effort. Yeah, it would be great if it worked this way. It would be great if every app did everything to this level of perfection. I also want a toilet made out of solid gold, but it's just not in the cards now is it?
So my $0.02 is, again, let's have some perspective. Do the cost benefit analysis and unless you're doing something where lives or lots of money is at stake, aim for what works perfectly well 99%+ of the time and only causes minor inconvenience when it doesn't work perfectly.
I've read through a few posts on SO on whether to use a Windows Service or Scheduled Task and from my understanding I should be using a Scheduled Task.
I have a simple program, basically do a little logic and send an email. The only hard requirement that I have is the email must be sent on the :40 minute mark of each hour. So 8:40, 9:40, 10:40, etc. When I initially setup the schedule for the task I can set it to start at 8:40, recur every hour, every day.
That seems to fulfill the requirement but should I be worried about anything in regards to ensuring the task is ran on that schedule?
It all seems so simple that I'm sure I'm missing something?
Well few points to mention
If you are using a batch file make sure other programs can replace or change it.
Use and access log to log whether the execution was successful or not so you can monitor later(Debugging you application)
Use schedule task errors to check whether your application ran correctly (http://support.microsoft.com/kb/308558)
Use as few privileges as possible.(which is the default)
I created a job that implements IStatefulJob and according to the quartz docs
"if a job is stateful, and a trigger attempts to 'fire' the job while it is already
executing, the trigger will block (wait) until the previous execution completes"
Is there anyway way to remove the block and kill the newly fired instance of the job?
The job I am running can have wildly different run times based on the amount of data behind it and I am concerned that if we have a number of jobs waiting to run that it could have a negative effect...
Thanks
Unfortunately no. As a job implementor you are responsible for making sure that job will keep track whether it has reached its time limit of 'good behavior'. Normally there's no need as jobs take somewhat expected time to complete.
Same goes when you want to interrupt all jobs in scheduler, you need to implement IInterruptableJob and set flag that your main job loop watches.
You can always rethink the design. It shouldn't be problem to queue same job as it has the same duty to do. With misfire instructions you can configure misfired (queued too long) instanced to be discarded and wait for the next fire time.