i'm working with durable functions. I have already understood how Durable Functions work, so it has an Orchestration that controls the Flow (the order that activities work), that orchestration takes care of the Sequence of the Activities.
But currently i am having a question that i'm not finding the correct answer and maybe you can help me on that:
Imagine that i have one orchestration, with 5 activities.
One of the activities do a Call to an API that will get a Document as an Array of bytes.
If one of the activities fail, the orchestration can throw an exception and i can detect that through the code.
Also i have some retry options that Retry the activities with an interval of 2 minutes.
But... What if those retries doesn't success?
As i was able to read, i can use "ContinueasNew" method in order to restart the orchestration, but there is a problem i think.
If i use this method in order to restart the orchestration 1 hour after, will it resume the activity where it was?
I mean, if the first activity is done and when i restart the orchestration due to failure of one of the activities, will it resume on the 2nd activity as it was before?
Thank you for your time guys.
If you restart the orchestration, it doesn't have any state of the previous one.
So the first activity will run again.
If you don't want that to happen, you'll need to retry the second one until it succeeds.
I would not recommend making that infinite though, an orchestration should always finish at some point.
I'd just increase the retry count to a sufficiently high number so I can be confident that the processing will succeed in at least 99% of cases.
(How likely is your activity to fail?)
Then if it still fails, you could send a message to a queue and have it trigger some alert. You could then start that one from the beginning.
If something fails so many times that the retry amount is breached, there could be something wrong with the data itself and typically a manual intervention may be needed at that point.
Another option could be to send the alert from within the orchestration if the retries fail, and then wait for an external event to come from an admin who approves or denies it to retry.
Related
Problem: I need to reschedule/defer a message to be processed after a user defined elapsed time as the receiver.
Goal: After a HttpReponseException of ServerUnavailable, I would like to retry processing of the message after 30 minutes. It must also follow the rule set being after 10 delivery attempts the message is sent to the dead letter queue (Happens automatically based on the topic rules).
So I have a function app to process a Azure Service Bus Topic. This means that thread sleeping is not an option.
What I have tried:
I understand messageSender.ScheduleMessageAsync(message, dateTime) is used on the senders side to schedule the message for later processing which works when sending a new message, however, I as the receiver would like to do this on my side after getting an exception.
I tried using messageReceiver.DeferAsync(message.SystemProperties.LockToken, properties), properties containing the new "ScheduledEnqueueTimeUtc" and this does defer the message but the Sequence ID's seem to go out of sync making it impossible to receive my deferred message.
If I clone the message I cannot set the SystemProperty.DeliveryCount as it is readonly hence the Dead Letter Queue Rule will not function as intended. I can create UserProperty's and manually count message reties and a scheduled date in my function app but I am wondering if there is a better way to do this?
Any suggestions will be appriciated.
What do you think about creating a retry policy ? and instead of thread.sleep you can shedule the same message another time in the queue but with specific time +30 minutes ,
and returning postive response to clear the currentmessage,
you need to keep the rule of deliveryCount , so you may need to add property to the message queue that have a count?
i think this idea is logic , here is article that may help you , you need to jsut change thread.sleep with ScheduleMessageAsync
I managed to resolve retrying a message using custom UserProperties.
This is in line with what Houssem Dbira suggested and my 3rd point but instead of using a custom retry policy object I created a helper function that manages the retry count as well as schedules the message again to the service bus.
The link below will take you to the helper function I created if you interested in doing this yourself.
RetryHelper.cs
Is it possible to send a heartbeat to hangfire (Redis Storage) to tell the system that the process is still alive? At the moment I set the InvisibilityTimeout to TimeSpan.MaxValue to prevent hangfire from restarting the job. But, if the process fails or the server restarts, the job will never be removed from the list of running jobs. So my idea was, to remove the large time out and send a kind of heartbeat instead. Is this possible?
I found https://discuss.hangfire.io/t/hangfire-long-job-stop-and-restart-several-time/4282/2 which deals with how to keep a long-running job alive in Hangfire.
The User zLanger says that jobs are considered dead and restarted once you ...
[...] are hitting hangfire’s invisibilityTimeout. You have two options.
increase the timeout to more than the job will ever take to run
have the job send a heartbeat to let hangfire’s know it’s still alive.
That's not new to you. But interestingly, the follow-up question there is:
How do you implement heartbeat on job?
This remains unanswered there, a hint that that your problem is really not trivial.
I have never handled long-running jobs in Hangfire, but I know the problem from other queuing systems like the former SunGrid Engine which is how I got interested in your question.
Back in the days, I had exactly your problem with SunGrid and the department's computer guru told me that one should at any cost avoid long-running jobs according to some mathematical queuing theory (I will try to contact him and find the reference to the book he quoted). His idea is maybe worth sharing with you:
If you have some job which takes longer than the tolerated maximal running time of the queuing system, do not submit the job itself, but rather multiple calls of a wrapper script which is able to (1) start, (2) freeze-stop, (3) unfreeze-continue the actual task.
This stop-continue can indeed be a suspend (CTRL+Z respectively fg in Linux) on operating-system level, see e.g. unix.stackexchange.com on that issue.
In practice, I had the binary myMonteCarloExperiment.x and the wrapper-script myMCjobStarter.sh. The maximum compute time I had was a day. I would fill the queue with hundreds of calls of the wrapper-script with the boundary condition that only one at a time of them should be running. The script would check whether there is already a process myMonteCarloExperiment.x started anywhere on the compute cluster, if not, it would start an instance. In case there was a suspended process, the wrapper script would forward it and let it run for 23 hours and 55 minutes, and suspend the process then. In any other case, the wrapper script would report an error.
This approach does not implement a job heartbeat, but it does indeed run a lengthy job. It also keeps the queue administrator happy by avoiding that job logs of Hangfire have to be cleaned up.
Further references
How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution seems to be a good read
I have an Azure function (based on the new C# functions instead of the old .csx functions) that is triggered whenever a message comes into an Azure Service Bus Queue. Once the function is triggered, it starts processing the service bus message. It decodes the message, reads a bunch of databases, updates a bunch of others, etc... This can take upwards of 30 minutes at times.
Since, this is not a time sensitive process, 30 minutes or even 60 minutes is not an issue. The problem is that in the meanwhile, Azure Function seems to kick in again and picks up the same message again and again and reprocesses it. This is an issue and cause problems in our business logic.
So, the question is, can we force the Azure function to run in a singleton mode? Or if that's not possible, how do we change the polling interval?
The issue is related to Service Bus setting...
What is happening is that the message is added to the queue, the message is then given to the function and a lock is placed on that message so that no other consumer can see/process that message while you have a lock on it.
If within that lock period you do not tell service bus that you've processed the file, or to extend the lock, the lock is removed from the message and it will become visible to other services that will then process that message, which is what you are seeing.
Fortunately, Azure Functions can automatically renew the lock for you. In the host.json file there is an autoRenewTimeout setting that specifies for how long you want Azure Functions to keep on renewing the lock for.
https://github.com/Azure/azure-webjobs-sdk-script/wiki/host.json
"serviceBus": {
// the maximum duration within which the message lock will be renewed automatically.
"autoRenewTimeout": "00:05:00"
},
AutoRenewTimeout is not as great as suggested. It has a downside that you need to be aware of. It's not a guaranteed operation. Being a client side initiated operation, it can and sometimes will fail, leaving you in the same state as you are today.
What you could do to address it is to review your design. If you have a long running process, then you process the message, and hand off processing to something that can run longer than MaxLockDuration. The fact that your function is taking so long, indicates you have a long running process. Messaging is not designed for that.
One of the potential solutions would be to get a message, register processing intent in a storage table. Have another storage table triggered function to kick-off processing that could take X minutes. Mark it as a singleton. By doing so, you'll be processing your messages in parallel, writing "request for long running processing" into storage table, completing Service Bus messaging and therefore not triggering their re-processing. With the long running processing you can decide how to handle failure cases.
Hope that helps.
So your question is on how to make the Azure Function Message trigger to process one message at a time, and a message has to be processed once only.
I have been able to achieve the same functionality using the following host.json configuration.
{
"serviceBus": {
"maxConcurrentCalls": 1,
"autoRenewTimeout": "23:59:59",
"autoComplete": true
}
}
Note that I set the autoRenewTimeout to 24 hour, which is long enough for a really long running process. Otherwise you can change it to a duration that fits your need.
Many will argue the suitability of using Azure Function for a long running operation. But that is not the question that needs an answer here.
I also experience the same issue, what I did is remove the default rule and add a custom rule on subscriber.
(OriginalTopic='name-of-topic' AND OriginalSubscription='name-of-subcription-sub') OR (NOT Exists([OriginalTopic]))
I'll try and briefy explain what i'm looking to achieve. My intial idea of doing this, is not going to work well in my opinion, so i'm trying to decide how best to plan this.
The first thought was:
I have a list of messages that need to be sent out at scheduled times, each one is stored in a central SQL database.
The intention is to use a Windows service that will have a timer that ticks every 30 mins. So..
30 Mins pass > Call ScheduleMessages()
ScheduleMessages will check the database for any unsent messages that need to go out in the next 30 minutes, it will then mark them in the database as:
ScheduleActivated = 1
For each one it marks as ScheduleActivated = 1 it will fire off a customer time object which inherits from the normal timer, which also includes the properties for the message it needs to send.
It will be set to tick at the time the message is due to go out, will send the message and mark as successful in the database.
The main problem with this is I am going to have timers all over the place, and if there were a few hundred messages sheduled at once, it's probably going to either not perform well, or fall over completely.
After re-evalutating I thought of solution 2
My other idea was to have 1 timer running in the service, which ticks once every 10 minutes. Each time it ticks it would fire off a method that gathers every single message due to be sent at any point up until that time into a list, and then processes them one at a time.
This seems much less resource intensive, but i'm worried that if the timer ticks after 10 minutes, any messages that haven't finished sending, wil be caught in the next tick, and be sent again.
Would it be work to stop the timer once it has been going for 10 minutes, then reset to zero and start again once the messages have been sent.
Is there a 3rd solution to the problem which is better than the above?
We implemented this on one project, what worked for us was:
All messages written to a table with a send time
Service that checks every x mins if there is something to send
When the service sends a message it also marks the message a sent (update sent time from null to actual sent time)
Marking the message avoids resends, and if you want to resend you just set the date to null.
The only problem that we had is that the service ran as a single thread and the number of messages sent was therefore limited. But you would have very many messages and a very small window before this is a problem.
Ditch the fixed interval. Windows has plenty of ways to sleep for a specific amount of time, including the sleep function, waitable timers, etc.
Some of these are available in .NET, for example WaitHandle.WaitAll accepts a sleep time and an event, that way your thread can wait until the next scheduled item but also be woken by a request to modify the schedule.
In my opinion, the scheduling service should only be responsible for checking schedules and any work should be passed off to a separate service. The scheduling service shouldnt care about the work to be scheduled. Try implementing a work item interface that contains an execute method. That way the executing object can handle the internals itself and neednt be aware of the scheduling service. For scheduling have you checked out quartz.net?
I wondering would this work. I have a simple C# cmd line application. It sends out emails at a set time(through windows scheduler).
I am wondering if the smtp would say fail would this be a good idea?
In the smtpException I put thread that sleeps for say 15mins. When it wakes up it just calls that method again. This time hopefully the smtp would be back up. If not it would keep doing this until the smpt is back online.
Is some down side that I am missing about this? I would of course do some logging that this is happening.
This is not a bad idea, in fact what you are effectively implementing is a simple variation of the Circuit-Breaker pattern.
The idea behind the pattern is the fact that if an external resource is down, it will probably not come back up a few milliseconds later. It might need some time to recover. Typically the circuit breaker pattern is used as a mean to fail fast - so that the user can get an error sooner; or in order not to consume more resources on the failing system. When you have stuff that can be put in a queue, and does not require instant delivery, like you do, it is perfectly reasonable to wait around for the resource to become available again.
Some things to note though: You might want to have a maximum count of retries, before failing completely, and you might want to start off with a delay less than 15 minutes.
Exponential back-off is the common choice here I think. Like the strategy that TCP uses to try to make a connection: double the timeout on each failed attempt. Prevents your program from flooding the event log with repeated failure notifications before somebody notices that something is wrong. Which can take a while.
However, using the task scheduler certainly doesn't help. You really ought to reprogram it so your program isn't consuming machine resource needlessly. But using the ITaskService interface from .NET isn't that easy. Check out this project.
I would strongly recommend using a Windows Service. Long-running processes that run in the background, wait for long periods of time and need a controlled, logged, 'monitorable' lifetime: it's what Windows Services do.
Thread.Sleep would do the job, but if you want it to be interruptable from another thread or something else going on, I would recommend Monitor.Wait (MSDN ref). Then you can run your process in a thread created and managed by the Service, and if you need to stop/interrupt, you Monitor.Pulse on the same sync object and the thread will come back to life.
Also ref:
Best architecture for a 30 + hour query
Hope that helps!
Infinite loops are always a worry. You should set it up so that it will fail after N attempts, and you definitely should have some way to shut it down from the user console.
Failure is not such a bad thing when the failure isn't yours. Let it fail and report why it failed.
Your choices are limited. Assuming that it is just a temporary condition and that it has worked at some point. The only thing you can do is notify of a problem, get somebody to fix it and then retry the operation later. The only thing you need to do is safeguard the messages so that you do not lose any.
If you stick with what you got watchout for concurrency, perhaps a named mutex to ensure only a single process is running at a time.
I send out Notifications to all our developers in a similar fashion. Only, I store the message body and subject in the database. After a message has been successfully processed then I set a success flag in the database. This way its easy to track and report errors and retries are a cakewalk.