I using Azure Cloud Worker Role for processing incoming task from queues. Processing of each task can take up to several hours and each worker-role can handle up to N tasks simultaneously. Basically, it's working.
Now, you can read in documentation that from time to time, the worker role can be shutdown (for software update, OS upgrade, ...). Basically, it's fine. But, this planned shutdown cannot forcedly stop the worker-role already running tasks.
Expected:
When calling the OnStop() method by the environment:
the worker role will stop getting new tasks for processing.
Wait for running tasks completion.
Continue with the planned shutdown.
Actual:
OnStop() method can be block for up to 5 minutes. I cannot guaranty that I'll finish processing the task in 5 minutes - so, this is problem... My task is being killed in the middle of processing and this became unstable situation for my software.
How I'm can avoid this 5 minutes limit? Any tip will be welcome.
How I'm can avoid this 5 minutes limit?
Unfortunately, you can't. This is a hard limit imposed from Azure side. You will need to work around that.
There are two possible solutions I can think of and both of them would require you to rethink about your current architecture:
Break your one big task into many smaller tasks and create some kind of work flow.
Make your task idempotent so that even if it gets terminated in between (because of worker role shutdown or error in task itself) and when it gets pick up by another instance, it starts again in such a way that your output of the task is not corrupted.
No, you cannot bypass this limit. In general you should not rely on any of your instances running continuously for any long period of time. Instances may be suddenly stopped or they may suddenly disappear (because of an underlying server failure). You software should be designed such that when an instance is restarted (possibly redeployed) or some other instance finds capacity to take a previously released work item that work item is reprocessed without any adverse effects.
Related
Is it possible to send a heartbeat to hangfire (Redis Storage) to tell the system that the process is still alive? At the moment I set the InvisibilityTimeout to TimeSpan.MaxValue to prevent hangfire from restarting the job. But, if the process fails or the server restarts, the job will never be removed from the list of running jobs. So my idea was, to remove the large time out and send a kind of heartbeat instead. Is this possible?
I found https://discuss.hangfire.io/t/hangfire-long-job-stop-and-restart-several-time/4282/2 which deals with how to keep a long-running job alive in Hangfire.
The User zLanger says that jobs are considered dead and restarted once you ...
[...] are hitting hangfire’s invisibilityTimeout. You have two options.
increase the timeout to more than the job will ever take to run
have the job send a heartbeat to let hangfire’s know it’s still alive.
That's not new to you. But interestingly, the follow-up question there is:
How do you implement heartbeat on job?
This remains unanswered there, a hint that that your problem is really not trivial.
I have never handled long-running jobs in Hangfire, but I know the problem from other queuing systems like the former SunGrid Engine which is how I got interested in your question.
Back in the days, I had exactly your problem with SunGrid and the department's computer guru told me that one should at any cost avoid long-running jobs according to some mathematical queuing theory (I will try to contact him and find the reference to the book he quoted). His idea is maybe worth sharing with you:
If you have some job which takes longer than the tolerated maximal running time of the queuing system, do not submit the job itself, but rather multiple calls of a wrapper script which is able to (1) start, (2) freeze-stop, (3) unfreeze-continue the actual task.
This stop-continue can indeed be a suspend (CTRL+Z respectively fg in Linux) on operating-system level, see e.g. unix.stackexchange.com on that issue.
In practice, I had the binary myMonteCarloExperiment.x and the wrapper-script myMCjobStarter.sh. The maximum compute time I had was a day. I would fill the queue with hundreds of calls of the wrapper-script with the boundary condition that only one at a time of them should be running. The script would check whether there is already a process myMonteCarloExperiment.x started anywhere on the compute cluster, if not, it would start an instance. In case there was a suspended process, the wrapper script would forward it and let it run for 23 hours and 55 minutes, and suspend the process then. In any other case, the wrapper script would report an error.
This approach does not implement a job heartbeat, but it does indeed run a lengthy job. It also keeps the queue administrator happy by avoiding that job logs of Hangfire have to be cleaned up.
Further references
How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution seems to be a good read
I have following scenario:
C# application (.net 4.0/4.5), with 5-6 different threads. Every thread has a different task, which is launched every x seconds (ranging from 5 to 300).
Each task has following steps:
Fetch items from Sql Server
Convert items in Json
Send data to webserver
Wait for reply from server.
Since this tasks can fail at some point (internet problems, timeout, etc) what is best solution in .NET world?
I thought about following solutions:
Spawn new thread every x seconds (if there is not another thread of this type in execution)
Spawn one thread for each type of task and loop steps every x seconds (to understand the way to manage exceptions)
Which would be more secure and robust? Application will run on unattended systems, so it should be able to remain in execution regardless of any possible exception.
Threads are pretty expensive to create. The first option isn't a great one. If the cycle of "do stuff" is pretty brief (between the pauses), you might consider using the ThreadPool or the TPL. If the threads are mostly busy, or the work takes any appreciable time, then dedicated workers are more appropriate.
As for exceptions: don't let exceptions escape workers. You must catch them. If all that means is that you give up and retry in a few seconds, that is probably fine.
You could have modeled the whole thing using a producer consumer pattern approach. You have a producer who puts the new task description in the queue and you can have multiple consumers (4 or 5 threads) who process from the queue. The number of consumers or the processing thread could vary depending on the load, length of the queue.
Each task involves reading from DB, converting the format, sending to web server and then process the response from web server. I assume each task would do all these steps.
In case of exceptions for an item in the queue, you could potentially mark the queue item as failed and schedule it for a retry later.
I have a scheduler which runs as background thread on application start of an ASP.NET site. User can initiate various tasks (alert emails/file generation etc) which is inserted in a db table. The scheduler will pick the tasks from database and push the items into a stack. Also scheduler has a threadpool running 10 background threads, which will pop task items from the stack and execute it.
This is running fine in one web server, but behaving strange in other web server. The threads goes idle for 6-12 seconds with no reason and do nothing even though there are items in the stack.
Using lock() on stack object to make Push & Pop thread safe
Tried Thread.Yield() to give yield to cpu to execute other threads, but slowing down the execution and going idle still persists
Tried Thread.Sleep(0) to give yield to cpu to execute other threads, but slowing down the execution and going idle still persists
Logged entries and exit of all methods to check if something going wrong during the execution, but no luck
My questions:
Is execution of threads in .net in-deterministic?
Is it necessary to specify Thread.Yield() or Thread.Sleep(0) to give breathing time to cpu?
Why it is behaving differently on boxes with same configuration? Is there any machine/environment specific factors that affect the execution of thread?
UPDATE on May.08.2013
There are two boxes in the farm, both are identical in hardware configuration, setup with same software configuration as well Windows 2008 64bit / IIS7. Both webserver has only one site each with same build. Application pools of both site runs on Framework V4.0 on integrated mode. This is a legacy code and no chance since last two years.
We tried several iterations, in all cases webserver1 executes without any issues and completes the background work quickly as it was earlier. BUT webserver2 has significant delay and performing very poor.
We tried extensive logging, capturing entries/exit of all methods. The scenario is like this, all threads works fine for 2 seconds and then goes idle for 6-12 seconds, again become live and execute for next 2 seconds and then goes idle again. This behavioral is consistent till the completion of the task. There is no exception, no application termination, no error in application pool/iis log.
Any idea ?
Your threads are repeatedly trying to grab a lock which may be causing contention. But should not be 6-12 seconds - that answer only debugger can provide.
You can use AutoResetEvent and wait on it in worker threads - and Set the event when you push item to stack.
Okay guys finally we pinned the issue.
One of the cpu core of the webserver was hitting 100% and never comes back. Whereas other cores are at 0-5%.
We did a load testing for normal - moderate - heavy loads. While generating normal to moderate load the server is serving decently, properly sharing the process execution with all other cpu cores. But when we generate heavy load, things change, server struggles to distribute the load among the cores and the thread goes idle for 6-7 seconds. We assume that due to the failure of one cpu core its dealing with some fuzzy logic to distribute the process among the cores.
After further investigation we found that Windows NT Kernel is causing this problem, might be due to corruption or driver related issue.
I need to optimize a WCF service... it's quite a complex thing. My problem this time has to do with tasks (Task Parallel Library, .NET 4.0). What happens is that I launch several tasks when the service is invoked (using Task.Factory.StartNew) and then wait for them to finish:
Task.WaitAll(task1, task2, task3, task4, task5, task6);
Ok... what I see, and don't like, is that on the first call (sometimes the first 2-3 calls, if made quickly one after another), the final task starts much later than the others (I am looking at a case where it started 0.5 seconds after the others). I tried calling
ThreadPool.SetMinThreads(12*Environment.ProcessorCount, 20);
at the beginning of my service, but it doesn't seem to help.
The tasks are all database-related: I'm reading from multiple databases and it has to take as little time as possible.
Any idea why the last task is taking so long? Is there something I can do about it?
Alternatively, should I use the thread pool directly? As it happens, in one case I'm looking at, one task had already ended before the last one started - I would had saved 0.2 seconds if I had reused that thread instead of waiting for a new one to be created. However, I can not be sure that that task will always end so quickly, so I can't put both requests in the same task.
[Edit] The OS is Windows Server 2003, so there should be no connection limit. Also, it is hosted in IIS - I don't know if I should create regular threads or using the thread pool - which is the preferred version?
[Edit] I've also tried using Task.Factory.StartNew(action, TaskCreationOptions.LongRunning); - it doesn't help, the last task still starts much later (around half a second later) than the rest.
[Edit] MSDN1 says:
The thread pool has a built-in delay
(half a second in the .NET Framework
version 2.0) before starting new idle
threads. If your application
periodically starts many tasks in a
short time, a small increase in the
number of idle threads can produce a
significant increase in throughput.
Setting the number of idle threads too
high consumes system resources
needlessly.
However, as I said, I'm already calling SetMinThreads and it doesn't help.
I have had problems myself with delays in thread startup when using the (.Net 4.0) Task-object. So for time-critical stuff I now use dedicated threads (... again, as that is what I was doing before .Net 4.0.)
The purpose of a thread pool is to avoid the operative system cost of starting and stopping threads. The threads are simply being reused. This is a common model found in for example internet servers. The advantage is that they can respond quicker.
I've written many applications where I implement my own threadpool by having dedicated threads picking up tasks from a task queue. Note however that this most often required locking that can cause delays/bottlenecks. This depends on your design; are the tasks small then there would be a lot of locking and it might be faster to trade some CPU in for less locking: http://www.boyet.com/Articles/LockfreeStack.html
SmartThreadPool is a replacement/extension of the .Net thread pool. As you can see in this link it has a nice GUI to do some testing: http://www.codeproject.com/KB/threads/smartthreadpool.aspx
In the end it depends on what you need, but for high performance I recommend implementing your own thread pool. If you experience a lot of thread idling then it could be beneficial to increase the number of threads (beyond the recommended cpucount*2). This is actually how HyperThreading works inside the CPU - using "idle" time while doing operations to do other operations.
Note that .Net has a built-in limit of 25 threads per process (ie. for all WCF-calls you receive simultaneously). This limit is independent and overrides the ThreadPool setting. It can be increased, but it requires some magic: http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=201
Following from my prior question (yep, should have been a Q against original message - apologies):
Why do you feel that creating 12 threads for each processor core in your machine will in some way speed-up your server's ability to create worker threads? All you're doing is slowing your server down!
As per MSDN do
As per the MSDN docs: "You can use the SetMinThreads method to increase the minimum number of threads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases, the thread pool will perform better with its own algorith for allocating threads. Reducing the minimum to less than the number of processors can also hurt performance.".
Issues like this are usually caused by bumping into limits or contention on a shared resource.
In your case, I am guessing that your last task(s) is/are blocking while they wait for a connection to the DB server to come available or for the DB to respond. Remember - if your invocation kicks off 5-6 other tasks then your machine is going to have to create and open numerous DB connections and is going to kick the DB with, potentially, a lot of work. If your WCF server and/or your DB server are cold, then your first few invocations are going to be slower until the machine's caches etc., are populated.
Have you tried adding a little tracing/logging using the stopwatch to time how long it takes for your tasks to connect to the DB server and then execute their operations?
You may find that reducing the number of concurrent tasks you kick off actually speeds things up. Try spawning 3 tasks at a time, waiting for them to complete and then spawn the next 3.
When you call Task.Factory.StartNew, it uses a TaskScheduler to map those tasks into actual work items.
In your case, it sounds like one of your Tasks is delaying occasionally while the OS spins up a new Thread for the work item. You could, potentially, build a custom TaskScheduler which already contained six threads in a wait state, and explicitly used them for these six tasks. This would allow you to have complete control over how those initial tasks were created and started.
That being said, I suspect there is something else at play here... You mentioned that using TaskCreationOptions.LongRunning demonstrates the same behavior. This suggests that there is some other factor at play causing this half second delay. The reason I suspect this is due to the nature of TaskCreationOptions.LongRunning - when using the default TaskScheduler (LongRunning is a hint used by the TaskScheduler class), starting a task with TaskCreationOptions.LongRunning actually creates an entirely new (non-ThreadPool) thread for that Task. If creating 6 tasks, all with TaskCreationOptions.LongRunning, demonstrates the same behavior, you've pretty much guaranteed that the problem is NOT the default TaskScheduler, since this is going to always spin up 6 threads manually.
I'd recommend running your code through a performance profiler, and potentially the Concurrency Visualizer in VS 2010. This should help you determine exactly what is causing the half second delay.
What is the OS? If you are not running the server versions of windows, there is a connection limit. Your many threads are probably being serialized because of the connection limit.
Also, I have not used the task parallel library yet, but my limited experience is that new threads are cheap to make in the context of networking.
These articles might explain the problem you're having:
http://blogs.msdn.com/b/wenlong/archive/2010/02/11/why-are-wcf-responses-slow-and-setminthreads-does-not-work.aspx
http://blogs.msdn.com/b/wenlong/archive/2010/02/11/why-does-wcf-become-slow-after-being-idle-for-15-seconds.aspx
seeing as you're using .Net 4, the first article probably doesn't apply, but as the second article points out the ThreadPool terminates idle threads after 15 seconds which might explain the problem you're having and offers a simple (though a little hacky) solution to get around it.
Whether or not you should be using the ThreadPool directly wouldn't make any difference as I suspect the task library is using it for you underneath anyway.
One third-party library we have been using for a while might help you here - Smart Thread Pool. You still get the same benefits of using the task libraries, in that you can have the return values from the threads and get any exception information from them too.
Also, you can instantiate threadpools so that when you have multiple places each needing a threadpool (so that a low priority process doesn't start eating into the quota of some high priority process) and oh yeah you can set the priority of the threads in the pool too which you can't do with the standard ThreadPool where all the threads are background threads.
You can find plenty of info on the codeplex page, I've also got a post which highlights some of the key differences:
http://theburningmonk.com/2010/03/threading-introducing-smartthreadpool/
Just on a side note, for tasks like the one you've mentioned, which might take some time to return, you probably shouldn't be using the threadpool anyway. It's recommended that we should avoid using the threadpool for any blocking tasks like that because it hogs up the threadpool which is used by all sorts of things by the framework classes, like handling timer events, etc. etc. (not to mention handling incoming WCF requests!). I feel like I'm spamming here but here's some of the info I've gathered around the use of the threadpool and some useful links at the bottom:
http://theburningmonk.com/2010/03/threading-using-the-threadpool-vs-creating-your-own-threads/
well, hope this helps!
I have a task that needs to run every 30 seconds. I can do one of two things:
Write a command line app that runs the task once, waits 30 seconds, runs it again and then exits. I can schedule this task with Scheduled Tasks in Windows to run every minute
Write a Service that runs a task repeatedly while waiting 30 seconds in between each run.
Number 1 is more trivial, in my opinion and I would opt to do it this way by default. Am I wimping out? Is there a reason why I should make this a Service and not a scheduled task? What are the pros and cons of both and which would you pick in the end?
I read a nice blog post about this question recently. It goes into a lot of good reasons why you should not write a service to run a recurring job. Additionally, this question has been asked before:
https://stackoverflow.com/questions/390307/windows-service-vs-scheduled-task
Windows Service or Scheduled Task, which one do we prefer?
One advantage of using the scheduled task, is that if there is some potential risk involved with running the service such as a memory leak or hanging network connection, then the windows service can potentially hang aroung for a long time, adversely affecting other users. On the other hand, the scheduled task is written to be short running, so even if it does leak, the effect is minimised.
On the other hand, someone in one of the above questions commented that the scheduler has a limit of accuracy of somewhere in the range of 1 minute, so you may see that the scheduler is unable to run your task every 30 seconds with accuracy.
Obviously there are a number of tradeoffs to consider, but hopefully this will help you make a good decision.
If you're trying to run every 30 seconds, I'd go for option 2. This is pretty much a continually running job, in that case. The overhead of starting and stopping the process is probably higher than the process itself, especially if you use an appropriate timer.
If you make a job that is running once a day (or a few times a day), then I'd go for option 1 - using a scheduled task.
The task scheduler in windows seems a bit flakey in my opinion. I think you would get a more reliable result running as a service.
Also, a service could keep resources in memory, such as reading input from a file, and only have to do this at start-up of the service, not every 30 seconds.
30 seconds is a pretty short interval (relatively speaking) between processing cycles. Like the others I have my concerns about the task scheduler and I am afraid such a short interval will only compound the issues you might encounter if you took that approach. If this were my project I would almost certainly go with the service.