I have heard a few developers recently say that they are simply polling stuff (databases, files, etc.) to determine when something has changed and then run a task, such as an import.
I'm really against this idea and feel that utilising available technology such as Remoting, WCF, etc. would be far better than polling.
However, I'd like to identify the reasons why other people prefer one approach over the other and more importantly, how can I convince others that polling is wrong in this day and age?
Polling is not "wrong" as such.
A lot depends on how it is implemented and for what purpose. If you really care about immedatly notification of a change, it is very efficient. Your code sits in tight loop, constantly polling (asking) a resource whether it has changed / updated. This means you are notified as soon as you can be that something is different. But, your code is not doing anything else and there is overhead in terms of many many calls to the object in question.
If you are less concerned with immediate notification you can increase the interval between polls, and this can also work well, but picking the correct interval can be difficult. Too long and you might miss critical changes, too short and you are back to the problems of the first method.
Alternatives, such as interrupts or messages, etc. can provide a better compromise in these situations. You are notified of a change as soon as is practically possible, but this delay is not something you control, it depends on the component tself being timely about passing on changes in state.
What is "wrong" with polling?
It can be resource hogging.
It can be limiting (especially if you have many things you want to know about / poll).
It can be overkill.
But...
It is not inherently wrong.
It can be very effective.
It is very simple.
Examples of things that use polling in this day and age:
Email clients poll for new messages (even with IMAP).
RSS readers poll for changes to feeds.
Search engines poll for changes to the pages they index.
StackOverflow users poll for new questions, by hitting 'refresh' ;-)
Bittorrent clients poll the tracker (and each other, I think, with DHT) for changes in the swarm.
Spinlocks on multi-core systems can be the most efficient synchronisation between cores, in cases where the delay is too short for there to be time to schedule another thread on this core, before the other core does whatever we're waiting for.
Sometimes there simply isn't any way to get asynchronous notifications: for example to replace RSS with a push system, the server would have to know about everyone who reads the feed and have a way of contacting them. This is a mailing list - precisely one of the things RSS was designed to avoid. Hence the fact that most of my examples are network apps, where this is most likely to be an issue.
Other times, polling is cheap enough to work even where there is async notification.
For a local file, notification of changes is likely to be the better option in principle. For example, you might (might) prevent the disk spinning down if you're forever poking it, although then again the OS might cache. And if you're polling every second on a file which only changes once an hour, you might be needlessly occupying 0.001% (or whatever) of your machine's processing power. This sounds tiny, but what happens when there are 100,000 files you need to poll?
In practice, though, the overhead is likely to be negligible whichever you do, making it hard to get excited about changing code that currently works. Best thing is to watch out for specific problems that polling causes on the system you want to change - if you find any then raise those rather than trying to make a general argument against all polling. If you don't find any, then you can't fix what isn't broken...
There are two reasons why polling could be considered bad by principle.
It is a waste of resources. It is very likely that you will check for a change while no change has occurred. The CPU cycles/bandwidth spend on this action does not result in a change and thus could have been better spend on something else.
Polling is done on a certain interval. This means that you won’t know that a change has occurred until the next time that the interval has passed.
It would be better to be notified of changes. This way you’re not polling for changes that haven’t occurred and you’ll know of a change as soon as you receive the notification.
Polling is easy to do, very easy, its as easy as any procedural code. Not polling means you enter the world of Asynchronous programming, which isn't as brain-dead easy, and might even become challenging at times.
And as with everything in any system, the path of less resistance is normally more commonly taken, so there will always be programmers using polling, even great programmers, because sometimes there is no need to complicate things with asynchronous patterns.
I for one always thrive to avoid polling, but sometimes I do polling anyways, especially when the actual gains of asynchronous handling aren't that great, such as when acting against some small local data (of course you get a bit faster, but users won't notice the difference in a case like this). So there is room for both methodologies IMHO.
Client polling doesn't scale as well as server notifications. Imagine thousands of clients asking the server "any new data?" every 5 seconds. Now imagine the server keeping a list of clients to notify of new data. Server notification scales better.
I think people should realize that in most cases, at some level there is polling being done, even in event or interrupt driven situations, but you're isolated from the actual code doing the polling. Really, this is the most desirable situation ... isolate yourself from the implementaion, and just deal with the event. Even if you must implement the polling yourself, write the code so that it's isolated, and the results are dealt with independently of the implementation.
The thing about polling is that it works! Its reliable and simple to implement.
The costs of pooling can be high -- if you are scanning a database for changes every minute when there are only two changes a day you are consuming a lot of resources for a very small result.
However the problem with any notification technoligy is that they are much more complex to implement and not only can they be unreliable but (and this is a big BUT) you cannot easily tell when they are not working.
So if you do drop polling for some other technoligy make sure it is usable by average programmers and is ultra reliable.
Its simple - polling is bad - inefficient, waste of resources, etc. There is always some form of connectivity in place that is monitoring for an event of some sort anyway, even if 'polling' is not chosen.
So why go the extra mile and put additional polling in place.
Callbacks are the best option - just need to worry about tie the callback in with your current process. Underlying, there is polling going on to see that the connection is still in place anyhow.
If you keep phoning/ringing your girlfriend and shes never answers, then why keep calling? Just leave a message, and wait until she 'calls back' ;)
I use polling occasionally for certain situations (for example, in a game, I would poll the keyboard state every frame), but never in a loop that ONLY does polling, rather I would do polling as a check (has resource X changed? If yes, do something, otherwise process something else and check again later). Generally speaking though, I avoid polling in favor of asynchronous notifications.
The reasons being that I do not spend resources (CPU time, whatever) waiting for something to happen (especially if those resources could speed up that thing happening in the first place). The cases where I use polling, I don't sit idle waiting, I use the resources elsewhere, so it's a non-issue (for me, at least).
If you are polling for changes to a file, then I agree that you should use the filesystem notifications that are available for when this happens, which are available in most operating systems now.
In a database you could trigger on update/insert and then call your external code to do something. However it might just be that you don't have a requirement for instant actions. For instance you might only need to get data from Database A to Database B on a different network within 15 minutes. Database B might not be accessible from Database A, so you end up doing the polling from, or as a standalone program running near, Database B.
Also, Polling is a very simple thing to program. It is often a first step implementation done when time constraints are short, and because it works well enough, it remains.
I see many answers here, but I think the simplest answer is the answer it self:
Because is (usually) much more simple to code a polling loop than to make the infrastructure for callbacks.
Then, you get simpler code which if it turns out to be a bottleneck later can be easily understood and redesigned/refactored into something else.
This is not answering your question. But realistically, especially in this "day and age" where processor cycles are cheap, and bandwidth is large, polling is actually a pretty good solution for some tasks.
The benefits are:
Cheap
Reliable
Testable
Flexible
I agree that avoiding polling is a good policy. However, In reference to Robert's post, I would say that the simplicity of polling can make it a better approach in instances where the issues mentioned here are not such a big problem, as the asynchronous approach is often considerably less readable and harder to maintain, not to mention the bugs that can creep in to its implementation.
As with everything, it depends. A large high-transaction system I work on currently uses a notification with SQL (A DLL loaded within SQL Server that is called by an extended SP from triggers on certain tables. The DLL then notifies other apps that there is work to do).
However we're moving away from this because we can practically guarantee that there will be work to do continuously. Therefore in order to reduce the complexity and actually speed things up a bit, the apps will process their work and immediately poll the DB again for new work. Should there be none it'll try again after a small interval.
This seems to work quicker and is much simpler. However, another part of the application which is much lower volume does not benefit from a speed increase using this method - unless the polling interval is very small, which leads to performance problems. So we're leaving it as is for this part. Therefore it's a good thing when it's appropriate, but everybody's needs are different.
Here is a good summary of relative merits of push and pull:
https://stpeter.im/index.php/2007/12/14/push-and-pull-in-application-architectures/
I wish I could summarize it further into this answer but some things are best left unabridged.
When thinking about SQL polling, back in the day of VB6 you used to be able to create recordsets using the WithEvents keyword which was an early incarnation of async "listening".
I personally would always look for a way of using an events driven implementation before polling. Failing that a manual implementation of any of the following might help:
sql service broker / dependency class
Some kind of queue technology(RabbitMQ or similar)
UDP broadcast - interesting technique that can
be built with multiple node listeners. Not always possible on some net works though.
Some of these may require a slight redesign of your project, but in an enterprise world might be the better route to go rather than a polling service.
Agreee with most responses that Async/Messaging is usually better. I absolutely agree with Robert Gould's answer. But I'd like to add one more point.
One addition is that polling can kill two birds with one stone. In one particular use case, a project I was involved with used a message queue between databases but polling from an application server to one of the databases. Because the network from app server to DB was occasionally down, polling was additionally used to notify the app of network issues.
In the end, use what makes most sense for the use case with scale-ability in mind.
I'm using polling to check for updates on a file because I'm getting information about that file across a heterogeneous system with different OS types, one of which is very old. The notifications for Linux won't work if the file is on a remote system with a different OS, because that information is not transmitted, but polling works. It's a low bandwidth check, so it doesn't hurt anything.
Related
Looking at the official .net client code, on several places, I saw lock's statements. This issued an internal question on how much does that impact performance.
My current solution is a web-app that is using graylog for logging, and its sink is a rabbit queue. A single critical path request can result on several dozens of logs alone, and ideally it should run on 500ms. On peak moments we´re expecting to handle 3-5 of those requests and 1-2 hundreds of others each second.
Right now, the connection and the model are basically singletons and my question is: how worried should i be about those locks when we hit heavy load? are there know deadlocks spots?
In general, the locks itself are relatively cheap, as can be read here: How expensive is the lock statement?
Short answer: 50ns
So the actual question is: what part is actually locked and does it matter?
My assumption is that it is the part where a message is being published to the queue (although it would help if you elaborate on that).
So, I didn't dive into the code, but since it's purely on the client side, you should be able to horizontal scale those clients with no difficulties.
I'm making a cool (imo) T4 template which will make caching a lot easier. One of the options I have in making this template is to allow for a "load once" type functionality, though I'm not sure how safe it is.
Basically, I want to make it so you can do something like this:
var post=MyCache.PostsCache.GetOrLockLoad(id, ()=>LoadPost(id));
and basically make it so that when the cache must be loaded, it will place a blocking lock across PostsCache. This way, other threads would block until the LoadPost() function is done. This makes it so that LoadPost will only be executed once per cache miss. The traditional way of doing this is LoadPost will be executed anytime the cache is empty, possibly multiple times if multiple requests for it come before the cache is loaded the first time.
Is this a reasonable thing to do, or is blocking other threads for something like this dangerous or wasteful? I'm thinking something along the lines that the thread locking overheads are greater than most operations, but maybe not?
Has anyone seen this kind of thing done and is it a good idea or just dangerous?
Also, although it's designed to run on any cache and application type, it's initially being targetted at ASP.Net's built in caching mechanism.
This seems ok, since in theory the requests after the first will only wait about as long as it would have taken for them to load the data themselves anyway.
But it still feels a bit iffy - what if the first loader thread gets held up due to some intermittent issue that may not affect other threads. It feels like it would be safer to let each thread try the load independently.
It's also adding the complexity and overhead of the locking mechanisms. Keep in mind the more locking you do, the more risk you introduce of getting a deadlock condition (in general). Although in your case, as long as there's no funky locking going on in the LoadPost method it shouldn't be an issue.
Given the risks, I think you would be better off going with a non-locking option.
After all, for any given thread the wait time is pretty much the same - either the time taken to load, or the time spent waiting for the first thread to load.
I'm always a little uncomfortable when a non-concurrent option is used over a concurrent one, especially if the gain seems marginal.
I'm working on a web application framework, which uses MSSQL for data storage, mostly just does CRUD operations (but on arbitrarly complex structures), provides a WCF interface for rich Silverlight admin and has an MVC3 display (and some basic forms like user settings, etc).
It's getting quite good at being able to load, display, edit and save any (reasonably) complex data structure, in a user-friendly way.
But, I'm looking towards the future, and want to expand my capabilities (and it would be fun to learn new things along the way as well...) - so I've decided (in the light of what's coming for C#5...) to try to get some parallel/async optimalization... Now, I haven't even learned TPL and PLinq yet, so I'm happy for any advice there as well.
So my question is, what are possible areas where parallel processing maybe of help, and where does TPL and PLinq help me on that?
My guts tell me, I could try saving branches of a data structure in a parallel way to the database (this is where I'd expect the biggest peformance optimalization), I could perform some complex operations (file upload, mail sending maybe?) in a multithreaded enviroment, etc. Can I build complex SL UI views in parallel on the client? (Creating 60 data-bound fields on a view can cause "blinking"...) Can I create partial views (menus, category trees, search forms, etc) in MVC at once?
ps: If this turns into "Tell me everything about parallel stuffs" thread, I'm happy to make it community-wiki...
Remember that an asp.net web application is intrinsically a parallel application in any case. Requests can be serviced in parallel and this will all be managed by the asp.net framework. So there are two cases:
You have lots of users all hitting the site at once. In which case the parallel processing capability of the server is probably being used to capacity in any case.
You don't have lots of users all hitting the site at once. In which case the server is probably quite capable of dealing with the responses without parallel processing in a suitable fast response time.
Any time you start thinking about optimising something just because it might be fun, or because you just think you should make stuff faster then you are almost certainly guilty of premature optimization. Your efforts could almost certainly be better spent enriching the functionality of the framework, rather than making what is probably a plenty fast enough solution a little bit faster (at the cost of significantly increase complexity).
In answer to the question of where can TPL and PLINQ really help. In my opinion the main advantage of these technologies is in places in the application where you really do have a lot of long running blocking processes. For example if you have a situation where you call out several times to an external web service - it can be a significant advantage to make these calls in parallel. I would strongly question whether writing to a local database - or even a database on a different box on a local network would count as being a long running blocking process to the extent that this kind of parallelisation is of any significant value.
Pretty much all the examples you list fall in to the category of getting the PC to do something in parallel that it was previously doing in sequence. How many CPUs are on your server - how many are really free when the website is under load. Making something parallel does not necessarily equate to making it faster unless the process involved has some measure of time when you PC is sitting around doing nothing waiting for an external event.
First question is to ask the users / testers which bits seem slow. The only way to know for sure what's slowing you down is to use a profiler like dottrace. The results are sometimes surprising.
If you do find something, parallel processing may not be the answer. You need to remember that there is an overhead in splitting tasks up, so if the task is fairly quick in the first place, it could end up being slower. You also have to consider the added complexity, e.g. what happens if half a task succeeds, and half fails? (Although TPL and PLINQ hide you from this to an extend)
Have fun, but I wondering whether this is a case of 1) solution chasing a problem, and 2) premature optimization.
I have an app that moves a project and its files from preview to production using a Flex front-end and a .NET web service. Currently, the process takes about 5-10 mins/per project. Aside from latency concerns, it really shouldn't take that long. I'm wondering whether or not this is a good use-case for multi-threading. Also, considering the user may want to push multiple projects or one right after another, is there a way to queue the jobs.
Any suggestions and examples are greatly appreciated.
Thanks!
Something that does heavy disk IO typically isn't a good candidate for multithreading since the disks can really only do one thing at a time. However, if you're pushing to multiple servers or the servers have particularly good disk subsystems some light threading may be beneficial.
As a note - regardless of whether or not you decide to queue the jobs, you will use multi-threading. Queueing is just one way of handling what is ultimately solved using multi-threading.
And yes, I'd recommend you build a queue to push out each project.
You should compare the speed of your code compared to just copying in Windows (i.e., explorer or command line) vs copying with something advanced like TeraCopy. If your code is significantly slower than Window then look at parts in your code to optimize using a profiler. If your code is about as fast as Windows but slower than TeraCopy, then multithreading could help.
Multithreading is not generally helpful when the operation I/O bound, but copying files involves reading from the disk AND writing over the network. This is two I/O operations, so if you separate them onto different threads, it could increase performance. For something like this you need a producer/consumer setup where you have a Circular queue with one thread reading from disk and writing to the queue, and another thread reading from the queue and writing to the network. It'll be important to keep in mind that the two threads will not run at the same speed, so if the queue gets full, wait before writing more data and if it's empty, wait before writing. Also the locking strategy could have a big impact on performance here and could cause the performance to degrade to slower than a single-threaded implementation.
If you're moving things between just two computers, the network is going to be the bottleneck, so you may want to queue these operations.
Likewise, on the same machine, the I/O is going to be the bottleneck, so you'd want to queue there, too.
You should try using the ThreadPool.
ThreadPool.QueueUserWorkItem(MoveProject, project);
Agreed with everyone over the limited performance of running the tasks in parallel.
If you have full control over your deployment environment, you could use Rhino Queues:
http://ayende.com/Blog/archive/2008/08/01/Rhino-Queues.aspx
This will allow you to produce a queue of jobs asynchronously (say from a WCF service being called from your Silverlight/Flex app) and consume them synchronously.
Alternatively you could use WCF and MSMQ, but the learning curve is greater.
When dealing with multiple files using multiple threads usually IS a good idea in concerns of performance.The main reason is that most disks nowadays support native command queuing.
I wrote an article recently about reading/writing files with multiple files on ddj.com.
See http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=220300055.
Also see related question
Will using multiple threads with a RandomAccessFile help performance?
In particular i made the experience that when dealing with very many files it IS a good idea to use a number of threads. In contrary using many thread in many cases does not slow down applications as much as commonly expected.
Having said that i'd say there is no other way to find out than trying all possible different approaches. It depends on very many conditions: Hardware, OS, Drivers etc.
The very first thing you should do is point any kind of profiling tool towards your software. If you can't do that (like, if you haven't got such a tool), insert logging code.
The very first thing you need to do is figure out what is taking a long time to complete, and then why is it taking a long time to complete. That your "copy" operation as a whole takes a long time to complete isn't good enough, you need to pinpoint the reason for this down to a method or a set of methods.
Until you do that, all the other things you can do to your code will likely be guesswork. My experience has taught me that when it comes to performance, 9 out of 10 reasons for things running slow comes as surprises to the guy(s) that wrote the code.
So measure first, then change.
For instance, you might discover that you're in fact reporting progress of copying the file on a byte-per-byte basis, to a GUI, using a synchronous call to the UI, in which case it wouldn't matter how fast the actual copying can run, you'll still be bound by message handling speed.
But that's just conjecture until you know, so measure first, then change.
At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don't want to be doing is restarting them more frequently.
#akmad: True, it is very difficult.
Yes, a service run in isolation will show the same symptom over time.
No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
We do; and they are. The services are doing exactly what they should be, and nothing else.
Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
#Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
#Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn't thought of (like Ian's suggestion).
So does the CPU spike happen
immediately preceding the timer
callback, within the timer callback,
or immediately following the timer
callback?
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
That made Ian's suggestion the logical silver bullet (NHibernate does a lot of stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seeming to go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
I just noted a sense of desperation in
the question -- that your problems
would continue barring a small miracle
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
That is extremely intriguing (insofar
as you trust your profiler).
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.
You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.
Here's where I'd start:
Get Process Explorer and show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
Compare before and after shots.
A couple of thoughts on possibilities:
excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
excessive threads and associated context switches (# of threads going up)
polling (stack traces are consistently caught in a single function)
excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)
It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
Best of luck!
I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.
'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
Hope this helps, but apologies up front if both my thoughts are red herrings.
Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Where is your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.