SQL Performance, .Net Optimizations vs Best Practices - c#

I need confirmation/explanation from you pros/gurus with the following because my team is telling me "it doesn't matter" and it's fustrating me :)
Background: We have a SQL Server 2008 that is being used by our main MVC3 / .Net4 web app. We have about 200+ concurrent users at any given point. The server is being hit EXTREMELY hard (locks, timeouts, overall slowness) and I'm trying to apply things i learned throughout my career and at my last MS certification class. They are things we've all been drilled on ("close SQL connections STAT") and I'm trying to explain to my team that these 'little things", though not one alone makes a difference, adds up in the end.
I need to know if the following do have a performance impact or if it's just 'best practice'
1. Using "USING" keyword.
Most of their code is like this:
public string SomeMethod(string x, string y) {
SomethingDataContext dc = new SomethingDataContext();
var x = dc.StoredProcedure(x, y);
}
While I'm trying to tell them that USING closes/frees up resources faster:
using (SomethingDataContext dc = new SomethingDataContext()) {
var x = dc.StoredProcedure(x, y);
}
Their argument is that the GC does a good enough job cleaning up after the code is done executing, so USING doesn't have a huge impact. True or false and why?
2. Connection Pools
I always heard setting up connection pools can significantly speed up any website (at least .Net w/ MSSQL).
I recommended we add the following to our connectionstrings in the web.config:
..."Pooling=True;Min Pool Size=3;Max Pool Size=100;Connection
Timeout=10;"...
Their argument is that .Net/MSSQL already sets up the connection pools behind the scenes and is not necessary to put in our web.config. True or false? Why does every other site say pooling should be added for optimal performance if it's already setup?
3. Minimize # of calls to DB
The Role/Membership provider that comes with the default .Net MVC project is nice - it's handy and does most of the legwork for you. But these guys are making serious use of UsersInRoles() and use it freely like a global variable (it hits the DB everytime this method is called).
I created a "user object" that loads all the roles upfront on every pageload (along with some other user stuff, such as GUIDs, etc) and then query this object for if the user has the Role.
Other parts of the website have FOR statements that loop over 200 times and do 20-30 sql queries on every pass = over 4,000 database calls. It somehow does this in a matter of seconds, but what I want to do is consolidate the 20-30 DB calls into one, so that it makes ONE call 200 times (each loop).
But because SQL profiler says the query took "0 seconds", they're argument is it's so fast and small that the servers can handle these high number of DB queries.
My thinking is "yeah, these queries are running fast, but they're killing the overall SQL server's performance."
Could this be a contributing factor? Am I worrying about nothing, or is this a (significant) contributing factor to the server's overall performance issues?
4. Other code optimizations
The first one that comes to mind is using StringBuilder vs a simple string variable. I understand why I should use StringBuilder (especially in loops), but they say it doesn't matter - even if they need to write 10k+ lines, their argument is that the performance gain doesn't matter.
So all-in-all, are all the things we learn and have drilled into us ("minimize scope!") just 'best practice' with no real performance gain or do they all contribute to a REAL/measurable performance loss?
EDIT***
Thanks guys for all your answers! I have a new (5th) question based on your answers:
They in fact do not use "USING", so what does that mean is happening? If there is connection pooling happening automatically, is it tying up connections from the pool until the GC comes around? Is it possible each open connection to the SQL server is adding a little more burden to the server and slowing it down?
Based on your suggestions, I plan on doing some serious benchmarking/logging of connection times because I suspect that a) the server is slow, b) they aren't closing connections and c) Profiler is saying it ran in 0 seconds, the slowness might be coming from the connection.
I really appreciate your help guys. THanks again

Branch the code, make your changes & benchmark+profile it against the current codebase. Then you'll have some proof to back up your claims.
As for your questions, here goes:
You should always manually dispose of classes which implement IDisposable, the GC won't actually call dispose however if the class also implements a finalizer then it will call the finalizer however in most implementations they only clean up unmanaged resources.
It's true that the .NET framework already does connection pooling, I'm not sure what the defaults are but the connection string values would just be there to allow you to alter them.
The execution time of the SQL statement is only part of the story, in SQL profiler all you will see is how long the database engine took to execute the query, what you're missing there is the time it takes the web server to connect to and receive the results from the database server so while the query may be quick, you can save on a lot of IO & network latency by batching queries.
This one is a good one to do some profiling on to prove the extra memory used by concatenation over string builders.

Oye. For sure, you can't let GC close your database connections for you. GC might not happen for a LONG time...sometimes hours later. It doesn't happen right away as soon as a variable goes out of scope. Most people use the IDisposable using() { } syntax, which is great, but at the very least something, somewhere needs to be calling connection.Close()

Objects that implement IDisposable and hold on inmanaged resources also implement a finilizer that will ensure that dispose is called during GC, the problem is when it is called, the gc can take a lot of time to do it and you migth need those resources before that. Using makes the call to the dispose as soon as you are done with it.
You can modify the parameters of pooling in the webconfig but its on by default now, so if you leave the default parameters you are no gaining anything
You not only have to think about how long it takes the query to execute but also the connection time between application server and database, even if its on the same computer it adds an overhead.
StringBuilder wont affect performance in most web applications, it would only be important if you are concatenating 2 many times to the same string, but i think its a good idea to use it since its easier to read .

I think that you have two separate issues here.
Performance of your code
Performance of the SQL Server database
SQL Server
Do you have any monitoring in place for SQL Server? Do you know specifically what queries are being run that cause the deadlocks?
I would read this article on deadlocks and consider installing the brilliant Who is active to find out what is really going on in your SQL Server. You might also consider installing sp_Blitz by Brent Ozar. This should give you an excellent idea of what is going on in your database and give you the tools to fix that problem first.
Other code issues
I can't really comment on the other code issues off the top of my head. So I would look at SQL server first.
Remember
Monitor
Identify Problems
Profile
Fix
Go to 1

Well, I'm not a guru, but I do have a suggestion: if they say you're wrong, tell them, "Prove it! Write me a test! Show me that 4000 calls are just as fast as 200 calls and have the same impact on the server!"
Ditto the other things. If you're not in a position to make them prove you right, prove them wrong, with clear, well-documented tests that show that what you're saying is right.
If they're not open even to hard evidence, gathered from their own server, with code they can look at and inspect, then you may be wasting your time on that team.

At the risk of just repeating what others here have said, here's my 2c on the matter
Firstly, you should pick your battles carefully...I wouldn't go to war with your colleagues on all 4 points because as soon as you fail to prove one of them, it's over, and from their perspective they're right and you're wrong.
Also bear in mind that no-one likes to be told their beatiful code is an ugly baby, so I assume you'll be diplomatic - don't say "this is slow", say "I found a way to make this even faster"....(of course your team could be perfectly reasonable so I'm basing that on my own experience as well:) So you need to pick one of the 4 areas above to tackle first.
My money is on #3.
1, 2 and 4 can make a difference, but in my own experience, not that much - but what you described in #3 sounds like death by a thousand papercuts for the poor old server! The queries probably execute fast because they're parameterised so they're cached, but you need to bear in mind that "0 seconds" in the profiler could be 900 milliseconds, if you see what I mean...add that up for many and things start getting slow; this could also be a primary source of the locks because if each of these nested queries is hitting the same table over and over, no matter how fast it runs, with the number of users you mentioned, it's certain you will have contention.
Grab the SQL and run it in SSMS but include Client Statistics so you can see not only the execution time but also the amount of data being sent back to the client; that will give you a clearer picture of what sort of overhead in involved.
Really the only way you can prove any of this is to setup a test and measure as others have mentioned, but also be certain to also run some profiling on the server as well - locks, IO queues, etc, so that you can show that not only is your way faster, but that it places less load on the server.
To touch on your 5th question - I'm not sure, but I would guess that any SqlConnection that's not auto-disposed (via using) is counted as still "active" and is not available from the pool any more. That being said - the connection overhead is pretty low on the server unless the connection is actually doing anything - but you can again prove this by using the SQL Performance counters.
Best of luck with it, can't wait to find out how you get on.

I recently was dealing with a bug in the interaction between our web application and our email provider. When an email was sent, a protocol error occurred. But not right away.
I was able to determine that the error only occurred when the SmtpClient instance was closed, which was occurring when the SmtpClient was disposed, which was only happening during garbage collection.
And I noticed that this often took two minutes after the "Send" button was clicked...
Needless to say, the code now properly implements using blocks for both the SmtpClient and MailMessage instances.
Just a word to the wise...

1 has been addressed well above (I agree with it disposing nicely, however, and have found it to be a good practice).
2 is a bit of a hold-over from previous versions of ODBC wherein SQL Server connections were configured independently with regards to pooling. It used to be non-default; now it's default.
As to 3 and 4, 4 isn't going to affect your SQL Server's performance - StringBuilder might help speed the process within the UI, certainly, which may have the effect of closing off your SQL resources faster, but they won't reduce the load on the SQL Server.
3 sounds like the most logical place to concentrate, to me. I try to close off my database connections as quickly as possible, and to make the fewest calls possible. If you're using LINQ, pull everything into an IQueryable or something (list, array, whatever) so that you can manipulate it & build whatever UI structures you need, while releasing the connection prior to any of that hokum.
All of that said, it sounds like you need to spend some more quality time with the profiler. Rather than looking at the amount of time each execution took, look at the processor & memory usage. Just because they're fast doesn't mean they're not "hungry" executions.

The using clause is just syntactic sugar, you are essentially doing
try
{
resouce.DoStuff();
}
finally
{
resource.Dispose()
}
Dispose is probably going to get called anyway when the object is garbage collected, but only if the framework programmers did a good job of implementing the disposable pattern. So the arguments against your colleagues here are:
i) if we get into the habit of utilizing using we make sure to free unmanaged resources because not all framework programmers are smart to implement the disposable pattern.
ii) yes, the GC will eventually clean that object, but it may take a while, depending on how old that object is. A gen 2 GC cleanup is done only once per second.
So on short:
see above
yes, pooling is set by default to true and max pool size to 100
you are correct, definitely the best area to push on for improvements.
premature optimization is the root of all evil. Get #1 and #3 in first. Use SQL
profiler and db specific methods (add indexes, defragment them, monitor deadlocks etc.).
yes, could be. best way is to measure it - look at the perf counter SQLServer: General Statistics – User Connections; here is an article describing how to do it.
Always measure your improvements, don't change the code without evidence!

Related

Desktop C# SQL Server (LocalDB) database access patterns

I'm coming from a Native C++ / PHP / MySQL / SQLite background.
Spending this weekend learning C# / WinForms / SQL Server / ASP.NET. And it all seems to work differently. Especially considering I no longer know exactly what happens under the hood, where I can optimize things and so on.
Needing to work with SQL Server (LocalDB) I think I noticed a weird database access pattern in most of the online examples I read + video tutorials (I got 2 books from Amazon but they arrive next week so currently, to my shame, learning basics online).
Every time they access the Database in those examples, they open and close a SqlConnection for each query.
using(var sql = new SqlConnection())
{
sql.Open();
// do Sql stuff here
}
For a C++ guy, this is making me very nervous:
What's the overhead of open/close connections all the time when I need to do a query?
Why not open an object and reuse it when required?
Can anyone tell me if this a performance-friendly DB access pattern in Desktop C# or go with Plan B? The end-result will be a C# Windows Service featuring an IOCP Server (which I figured out already) that should deal with up to 1,000 connections. It won't be very data intensive. But even with 100 clients, Sql Open/Close operations overhead, if any, can add up quickly.
I also noticed MultipleActiveResultSets=True; that should make this especially friendly for multiple-reads. So, I would imagine a single connection for the entire application's read-access & short-write with MARS should do the trick?! And dedicated connections for larger INSERT/UPDATE.
Plan B: I've initially thought about creating a connection pool for short reading / writing operations. And another one for longer read/write operations. And looping through it myself... Or maybe one connection per client but I'm not sure that won't be quite abusive.
Actually, there is very little performance issue here, and the small amount of overhead is made up for by a huge increase in maintainability.
First, SqlConnection uses ADO.NET connection pooling by default. So connections are not actually opened and closed to the server. Instead, internally ADO.NET has a pool of connections that it reuses when appropriate, grouping them by ConnectionString. It's very good at managing these resources, so long as you are good about cleaning up your objects when you are done with them.
This is part of what makes this work. By closing the connection, you are telling the connection pool that the connection can be reused by a different SqlConnection, so in effect, what you view as a performance problem is actually a performance optimization.
Coming from native programming, the first thing you have to learn about writing code in a managed world is that you HAVE to release your resources, because otherwise the garbage collector won't be able to efficiently clean them up. I know your first impulse is to try and manage the lifetimes yourself, but you should only do this when it is absolutely necessary.
The second thing to learn is to stop "getting nervous" about things you view as potential performance issues... Only optimize when you KNOW them to be a problem (ie, you have used a profiler and found that the normal way isn't as efficient as you would like it to.
As always, read the documentation:
http://msdn.microsoft.com/en-us/library/8xx3tyca(v=vs.110).aspx

MsgSetRequest - Can I throw too much data at it?

I'm working on a C# library project that will process transactions between SQL and QuickBooks Enterprise, keeping both data stores in sync. This is great and all, but the initial sync is going to be a fairly large set of transactions. Once the initial sync is complete, transactions will sync as needed for the remainder of the life of the product.
At this point, I'm fairly familiar with the SDK using QBFC, as well as all of the various resources and sample code available via the OSR, the ZOMBIE project by Paul Keister (thanks, Paul!) and others. All of these resources have been a huge help. But one thing I haven't come across yet is whether there is a limit or substantial or deadly performance cost associated with large amounts of data via a single Message Set Request. As I understand it, the database on QuickBooks' end is just a SQL database as well, but I don't want to make any assumptions.
Again, I just need to hit this hard once, so I don't want to engineer a separate solution to do the import. This also affords me an opportunity to test a copy of live data against my library, logs and all.
For what it's worth, this is my first ever post on Stack, so feel free to educate me on posting here if I've steered off course in any way. Thanks.
For what it's worth, I found that in a network environment (as opposed to everything happening on 1 box) it's better to have a larger MsgSetRequest as opposed to a smaller one. Of course everything has its limits, and maybe I just never hit it. I don't remember exactly how big the request set was, but it was big. The performance improvement was easily 10 to 1 or better.
If I was you, I'd build some kind of iteration into my design from the beginning (to iterate through your SQL data set). Start with a big number that will do it all at once, and if that breaks just scale it back until you find something that works.
I know this answer doesn't have the detail you're looking for, but hopefully it will help.

Trying to optimize I/O for MongoDB

I have an updater script that runs every few hours for various regions on a gaming server. I am looking to run this script more frequently and add more regions. Ideally I would love to spread the load of the CPU and I/O as evenly as possible. I used to run this script using mysql, but now the website uses mongodb for everything, so it kinda made sense to move the updater scripts to mongodb too. I am having really high I/O spikes when mongodb flushes all of the updates to the database.
The script is written in C#, although I don't think that's too relative. More importantly is that we are doing about 500k to 1.2 million updates each time one of these scripts runs. We have done some small optimizations in the code and with indexes, but at this point we are stuck at how to optimize the actual mongodb settings.
Some other important information is that we do something like this
update({'someIdentifier':1}, $newDocument)
instead of this:
$set : { internalName : 'newName' }
Not sure if this is a lot worse in performance than doing $set or not.
What can we do to try and spread the load out? I can assign more memory to the VM if that will help as well.
I am happy to provide more information.
Here are my thoughts:
1) Properly explain your performance concerns.
So far I can't really figure out what the issue is or if you have one at all. As far as I can tell you're doing around a GB of updates and are writing about a GB of data to the disk... not much of a shock.
Oh and do some damn testing - Not sure if this is a lot worse in performance than doing $set or not. - why don't you know? What do your tests say?
2) Check to see if there is any hardware mismatch.
Is your disk just slow? Is your working set bigger than RAM?
3) Ask on mongo-user and other MongoDB specific communities...
...simply because you might get a better answer there than the lack of answers here.
4) Consider trying TokuMX.
Wait what? Didn't I just accuse the last guy of suggesting that basically spamming his own product?
Sure, it's a new product that's only been very newly introduced into Mongo (it appears to have a mysql version for a bit longer), but the fundamentals seem sound. In particular it's very good at being fast of not only insertions, but updates/deletions. It does this by not needing to actually go and make the changes to the document in question - but store the insertion/update/deletion message in a buffered queue as part of the index structure. As the buffer fills up it applies these changes in bulk, which is massively more efficient in terms of I/O. On top of that, it uses compression in storing data which should additionally reduce I/O - there's physically less to write.
The biggest disadvantage I can see so far is that its best performance is seen with big data - if your data fits into RAM than it loses to BTrees in a bunch of tests. Still fast, but not as fast.
So yeah, it's very new and I would not trust it for anything without testing, and even then only for non-mission-critical stuff, but it might be what you're looking for. And TBH, as it's just a new index/store sub-system... it fits the bill of being an optimisation for mongodb than a separate product. Especially since index/storage systems in mongodb have always been a bit simple - 'lets use memory-mapped files for caching' etc.

How many DataTable objects should I use in my C# app?

I'm an experienced programmer in a legacy (yet object oriented) development tool and making the switch to C#/.Net. I'm writing a small single user app using SQL server CE 3.5. I've read the conceptual DataSet and related doc and my code works.
Now I want to make sure that I'm doing it "right", get some feedback from experienced .Net/SQL Server coders, the kind you don't get from reading the doc.
I've noticed that I have code like this in a few places:
var myTableDataTable = new MyDataSet.MyTableDataTable();
myTableTableAdapter.Fill(MyTableDataTable);
... // other code
In a single user app, would you typically just do this once when the app starts, instantiate a DataTable object for each table and then store a ref to it so you ever just use that single object which is already filled with data? This way you would ever only read the data from the db once instead of potentially multiple times. Or is the overhead of this so small that it just doesn't matter (plus could be counterproductive with large tables)?
For CE, it's probably a non issue. If you were pushing this app to thousands of users and they were all hitting a centralized DB, you might want to spend some time on optimization. In a single-user instance DB like CE, unless you've got data that says you need to optimize, I wouldn't spend any time worrying about it. Premature optimization, etc.
The way to decide varys between 2 main few things
1. Is the data going to be accesses constantly
2. Is there a lot of data
If you are constanty using the data in the tables, then load them on first use.
If you only occasionally use the data, fill the table when you need it and then discard it.
For example, if you have 10 gui screens and only use myTableDataTable on 1 of them, read it in only on that screen.
The choice really doesn't depend on C# itself. It comes down to a balance between:
How often do you use the data in your code?
Does the data ever change (and do you care if it does)?
What's the relative (time) cost of getting the data again, compared to everything else your code does?
How much value do you put on performance, versus developer effort/time (for this particular application)?
As a general rule: for production applications, where the data doesn't change often, I would probably create the DataTable once and then hold onto the reference as you mention. I would also consider putting the data in a typed collection/list/dictionary, instead of the generic DataTable class, if nothing else because it's easier to let the compiler catch my typing mistakes.
For a simple utility you run for yourself that "starts, does its thing and ends", it's probably not worth the effort.
You are asking about Windows CE. In that particular care, I would most likely do the query only once and hold onto the results. Mobile OSs have extra constraints in batteries and space that desktop software doesn't have. Basically, a mobile OS makes bullet #4 much more important.
Everytime you add another retrieval call from SQL, you make calls to external libraries more often, which means you are probably running longer, allocating and releasing more memory more often (which adds fragmentation), and possibly causing the database to be re-read from Flash memory. it's most likely a lot better to hold onto the data once you have it, assuming that you can (see bullet #2).
It's easier to figure out the answer to this question when you think about datasets as being a "session" of data. You fill the datasets; you work with them; and then you put the data back or discard it when you're done. So you need to ask questions like this:
How current does the data need to be? Do you always need to have the very very latest, or will the database not change that frequently?
What are you using the data for? If you're just using it for reports, then you can easily fill a dataset, run your report, then throw the dataset away, and next time just make a new one. That'll give you more current data anyway.
Just how much data are we talking about? You've said you're working with a relatively small dataset, so there's not a major memory impact if you load it all in memory and hold it there forever.
Since you say it's a single-user app without a lot of data, I think you're safe loading everything in at the beginning, using it in your datasets, and then updating on close.
The main thing you need to be concerned with in this scenario is: What if the app exits abnormally, due to a crash, power outage, etc.? Will the user lose all his work? But as it happens, datasets are extremely easy to serialize, so you can fairly easily implement a "save every so often" procedure to serialize the dataset contents to disk so the user won't lose a lot of work.

Windows Service Increasing CPU Consumption

At my job, I have a clutch of six Windows services that I am responsible for, written in C# 2003. Each of these services contain a timer that fires every minute or so, where the majority of their work happens.
My problem is that, as these services run, they start to consume more and more CPU time through each iteration of the loop, even if there is no meaningful work for them to do (ie, they're just idling, looking through the database for something to do). When they start up, each service uses an average of (about) 2-3% of 4 CPUs, which is fine. After 24 hours, each service will be consuming an entire processor for the duration of its loop's run.
Can anyone help? I'm at a loss as to what could be causing this. Our current solution is to restart the services once a day (they shut themselves down, then a script sees that they're offline and restarts them at about 3AM). But this is not a long term solution; my concern is that as the services get busier, restarting them once a day may not be sufficient... but as there's a significant startup penalty (they all use NHibernate for data access), as they get busier, exactly what we don't want to be doing is restarting them more frequently.
#akmad: True, it is very difficult.
Yes, a service run in isolation will show the same symptom over time.
No, it doesn't. We've looked at that. This can happen at 10AM or 6PM or in the middle of the night. There's no consistency.
We do; and they are. The services are doing exactly what they should be, and nothing else.
Unfortunately, that requires foreknowledge of exactly when the services are going to be maxing out CPUs, which happens on an unpredictable schedule, and never very quickly... which makes things doubly difficult, because my boss will run and restart them when they start having problems without thinking of debug issues.
No, they're using a fairly consistent amount of RAM (approx. 60-80MB each, out of 4GB on the machine).
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving. My boss' solution (which I emphatically don't want to implement) is to put a field in the database which holds multiple times for the services to restart during the day, so that he can make the problem go away and not think about it. I'm desperately seeking the cause of the real problem so that I can fix it, because that solution will become a disaster in about six months.
#Yaakov Ellis: They each have a different function. One reads records out of an Oracle database somewhere offsite; another one processes those records and transfers files belonging to those records over to our system; a third checks those files to make sure they're what we expect them to be; another is a maintenance service that constantly checks things like disk space (that we have enough) and polls other servers to make sure they're alive; one is running only to make sure all of these other ones are running and doing their jobs, monitors and reports errors, and restarts anything that's failed to keep the whole system going 24 hours a day.
So, if you're asking what I think you're asking, no, there isn't one common thing that all these services do (other than database access via NHibernate) that I can point to as a potential problem. Unfortunately, if that turns out to be the actual issue (which wouldn't surprise me greatly), the whole thing might be screwed -- and I'll end up rewriting all of them in simple SQL. I'm hoping it's a garbage collector problem or something easier to deal with than NHibernate.
#Joshdan: No secret. As I said, we've tried all the usual troubleshooting. Profiling was unhelpful: the profiler we use was unable to point to any code that was actually executing when the CPU usage was high. These services were torn apart about a month ago looking for this problem. Every section of code was analyzed to attempt to figure out if our code was the issue; I'm not here asking because I haven't done my homework. Were this a simple case of the services doing more work than anticipated, that's something that would have been caught.
The problem here is that, most of the time, the services are not doing anything at all, yet still manage to consume 25% or more of four CPU cores: they're finding no work to do, and exiting their loop and waiting for the next iteration. This should, quite literally, take almost no CPU time at all.
Here's a example of behaviour we're seeing, on a service with no work to do for two days (in an unchanging environment). This was captured last week:
Day 1, 8AM: Avg. CPU usage approx 3%
Day 1, 6PM: Avg. CPU usage approx 8%
Day 2, 7AM: Avg. CPU usage approx 20%
Day 2, 11AM: Avg. CPU usage approx 30%
Having looked at all of the possible mundane reasons for this, I've asked this question here because I figured (rightly, as it turns out) that I'd get more innovative answers (like Ubiguchi's), or pointers to things I hadn't thought of (like Ian's suggestion).
So does the CPU spike happen
immediately preceding the timer
callback, within the timer callback,
or immediately following the timer
callback?
You misunderstand. This is not a spike. If it were, there would be no problem; I can deal with spikes. But it's not... the CPU usage is going up generally. Even when the service is doing nothing, waiting for the next timer hit. When the service starts up, things are nice and calm, and the graph looks like what you'd expect... generally, 0% usage, with spikes to 10% as NHibernate hits the database or the service does some trivial amount of work. But this increases to an across-the-board 25% (more if I let it go too far) usage at all times while the process is running.
That made Ian's suggestion the logical silver bullet (NHibernate does a lot of stuff when you're not looking). Alas, I've implemented his solution, but it hasn't had an effect (I have no proof of this, but I actually think it's made things worse... average usage is seeming to go up much faster now). Note that stripping out the NHibernate "sections" (as you recommend) is not feasible, since that would strip out about 90% of the code in the service, which would let me rule out the timer as a problem (which I absolutely intend to try), but can't help me rule out NHibernate as the issue, because if NHibernate is causing this, then the dodgy fix that's implemented (see below) is just going to have to become The Way The System Works; we are so dependent on NHibernate for this project that the PM simply won't accept that it's causing an unresolvable structural problem.
I just noted a sense of desperation in
the question -- that your problems
would continue barring a small miracle
Don't mean for it to come off that way. At the moment, the services are being restarted daily (with an option to input any number of hours of the day for them to shutdown and restart), which patches the problem but cannot be a long-term solution once they go onto the production machine and start to become busy. The problems will not continue, whether I fix them or the PM maintains this constraint on them. Obviously, I would prefer to implement a real fix, but since the initial testing revealed no reason for this, and the services have already been extensively reviewed, the PM would rather just have them restart multiple times than spend any more time trying to fix them. That's entirely out of my control and makes the miracle you were talking about more important than it would otherwise be.
That is extremely intriguing (insofar
as you trust your profiler).
I don't. But then, these are Windows services written in .NET 1.1 running on a Windows 2000 machine, deployed by a dodgy Nant script, using an old version of NHibernate for database access. There's little on that machine I would actually say I trust.
You mentioned that you're using NHibernate - are you closing your NHibernate sessions at appropriate points (such as the end of each iteration?)
If not, then the size of the object map loaded into memory will be gradually increasing over time, and each session flush will take increasingly more CPU time.
Here's where I'd start:
Get Process Explorer and show %Time in JIT, %Time in GC, CPU Cycles Delta, CPU Time, CPU %, and Threads.
You'll also want kernel and user time, and a couple of representative stack traces but I think you have to hit Properties to get snapshots.
Compare before and after shots.
A couple of thoughts on possibilities:
excessive GC (% Time in GC going up. Also, Perfmon GC and CPU counters would correspond)
excessive threads and associated context switches (# of threads going up)
polling (stack traces are consistently caught in a single function)
excessive kernel time (kernel times are high - Task Manager shows large kernel time numbers when CPU is high)
exceptions (PE .NET tab Exceptions thrown is high and getting higher. There's also a Perfmon counter)
virus/rootkit (OK, this is a last ditch scenario - but it is possible to construct a rootkit that hides from TaskManager. I'd suspect that you could then allocate your inevitable CPU usage to another process if you were cunning enough. Besides, if you've ruled out all of the above, I'm out of ideas right now)
It's obviously pretty difficult to remotely debug you're unknown application... but here are some things I'd look at:
What happens when you only run one of the services at a time? Do you still see the slow-down? This may indicate that there is some contention between the services.
Does the problem always occur around the same time, regardless of how long the service has been running? This may indicate that something else (a backup, virus scan, etc) is causing the machine (or db) as a whole to slow down.
Do you have logging or some other mechanism to be sure that the service is only doing work as often as you think it should?
If you can see the performance degradation over a short time period, try running the service for a while and then attach a profiler to see exactly what is pegging the CPU.
You don't mention anything about memory usage. Do you have any of this information for the services? It's possible that your using up most of the RAM and causing the disk the trash, or some similar problem.
Best of luck!
I suggest to hack the problem into pieces.
First, find a way to reproduce the problem 100% of the times and quickly. Lower the timer so that the services fire up more frequently (for example, 10 times quicker than normal). If the problem arises 10 times quicker, then it's related to the number of iterations and not to real time or to real work done by the services). And you will be able to do the next steps quicker than once a day.
Second, comment out all the real work code, and let only the services, the timers and the synchronization mechanism. If the problem still shows up, than it will be in that part of the code.
If it doesn't, then start adding back the code you commented out, one piece at a time. Eventually, you should find out what part of the code is causing the problem.
'Fraid this answer is only going to suggest some directions for you to look in, but having seen similar problems in .NET Windows Services I have a couple of thoughts you might find helpful.
My first suggestion is your services might have some bugs in either the way they handle memory, or perhaps in the way they handle unmanaged memory. The last time I tracked down a similar issue it turned out a 3rd party OSS libray we were using stored handles to unmanaged objects in static memory. The longer the service ran the more handles the service picked up which caused the process' CPU performance to nose-dive very quickly. The way to try and resolve this sort of issue to ensure your services store nothing in memory inbetween the timer invocations, although if your 3rd party libraries use static memory you might have to do something clever like create an app domain for the timer invocation and ditch the app doamin (and its static memory) once processing is complete.
The other issue I've seen in similar circumstances was with the timer synchronization code being suspect, which in effect allowed more than one thread to be running the processing code at once. When we debugged the code we found the 1st thread was blocking the 2nd, and by the time the 2nd kicked off there was a 3rd being blocked. Over time the blocking was lasting longer and longer and the CPU usage was therefore heading to the top. The solution we used to fix the issue was to implement proper synchronization code so the timer only kicked off another thread if it wouldn't be blocked.
Hope this helps, but apologies up front if both my thoughts are red herrings.
Sounds like a threading issue with the timer. You might have one unit of work blocking another running on different worker threads, causing them to stack up every time the timer fires. Or you might have instances living and working longer than you expect.
I'd suggest refactoring out the timer. Replace it with a single thread that queues up work on the ThreadPool. You can Sleep() the thread to control how often it looks for new work. Make sure this is the only place where your code is multithreaded. All other objects should be instantiated as work is readied for processing and destroyed after that work is completed. STATE IS THE ENEMY in multithreaded code.
Another area where the design is lacking appears to be that you have multiple services that are polling resources to do something. I'd suggest unifying them under a single service. They might do seperate things, but they're working in unison; you're just using the filesystem, database, etc as a substitution for method calls. Also, 2003? I feel bad for you.
Good suggestions, but rest assured, we have tried all of the usual troubleshooting. What I'm hoping is that this is a .NET issue that someone might know about, that we can work on solving.
My feeling is that no matter how bizarre the underlying cause, the usual troubleshooting steps are your best bet for locating the issue.
Since this is a performance issue, good measurements are invaluable. The overall process CPU usage is far too broad a measurement. Where is your service spending its time? You could use a profiler to measure this, or just log various section start and stops. If you aren't able to do even that, then use Andrea Bertani's suggestion -- isolate sections by removing others.
Once you've located the general area, then you can make even finer-grained measurements, until you sort out the source of the CPU usage. If it's not obvious how to fix it at that point, you at least have ammunition for a much more specific question.
If you have in fact already done all this usual troubleshooting, please do let us in on the secret.

Categories