How to scale effectively in windows azure? - c#

I have been having some difficulties in identifying the right configurations for effectively scaling my cloud service. I am assuming we just have to use the scale section of the management portal and nothing programmatically?
My current configuration for Web Role is
Medium sized VM (4 GB RAM)
Autoscale - CPUInstance Range - 1 to 10Target CPU - 50 to 80Scale up and down by 1 instance at a timeScale up and down wait time - 5 mins
I used http://loader.io/ site to do load testing by sending concurrent requests to an API. And it could support only 50 -100 users. After that I was getting timeout(10 secs) errors.
My app will be targeting millions of users on a huge scale, so am not really sure how I can efficiently scale to cater to that much load on the server.
I think the problem could be the scale up time which is 5mins(i think its very high), and in management portal, the lowest option is 5mins, so dunno how i can reduce it?
Any suggestions?

Azure's auto-scaling engine examines 60-minute cpu-utilization averages every 5 minutes. This means that every 5 minutes it has a chance to decide if your CPU utilization is too high and scale you up.
If you need something more robust, I'd recommend to think about the following:
CPU Usage is rarely a good indicator for scaling of websites. Look
into Requests/sec or requests/current instead of CPU utilization.
Consider examining the need to scale more frequently (every 1 min?)
Azure portal cannot do this. You'll need either WASABi or
AzureWatch for this
Depending on your usage patterns, consider looking at shorter time averages to make a decision (ie: average over 20 minutes not 60 minutes). Once again, your choices here are WASABi or AzureWatch
Consider looking at the /rate/ of increase
in the metrics and not just the latest averages themselves. IE:
requests/sec rose by 20% in the last 20 minutes. Once again, Azure
autoscaling engine cannot do this, consider either WASABi (which may
do this) or AzureWatch which definitely can do this.
WASABi is an application block from Microsoft (ie: a DLL) that you'll need to configure, host and monitor somewhere yourself. It is pretty flexible and you can override whatever functionality since it is open source.
AzureWatch is a third-party managed service that monitors/autoscales/heals your Azure roles/Virtual Machines/Websites/SQL Azure/etc. It costs money but you let someone else do all the dirty work.
I recently wrote a blog about the comparison of the three products
Disclosure: I'm affiliated with AzureWatch
HTH

Another reason why the minimum time is 5 minutes is because it takes Azure some time to assign additional machines to your Cloud Service and replicate your software onto them. (WebApps dont have that 'problem')
In my work as a saas admin I have found that for Cloud Services this ramp up time after scaling can be around 3-5 minutes for our software package.
If you want to configure scaling within the Azure portal, then my suggestion would be to significantly lower your CPU ranges. As Igorek mentioned Azure scaling looks at the Average over the last 60 minutes.
If a Cloud Service is running at 5% CPU for most of the time, then suddenly it peaks and runs at 99%, it will take some time for the Average to go up and trigger your scale settings. Leaving it at 80% will cause scaling to happen far too late.
RL example:
I manage a portal that runs some CPU intensive calculations. At normal usage our Cloud Services tend to run at 2-5% CPU but on rare occasion we've seen it go up to 99% and stay there for a while.
My first scaling attempt was 2 instances and scaling up with 2 at 80% average CPU, but then it took around 40 minutes for the event to trigger because the Average CPU did not go up that fast. Right now I have everything set to scale when average CPU goes over 25% and what I see is that our Services will scale up after 10-12 minutes.
I'm not saying 25% is the magic number, I'm saying keep in mind that you're working with "average over 60 minutes"
The second thing is that the Azure Portal only shows a limited set of scaling options, and scaling can be set in greater detail when you use Powershell / REST. The 60 minute interval over which the average is calculated for example can be lowered.

Related

Azure SQL - DTU CPU usage disparity

I have been analyzing a DB that is running in Azure SQL that is performing VERY badly. it is on the premium tier with 1750 DTUs available, and at times can still max out DTUs.
Ive identified a variety of querys and terrible data access patterns thru stored procs, which has reduced load. But there is still this massive disparity between DTU and CPU usage in the image below, any other image i see of the "Query Performance Insight" in azure sql shows the DTU line aligning with the CPU usage for the most part.
DTU (in red) to CPU usage per query
Looking at the C# app sitting ontop of this, for each user that uses the app, it creates a SQL user, and uses that user in the connection string to access the DB. This means that connection pooling is not being used, resulting in a massively larger number of active users/sessions on the SQL azure DB. Could this be the sole reason why there is such high DTU usage?
Or could i possibly be missing something regarding IO that isnt visible in the Azure portal?
Thanks
Neil
EDIT: Adding sessions and workers image!
enter image description here
Based on that im not convinved now.. what is the session percentage of? Like its 10% but 10% of what? the max allowed?
Edit2: Adding more metrics:
One week:
2-3 hours when load is high:
The purple spike i believe is the reindex so can ignore that!
Trying to understand DTU versus resources was a stumbling block for me too. Click on your Resource utilization chart and click Edit
Then you get a slider with a lot of resources you can monitor. Select Sessions, and Workers percent. More than likely one of these are your problems. If not, you can add in: CPU, Data IO, Log IO, and/or In-memory OLTP percentage. Hit OK.
Now, what you should find is the real cost in your query or queries. Learning how your queries consume the different resources can help you fix performance problems like these. I learned this when doing large inserts, I was maxing out my Log IO, and everything else was <5% utilization.
Try that, and if you are right about connection pooling, unfortunately that will mean some refactoring in the application. At the very least, using this will give you more insight than just looking at that DTU percentage!

.NET Service - Analysing Development vs Test performance

I have a .NET REST API written using C# MVC5.
The API uses repository that fire hoses necessary data from database, then analyses it and transforms into usable model. The transformation uses a lot of linq to model the data.
On dev (Windows 10), i7 8 core # 3.7ghz, 32gb ram. it takes 10 secs for large test range.
Running on a VM (Windows 2008R2) virtual xeon with 8 virtual cores # 2.99ghz, 8GB RAM takes 300 seconds (5 mins).
Neither exhaust memory, and neither are CPU-bound (CPU touches 50% on the VM, and barely noticed on dev box.)
Same database, code etc.
The API makes use of async api to load some peripheral data whilst it's doing primary job, so I could put some logs in to log time I guess.
What are the common techniques for tackling this problem? Can the CPU speed really be making that much difference?
thanks
EDIT:
FOllowing comment by Pieter, I've increased the VM's memory to 12GB and monitored the performance of the VM whilst executing the operation. It's not the best visual aid (screen shot of TM end of op), but what it did show what that the vCPUs never really went above ~60% and memory - apart from a few mb at beginning of request, never went above 2.7GB.
If IIS / .NET / my operation is not maxing out the resources, what is taking so long?

Finding source of web performance issue

We are trying to track down a performance issue on a asp.net solution on windows 2008.
The error page that has no database access and very little logic takes 10 sec. Other pages over 70 seconds.
Performance drops noticeably only at high load
total memory usage is low 5 GB of 16 available
W3wp.exe using 2.5 GB
several Connection_Dropped DefaultAppPool in the httperr file
ca. 1500 connections, Asp.net queue length is 10000
CPU usage is low
Anyone have an idea what I could check next?
Edit
I have now used VS 2010 to run a performance test against it in a test virtual server.
I ran 200 users with a stepped build up and no wait time.
The interesting thing was that the page time continued to increase even after the max number of users was reached. There did not appear to be any memory leaks, memory usage is flat. Time taken per page goes from 0.1 to 30.0 seconds.
All pages increase, the one that increases most is a get of the login page, no database access just a forms auth check to see if the user is logged in.
Upon reading your numbers (always answering too fast, am I?) I agree that you should probably profile the server-side first. See What Are Some Good .NET Profilers?
I suggest you use
google chrome
It has excellent profiling tools (under developer tools, Ctrl-Shift-I on my installation). I peruse
network
profiles
timeline
charts for the information.
Also, there is the more highlevel Y-Slow extension to Firefox. It is developed/used by yahoo and gives some rather to the point advice out of the box.
If you prefer Firefox, the Firebug extension comes pretty close to the Google developer tools
Ah. Wha about you just look it up?
Attach a profiler, make a profiling run, find out where the CPU spends itÄs time.
There are plenty of profilers around that offer 14 day free trials.
I wouldsay ou need more CPU - find out why ;)

Multiple app instances, windows GDI limit

Im trying to run simultaneously hundreds of instances of the same app(using C#), and after about 200 instances the GUI starts to slow down dramatically until the point that the load time of the next instance is climbing up to 20s (from 1s).
The test maching is :
xeon 5520
12gb ram
windows 2008 web 64 bit
at max load (200 instances) the cpu is at about 20% and ram 45%, so im sure its not a hardware issue.
I already tried configuring Session size and SharedSection in the registry of the windows but it doesnt seem to help.
I also tried to running the app in the background and also on multiple sessions (different sessions) and still the same (i though maybe it a limitation per session).
When the slowdown occures for example on one session i can login to another session and the desktops works without a problem (the first dekstop becomse unusable.)
My question is - is there a way to strip the gdi objects or maybe eliminate the use of the GUI? or is it a windows limitation?
p.s - I cant change the app since its a third pary.
Thanks in advance.
With 200 instances running, the constant context switching is probably hurting performance. Context switching isn't counted in CPU load.
Edit: whoops, wrong link.
Try monitoring context switching on your system
http://technet.microsoft.com/en-us/library/cc938606.aspx
I doubt it's GDI - if you run out of GDI handles/resources you'll notice vast chunks of your windows failing to redraw, rather than everythign slowing down.
The most likely reason for a sudden drop in performance is that you are maxing out your RAM and thrashing your Virtual Memory as all your processes fight for CPU time. Check memory usage, and if it's high, see if you can reduce the footprint of your application. Or apply a "hardware fix" by installing more RAM. Or add Sleeps into your Apps where possible so that they aren't demanding constant timeslices from your CPU (and thus needing to be constantly paged in from VM).

Clock Speed Formula

Is there a simple way to determine how many milliseconds I need to "Sleep" for in order to "emulate" a 2 mhz speed. In other words, I want to execute an instruction, call System.Threading.Thread.Sleep() function for an X amount of milliseconds in order to emulate 2 mhz. This doesn't need to be exact to the millisecond, but is there a ball park I can get? Some forumlate that divides the PC clock speed by the 2 mhz or something?
Thanks
A 2 MHz clock has a 500 ns period. Sleep's argument is in milliseconds, so even if you used Sleep(1), you would miss 2,000 cycles.
Worse, Sleep does not promise that it will return after X milliseconds, only that it will return after at least X milliseconds.
Your best bet would be to use some kind of Timer with an event that keeps the program from consuming or producing data too quickly.
For the user, a pause of less than 100 ms or so will generally be imperceptible. Based on that, instead of attempting to sleep after each instruction, you'd be much better off executing for something like 50 ms, then sleeping for an appropriate length of time, then executing for another 50 ms.
Also note, however, that most processors with a 2 MHz clock (e.g. a Z80) did not actually execute 2 million instructions per second. A 2 MHz Z80 took a minimum of four processor clocks to fetch one instruction giving a maximum instruction rate of 500 KHz.
Note that sleeping is not at all a good proxy for running code on a less capable CPU. There are many things that affect computational performance other than clock rate. In many cases, clock rate is a second or third (or 10'th) order determinate of computational performance.
Also note that QueryPerformanceCounter() while high resolution is expensive on most systems (3000 to 5000 CPU clocks in many cases). The reason is that it requires a system call and several reads from the HPET in the system's south bridge. (note, this varies by system).
Could you help us better understand what you are trying to do?
As I mentioned in my comment on James Black's answer: do not poll a timer call (like QPC or the direct X stufF). Your thread will simply consume massive amounts of CPU cycles and not let ANY thread at a lower priority run, and will eat up most of the time at its priority. Note that the NT Scheduler does adjust thread priorities. This is called 'boosting'. If your thread is boosted and hits one of your polling loops, then it will almost assuredly cause perf problems. This is very bad behavior from a system perspective. Avoid it if at all possible.
Said another way: Windows is a mult-tasking OS and users run lots of things. Be aware that your app is running in a larger context and its behavior can have system wide implications.
The problem you will have is that the minimum sleep on windows seems to be about 20-50ms, so though you may put that you want to sleep for 1ms, it will wake up later, due to the fact that other processes will be running, and the time slice is quite large.
If you must have a small time such as 500ns (1/2e06 * 1000) then you will want to use DirectX, as it has a high-resolution timer, so that you can just loop until the pause is done, but, you will need to take over the computer, not allow other processes to interrupt what is going on.

Categories