Azure Worker Role - Threading, memory consumption, fragmentation and compaction

Azure Worker Role - Threading, memory consumption, fragmentation and compaction - c#

Our running Cloud Service role makes an intensive use of threading. It spins up more and less 100 Tasks. Most of them do something, then they sleep for some minutes than start again. Some of them spin up a Service Bus Queue Client and handle up to 5 concurrent messages. Those services do a lot of stuff, but mainly data trasformation: they takes something from the DB and send them to external web services, or they take something from those web services and put on the DB.
From my application's service inventory, I'd say that there are no more than 300 "concurrent" Tasks.
Because of some error logs, I had to try to go deeper and I found something strange. I try to sum up what I discovered and what I think about them. I'd be glad to get a check from you.
This application is a .NET 4.5.2 worker role, compiled for a 64-bit architecture, and runs on a 2 Standard_A2 Medium Instances (i.e. 3.5GB memory, 2 core CPU). Because of the usage of Redis Caching, we have tweaked the ThreadPool SetMinThread like this:
ThreadPool.SetMinThreads(300, 300);
Now, here is what I discovered going through my analysis:
It happens that I log this error: "There were not enough free threads in the ThreadPool to complete the operation."
I think this is related to async / await operations in our whole application. If I understood correctly, async/await should use IO Completion Ports part of the Thread Pool. The fact that there are no free IOCP, sent me through the need of raising up accordingly the "SetMaxThread" property of the ThreadPool. Does it make sense?
int CurrentMaxWorkerThreads = 0;
int CurrentMaxIOPorts = 0;
ThreadPool.GetMaxThreads(out CurrentMaxWorkerThreads, out CurrentMaxIOPorts);
ThreadPool.SetMaxThreads(CurrentMaxWorkerThreads, 4000);
Memory usage: I've seen that after some days of running (2-3 days), the process takes more and less 2GB. Using Ants Memory Profiler, I discovered that a lot of this used memory is free, because fragmented (I'll ask about this on point 3). The question is: shouldn't my application scale up to more than 2GB? Isn't the 2GB limit set just for 32 bit applications?
Memory fragmentation: I've read a lot about that in those days, and I really think that trying to figure out why it's happening will be very time expensive, even because - using Ants Memory Profiler - I see that a lot of memory consumption is due to byte arrays (that probably are generated to send stuff outside of the application). Right now, just to keep things under control, I'd compact the LOH, using the .NET 4.5.1 feature:
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();

Related

Debugging a memory leak in a long-running mono/C# application

I have a C# program that runs on an Ubuntu VM as a server using mono 5.10. Every now and then, the program starts using lots of memory until it eventually crashes. It can often take days before such a memory leak occurs, and I haven't been able to reproduce the issue locally.
For the past few days, I've used the mono log profiler to be able to collect heapshots on demand. This has given me a 15Gb+ file. I managed to get a heapshot from when the memory leak occurs, which displays as follows in the Xamarin profiler:
Running mprof-report gives a similarly unhelpful report.
Neither really help debug the problem. Obviously there's something going on with the threadpool, but it seems like there is no way of figuring out what.
Being able to see where the objects are allocated might help, but that requires enabling alloc in the profiler, which is not possible because of the file size it will produce.
What is causing the massive amount of IThreadPoolWorkItems? The app is using 30-40% CPU, so it doesn't seem like tasks are being scheduled more quickly than the CPU can handle.
Is there a way to list the objects referenced by the ThreadPool jobs? That would at least allow me to identify what piece of code is being run so much. mprof-report only shows inverse references as far as I can see, which isn't very helpful as the IThreadPoolWorkItems are obviously owned by the ThreadPool itself.
Update: The tasks aren't (disk) IO bound. While leaking, the reads are in the 10s of kbs per second, which shouldn't be saturating the disk.
On second thought: if there were a lot of tasks scheduled, shouldn't I be seeing lots of individual IThreadPoolWorkItem objects too? Since IThreadPoolWorkItem isn't a struct, there should be a separate entry for the actual object and IThreadPoolWorkItem[] would just be an array of pointers. So this would suggest that those IThreadPoolWorkItem[]s are just arrays of (mostly) null pointers.
Update: ThreadPool.GetAvailableThreads shows 2 worker threads and 0 completion port threads are active, which seems to indicate that the ThreadPool is doing just fine. Custom logging also indicates that once again, I'm not scheduling too many tasks and the tasks that are scheduled are finished quickly.

Webjob getting 'System.OutOfMemoryException'

I have a webjob running on Azure on their S3 standard platform meaning it has 7 gb of ram available to run my app.
On the machine 3 jobs are running of which one is the one doing all the processing and the other two handles small tasks. My problem lies in the fact that I on certain memoryintensive large tasks gets a memory exception meaning that results in the given job crashing.
The job I try to run is a very memory intensive job, and requires around 1,5 gb of ram, but based on the graph below I do not understand how this should be a problem since I never am above 2.2gb of used ram for the app service. I do have to add, that I run 3 instances, so it might be that one instance is using way more memory, but I can not find anywhere to view that information.
Memory consumption on server
When I look in the process explorer in the Kudo I see I use around 1.3gb of ram currently, which is still way below the needed memory for the job.
Kudo screenshot
The job has run without any problems no more than 2 days ago on the same server setup, so I am completely lost as to where to look.
Update: Code works fine in visual studio with same data running the same exact task.
Do anyone have ideas as to how to approach this problem

Per my understanding, you could capture the memory dump in your azure app service and analyze the dump to narrow this issue. You could refer to this tutorial about how to get a full memory dump in Azure App Services. Also, you could leverage the Crash Diagnoser extension to monitor the CPU and Memory, for more details you could refer to this blog.

Well well well first of all how do you deal with Garbage collector ?! i mean do you dispose disposable objects after they will do their tasks. You said app is being run for 2 days , seems app has clashed with more intensive memory loading stuff. As you said your app is "very memory intensive" , i guess you should fetch it out (source code) and make sure you are managing objects correctly, because garbage collector couldn't care all of your "source code mess". Good luck.

CPU underutilized. Due to blocking I/O?

I am trying to find where lies the bottleneck of a C# server application which underutilize CPU. I think this may be due to poor disk I/O performance and has nothing to do with the application itself but I am having trouble making a fact out of this supposition.
The application reads messages from a local MSMQ queue, does some processing on each messages and after processing the messages, sends out response messages to another local MSMQ queue.
I am using an async loop to read messages from queue, dequeuing them as fast as possible and dispatching them for processing using Task.Run to launch the processing of each messages (and do not await on this Task.Run .. just attaching a continuation only faulted on it to log error). Each messages is processed concurrently, i.e there is no need to wait for a message to be fully processed before processing the next one.
At the end of the processing of a message, I am using the Send method of MessageQueue (somehow asynchronous but not really because it has to wait on disk write before returning -see System.Messaging - why MessageQueue does not offer an asynchronous version of Send).
For the benchmarks I am queuing 100K messages in the queue (approx 100MB total size for 100K messages) and then I launch the program. On two of my personal computers (SSD HD on one and SATA2 HD on the other with i7 CPU quadcores -8 logical proc-) I reach ~95% CPU usage for the duration of the program lifecyle (dequeuing the 100K messages, processing them and sending responses). Messages are dequeued as fast a possible, processed as fast as possible (CPU involved here) and then response for each message sent to different local queue.
Now on a virtual machine running non HT dual core CPU (have no idea what is the underlying disk but seems far less performant than mines... during benchmark, with Perfmon I can see avg disk sec/write arround 10-15 ms on this VM, whereas it is arround 2ms on my personal machines) when I am running the same bench, I only reach ~55% CPU (when I am running the same bench on the machine without sending response messages to queue I reach ~90% CPU).
I don't really understand what is the problem here. Seems clear that sending message to the queue is the problem and slows down the global processing of the program (and dequeuing of messages to be processed), but why would that be considering that I am using Task.Run to launch processing of each dequeued message and ultimately response sending, I would not expect CPU to be underutilized. Unless when one thread is sending a message it blocks other threads to run on the same core while it waits for the return (disk write) in which case it would maybe make sense considering latency is much higher than on my personal computers, but a thread waiting for I/O should not prevent other threads from running.
I am really trying to understand why I am not reaching at least 95% cpu usage on this machine. I am blindly saying this is due to poorer disk i/o performance, but still I don't see why it would lead to CPU underutilization considering I am running processing concurrently using Task.Run. It could also be some system problem completely unrelated to disk, but considering that MessageQueue.Send seems to be the problem and that this method ultimately writes messages to a memory mapped file + disk, I don't see where the performance issue could come from other than disk.
It is of course for sure a system performance issue as the program maximize CPU usage on my own computers, but I need to find what the bottleneck is exactly on the VM system, and why exactly it is affecting the concurrency / speed of my application.
Any idea ?

To examine poor disc and or cpu utilization there is only one tool: Windows Performance Toolkit. For an example how to use it see here.
You should get the latest one from the Windows 8.1 SDK (requires .NET 4.5.1) which gives you most capabilities but the one from the Windows 8 SDK is also fine.
There you get graphs % CPU Utilization and % Disc Utilization. If either one is at 100% and the other one is low then you have found the bottleneck. Since it is a system wide profiler you can check if the msmq service is using the disc badly or you or someone else (e.g. virus scanner is a common issue).
You can directly get to your call stacks and check which process and thread did wake your worker thread up which is supposed to run at full speed. Then you can jump to the readying thread and process and check what it did do before it could ready your thread. That way you can directly verify what was hindering it so long.
No more guessing. You can really see what the system is doing.
To analyze further enable in the CPU Usage Precise view the following columns:
NewProcess
NewThreadId
NewThreadStack(Frame Tags)
ReadyingProcess
ReadyingThreadId
Ready(us) Sum
Wait(us) Sum
Wait(us)
%CPU Usage
Then drill down for a call stack in your process to see where high Wait(us) times do occur in a thread that is supposed to run at full speed.. You can drill down to one single event until you can go no further. Then you will see values in Readying Process and ReadyingThreadId. Go to that process/thread (it can be your own) and repeat the process until you end up in some blocking operation which does either involve disc IO or sleeps or a long running device driver call (e.g virus scanner or the vm driver).

If the Disk I/O performance counters don't look abnormally high, I'd look next at the hypervisor level. Assuming you're running the exact same code, using a VM adds latency to the entire stack (CPU, RAM, Disk). You can perhaps tweak CPU Scheduling at the hypervisor level and see if this will increase CPU utilization.
I'd also consider using a RAMDisk temporarily for performance testing. This would eliminate the Disk/SAN latency and you can see if that fixes your problem.

Parallel Programing with Threads

Okay, I am bit confuse on what and how should I do. I know the theory of Parallel Programming and Threading, but here is my case:
We have number of log files in given folder. We read these log files in database. Usually reading these files take couple of hours to read, as we do it in serial method, i.e. we iterate through each file, then open a SQL transaction for each file and insert the log in database, then read another and do the same.
Now, I am thinking of using Parallel programming so I can consume all core of CPU, however I am still not clear if I use Thread for each file, will that make any difference to system? I mean if I create say 30 threads then will they run on single core or they run on Parallel ? How can I use both of them? if they are not already doing that?
EDIT: I am using Single Server, with 10K HDD Speed, and 4 Core CPU, with 4 GB RAM, no network operation, SQL Server is on same machine with Windows 2008 as OS. [can change OS if that help too :)].
EDIT 2: I run some test to be sure based on your feedbacks, here is what I found on my i3 Quad Core CPU with 4 GB RAM
CPU remains at 24-50% CPU1, CPU2 remain under 50% usage, CPU3 remain at 75% usage and CPU4 remains around 0%. Yes I have Visual studio, eamil client and lot of other application open, but this tell me that application is not using all core, as CPU4 remain 0%;
RAM remain constantly at 74% [it was around 50% before test], that is how we design the read. So, nothing to worry
HDD remain READ/Write or usage value remain less than 25% and even it spike to 25% in sine wave, as our SQL transaction first stored in memory and then it write to disk when memory is getting threshold, so again,
So all resources are under utilized here, and hence I think I can distribute work to make it efficient. Your thoughts again. Thanks.

First of all, you need to understand your code and why is it slow. If you're thinking something like “my code is slow and uses one CPU, so I'll just make it use all 4 CPUs and it will be 4 times faster”, then you're most likely wrong.
Using multiple threads makes sense if:
Your code (or at least a part of it) is CPU bound. That is, it's not slowed down by your disk, your network connection or your database server, it's slowed down by your CPU.
Or your code has multiple parts, each using a different resource. E.g. one part reads from a disk, another part converts the data, which requires lots of CPU and last part writes the data to a remote database. (Parallelizing this doesn't actually require multiple threads, but it's usually the simplest way to do it.)
From your description, it sounds like you could be in situation #2. A good solution for that is the producer consumer pattern: Stage 1 thread reads the data from the disk and puts it into a queue. Stage 2 thread takes the data from the queue, processes them and puts them into another queue. Stage 3 thread takes the processed data from the second queue and saves them to the database.
In .Net 4.0, you would use BlockingCollection<T> for the queue between the threads. And when I say “thread”, I pretty much mean Task. In .Net 4.5, you could use blocks from TPL Dataflow instead of the threads.
If you do it this way then you can get up to three times faster execution (if each stage takes the same time). If Stage 2 is the slowest part, then you can get another speedup by using more than one thread for that stage (since it's CPU bound). The same could also apply to Stage 3, depending on your network connection and your database.

There is no definite answer to this question and you'll have to test because as mentionned in my comments:
if the bottleneck is the disk I/O then you won't gain a lot by adding more threads and you might even worsen performance because more threads will be fighting to get access to the disk
if you think disk I/O is OK but CPU loads is the issue then you can add some threads, but no more than the number of cores because here again things will worsen due to context switching
if you can do more disk and network I/Os and CPU load is not high (very likely) then you can oversubscribe with (far) more threads than cores: typically if your threads are spending much of their time waiting for the database
So you should profile first, and then (or directly if you're in a hurry) test different configurations, but chances are you'll be in the third case. :)

First, you should check what is taking the time. If the CPU actually is the bottleneck, parallel processing will help. Maybe it's the network and a faster network connection will help. Maybe buying a faster disc will help.
Find the problem before thinking about a solution.

Your problem is not using all CPU, your action are mainly I/O (reading file , sending data to DB).
Using Thread/Parallel will make your code run faster since you are processing many files at the same time.
To answer your question , the framework/OS will optimize running your code over the different cores.

It varies from machine to machine but speaking generally if you have a dual core processor and you have 2 threads the Operating System will pass one thread to one core and the other thread to the other. It doesn't matter how many cores you use what matters is whether your equation is the fastest. If you want to make use of Parallel programming you need a way of sharing the workload in a way that logically makes sense. Also you need to consider where your bottleneck is actually occurring. Depending on the size of the file it may be simply the max speed of your read/write of the storage medium that is taking so long.As a test I suggest you log where the most time in your code is being consumed.
A simple way to test whether a non-serial approach will help you is to sort your files in some order divide the workload between 2 threads doing the same job simultaneously and see if it makes a difference. If a second thread doesn't help you then I guarantee 30 threads will only make it take longer due to the OS having to switch threads back and fourth.

Using the latest constructs in .Net 4 for parallel programming, threads are generally managed for you... take a read of getting started with parallel programming
(pretty much the same as what has happened more recently with async versions of functions to use if you want it async)
e.g.
for (int i = 2; i < 20; i++)
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
}
becomes
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
EDIT: That said, it would be productive / faster to perhaps also put intensive tasks into seperate threads... but to manually make your application 'Multi-Core' and have things like certain threads running on particular cores, that isn't currently possible, that's all managed under the hood...
have a look at plinq for example
and .Net Parallel Extensions
and look into
System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity = 4
Edit2:
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.

You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.

The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?

If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).

Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.