I have a C# program that runs on an Ubuntu VM as a server using mono 5.10. Every now and then, the program starts using lots of memory until it eventually crashes. It can often take days before such a memory leak occurs, and I haven't been able to reproduce the issue locally.
For the past few days, I've used the mono log profiler to be able to collect heapshots on demand. This has given me a 15Gb+ file. I managed to get a heapshot from when the memory leak occurs, which displays as follows in the Xamarin profiler:
Running mprof-report gives a similarly unhelpful report.
Neither really help debug the problem. Obviously there's something going on with the threadpool, but it seems like there is no way of figuring out what.
Being able to see where the objects are allocated might help, but that requires enabling alloc in the profiler, which is not possible because of the file size it will produce.
What is causing the massive amount of IThreadPoolWorkItems? The app is using 30-40% CPU, so it doesn't seem like tasks are being scheduled more quickly than the CPU can handle.
Is there a way to list the objects referenced by the ThreadPool jobs? That would at least allow me to identify what piece of code is being run so much. mprof-report only shows inverse references as far as I can see, which isn't very helpful as the IThreadPoolWorkItems are obviously owned by the ThreadPool itself.
Update: The tasks aren't (disk) IO bound. While leaking, the reads are in the 10s of kbs per second, which shouldn't be saturating the disk.
On second thought: if there were a lot of tasks scheduled, shouldn't I be seeing lots of individual IThreadPoolWorkItem objects too? Since IThreadPoolWorkItem isn't a struct, there should be a separate entry for the actual object and IThreadPoolWorkItem[] would just be an array of pointers. So this would suggest that those IThreadPoolWorkItem[]s are just arrays of (mostly) null pointers.
Update: ThreadPool.GetAvailableThreads shows 2 worker threads and 0 completion port threads are active, which seems to indicate that the ThreadPool is doing just fine. Custom logging also indicates that once again, I'm not scheduling too many tasks and the tasks that are scheduled are finished quickly.
Our running Cloud Service role makes an intensive use of threading. It spins up more and less 100 Tasks. Most of them do something, then they sleep for some minutes than start again. Some of them spin up a Service Bus Queue Client and handle up to 5 concurrent messages. Those services do a lot of stuff, but mainly data trasformation: they takes something from the DB and send them to external web services, or they take something from those web services and put on the DB.
From my application's service inventory, I'd say that there are no more than 300 "concurrent" Tasks.
Because of some error logs, I had to try to go deeper and I found something strange. I try to sum up what I discovered and what I think about them. I'd be glad to get a check from you.
This application is a .NET 4.5.2 worker role, compiled for a 64-bit architecture, and runs on a 2 Standard_A2 Medium Instances (i.e. 3.5GB memory, 2 core CPU). Because of the usage of Redis Caching, we have tweaked the ThreadPool SetMinThread like this:
ThreadPool.SetMinThreads(300, 300);
Now, here is what I discovered going through my analysis:
It happens that I log this error: "There were not enough free threads in the ThreadPool to complete the operation."
I think this is related to async / await operations in our whole application. If I understood correctly, async/await should use IO Completion Ports part of the Thread Pool. The fact that there are no free IOCP, sent me through the need of raising up accordingly the "SetMaxThread" property of the ThreadPool. Does it make sense?
int CurrentMaxWorkerThreads = 0;
int CurrentMaxIOPorts = 0;
ThreadPool.GetMaxThreads(out CurrentMaxWorkerThreads, out CurrentMaxIOPorts);
ThreadPool.SetMaxThreads(CurrentMaxWorkerThreads, 4000);
Memory usage: I've seen that after some days of running (2-3 days), the process takes more and less 2GB. Using Ants Memory Profiler, I discovered that a lot of this used memory is free, because fragmented (I'll ask about this on point 3). The question is: shouldn't my application scale up to more than 2GB? Isn't the 2GB limit set just for 32 bit applications?
Memory fragmentation: I've read a lot about that in those days, and I really think that trying to figure out why it's happening will be very time expensive, even because - using Ants Memory Profiler - I see that a lot of memory consumption is due to byte arrays (that probably are generated to send stuff outside of the application). Right now, just to keep things under control, I'd compact the LOH, using the .NET 4.5.1 feature:
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
I have implemented rabbitmq in my application and it's running on windows server 2008 server, the problem is that erl.exe taking high CPU usages like sometime it reaches 40-45% CPU usages, even in the ideal case (when not processing any queue) it takes at least 4-15% CPU usages.
What could be the reason for taking high CPU usages? Is there any setting or any other thing that I need to do.
You say that even when not processing a queue it is still at 4-15%, but is your application running? If you weren't before, try to monitor erl while no application is using Rabbit.
One thing that comes to mind is that you might be using the QueingBasicConsumer in a loop and that could be contributing to the CPU usage. If you are using QueingBasicConsumer and it is what is causing the hit, try substituting it with EventingBasicConsumer (such that you don't do busy waiting) and see if you have improvement.
Also, how is your application using Rabbit? According to the documentation every IConnection is backed up by a background thread and if you're creating a bunch of connections in your application it could be another reason for the slow down.
When we are doing some CPU intensive tasks we do it in parallel to reduce total execution time, we are using parallel execution, and basically optimal number of threads is equal to the Environment.ProcessorCount. It is not always optimal, but in most of cases.
Ok, but what if i have intensive IO bound task with little load on CPU. Basically if CPU is not used intensive in the task it will be faster to use 1 thread, to not get switching overhead.
But now i realized that many customers (i talk about server software) have raids, striped disk... in some system configurations IO operations could be done in parallel. But how can i find when it is better to use parallel IO and how to find what number of threads should i use? Is there some value like Environment.ProcessorCount for IO, as i know -no. Do you know good way to find optimal number of IO thread for different system configuration?
I think that there should be some form of custom Task Scheduler for IO like, which is optimized for IO but i can't find... IOTaskScheduler - is not optimized for perfomance
For IO-bound work there is no easy guideline. You don't know what the point of optimal throughput is. It depends on the hardware. For example, SSDs have independent banks of storage. The network has high latency and can benefit from pipelining. Who knows what a remote web-service is like.
Test different values and measure which one is the fastest.
You could even implement a runtime benchmark where you run different degrees of parallelism and pick the fastest. Or you do an adaptive algorithm like the TPL uses. It speculatively increases the number of threads and if throughput increased it keeps the new thread. If it dropped, it retires the thread.
You can not. THe main problem is that even without a raid controller it hughely depends on the IO load (type). THe moment you add Raid, SAS thigns are out of control. There may be guidelines, but there is no way to measure a best thing. I have a raid array here that sometimes spikes to tens of thousands of outstanding IO requests and between a gb size raid controller cache, a ssd cache and half a dozen SAS discs this gets handled in a second or two at times.
Measure. If you want to look at one item - measure latency.
The moment it takes longer to finish a request, you are waiting in the line. THen optimize for that. Queue size etc. are useless - latency is the only real measurement how busy a IO subsystem it.
Once oyu have that, you can build a feedback loop to adjust the paralellism for optimal size, but then-..... you may get totally SNAFU'd when some other software kicks in (disc scan, anti virus is famous for that).
Okay, I am bit confuse on what and how should I do. I know the theory of Parallel Programming and Threading, but here is my case:
We have number of log files in given folder. We read these log files in database. Usually reading these files take couple of hours to read, as we do it in serial method, i.e. we iterate through each file, then open a SQL transaction for each file and insert the log in database, then read another and do the same.
Now, I am thinking of using Parallel programming so I can consume all core of CPU, however I am still not clear if I use Thread for each file, will that make any difference to system? I mean if I create say 30 threads then will they run on single core or they run on Parallel ? How can I use both of them? if they are not already doing that?
EDIT: I am using Single Server, with 10K HDD Speed, and 4 Core CPU, with 4 GB RAM, no network operation, SQL Server is on same machine with Windows 2008 as OS. [can change OS if that help too :)].
EDIT 2: I run some test to be sure based on your feedbacks, here is what I found on my i3 Quad Core CPU with 4 GB RAM
CPU remains at 24-50% CPU1, CPU2 remain under 50% usage, CPU3 remain at 75% usage and CPU4 remains around 0%. Yes I have Visual studio, eamil client and lot of other application open, but this tell me that application is not using all core, as CPU4 remain 0%;
RAM remain constantly at 74% [it was around 50% before test], that is how we design the read. So, nothing to worry
HDD remain READ/Write or usage value remain less than 25% and even it spike to 25% in sine wave, as our SQL transaction first stored in memory and then it write to disk when memory is getting threshold, so again,
So all resources are under utilized here, and hence I think I can distribute work to make it efficient. Your thoughts again. Thanks.
First of all, you need to understand your code and why is it slow. If you're thinking something like “my code is slow and uses one CPU, so I'll just make it use all 4 CPUs and it will be 4 times faster”, then you're most likely wrong.
Using multiple threads makes sense if:
Your code (or at least a part of it) is CPU bound. That is, it's not slowed down by your disk, your network connection or your database server, it's slowed down by your CPU.
Or your code has multiple parts, each using a different resource. E.g. one part reads from a disk, another part converts the data, which requires lots of CPU and last part writes the data to a remote database. (Parallelizing this doesn't actually require multiple threads, but it's usually the simplest way to do it.)
From your description, it sounds like you could be in situation #2. A good solution for that is the producer consumer pattern: Stage 1 thread reads the data from the disk and puts it into a queue. Stage 2 thread takes the data from the queue, processes them and puts them into another queue. Stage 3 thread takes the processed data from the second queue and saves them to the database.
In .Net 4.0, you would use BlockingCollection<T> for the queue between the threads. And when I say “thread”, I pretty much mean Task. In .Net 4.5, you could use blocks from TPL Dataflow instead of the threads.
If you do it this way then you can get up to three times faster execution (if each stage takes the same time). If Stage 2 is the slowest part, then you can get another speedup by using more than one thread for that stage (since it's CPU bound). The same could also apply to Stage 3, depending on your network connection and your database.
There is no definite answer to this question and you'll have to test because as mentionned in my comments:
if the bottleneck is the disk I/O then you won't gain a lot by adding more threads and you might even worsen performance because more threads will be fighting to get access to the disk
if you think disk I/O is OK but CPU loads is the issue then you can add some threads, but no more than the number of cores because here again things will worsen due to context switching
if you can do more disk and network I/Os and CPU load is not high (very likely) then you can oversubscribe with (far) more threads than cores: typically if your threads are spending much of their time waiting for the database
So you should profile first, and then (or directly if you're in a hurry) test different configurations, but chances are you'll be in the third case. :)
First, you should check what is taking the time. If the CPU actually is the bottleneck, parallel processing will help. Maybe it's the network and a faster network connection will help. Maybe buying a faster disc will help.
Find the problem before thinking about a solution.
Your problem is not using all CPU, your action are mainly I/O (reading file , sending data to DB).
Using Thread/Parallel will make your code run faster since you are processing many files at the same time.
To answer your question , the framework/OS will optimize running your code over the different cores.
It varies from machine to machine but speaking generally if you have a dual core processor and you have 2 threads the Operating System will pass one thread to one core and the other thread to the other. It doesn't matter how many cores you use what matters is whether your equation is the fastest. If you want to make use of Parallel programming you need a way of sharing the workload in a way that logically makes sense. Also you need to consider where your bottleneck is actually occurring. Depending on the size of the file it may be simply the max speed of your read/write of the storage medium that is taking so long.As a test I suggest you log where the most time in your code is being consumed.
A simple way to test whether a non-serial approach will help you is to sort your files in some order divide the workload between 2 threads doing the same job simultaneously and see if it makes a difference. If a second thread doesn't help you then I guarantee 30 threads will only make it take longer due to the OS having to switch threads back and fourth.
Using the latest constructs in .Net 4 for parallel programming, threads are generally managed for you... take a read of getting started with parallel programming
(pretty much the same as what has happened more recently with async versions of functions to use if you want it async)
e.g.
for (int i = 2; i < 20; i++)
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
}
becomes
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
EDIT: That said, it would be productive / faster to perhaps also put intensive tasks into seperate threads... but to manually make your application 'Multi-Core' and have things like certain threads running on particular cores, that isn't currently possible, that's all managed under the hood...
have a look at plinq for example
and .Net Parallel Extensions
and look into
System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity = 4
Edit2:
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.