.NET GC Stalling Desktop Application - Performance Issue

.NET GC Stalling Desktop Application - Performance Issue - c#

I am working on a large windows desktop application that stores large amount of data in form of a project file. We have our custom ORM and serialization to efficiently load the object data from CSV format. This task is performed by multiple threads running in parallel processing multiple files. Our large project can contain million and likely more objects with many relationships between them.
Recently I got tasked to improve the project open performance which deteriorated for very large projects. Upon profiling it turned out that most of the time spent can be attributed to garbage collection (GC).
My theory is that due to large number of very fast allocations the GC is starved, postponed for a very long time and then when it finally kicks in it takes a very long time to the job. That idea was further confirmed by two contradicting facts:
Optimizing deserialization code to work faster only made things worse
Inserting Thread.Sleep calls at strategic places made load go faster
Example of slow load with 7 generation 2 collections and huge % of time in GC is below.
Example of fast load with sleep periods in the code to allow GC some time is below. In this case wee have 19 generation 2 collections and also more than double the number of generation 0 and generation 1 collections.
So, my question is how to prevent this GC starvation? Adding Thread.Sleep looks silly and it is very difficult to guess the right amount of milliseconds in the right place. My other idea would be to use GC.Collect, but that also poses the difficulty of how many and where to put them. Any other ideas?

Based on the comments, I'd guess that you are doing a ton of String.Substring() operations as part of CSV parsing. Each of these creates a new string instance, which I'd bet you then throw away after further parsing it into an integer or date or whatever you need. You almost certainly need to start thinking about using a different persistence mechanism (CSV has a lot of shortcomings that you are undoubtedly aware of), but in the meantime you are going to want to look into versions of parsers that do not allocate substrings. If you dig into the code for Int32.TryParse, you'll find that it does some character iteration to avoid allocating more strings. I'd bet that you could spend an hour writing a version that takes a start and end parameter, then you can pass them the whole line with offsets and avoid doing a substring call to get the individual field values. Doing that will save you millions of allocations.

So, it appears that this is a .NET bug rather then GC starvation. The workarounds and answers described in this question Garbage Collection and Parallel.ForEach Issue After VS2015 Upgrade apply perfectly. I got best results by switching to GC server mode.
Note however, that I am experiencing this issue in .NET 4.5.2. Will add hotfix link if there is one.

Related

How to find cause of high memory usage in a complex C# based Windows service?

I am having problems figuring out the source of a memory problem in a complex C# based Windows service. Unfortunately the problem does not occur all the time and I still don't exactly know the conditions which cause it to happen. Sometimes when I check the system ressources used by the service, it takes up multiple gigabytes of memory until the point where it throws OutOfMemory-exceptions everywhere because there isn't any memory left.
I have a paid version of .NET Memory Profiler available but so far it has been useless because the whole system becomes slow and unstable when the service uses too much memory so I cannot attach the memory profiler to the application.
The solution of the application consists of more than 30 individual projects and hundreds of thousands lines of code so there is no way for me to find the source of the problem by simply looking through the source code.
So far the only thing I was able to do is creating a memory dump (.dmp file) of the process while it was using a lot of memory. Is there a way to analyze this dump or anything else that would help me narrow down the source of this problem?

If you could identify some central methods in the main classes of your legacy projects and you have some kind of logging already in place, you could log the total memory (managed and unmanaged, if your application opens such resources) by calling
Process.GetCurrentProcess().PrivateMemorySize64
At least that would give you a feeling if the memory problem is "diffuse", e. g. by not release objects for garbage collection etc., if if the memory problem occurs just in certain use cases (jump in memory consumption, if a certain action happens). Then you could nail it down by more logging and investigating the corresponding code sections. Its tedious, but when you can not use code instrumentation as you have said, I find it effective. If you want to analyze a specific situation with a memory dump, you can use WinDbg for analyzing it, but that takes some effort for the first time to learn, and would be a separate topic (see https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools).

C# - Is the amount of objects in memory affecting performance of local processing?

I am very confused by what I am seeing in my program.
Let's say we have a list of two large objects (loaded from 2 external files).
Then I iterate over each object and for each one I call a method, which performs a bunch of processing.
Just to illustrate:
foreach (var object in objects)
{
object.DoSomething();
}
In first case, objects contains 2 items. It completes very fast, I track the progress of each object individually and the processing for each one is very fast.
Then I run the program again, this time adding some more input files, so instead of 2, I'd have let's say 6 objects.
So the code runs again, and the 2 objects from before are still there, along with some more, but for some odd reason, now each processing (each call to object.DoSomething()) takes much longer than before.
Let's say scenario 1 with 2 objects, objectA.Dosomething() takes 1
minute to complete.
Let's say scenario 2, with 6 objects, same objectA.Dosomething()
as in scenario 1 now takes 5 minutes to complete.
The more objects I have in my list, the longer each processing for each individual object takes.
How is that possible? How can the performance of an individual processing for a specific, independent object, be affected so much by objects in the memory? How can, in scenario 1 and 2 above, the exact same processing on the exact same data take a significantly different amount of time to complete?
Also, please note that processing is slower from the start, it does not start fast on first object and then slows down progressively, it's just consistently slowed down proportionally to the amount of objects to process. I have some multi-threading in there, and I can see the rate at which threads complete drops dramatically when I start adding more objects. The multi-threading happens inside of "DoSomething()" and it will not leave untill all threads have completed. However, I don't think this issue is related to multi-threading. Actually, I added multi-threading because of the slowness.
Also please note that initially I was merging all input files into one huge object and one single call to DoSomething(), and I broke it down thinking it would help performance.
Is this a "normal" behavior and if so, what are the ways around this? I can think of other ways to process the data, but I still don't get this behavior and there has to be something I can do to get the intended result here.
Edit 1:
Each object in the "objects" list above also contains a list (queue) of smaller objects, around 5000 of those each. I am starting to believe my issue might be that, and that I should use structs or something similar instead of having so many nested objects. Would that explain the type of behavior I am describing above?

As stated in the comments, my question was too abstract for any precise answer to be given. I mostly wanted some pointers and to know if somehow I might have hit some internal limit.
It turned out I was overlooking a separate mechanism I have for logging results internally and producing reports. I built that part of the system really quickly and it was ridiculously inefficient and growing way too fast. Limiting the size of the internal structures, limiting the amount of retrievals from big collections and breaking down the processing in smaller chunks did the trick.
Just to illustrate, something that was taking over 6 hours is now taking 1 minute. Shame on me. Cleaner solution would be to use a database, but at least it seems I will be getting away with this one for now.

Inefficient Parallel.For?

I'm using a parallel for loop in my code to run a long running process on a large number of entities (12,000).
The process parses a string, goes through a number of input files (I've read that given the number of IO based things the benefits of threading could be questionable, but it seems to have sped things up elsewhere) and outputs a matched result.
Initially, the process goes quite quickly - however it ends up slowing to a crawl. It's possible that it's just hit a number of particularly tricky input data, but this seems unlikely looking closer at things.
Within the loop, I added some debug code that prints "Started Processing: " and "Finished Processing: " when it begins/ends an iteration and then wrote a program that pairs a start and a finish, initially in order to find which ID was causing a crash.
However, looking at the number of unmatched ID's, it looks like the program is processing in excess of 400 different entities at once. This seems like, with the large number of IO, it could be the source of the issue.
So my question(s) is(are) this(these):
Am I interpreting the unmatched ID's properly, or is there some clever stuff going behind the scenes I'm missing, or even something obvious?
If you'd agree what I've spotted is correct, how can I limit the number it spins off and does at once?
I realise this is perhaps a somewhat unorthodox question and may be tricky to answer given there is no code, but any help is appreciated and if there's any more info you'd like, let me know in the comments.

Without seeing some code, I can guess at the answers to your questions:
Unmatched IDs indicate to me that the thread that is processing that data is being de-prioritized. This could be due to IO or the thread pool trying to optimize, however it seems like if you are strongly IO bound then that is most likely your issue.
I would take a look at Parallel.For, specifically using ParallelOptions.MaxDegreesOfParallelism to limit the maximum number of tasks to a reasonable number. I would suggest trial and error to determine the optimum number of degrees, starting around the number of processor cores you have.
Good luck!

Let me start by confirming that is indeed a very bad idea to read 2 files at the same time from a hard drive (at least until the majority of HDs out there are SSDs), let alone whichever number your whole thing is using.
The use of parallelism serves to optimize processing using an actually paralellizable resource, which is the CPU power. If you paralellized process reads from a hard drive then you're losing most of the benefit.
And even then, even the CPU power is not prone to infinite paralellization. A normal desktop CPU has the capacity to run up to 10 threads at the same time (depends of the model obviously, but that's the order of magnitude).
So two things
first, I am going to make the assumption that your entities use all your files, but your files are not too big to be loaded into memory. If it's the case, you should read your files into objects (i.e. into memory), then paralellize the processing of your entities using those objects. If not, you're basically relying on your hard drive's cache to not reread your files every time you need them, and your hard drive's cache is far smaller than your memory (1000-fold).
second, you shouldn't be running Parallel.For on 12.000 items. Parallel.For will actually (try to) create 12.000 threads, and that is actually worse than 10 threads, because of the big overhead that paralellizing will create, and the fact your CPU will not benefit from it at all since it cannot run more than 10 threads at a time.
You should probably use a more efficient method, which is the IEnumerable<T>.AsParallel() extension (comes with .net 4.0). This one will, at runtime, determine what is the optimal thread number to run, then divide your enumerable into as many batches. Basically, it does the job for you - but it creates a big overhead too, so it's only useful if the processing of one element is actually costly for the CPU.
From my experience, using anything parallel should always be evaluated against not using it in real-life, i.e. by actually profiling your application. Don't assume it's going to work better.

What is the correct strategy for keeping lots of text data in memory? System.Runtime.Caching or custom classes?

Before refactoring my code to start experimenting, I'm hoping the wisdom of the community can advise me on the correct path.
Problem: I have a WPF program that does a diff on hundreds of ini files. For performance of the diffing I'd like to keep several hundred of the base files that other files are diffed against in memory. I've found using custom classes to store this data starts to bring my GUI to a halt once I've loaded 10-15 files with approximately 4000 line of data each.
I'm considering several strategies to improve performance:
Don't store more than a few files in memory at a time and forget about what I hoped would have been perf improvement in parsing by keeping them in memory
Experiment with running all the base file data in a BackgroundWorker thread. I'm not doing any work of these files on the GUI thread but maybe all that stored data is affecting it somehow. I'm guessing here.
Experiment with System.Runtime.Caching class.
The question asked here on SO didn't, in my mind, answer the question of what's the best strategy for this type of work. Thanks in advance for any help you can provide!

Assuming 100 character lines of text 15 * 4000 * 100 is only 6MB which is a trivial amount of memory on a modern PC. If your GUI is coming to a halt then to me that is an indication of virtual memory being swapped in and out to disk. That doesn't make sense for only 6MB so I'd figure out how much it's really taking up and why. If may well be some trivial mistake that would be easier to fix than re-thinking your whole strategy. The other possibility is that it has nothing to do with memory consumption but rather an algorithm issue.

You should use MemoryCache for this.
It works almost alike the ASP.Net Cache class, and allows you to set when it should clean up, which should be cleaned up first etc.
It also allows you to reload items based on dependencies, or after a certain time. Has callbacks on remove.
Very complete.

if your application start to hang, it more like you are doing intensive process in your GUI process, which consumer too much resource either CPU/Memory on your GUI thread, thus the GUI thread can't repaint your UI in time.
The best way to resolve it is spawn separate thread to do the diff operation, as you mentioned in your post, you can use backgroundworker, or you can use threadpool to spawn as much thread as you can to do the diff.
Don't think you need to cache the file in memory, I think it would be more appropriate to save the result into file, and load the file ondemand. it shouldn't become a bottleneck of your application.

Variation in execution time

I've been profiling a method using the stopwatch class, which is sub-millisecond accurate. The method runs thousands of times, on multiple threads.
I've discovered that most calls (90%+) take 0.1ms, which is acceptable. Occasionally, however, I find that the method takes several orders of magnitude longer, so that the average time for the call is actually more like 3-4ms.
What could be causing this?
The method itself is run from a delegate, and is essentially an event handler.
There are not many possible execution paths, and I've not yet discovered a path that would be conspicuously complicated.
I'm suspecting garbage collection, but I don't know how to detect whether it has occurred.
Finally, I am also considering whether the logging method itself is causing the problem. (The logger is basically a call to a static class + event listener that writes to the console.)

Just because Stopwatch has a high accuracy doesn't mean that other things can't get in the way - like the OS interrupting that thread to do something else. Garbage collection is another possibility. Writing to the console could easily cause delays like that.
Are you actually interested in individual call times, or is it overall performance which is important? It's generally more useful to run a method thousands of times and look at the total time - that's much more indicative of overall performance than individual calls which can be affected by any number of things on the computer.

As I commented, you really should at least describe what your method does, if you're not willing to post some code (which would be best).
That said, one way you can tell if garbage collection has occurred (from Windows):
Run perfmon (Start->Run->perfmon)
Right-click on the graph; select "Add Counters..."
Under "Performance object", select ".NET CLR Memory"
From there you can select # Gen 0, 1, and 2 collections and click "Add"
Now on the graph you will see a graph of all .NET CLR garbage collections
Just keep this graph open while you run your application
EDIT: If you want to know if a collection occurred during a specific execution, why not do this?
int initialGen0Collections = GC.CollectionCount(0);
int initialGen1Collections = GC.CollectionCount(1);
int initialGen2Collections = GC.CollectionCount(2);
// run your method
if (GC.CollectionCount(0) > initialGen0Collections)
// gen 0 collection occurred
if (GC.CollectionCount(1) > initialGen1Collections)
// gen 1 collection occurred
if (GC.CollectionCount(2) > initialGen2Collections)
// gen 2 collection occurred
SECOND EDIT: A couple of points on how to reduce garbage collections within your method:
You mentioned in a comment that your method adds the object passed in to "a big collection." Depending on the type you use for said big collection, it may be possible to reduce garbage collections. For instance, if you use a List<T>, then there are two possibilities:
a. If you know in advance how many objects you'll be processing, you should set the list's capacity upon construction:
List<T> bigCollection = new List<T>(numObjects);
b. If you don't know how many objects you'll be processing, consider using something like a LinkedList<T> instead of a List<T>. The reason for this is that a List<T> automatically resizes itself whenever a new item is added beyond its current capacity; this results in a leftover array that (eventually) will need to be garbage collected. A LinkedList<T> does not use an array internally (it uses LinkedListNode<T> objects), so it will not result in this garbage collection.
If you are creating objects within your method (i.e., somewhere in your method you have one or more lines like Thing myThing = new Thing();), consider using a resource pool to eliminate the need for constantly constructing objects and thereby allocating more heap memory. If you need to know more about resource pooling, check out the Wikipedia article on Object Pools and the MSDN documentation on the ConcurrentBag<T> class, which includes a sample implementation of an ObjectPool<T>.

That can depend on many things and you really have to figure out which one you are delaing with.
I'm not terribly familiar with what triggers garbage collection and what thread it runs on, but that sounds like a possibility.
My first thought around this is with paging. If this is the first time the method runs and the application needs to page in some code to run the method, it would be waiting on that. Or, it could be the data that you're using within the method that triggered a cache miss and now you have to wait for that.
Maybe you're doing an allocation and the allocator did some extra reshuffling in order to get you the allocation you requested.
Not sure how thread time is calculated with Stopwatch, but a context switch might be what you're seeing.
Or...it could be something completely different...
Basically, it could be one of several things and you really have to look at the code itself to see what is causing your occasional slow-down.

It could well be GC. If you use a profiler application such as Redgate's ANTS profiler you can profile % time in GC along side your application's performance to see what's going on.
In addition, you can use the CLRProfiler...
https://github.com/MicrosoftArchive/clrprofiler
Finally, Windows Performance Monitor will show the % time in GC for a given running applicaiton too.
These tools will help you get a holistic view of what's going on in your app as well as the OS in general.
I'm sure you know this stuff already but microbenchmarking such as this is sometimes useful for determining how fast one line of code might be compared to another than you might write, but you generally want to profile your application under typical load too.
Knowing that a given line of code is 10 times faster than another is useful, but if that line of code is easier to read and not part of a tight loop then the 10x performance hit might not be a problem.

What you need is a performance profile to tell you exactly what causes a slow down. Here is a quick list And of course here is the ANTS profiler.
Without knowing what your operation is doing, it sounds like it could be the garbage collection. However that might not be the only reason. If you are reading or writing to the disc it is possible your application might have to wait while something else is using the disk.
Timing issues may occur if you have a multi-threaded application and another thread could be taking some processor time that is only running 10 % of the time. This is why a profiler would help.

If you're only running the code "thousands" of times on a pretty quick function, the occasional longer time could easily be due to transient events on the system (maybe Windows decided it was time to cache something).
That being said, I would suggest the following:
Run the function many many more times, and take an average.
In the code that uses the function, determine if the function in question actually is a bottleneck. Use a profiler for this.

It can be dependent on your OS, environment, page reads, CPU ticks per second and so on.
The most realistic way is to run an execution path several thousand times and take the average.
However, if that logging class is only called occasionally and it logs to disk, that is quite likely to be a slow-down factor if it has to seek on the drive first.
A read of http://en.wikipedia.org/wiki/Profiling_%28computer_programming%29 may give you an insight into more techniques for determining slowdowns in your applications, while a list of profiling tools that you may find useful is at:
http://en.wikipedia.org/wiki/Visual_Studio_Team_System_Profiler
specifically http://en.wikipedia.org/wiki/Visual_Studio_Team_System_Profiler if you're doing c# stuff.
Hope that helps!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.