I'm building a windows form application in C# that reads from hundreds of file and create an object hierarchy. In particular:
DEBUG[14]: Imported 129 system/s, 6450 query/s, 6284293 document/s.
The sum is the total number of object I've created. Objects are really simple by the way, just some int/string properties and strongly typed lists inside.
Question: is normal that my application is consuming about 700MB of memory (in debug mode)? What can I do for how to reduce memory usage?
EDIT: here is why i have 6284293 objects, if you're just curious. Imagine a search engine, called "system". A system have more queries inside it.
public class System
{
public List<Query> Queries;
}
Each query object refers to a "topic"; that is the main argument (eg. search for "italy weekend"). It ha a list of retrieved document inside:
public class Query
{
public Topic Topic; // Maintain only a reference to the topic
public List<RetrievedDocument> RetrievedDocuments;
public System System; // Maintain only a reference to the system
}
Each retrieved document has a score and a rank and has a reference to the topic document:
public class RetrievedDocument
{
public string Id;
public int Rank;
public double Score;
public Document Document;
}
Each topic has a collection of documents inside, that can be relevant or not relevant, and a reference to its parent topic:
public class Topic
{
public int Id;
public List<Document> Documents;
public List<Document> RelevantDocuments
{
get {return Documents.Where(d => d.IsRelevant());}
}
}
public class Document
{
public string Id;
public bool IsRelevant;
public Topic Topic; // Maintain only a reference to the topic
}
There are 129 systems, 50 main topics (129*50 = 6450 query objects), each query has a different number of retrieved documents, 6284293 in total. I need this hierarchy for doing some calculations (average precision, topic ease, system mean average precision, relevancy). This is how TREC works...
If you're reading 6284293 documents and are holding on to these in an object hierarchy, then obviously your application if going to use a fair amount of memory. It is hard to say if you're using more than could be expected given that we don't know the size of these objects.
Also, remember that the CLR allocates and frees memory on behalf of your application. So even though your application has released memory this may not be immediately reflected on the process' memory usage. If the application is not leaking this memory will be reclaimed at some point, but you shouldn't expect to see managed memory usage immediately reflected in process memory usage as the CLR may hold on to memory to reduce the number of allocations/frees.
Go get the scitech profiler (with two week free trial) and find out.
Watch out for empty lists, they take 40 bytes each.
It's hard to say what's going on without knowing more about your code, but here's some ideas and suggestions:
Make sure you close files after you finish reading from them
Make sure you're not maintaining references to objects that are no
longer being used
Look at what data structures you're using. Sometimes, there's a more
memory-efficient way to arrange your
data
Look at your data types, are you using Long or Double in places where
Byte would suffice?
Every program will use more memory in Debug mode than not-Debug mode,
but the difference should be on the
order of single or 10's of megabytes,
not hundreds. Can you use task
manager to check how much memory
you're using outside of Debug mode?
Related
I have millions of instances of class Data, I seek optimization advise.
Is there a way to optimize it in any way - save memory for example by serializing it somehow, although it will hurt the retrieval speed which is important too. Maybe turning the class to struct - but it seems that the class is pretty large for struct.
Queries for this objects can take hundreds-millions of these objects at a time. They sit in a list and queried by DateTime. The results are aggregated in different ways, many calculation can be applied.
[Serializable]
[DataContract]
public abstract class BaseData {}
[Serializable]
public class Data : BaseData {
public byte member1;
public int member2;
public long member3;
public double member4;
public DateTime member5;
}
Unfortunately, while you did specify that you want to "optimize", you did not specify what the exact problem is you mean to tackle. So I cannot really give you more than general advice.
Serialization will not help you. Your Data objects are already stored as bytes in memory. Nor will turning it into a struct help. The difference between a struct and a class lies in their addressing and referencing behaviour, not in their memory footprint.
The only way I can think of to reduce the memory footprint of a collection with "hundreds-millions" of these objects would be to serialize and compress the entire thing. But that is not feasible. You would always have to decompress the entire thing before accessing it, which would shoot your performance to hell AND actually almost double the memory consumption on access (compressed and decompressed data both lying in memory at that point).
The best general advice I can give you is not to try to optimize this scenario yourself, but use specialized software for that. By specialized software, I mean a (in-memory) database. One example of a database you can use in-memory, and for which you already have everything you need on-board in the .NET framework, is SQLite.
I assume, as you seem to imply, that you have a class with many members, have a large number of instances, and need to keep them all in memory at the same time to perform calculations.
I ran a few tests to see if you could actually get different sizes for the classes you described.
I used this simple method for finding the in-memory size of an object:
private static void MeasureMemory()
{
int size = 10000000;
object[] array = new object[size];
long before = GC.GetTotalMemory(true);
for (int i = 0; i < size; i++)
{
array[i] = new Data();
}
long after = GC.GetTotalMemory(true);
double diff = after - before;
Console.WriteLine("Total bytes: " + diff);
Console.WriteLine("Bytes per object: " + diff / size);
}
It may be primitive, but I find that it works fine for situations like this.
As expected, almost nothing you can do to that class (turning it to a struct, removing the inheritance, or the method attributes) influences the memory being used by a single instance. As far as memory usage goes, they are all equivalent. However, do try to fiddle with your actual classes and run them through the given code.
The only way you could actually reduce the memory footprint of an instance would be to use smaller structures for keeping your data (int instead of long for example). If you have a large number of booleans, you could group them into a byte or integer, and have simple property wrappers to work with them (A boolean takes a byte of memory). These may be insignificant things in most situations, but for a hundred million objects, removing a boolean could make a difference of a hundred MB of memory. Also, be aware that the platform target you choose for your application can have an impact on the memory footprint of an object (x64 builds take up more memory then x86 ones).
Serializing the data is very unlikely to help. An in-memory database has it's upsides, especially if you are doing complex queries. However, it is unlikely to actually reduce the memory usage for your data. Unfortunately, there just aren't many ways to reduce the footprint of basic data types. At some point, you will just have to move to a file-based database.
However, here are some ideas. Please be aware that they are hacky, highly conditional, decrease the computation performance and will make the code harder to maintain.
It is often a case in large data structures that objects in different states will have only some properties filled, and the other will be set to null or a default value. If you can identify such groups of properties, perhaps you could move them to a sub-class, and have one reference that could be null instead of having several properties take up space. Then you only instantiate the sub-class once it is needed. You could write property wrappers that could hide this from the rest of the code. Have in mind that the worst case scenario here would have you keeping all the properties in memory, plus several object headers and pointers.
You could perhaps turn members that are likely to take a default value into binary representations, and then pack them into a byte array. You would know which byte positions represent which data member, and could write properties that could read them. Position the properties that are most likely to have a default value at the end of the byte array (a few longs that are often 0 for example). Then, when creating the object, adjust the byte array size to exclude the properties that have the default value, starting from the end of the list, until you hit the first member that has a non-default value. When the outside code requests a property, you can check if the byte array is large enough to hold that property, and if not, return the default value. This way, you could potentially save some space. Best case, you will have a null pointer to a byte array instead of several data members. Worst case, you will have full byte arrays taking as much space as the original data, plus some overhead for the array. The usefulness depends on the actual data, and assumes that you do relatively few writes, as the re-computation of the array will be expensive.
Hope any of this helps :)
I am currently working on a very large legacy application which handles a large amount of string data gathered from various sources (IE, names, identifiers, common codes relating to the business etc). This data alone can take up to 200 meg of ram in the application process.
A colleague of mine mentioned one possible strategy to reduce the memory footprint (as a lot of the individual strings are duplicate across the data sets), would be to "cache" the recurring strings in a dictionary and re-use them when required. So for example…
public class StringCacher()
{
public readonly Dictionary<string, string> _stringCache;
public StringCacher()
{
_stringCache = new Dictionary<string, string>();
}
public string AddOrReuse(string stringToCache)
{
if (_stringCache.ContainsKey(stringToCache)
_stringCache[stringToCache] = stringToCache;
return _stringCache[stringToCache];
}
}
Then to use this caching...
public IEnumerable<string> IncomingData()
{
var stringCache = new StringCacher();
var dataList = new List<string>();
// Add the data, a fair amount of the strings will be the same.
dataList.Add(stringCache.AddOrReuse("AAAA"));
dataList.Add(stringCache.AddOrReuse("BBBB"));
dataList.Add(stringCache.AddOrReuse("AAAA"));
dataList.Add(stringCache.AddOrReuse("CCCC"));
dataList.Add(stringCache.AddOrReuse("AAAA"));
return dataList;
}
As strings are immutable and a lot of internal work is done by the framework to make them work in a similar way to value types i'm half thinking that this will just create a copy of each the string into the dictionary and just double the amount of memory used rather than just pass a reference to the string stored in the dictionary (which is what my colleague is assuming).
So taking into account that this will be run on a massive set of string data...
Is this going to save any memory, assuming that 30% of the string values will be used twice or more?
Is the assumption that this will even work correct?
This is essentially what string interning is, except you don't have to worry how it works. In your example you are still creating a string, then comparing it, then leaving the copy to be disposed of. .NET will do this for you in runtime.
See also String.Intern and Optimizing C# String Performance (C Calvert)
If a new string is created with code like (String goober1 = "foo"; String goober2 = "foo";) shown in lines 18 and 19, then the intern table is checked. If your string is already in there, then both variables will point at the same block of memory maintained by the intern table.
So, you don't have to roll your own - it won't really provide any advantage. EDIT UNLESS: your strings don't usually live for as long as your AppDomain - interned strings live for the lifetime of the AppDomain, which is not necessarily great for GC. If you want short lived strings, then you want a pool. From String.Intern:
If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates. ...
EDIT 2 Also see Jon Skeets SO answer here
This is already built-in .NET, it's called String.Intern, no need to reinvent.
You can acheive this using the built in .Net functionality.
When you initialise your string, make a call to string.Intern() with your string.
For example:
dataList.Add(string.Intern("AAAA"));
Every subsequent call with the same string will use the same reference in memory. So if you have 1000 AAAAs, only 1 copy of AAAA is stored in memory.
I have a class in C# that is storing information on a stack to be used by other pieces of the application later. The information is currently stored in a class (without any methods) and consists of some ints, shorts, floats and a couple of boolean values.
I can be processing 10-40 of these packets every second - potentially more - and the information on the stack will be removed when it's needed by another part of the program; however this isn't guaranteed to occur at any definite interval. The information also has a chance of being incomplete (I'm processing packets from a network connection).
Currently I have this represented as such:
public class PackInfo
{
public boolean active;
public float f1;
public float f2;
public int i1;
public int i2;
public int i3;
public int i4;
public short s1;
public short s2;
}
Is there a better way that this information can be represented? There's no chance of the stack getting too large (most of the information will be cleared if it starts getting too big) but I'm worried that there will be a needless amount of memory overhead involved in creating so many instances of the class to act as little more than a container for this information. Even though this is neither a computationally complex or memory-consuming task, I don't see it scaling well should it become either.
This sounds like it would be a good idea to use a generic Queue for storing these. I have the assumption that you're handling these "messages" in order.
As for the overhead of instantiating these classes, I don't think that instantiating 10-40 per second would have any visible impact on performance, the actual processing you do afterwards would likely be a much better candidate for optimization than the cost of instantiation.
Also, I would recommend only optimizing when you can actually measure performance in your application, otherwise you might be wasting your time doing premature optimization.
Currently, I am working on a project where I need to bring GBs of data on to client machine to do some task and the task needs whole data as it do some analysis on the data and helps in decision making process.
so the question is, what are the best practices and suitable approach to manage that much amount of data into memory without hampering the performance of client machine and application.
note: at the time of application loading, we can spend time to bring data from database to client machine, that's totally acceptable in our case. but once the data is loaded into application at start up, performance is very important.
This is a little hard to answer without a problem statement, i.e. what problems you are currently facing, but the following is just some thoughts, based on some recent experiences we had in a similar scenario. It is, however, a lot of work to change to this type of model - so it also depends how much you can invest trying to "fix" it, and I can make no promise that "your problems" are the same as "our problems", if you see what I mean. So don't get cross if the following approach doesn't work for you!
Loading that much data into memory is always going to have some impact, however, I think I see what you are doing...
When loading that much data naively, you are going to have many (millions?) of objects and a similar-or-greater number of references. You're obviously going to want to be using x64, so the references will add up - but in terms of performance the biggesst problem is going to be garbage collection. You have a lot of objects that can never be collected, but the GC is going to know that you're using a ton of memory, and is going to try anyway periodically. This is something I looked at in more detail here, but the following graph shows the impact - in particular, those "spikes" are all GC killing performance:
For this scenario (a huge amount of data loaded, never released), we switched to using structs, i.e. loading the data into:
struct Foo {
private readonly int id;
private readonly double value;
public Foo(int id, double value) {
this.id = id;
this.value = value;
}
public int Id {get{return id;}}
public double Value {get{return value;}}
}
and stored those directly in arrays (not lists):
Foo[] foos = ...
the significance of that is that because some of these structs are quite big, we didn't want them copying themselves lots of times on the stack, but with an array you can do:
private void SomeMethod(ref Foo foo) {
if(foo.Value == ...) {blah blah blah}
}
// call ^^^
int index = 17;
SomeMethod(ref foos[index]);
Note that we've passed the object directly - it was never copied; foo.Value is actually looking directly inside the array. The tricky bit starts when you need relationships between objects. You can't store a reference here, as it is a struct, and you can't store that. What you can do, though, is store the index (into the array). For example:
struct Customer {
... more not shown
public int FooIndex { get { return fooIndex; } }
}
Not quite as convenient as customer.Foo, but the following works nicely:
Foo foo = foos[customer.FooIndex];
// or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);
Key points:
we're now using half the size for "references" (an int is 4 bytes; a reference on x64 is 8 bytes)
we don't have several-million object headers in memory
we don't have a huge object graph for GC to look at; only a small number of arrays that GC can look at incredibly quickly
but it is a little less convenient to work with, and needs some initial processing when loading
additional notes:
strings are a killer; if you have millions of strings, then that is problematic; at a minimum, if you have strings that are repeated, make sure you do some custom interning (not string.Intern, that would be bad) to ensure you only have one instance of each repeated value, rather than 800,000 strings with the same contents
if you have repeated data of finite length, rather than sub-lists/arrays, you might consider a fixed array; this requires unsafe code, but avoids another myriad of objects and references
As an additional footnote, with that volume of data, you should think very seriously about your serialization protocols, i.e. how you're sending the data down the wire. I would strongly suggest staying far away from things like XmlSerializer, DataContractSerializer or BinaryFormatter. If you want pointers on this subject, let me know.
Basically I have a program which, when it starts loads a list of files (as FileInfo) and for each file in the list it loads a XML document (as XDocument).
The program then reads data out of it into a container class (storing as IEnumerables), at which point the XDocument goes out of scope.
The program then exports the data from the container class to a database. After the export the container class goes out of scope, however, the garbage collector isn't clearing up the container class which, because its storing as IEnumerable, seems to lead to the XDocument staying in memory (Not sure if this is the reason but the task manager is showing the memory from the XDocument isn't being freed).
As the program is looping through multiple files eventually the program is throwing a out of memory exception. To mitigate this ive ended up using
System.GC.Collect();
to force the garbage collector to run after the container goes out of scope. this is working but my questions are:
Is this the right thing to do? (Forcing the garbage collector to run seems a bit odd)
Is there a better way to make sure the XDocument memory is being disposed?
Could there be a different reason, other than the IEnumerable, that the document memory isnt being freed?
Thanks.
Edit: Code Samples:
Container Class:
public IEnumerable<CustomClassOne> CustomClassOne { get; set; }
public IEnumerable<CustomClassTwo> CustomClassTwo { get; set; }
public IEnumerable<CustomClassThree> CustomClassThree { get; set; }
...
public IEnumerable<CustomClassNine> CustomClassNine { get; set; }
Custom Class:
public long VariableOne { get; set; }
public int VariableTwo { get; set; }
public DateTime VariableThree { get; set; }
...
Anyway that's the basic structures really. The Custom Classes are populated through the container class from the XML document. The filled structures themselves use very little memory.
A container class is filled from one XML document, goes out of scope, the next document is then loaded e.g.
public static void ExportAll(IEnumerable<FileInfo> files)
{
foreach (FileInfo file in files)
{
ExportFile(file);
//Temporary to clear memory
System.GC.Collect();
}
}
private static void ExportFile(FileInfo file)
{
ContainerClass containerClass = Reader.ReadXMLDocument(file);
ExportContainerClass(containerClass);
//Export simply dumps the data from the container class into a database
//Container Class (and any passed container classes) goes out of scope at end of export
}
public static ContainerClass ReadXMLDocument(FileInfo fileToRead)
{
XDocument document = GetXDocument(fileToRead);
var containerClass = new ContainerClass();
//ForEach customClass in containerClass
//Read all data for customClass from XDocument
return containerClass;
}
Forgot to mention this bit (not sure if its relevent), the files can be compressed as .gz so I have the GetXDocument() method to load it
private static XDocument GetXDocument(FileInfo fileToRead)
{
XDocument document;
using (FileStream fileStream = new FileStream(fileToRead.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
if (String.Equals(fileToRead.Extension, ".gz", StringComparison.OrdinalIgnoreCase))
{
using (GZipStream zipStream = new GZipStream(fileStream, CompressionMode.Decompress))
{
document = XDocument.Load(zipStream);
}
}
else
{
document = XDocument.Load(fileStream);
}
return document;
}
}
Hope this is enough information.
Thanks
Edit: The System.GC.Collect() is not working 100% of the time, sometimes the program seems to retain the XDocument, anyone have any idea why this might be?
public static ContainerClass ReadXMLDocument(FileInfo fileToRead)
{
XDocument document = GetXDocument(fileToRead);
var containerClass = new ContainerClass();
//ForEach customClass in containerClass
//Read all data for customClass from XDocument
containerClass.CustomClassOne = document.Descendants(ElementName)
.DescendantsAndSelf(ElementChildName)
.Select(a => ExtractDetails(a));
return containerClass;
}
private static CustomClassOne ExtractDetails(XElement itemElement)
{
var customClassOne = new CustomClassOne();
customClassOne.VariableOne = Int64.Parse(itemElement.Attribute("id").Value.Substring(4));
customClassOne.VariableTwo = int.Parse(itemElement.Element(osgb + "version").Value);
customClassOne.VariableThree = DateTime.ParseExact(itemElement.Element(osgb + "versionDate").Value,
"yyyy-MM-dd", CultureInfo.InvariantCulture);
return customClassOne;
}
Forcing a manual garbage collection might appear to have solved your problem in some cases, but it's a pretty sure bet that this is nothing better than coincidence.
What you need to do is to stop guessing about what is causing your memory pressure problems, and to instead find out for sure.
I've used JetBrains dotTrace to very good effect in similar situations - set a breakpoint, trigger the profiler and browse through a view of all the "live" objects and their relationships. Makes it easy to find which objects are still retained, and by which references they're kept live.
While I haven't used it myself, the RedGate Ants Memory Profiler is also recommended by many.
Both of these tools have free trials, which should be enough to solve your current problem. Though, I'd strongly suggest that it's worth buying one or the other - dotTrace has saved me dozens of hours of troubleshooting memory issues, a very worthwhile ROI.
Your code doesn't look bad to me and I don't see any single reason for forcing collection. If your custom class holds a reference to XElements from XDocument then GC will not collect neither them nor the doc itself. If something else is holding references to your enumerables then they won't be collected either. So I'd really like to see your custom class definition and how it's populated.
Your inclination about calling GC.Collect is correct. Needing to call this method is an indication that something else is wrong with your code.
However, there are a few things about your statements that make me think your understanding of memory is a little off. Task manager is a very poor indicator of how much memory your program is actually using; a profiler is much better for this task. As far as memory goes, if it can be collected, the GC will collect the memory when it is needed.
Although it's a bit of a wording detail, you ask how "to make sure the XDocument memory is being disposed." Disposed is typically used to refer to manually releasing unmanaged resources, such as database connections or file handles; the GC collects memory.
Now to try to answer the actual question. It is very easy to have references to objects that you do not release, especially when using lambdas and LINQ. Things typed as IEnumerable are especially prone to this as the lazily-evaluated LINQ functions will almost always introduce references to objects you think are otherwise unused. The ReadXMLDocument code that you omitted may be a good place to start looking.
Another possibility is something along the line of what TomTom suggested in that the database classes you are using may be storing objects that you did not expect for its own reasons.
If processed XML files are too big (around 500-800M) than you cannot use XDocument (or XmlDocument either) because it will try to load the whole document in memory. See this discussion: Does LINQ handle large XML files? Getting OutOfMemoryException
In this case, you should rather use a XStreamingElement Class and build your ContainerClass from it.
Maybe going to a 64-bit process would help, but the best practise is always to use streaming from end to end.
It is not really answer, more investigation suggestion: if GC.Collect does not help that pretty much mean you still keep references to the objects somewhere.
Look for singletons and caches that may keep refrences.
If you realiably get the exception or can collect memory dump you can use WinDbg+Sos to find who holds refrences to objects: search for "memory leaks sos windbg" to find details.
Anyway, use
String.Equals(fileToRead.Extension, ".gz", StringComparison.OrdinalIgnoreCase)
instead
String.Compare()
You can try to force evaluation with tolist:
public static ContainerClass ReadXMLDocument(FileInfo fileToRead)
{
XDocument document = GetXDocument(fileToRead);
var containerClass = new ContainerClass();
//ForEach customClass in containerClass
//Read all data for customClass from XDocument
containerClass.CustomClassOne = document.Descendants(ElementName)
.DescendantsAndSelf(ElementChildName)
.Select(a => ExtractDetails(a)).ToList();
return containerClass;
}
The program then exports the data from
the container class to a database.
After the export the container class
goes out of scope, however, the
garbage collector isn't clearing up
the container class which, because its
storing as IEnumerable, seems to lead
to the XDocument staying in memory
(Not sure if this is the reason but
the task manager is showing the memory
from the XDocument isn't being freed).
The reason is that LYNC stored EVERY ITEM READ in it's own reference pool for the transaction. Basiaclly it does so so taht on a reread it can unique the item.
Suggestion:
Load only primry keys into an array. COmmit.
Walk over list and process items one at a time, commiting after each.