I am trying to build some objects and insert them into a database. The number of records that have to be inserted is big ~ millions.
The insert is done in batches.
The problem I am having is that i need to initialize new objects to add them to a list and at the end, i do a bulk insert into the database of the list. Because i am initializing a huge number of objects, my computer memory(RAM) gets filled up and it kinda freezes everything.
The question is :
From a memory point of view, should I initialize objects of set them to null ?
Also, I am trying to work with the same object reference. Am i doing it right ?
Code:
QACompleted completed = new QACompleted();
QAUncompleted uncompleted = new QAUncompleted();
QAText replaced = new QAText();
foreach (QAText question in questions)
{
MatchCollection matchesQ = rgx.Matches(question.Question);
MatchCollection matchesA = rgx.Matches(question.Answer);
foreach (GetKeyValues_Result item in values)
{
hasNull = false;
replaced = new QAText(); <- this object
if (matchesQ.Count > 0)
{
SetQuestion(matchesQ, replaced, question, item);
}
else
{
replaced.Question = question.Question;
}
if (matchesA.Count > 0)
{
SetAnswer(matchesA,replaced,question,item);
}
else
{
replaced.Answer = question.Answer;
}
if (!hasNull)
{
if (matchesA.Count == 0 && matchesQ.Count == 0)
{
completed = new QACompleted(); <- this object
MapEmpty(replaced,completed, question.Id);
}
else
{
completed = new QACompleted(); <- this object
MapCompleted(replaced, completed, question.Id, item);
}
goodResults.Add(completed);
}
else
{
uncompleted = new QAUncompleted(); <- this object
MapUncompleted(replaced,uncompleted,item, question.Id);
badResults.Add(uncompleted);
}
}
var success = InsertIntoDataBase(goodResults, "QACompleted");
var success1 = InsertIntoDataBase(badResults, "QAUncompleted");
}
I have marked the objects. Should I just call them like replaced = NULL, or should i use the constructor ?
What would be the difference between new QAText() and = null ?
The memory cost of creating objects
Creating objects in C# will always have a memory cost. This relates to the memory layout of object. Assuming you are using 64 bit OS, the runtime has to allocate an extra 8 bytes for sync block, and 8 bytes for method table pointer. After the sync block and method table pointer are your customized data fields. Besides the inevitable 16 bytes header, objects are always aligned to the boundary of 8 bytes and therefore can incur extra overhead.
You can roughly estimate the memory overhead if you know exactly what is the number of objects you create. However I would suggest you be careful when assuming that your memory pressure is coming from object layout overhead. This is also the reason I suggest you estimate the overhead as the first step. You might end up realizing that even if the layout overhead can magically be completely removed, you are not going to make a huge difference in terms of memory performance. After all, for a million objects, the overhead of object header is only 16 MB.
The difference between replaced = new QAText() and replaced = null
I suppose after you set replaced to null you still have to create another QAText()? If so, memory-wise there is no real difference to the garbage collector. The old QAText instance will be collected either way if you are not making any other reference to it. When to collect the instance, however, is the call of garbage collector. Doing replaced = null will not make the GC happen earlier.
You can try to reuse the same QAText instance instead of creating a new one every time. But creating a new one every time will not result in high memory pressure. It will make the GC a little busier therefore result in a higher CPU usage.
Identify the real cause for high memory usage
If your application is really using a lot of memory, you have to look at the design of your QACompleted and QAUncompleted objects. Those are the objects added to the list and occupy memory until you submit them to the database. If those objects are designed well(they are only taking the memory they have to take), as Peter pointed out you should use a smaller batch size so you don't have to keep too many of them in memory.
There are other factors in your program that can possible cause unexpected memory usage. What is the data structure for goodResults and badResults? Are they List or LinkedList? List internally is nothing but a dynamic array. It uses a grow policy which will always double its size when it is full. The always-double policy can eat up memory quickly especially when you have a lot of entries.
LinkedList, on the other side, does not suffer from the above-mentioned problem. But every single node requires roughly 40 extra bytes.
It also worth-checking what MapCompleted and MapUnCompleted methods are doing. Are they making long-lived reference to replaced object? If so it will cause a memory leak.
As a summary, when dealing with memory problems, you should focus on macro-scope issues such as the choice of data structures, or memory leaks. Or optimize your algorithms so that you don't have to keep all the data in memory all the time.
Instantiating new (albeit empty) object always takes some memory, as it has to allocate space for the object's fields. If you aren't going to access or set any data in the instance, I see no point in creating it.
It's unfortunate that the code example is not written better. There seem to be lots of declarations left out, and undocumented side-effects in the code. This makes it very hard to offer specific advice.
That said…
Your replaced object does not appear to be retained beyond one iteration of the loop, so it's not part of the problem. The completed and uncompleted objects are added to lists, so they do add to your memory consumption. Likewise the goodResults and badResults lists themselves (where are the declarations for those?).
If you are using a computer with too little RAM, then yes...you'll run into performance issues as Windows uses the disk to make up for the lack of RAM. And even with enough RAM, at some point you could run into .NET's limitations with respect to object size (i.e. you can only put so many elements into a list). So one way or the other, you seem to need to reduce your peak memory usage.
You stated that when the data in the lists is inserted into the database, the lists are cleared. So presumably that means that there are so many elements in the values list (one of the undeclared, undocumented variables in your code example) that the lists and their objects get too large before getting to the end of the inner loop and inserting the data into the database.
In that case, then it seems likely the simplest way to address the issue is to submit the updates in batches inside the inner foreach loop. E.g. at the end of that loop, add something like this:
if (goodResults.Count >= 100000)
{
var success = InsertIntoDataBase(goodResults, "QACompleted");
}
if (badResults.Count >= 100000)
{
var success = InsertIntoDataBase(badResults, "QACompleted");
}
(Declaring the actual cut-off as a named constant of course, and handling the database insert result return value as appropriate).
Of course, you would still do the insert at the end of the outer loop too.
Related
Context: I'm using Unity3D's IMGUI where OnGUI{} method is being called/update VERY often (few times per frame) to keep GUI content relevant. I need to iterate through Dictionary with data and display said data, but because I also will be making changes to Dictionary content (Add/Remove) I have to iterate through separate List/Array/whatever.
So in other words right now I have:
foreach (string line in new List<string>(this.myDic.Keys))
{
//fill GUI
//edit Dictionary content if needed
}
The problem here is that it allocates short-lived List multiple time per frame, thousands and thousands times per second and insane amount in general, producing GC. What I want is to avoid this allocation by reusing the same List I initialize at the start. However, another issue came up:
tempList.Clear();
foreach (KeyValuePair<string,string> pair in myDic)
{
tempList.Add(pair.key)
}
var j = tempList.Count;
for (int i = 0; i < j; i++)
{
//fill GUI
//edit Dictionary content if needed
}
As you can see now I basically have two loops, both processing same amount of data. Which leads me to the question: is it moot point here trying to optimize the allocation issue here with reusable List? Or may be even if it looks scary the double-loop variant still better solution?
P.S. Yes, I know, best option would be switch from IMGUI but right now I'm kinda limited to it.
First of all, the only object you allocate by calling new List<string>(this.myDic.Keys) (or this.myDic.Keys.ToArray() alternatively) is an array containing references to already existing objects (strings in your case). So, the GC collects only one object when the scope ends.
The size of that object is about equal to objectCount*referenceSize. The reference size depends on the selected platform: 32bit or 64bit.
Speaking formally, you can save some memory traffic by reusing an existing list, but I guess it's not worth it.
Anyway, if you're up to do that, please note that your approach to refilling the list isn't optimal. I suggest you using .AddRange (it internally uses Array.Copy).
tempList.Clear();
tempList.AddRange(myDic.Keys)
foreach (key in tempList) { ... } //replace with 'for' if you also want to avoid allocation of an iterator
Most likely it's a premature optimization, please do some performance benchmark to test if it makes any sense for your application.
I've got a few global arrays I use in a simple WinForms game. The arrays are initialized when a new game starts. When a player is in the middle of the game (the arrays are filled with data) he clicks on the StartNewGame() button (restarts the game). What to do next?
Is it ok to reinitialize the whole array for the new game or should I just set every array item to null and use the already initialized array (which would be slower)?
I mean is it okay to do something like this?
MyClass[,] gameObjects;
public Form1()
{
StartNewGame();
// game flow .. simplified here .. normally devided in functions and events..
StartNewGame();
// other game flow
}
public StartNewGame()
{
gameObjects = new MyClass[10,10];
// some work with gameObjects
}
This almost entirely depends upon MyClass, specifically how many data members it contains, how much processing does its constructor (and members' constructors) require and whether it is a relatively simply operation to (re)set an object of this class to "initialized" state. A more objective answer can be obtained through benchmarking.
From you question, I understand that there are not so many array's - in that case I would say, reinitialize the whole array
In cases you have a lot of work that can take 30 sec to set up maybe you do clean up instead of reinitializing everything.
If you choose to place null, you can jet some ugly exception , so I think you rather clean the object inside the array rather then set them to null
If there are only 100 elements as in your example, then there shouldn't really be a noticeable performance hit.
If you reinitialize the array, you will perform n constructions for n objects. The garbage collector will come clean up the old array and de-allocate those old n objects at some later time. (So you have n allocations upfront, and n deallocations by the GC).
If you set each pointer in the array to null, the garbage collector will still do the same amount of work and come clean up those n objects at some later time. The only difference is you're not deallocating the array here, but that single deallocation is negligible.
From my point of view, the best way to achieve performance in this case is to not reallocate the objects at all, but to use the same ones. Add a valid bit to mark whether or not an object is valid (in use), and to reinitialize you simply set all the valid bits to false. In a similar fashion, programs don't go through and write 0's to all your memory when it's not in use. They just leave it as garbage and overwrite data as necessary.
But again, if your number of objects isn't going into the thousands, I'd say you really won't notice the performance hit.
gameObjects = new MyClass[10,10];
... is the way to go. This is definitely faster than looping through the array and setting the items to null. It is also simpler to code and to understand. But both variants are very fast in anyway, unless you have tens of millions of entries! '[10, 10]' is very small, so forget about performance and do what seems more appropriate and more understandable to you. A clean coding is more important than performance in most cases.
i am putting 2 very large datasets into memory, performing a join to filter out a subset from the first collection and then attempting to destroy the second collection as it uses approximately 600MB of my system's RAM. The problem is that the code below is not working. After the code below runs, a foreach loop runs and takes about 15 mins. During this time the memory does NOT reduce from 600MB+. Am i doing something wrong?
List<APPLES> tmpApples = dataContext.Apples.ToList(); // 100MB
List<ORANGES> tmpOranges = dataContext.Oranges.ToList(); // 600MB
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id, oranges => oranges.Id, (apples, oranges) => apples).ToList();
tmpOranges.Clear();
tmpOranges = null;
GC.Collect();
Note i re-use tmpApples later so i am not clearing it just now..
A few things to note:
Unless your dataContext can be cleared / garbage collected, that may well be retaining references to a lot of objects
Calling Clear() and then setting the variable to null is pointless, if you're really not doing anything else with the list. The GC can tell when you're not using a variable any more, in almost all cases.
Presumably you're judging how much memory the process has reserved; I don't think the CLR will actually return memory to the operating system, but the memory which has been freed by garbage collection will be available to further uses within the CLR. (EDIT: As per comments below, it's possible that the CLR frees areas of the Large Object Heap, but I don't know for sure.)
Clearing, nullifying and collecting hardly ever has any (positive) effect. The GC will automatically detect when objects are not referenced anymore. Further more, As long as the Join operation runs, both the tmpApples and tmpOranges collections are referenced and with it all their objects. They can therefore not be collected.
A better solution would be to do the filter in the database:
// NOTE That I removed the ToList operations
IQueryable<APPLE> tmpApples = dataContext.Apples;
IQueryable<ORANGE> tmpOranges = dataContext.Oranges;
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id,
oranges => oranges.Id, (apples, oranges) => apples)
.ToList();
The reason this data is not collected back is because although you are clearing the collection (hence collection does not have a reference to items anymore),DataContext keeps a reference and this causes it to stay in memory.
You have to dispose your DataContext as soon as you are done.
UPDATE
OK, you probably have fallen victim to large object issue.
Assuming this as Large Object Heap issue you could try to not retrieve all apples at once but instead get them in "packets". So instead of calling
List<APPLE> apples = dataContext.Apples.ToList()
instead try to store the apples in separate lists
int packetSize = 100;
List<APPLE> applePacket1 = dataContext.Apples.Take(packetSize);
List<APPLE> applePacket2 = dataContext.Applies.Skip(packetSize).Take(packetSize);
Does that help?
Use some profiler tools or SOS.dll to find out, where your memory belongs to. If some operations take TOO much time, this sounds like you are swapping out to page file.
EDIT: Also keep in mind, the Debug version will delay the collection of local variables which are not referenced anymore for easier investigation.
The only thing you're doing wrong is explicitly calling the Garbage collector. You don't need to do this (in fact you shouldn't) and as Steven says you don't need to do anything to the collections anyway they'll just go away - eventually.
If you're concern is the performance of the 15 minute foreach loop perhaps it is that loop which you should post. It is probably not related to the memory usage.
I'm dealing with an SDK that keeps references to every object it creates, as long as the main connection object is in scope. Creating a new connection object periodically results in other resource issues, and is not an option.
To do what I need to do, I must iterate through thousands of these objects (almost 100,000), and while I certainly don't keep references to these objects, the object model in the SDK I'm using does. This chews through memory and is dangerously close to causing OutOfMemoryExceptions.
These objects are stored in nested ReadOnlyCollections, so what I'm trying now, is to use reflection to set some of these collections to null when I'm done with them, so the garbage collector can harvest the used memory.
foreach (Build build in builds)
{
BinaryFileCollection numBinaries = build.GetBinaries();
foreach (BinaryFile binary in numBinaries)
{
this.CoveredBlocks += binary.HitBlockCount;
this.TotalBlocks += binary.BlockCount;
this.CoveredArcs += binary.HitArcCount;
this.TotalArcs += binary.ArcCount;
if (binary.HitBlockCount > 0)
{
this.CoveredSourceFiles++;
}
this.TotalSourceFiles++;
foreach (Class coverageClass in binary.GetClasses())
{
if (coverageClass.HitBlockCount > 0)
{
this.CoveredClasses++;
}
this.TotalClasses++;
foreach (Function function in coverageClass.GetFunctions())
{
if (function.HitBlockCount > 0)
{
this.CoveredFunctions++;
}
this.TotalFunctions++;
}
}
FieldInfo fi = typeof(BinaryFile).GetField("classes", BindingFlags.NonPublic | BindingFlags.Instance);
fi.SetValue(binary, null);
}
When I check the values of the classes member in numBinaries[0], it returns null, which seems like mission accomplished, but when I run this code the memory consumption just keeps going up and up, just as fast as when I don't set classes to null at all.
What I'm trying to figure out is whether there's something intrinsically flawed in this approach, or if there's another object keeping references to the classes ReadOnlyCollection that I'm missing.
I can think of a few alternatives...
Logically split it out. You mentioned it keeps all the references "for the duration of a connection". Can you do 10%, close it, open a new one, skip that 10%, take another 10% (total 20%), etc?
How much memory are we talking about here, and is this tool going to be something that is long-lived? So what if it uses a lot of RAM for a few minutes? Are you actually getting OOMs? If your system has that much available RAM for the program to use, why not use it? You paid for the RAM. This reminds me of one of Raymond Chen's blog posts about 100% CPU consumption.
If you really want to see what is keeping something from getting garbage collected, firing up SOS and using !gcroot is a place to start.
But despite all of that, if this really is a problem, I would spend more time with the 3rd party API provider - at some point they may release an update you want that breaks this - and you'll be back to square one, or worse you can introduce subtle bugs in the product.
EDIT: Problem wasn't related to the question. It was indeed something wrong with my code, and actually, it was so simple that I don't want to put it on the internet. Thanks anyway.
I read in roughly 550k Active directory records and store them in a List, the class being a simple wrapper for an AD user. I then split the list of ADRecords into four lists, each containing a quarter of the total. After I do this, I read in about 400k records from a database, known as EDR records, into a DataTable. I take the four quarters of my list and spawn four threads, passing each one of the four quarters. I have to match the AD records to the EDR records using email right now, but we plan to add more things to match on later.
I have a foreach on the list of AD records, and inside of that, I have to run a for loop on the EDR records to check each one, because if an AD record matches more than one EDR record, then that isn't a direct match, and should not be treated as a direct match.
My problem, by the time I get to this foreach on the list, my ADRecords list only has about 130 records in it, but right after I pull them all in, I Console.WriteLine the count, and it's 544k.
I am starting to think that even though I haven't set the list to null to be collected later, C# or Windows or something is actually taking my list away to make room for the EDR records because I haven't used the list in a while. The database that I have to use to read EDR records is a linked server, so it takes about 10 minutes to read them all in, so my list is actually idle for 10 minutes, but it's never set to null.
Any ideas?
//splitting list and passing in values to threads.
List<ADRecord> adRecords = GetAllADRecords();
for (int i = 0; i < adRecords.Count/4; i++)
{
firstQuarter.Add(adRecords[i]);
}
for (int i = adRecords.Count/4; i < adRecords.Count/2; i++)
{
secondQuarter.Add(adRecords[i]);
}
for (int i = adRecords.Count/2; i < (adRecords.Count/4)*3; i++)
{
thirdQuarter.Add(adRecords[i]);
}
for (int i = (adRecords.Count/4)*3; i < adRecords.Count; i++)
{
fourthQuarter.Add(adRecords[i]);
}
DataTable edrRecordsTable = GetAllEDRRecords();
DataRow[] edrRecords = edrRecordsTable.Select("Email_Address is not null and Email_Address <> ''", "Email_Address");
Dictionary<string, int> letterPlaces = FindLetterPlaces(edrRecords);
Thread one = new Thread(delegate() { ProcessMatches(firstQuarter, edrRecords, letterPlaces); });
Thread two = new Thread(delegate() { ProcessMatches(secondQuarter, edrRecords, letterPlaces); });
Thread three = new Thread(delegate() { ProcessMatches(thirdQuarter, edrRecords, letterPlaces); });
Thread four = new Thread(delegate() { ProcessMatches(fourthQuarter, edrRecords, letterPlaces); });
one.Start();
two.Start();
three.Start();
four.Start();
In ProcessMatches, there is a foreach on the List of ADRecords passed in. The first line in the foreach is AdRecordsProcessed++; which is a global static int, and the program finishes with it at 130 instead of the 544k.
The variable is never set to null and is still in scope? If so, it shouldn't be collected and idle time isn't your problem.
First issue I see is:
AdRecordsProcessed++;
Are you locking that global variable before updating it? If not, and depending on how fast the records are processed, it's going to be lower than you expect.
Try running it from a single thread (i.e. pass in adRecords instead of firstQuarter and don't start the other threads.) Does it work as expected with 1 thread?
Firstly, you don't set a list to null. What you might do is set every reference to a list to null (or to another list), or all such references might simply fall out of scope. This may seem like a nitpick point, but if you are having to examine what is happening to your data it's time to be nitpicky on such things.
Secondly, getting the GC to deallocate something that has a live reference is pretty hard to do. You can fake it with a WeakReference<> or think you've found it when you hit a bug in a finaliser (because the reference isn't actually live, and even then its a matter of the finaliser trying to deal with a finalised rather than deallocated object). Bugs can happen everywhere, but that you've found a way to make the GC deallocate something that is live is highly unlikely.
The GC will be likely do two things with your list:
It is quite likely to compact the memory used by it, which will move its component items around.
It is quite likely to promote it to a higher generation.
Neither of these are going to have any changes you will detect unless you actually look for them (obviously you'll notice a change in generation if you keep calling GetGeneration(), but aside from that you aren't really going to).
The memory used could also be paged out, but it will be paged back in when you go to use the objects. Again, no effect you will notice.
Finally, if the GC did deallocate something, you wouldn't have a reduced number of items, you'd have a crash, because if objects just got deallocated the system will still try to use the supposedly live references to them.
So, while the GC or the OS may do something to make room for your other object, it isn't something observable in code, and it does not stop the object from being available and in the same programmatic state.
Something else is the problem.
Is there a reason you have to get all the data all at once? If you break the data up into chunks it should be more manageable. All I know is having to get into GC stuff is a little smelly. Best to look at refactoring your code.
The garbage collector will not collect:
A global variable
Objects managed by static objects
A local variable
A variable referencable by any method on the call stack
So if you can reference it from your code, there is no possibility that the garbage collector collected it. No way, no how.
In order for the collector to collect it, all references to it must have gone away. And if you can see it, that's most definitely not the case.