i am putting 2 very large datasets into memory, performing a join to filter out a subset from the first collection and then attempting to destroy the second collection as it uses approximately 600MB of my system's RAM. The problem is that the code below is not working. After the code below runs, a foreach loop runs and takes about 15 mins. During this time the memory does NOT reduce from 600MB+. Am i doing something wrong?
List<APPLES> tmpApples = dataContext.Apples.ToList(); // 100MB
List<ORANGES> tmpOranges = dataContext.Oranges.ToList(); // 600MB
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id, oranges => oranges.Id, (apples, oranges) => apples).ToList();
tmpOranges.Clear();
tmpOranges = null;
GC.Collect();
Note i re-use tmpApples later so i am not clearing it just now..
A few things to note:
Unless your dataContext can be cleared / garbage collected, that may well be retaining references to a lot of objects
Calling Clear() and then setting the variable to null is pointless, if you're really not doing anything else with the list. The GC can tell when you're not using a variable any more, in almost all cases.
Presumably you're judging how much memory the process has reserved; I don't think the CLR will actually return memory to the operating system, but the memory which has been freed by garbage collection will be available to further uses within the CLR. (EDIT: As per comments below, it's possible that the CLR frees areas of the Large Object Heap, but I don't know for sure.)
Clearing, nullifying and collecting hardly ever has any (positive) effect. The GC will automatically detect when objects are not referenced anymore. Further more, As long as the Join operation runs, both the tmpApples and tmpOranges collections are referenced and with it all their objects. They can therefore not be collected.
A better solution would be to do the filter in the database:
// NOTE That I removed the ToList operations
IQueryable<APPLE> tmpApples = dataContext.Apples;
IQueryable<ORANGE> tmpOranges = dataContext.Oranges;
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id,
oranges => oranges.Id, (apples, oranges) => apples)
.ToList();
The reason this data is not collected back is because although you are clearing the collection (hence collection does not have a reference to items anymore),DataContext keeps a reference and this causes it to stay in memory.
You have to dispose your DataContext as soon as you are done.
UPDATE
OK, you probably have fallen victim to large object issue.
Assuming this as Large Object Heap issue you could try to not retrieve all apples at once but instead get them in "packets". So instead of calling
List<APPLE> apples = dataContext.Apples.ToList()
instead try to store the apples in separate lists
int packetSize = 100;
List<APPLE> applePacket1 = dataContext.Apples.Take(packetSize);
List<APPLE> applePacket2 = dataContext.Applies.Skip(packetSize).Take(packetSize);
Does that help?
Use some profiler tools or SOS.dll to find out, where your memory belongs to. If some operations take TOO much time, this sounds like you are swapping out to page file.
EDIT: Also keep in mind, the Debug version will delay the collection of local variables which are not referenced anymore for easier investigation.
The only thing you're doing wrong is explicitly calling the Garbage collector. You don't need to do this (in fact you shouldn't) and as Steven says you don't need to do anything to the collections anyway they'll just go away - eventually.
If you're concern is the performance of the 15 minute foreach loop perhaps it is that loop which you should post. It is probably not related to the memory usage.
Related
In my game I can use a list of game objects or tags to iterate but i prefer knows what is the most efficient way.
Save more memory using tags or unity requires many resources to do a search by tag?
public List<City> _Citys = new List<City>();
or
foreach(GameObject go in GameObject.FindGameObjectsWithTag("City"))
You're better of using a List of City objects and doing a standard for loop to iterate over the 'City' objects. The List just simply holds references to the 'City' objects, so impact on memory should be minimal - you could use an array of GameObjects[] instead of a List (which is what FindGameObjectsWithTag returns).
It's better for performance to use a populated List/Array rather than searching by Tags and of course you're explicitly pointing to an object rather than using 'magic' strings -- if you change the tag name later on then the FindGameObjectsWithTag method will silently break, as it will no longer find any objects.
Also, avoid using a foreach loop in Unity as this unfortunately creates a lot of garbage (the garbage collector in Unity isn't great so it's best to create as little garbage as possbile), instead just use a standard for loop:
Replace the “foreach” loops with simple “for” loops. For some reason, every iteration of every “foreach” loop generated 24 Bytes of garbage memory. A simple loop iterating 10 times left 240 Bytes of memory ready to be collected which was just unacceptable
EDIT: As mentioned in pid's answer - measure. You can use the built-in Unity profiler to inspect memory usage: http://docs.unity3d.com/Manual/ProfilerMemory.html
Per Microsoft's C# API rules, verbs such as Find* or Count* denote active code while terms such as Length stand for actual values that require no code execution.
Now, if the Unity3D folks respected those guidelines is a matter of debate, but from the name of the method I can already tell that it has a cost and should not be taken too lightly.
On the other side, your question is about performance, not correctness. Both ways are correct per se, but one is supposed to have better performance.
So, the main rule of refactoring for performance is: MEASURE.
It depends on memory allocation and garbage collection, it is impossible to tell which really is faster without measuring.
So the best advice I could give you is pretty general. Whenever you feel the need to enhance performance of code you have to actually measure what you are about to improve, before and after.
Your code examples are 2 distinctly different things. One is instantiating a list, and one is enumerating over an IEnumerable returned from a function call.
I assume you mean the difference between iterating over your declared list vs iterating over the return value from GameObject.FindObjectsWithTag() in which case;
Storing a List as a member variable in your class, populating it once and then iterating over it several times is more efficient than iterating over GameObject.FindObjectsWithTag several times.
This is because you keep your List and your references to the objects in your list at all times without having to repopulate it.
GameObject.FindObjectsWithTag will search your entire object hierarchy and compile a list of all the objects that it finds that matches your search criteria. This is done every time you call the function, so there is additional overhead even if the amount of objects it finds is the same as it still searches your hierarchy.
To be honest, you could just cache your results with a List object using GameObject.FindObjectWithTag providing the amount of objects returned will not change. (As in to say you are not instantiating or destroying any of those objects)
I am trying to build some objects and insert them into a database. The number of records that have to be inserted is big ~ millions.
The insert is done in batches.
The problem I am having is that i need to initialize new objects to add them to a list and at the end, i do a bulk insert into the database of the list. Because i am initializing a huge number of objects, my computer memory(RAM) gets filled up and it kinda freezes everything.
The question is :
From a memory point of view, should I initialize objects of set them to null ?
Also, I am trying to work with the same object reference. Am i doing it right ?
Code:
QACompleted completed = new QACompleted();
QAUncompleted uncompleted = new QAUncompleted();
QAText replaced = new QAText();
foreach (QAText question in questions)
{
MatchCollection matchesQ = rgx.Matches(question.Question);
MatchCollection matchesA = rgx.Matches(question.Answer);
foreach (GetKeyValues_Result item in values)
{
hasNull = false;
replaced = new QAText(); <- this object
if (matchesQ.Count > 0)
{
SetQuestion(matchesQ, replaced, question, item);
}
else
{
replaced.Question = question.Question;
}
if (matchesA.Count > 0)
{
SetAnswer(matchesA,replaced,question,item);
}
else
{
replaced.Answer = question.Answer;
}
if (!hasNull)
{
if (matchesA.Count == 0 && matchesQ.Count == 0)
{
completed = new QACompleted(); <- this object
MapEmpty(replaced,completed, question.Id);
}
else
{
completed = new QACompleted(); <- this object
MapCompleted(replaced, completed, question.Id, item);
}
goodResults.Add(completed);
}
else
{
uncompleted = new QAUncompleted(); <- this object
MapUncompleted(replaced,uncompleted,item, question.Id);
badResults.Add(uncompleted);
}
}
var success = InsertIntoDataBase(goodResults, "QACompleted");
var success1 = InsertIntoDataBase(badResults, "QAUncompleted");
}
I have marked the objects. Should I just call them like replaced = NULL, or should i use the constructor ?
What would be the difference between new QAText() and = null ?
The memory cost of creating objects
Creating objects in C# will always have a memory cost. This relates to the memory layout of object. Assuming you are using 64 bit OS, the runtime has to allocate an extra 8 bytes for sync block, and 8 bytes for method table pointer. After the sync block and method table pointer are your customized data fields. Besides the inevitable 16 bytes header, objects are always aligned to the boundary of 8 bytes and therefore can incur extra overhead.
You can roughly estimate the memory overhead if you know exactly what is the number of objects you create. However I would suggest you be careful when assuming that your memory pressure is coming from object layout overhead. This is also the reason I suggest you estimate the overhead as the first step. You might end up realizing that even if the layout overhead can magically be completely removed, you are not going to make a huge difference in terms of memory performance. After all, for a million objects, the overhead of object header is only 16 MB.
The difference between replaced = new QAText() and replaced = null
I suppose after you set replaced to null you still have to create another QAText()? If so, memory-wise there is no real difference to the garbage collector. The old QAText instance will be collected either way if you are not making any other reference to it. When to collect the instance, however, is the call of garbage collector. Doing replaced = null will not make the GC happen earlier.
You can try to reuse the same QAText instance instead of creating a new one every time. But creating a new one every time will not result in high memory pressure. It will make the GC a little busier therefore result in a higher CPU usage.
Identify the real cause for high memory usage
If your application is really using a lot of memory, you have to look at the design of your QACompleted and QAUncompleted objects. Those are the objects added to the list and occupy memory until you submit them to the database. If those objects are designed well(they are only taking the memory they have to take), as Peter pointed out you should use a smaller batch size so you don't have to keep too many of them in memory.
There are other factors in your program that can possible cause unexpected memory usage. What is the data structure for goodResults and badResults? Are they List or LinkedList? List internally is nothing but a dynamic array. It uses a grow policy which will always double its size when it is full. The always-double policy can eat up memory quickly especially when you have a lot of entries.
LinkedList, on the other side, does not suffer from the above-mentioned problem. But every single node requires roughly 40 extra bytes.
It also worth-checking what MapCompleted and MapUnCompleted methods are doing. Are they making long-lived reference to replaced object? If so it will cause a memory leak.
As a summary, when dealing with memory problems, you should focus on macro-scope issues such as the choice of data structures, or memory leaks. Or optimize your algorithms so that you don't have to keep all the data in memory all the time.
Instantiating new (albeit empty) object always takes some memory, as it has to allocate space for the object's fields. If you aren't going to access or set any data in the instance, I see no point in creating it.
It's unfortunate that the code example is not written better. There seem to be lots of declarations left out, and undocumented side-effects in the code. This makes it very hard to offer specific advice.
That said…
Your replaced object does not appear to be retained beyond one iteration of the loop, so it's not part of the problem. The completed and uncompleted objects are added to lists, so they do add to your memory consumption. Likewise the goodResults and badResults lists themselves (where are the declarations for those?).
If you are using a computer with too little RAM, then yes...you'll run into performance issues as Windows uses the disk to make up for the lack of RAM. And even with enough RAM, at some point you could run into .NET's limitations with respect to object size (i.e. you can only put so many elements into a list). So one way or the other, you seem to need to reduce your peak memory usage.
You stated that when the data in the lists is inserted into the database, the lists are cleared. So presumably that means that there are so many elements in the values list (one of the undeclared, undocumented variables in your code example) that the lists and their objects get too large before getting to the end of the inner loop and inserting the data into the database.
In that case, then it seems likely the simplest way to address the issue is to submit the updates in batches inside the inner foreach loop. E.g. at the end of that loop, add something like this:
if (goodResults.Count >= 100000)
{
var success = InsertIntoDataBase(goodResults, "QACompleted");
}
if (badResults.Count >= 100000)
{
var success = InsertIntoDataBase(badResults, "QACompleted");
}
(Declaring the actual cut-off as a named constant of course, and handling the database insert result return value as appropriate).
Of course, you would still do the insert at the end of the outer loop too.
I've got a few global arrays I use in a simple WinForms game. The arrays are initialized when a new game starts. When a player is in the middle of the game (the arrays are filled with data) he clicks on the StartNewGame() button (restarts the game). What to do next?
Is it ok to reinitialize the whole array for the new game or should I just set every array item to null and use the already initialized array (which would be slower)?
I mean is it okay to do something like this?
MyClass[,] gameObjects;
public Form1()
{
StartNewGame();
// game flow .. simplified here .. normally devided in functions and events..
StartNewGame();
// other game flow
}
public StartNewGame()
{
gameObjects = new MyClass[10,10];
// some work with gameObjects
}
This almost entirely depends upon MyClass, specifically how many data members it contains, how much processing does its constructor (and members' constructors) require and whether it is a relatively simply operation to (re)set an object of this class to "initialized" state. A more objective answer can be obtained through benchmarking.
From you question, I understand that there are not so many array's - in that case I would say, reinitialize the whole array
In cases you have a lot of work that can take 30 sec to set up maybe you do clean up instead of reinitializing everything.
If you choose to place null, you can jet some ugly exception , so I think you rather clean the object inside the array rather then set them to null
If there are only 100 elements as in your example, then there shouldn't really be a noticeable performance hit.
If you reinitialize the array, you will perform n constructions for n objects. The garbage collector will come clean up the old array and de-allocate those old n objects at some later time. (So you have n allocations upfront, and n deallocations by the GC).
If you set each pointer in the array to null, the garbage collector will still do the same amount of work and come clean up those n objects at some later time. The only difference is you're not deallocating the array here, but that single deallocation is negligible.
From my point of view, the best way to achieve performance in this case is to not reallocate the objects at all, but to use the same ones. Add a valid bit to mark whether or not an object is valid (in use), and to reinitialize you simply set all the valid bits to false. In a similar fashion, programs don't go through and write 0's to all your memory when it's not in use. They just leave it as garbage and overwrite data as necessary.
But again, if your number of objects isn't going into the thousands, I'd say you really won't notice the performance hit.
gameObjects = new MyClass[10,10];
... is the way to go. This is definitely faster than looping through the array and setting the items to null. It is also simpler to code and to understand. But both variants are very fast in anyway, unless you have tens of millions of entries! '[10, 10]' is very small, so forget about performance and do what seems more appropriate and more understandable to you. A clean coding is more important than performance in most cases.
My code has to generate millions object to perform some algorithm (millions objects will be created and at the same time 2/3 of them should be destroyed).
I know that object creation causes performance problems.
Could someone recommend how to manage so huge amount of objects, garbage collection and so on?
Thank you.
Elaborating a bit on my "make them a value type" comment above.
If you have a struct Foo, then preparing for the algorithm with e.g. var storage = new Foo[1000000] will only allocate one big block of memory (I 'm assuming the required amount of contiguous memory will be available).
You can then manually manage the memory inside that block to avoid performing more memory allocations:
Keep a count of how many slots in the array are actually used
To "create" a new Foo, put it at the first unused slot and increment the counter
To "delete" a Foo, swap it with the one in last used slot and decrement the counter
Of course making an algorithm work with value types vs reference types is not as simple as changing class to struct. But if workable it will allow you to side-step all of this overhead for an one-time startup cost.
If it is possible in your algorithm then try to reuse objects - if 2/3 are destroyed immedietly then you can try to use them again.
You can implement IDisposable interface on the type whose object is been created. Then you can implment using keyword and write whatever logic involving the object within the using scope. The following links will give you a fair idea of what i am trying to say. Hope they are of some help.
http://www.codeguru.com/csharp/csharp/cs_syntax/interfaces/article.php/c8679
Am I implementing IDisposable correctly?
Regards,
Samar
EDIT: Problem wasn't related to the question. It was indeed something wrong with my code, and actually, it was so simple that I don't want to put it on the internet. Thanks anyway.
I read in roughly 550k Active directory records and store them in a List, the class being a simple wrapper for an AD user. I then split the list of ADRecords into four lists, each containing a quarter of the total. After I do this, I read in about 400k records from a database, known as EDR records, into a DataTable. I take the four quarters of my list and spawn four threads, passing each one of the four quarters. I have to match the AD records to the EDR records using email right now, but we plan to add more things to match on later.
I have a foreach on the list of AD records, and inside of that, I have to run a for loop on the EDR records to check each one, because if an AD record matches more than one EDR record, then that isn't a direct match, and should not be treated as a direct match.
My problem, by the time I get to this foreach on the list, my ADRecords list only has about 130 records in it, but right after I pull them all in, I Console.WriteLine the count, and it's 544k.
I am starting to think that even though I haven't set the list to null to be collected later, C# or Windows or something is actually taking my list away to make room for the EDR records because I haven't used the list in a while. The database that I have to use to read EDR records is a linked server, so it takes about 10 minutes to read them all in, so my list is actually idle for 10 minutes, but it's never set to null.
Any ideas?
//splitting list and passing in values to threads.
List<ADRecord> adRecords = GetAllADRecords();
for (int i = 0; i < adRecords.Count/4; i++)
{
firstQuarter.Add(adRecords[i]);
}
for (int i = adRecords.Count/4; i < adRecords.Count/2; i++)
{
secondQuarter.Add(adRecords[i]);
}
for (int i = adRecords.Count/2; i < (adRecords.Count/4)*3; i++)
{
thirdQuarter.Add(adRecords[i]);
}
for (int i = (adRecords.Count/4)*3; i < adRecords.Count; i++)
{
fourthQuarter.Add(adRecords[i]);
}
DataTable edrRecordsTable = GetAllEDRRecords();
DataRow[] edrRecords = edrRecordsTable.Select("Email_Address is not null and Email_Address <> ''", "Email_Address");
Dictionary<string, int> letterPlaces = FindLetterPlaces(edrRecords);
Thread one = new Thread(delegate() { ProcessMatches(firstQuarter, edrRecords, letterPlaces); });
Thread two = new Thread(delegate() { ProcessMatches(secondQuarter, edrRecords, letterPlaces); });
Thread three = new Thread(delegate() { ProcessMatches(thirdQuarter, edrRecords, letterPlaces); });
Thread four = new Thread(delegate() { ProcessMatches(fourthQuarter, edrRecords, letterPlaces); });
one.Start();
two.Start();
three.Start();
four.Start();
In ProcessMatches, there is a foreach on the List of ADRecords passed in. The first line in the foreach is AdRecordsProcessed++; which is a global static int, and the program finishes with it at 130 instead of the 544k.
The variable is never set to null and is still in scope? If so, it shouldn't be collected and idle time isn't your problem.
First issue I see is:
AdRecordsProcessed++;
Are you locking that global variable before updating it? If not, and depending on how fast the records are processed, it's going to be lower than you expect.
Try running it from a single thread (i.e. pass in adRecords instead of firstQuarter and don't start the other threads.) Does it work as expected with 1 thread?
Firstly, you don't set a list to null. What you might do is set every reference to a list to null (or to another list), or all such references might simply fall out of scope. This may seem like a nitpick point, but if you are having to examine what is happening to your data it's time to be nitpicky on such things.
Secondly, getting the GC to deallocate something that has a live reference is pretty hard to do. You can fake it with a WeakReference<> or think you've found it when you hit a bug in a finaliser (because the reference isn't actually live, and even then its a matter of the finaliser trying to deal with a finalised rather than deallocated object). Bugs can happen everywhere, but that you've found a way to make the GC deallocate something that is live is highly unlikely.
The GC will be likely do two things with your list:
It is quite likely to compact the memory used by it, which will move its component items around.
It is quite likely to promote it to a higher generation.
Neither of these are going to have any changes you will detect unless you actually look for them (obviously you'll notice a change in generation if you keep calling GetGeneration(), but aside from that you aren't really going to).
The memory used could also be paged out, but it will be paged back in when you go to use the objects. Again, no effect you will notice.
Finally, if the GC did deallocate something, you wouldn't have a reduced number of items, you'd have a crash, because if objects just got deallocated the system will still try to use the supposedly live references to them.
So, while the GC or the OS may do something to make room for your other object, it isn't something observable in code, and it does not stop the object from being available and in the same programmatic state.
Something else is the problem.
Is there a reason you have to get all the data all at once? If you break the data up into chunks it should be more manageable. All I know is having to get into GC stuff is a little smelly. Best to look at refactoring your code.
The garbage collector will not collect:
A global variable
Objects managed by static objects
A local variable
A variable referencable by any method on the call stack
So if you can reference it from your code, there is no possibility that the garbage collector collected it. No way, no how.
In order for the collector to collect it, all references to it must have gone away. And if you can see it, that's most definitely not the case.