i have a large string (e.g. 20MB).
i am now parsing this string. The problem is that strings in C# are immutable; this means that once i've created a substring, and looked at it, the memory is wasted.
Because of all the processing, memory is getting clogged up with String objects that i no longer used, need or reference; but it takes the garbage collector too long to free them.
So the application runs out of memory.
i could use the poorly performing club approach, and sprinkle a few thousand calls to:
GC.Collect();
everywhere, but that's not really solving the issue.
i know StringBuilder exists when creating a large string.
i know TextReader exists to read a String into a char array.
i need to somehow "reuse" a string, making it no longer immutable, so that i don't needlessly allocate gigabytes of memory when 1k will do.
If your application is dying, that's likely to be because you still have references to strings - not because the garbage collector is just failing to clean them up. I have seen it fail like that, but it's pretty unlikely. Have you used a profiler to check that you really do have a lot of strings in memory at a time?
The long and the short of it is that you can't reuse a string to store different data - it just can't be done. You can write your own equivalent if you like - but the chances of doing that efficiently and correctly are pretty slim. Now if you could give more information about what you're doing, we may be able to suggest alternative approaches that don't use so much memory.
This question is almost 10 years old. These days, please look at ReadOnlySpan - instantiate one from the string using AsSpan() method. Then you can apply index operators to get slices as spans without allocating any new strings.
I would suggest, considering the fact, that you can not reuse the strings in C#, use Memory-Mapped Files. You simply save string on a disk and process it with performance/memory-consuption excelent relationship via mapped file like a stream. In this case you reuse the same file, the same stream and operate only on small possible portion of the data like a string, that you need in that precise moment, and after immediately throw it away.
This solution is strictly depends on your project requieremnts, but I think one of the solutions you may seriously consider, as especially memory consumption will go down dramatically, but you will "pay" something in terms of performance.
Do you have some sample code to test whether possible solutions would work well?
In general though, any object that is bigger than 85KB is going to be allocated onto the Large Object Heap, which will probably be garbage collected less often.
Also, if you're really pushing the CPU hard, the garbage collector will likely perform its work less often, trying to stay out of your way.
Related
I have some code that deal with a lot of copying of arrays. Basically my class is a collection that uses arrays as backing fields, and since I don't want to run the risk of anyone modifying an existing collection, most operations involves creating copies of the collection before modifying it, hence also copying the backing arrays.
I have noticed that the copying can be slow sometimes, within acceptable limits but I am worried that it might be a problem when the application is scaled up and starts using more data.
Some performance analysis testing suggests that while barely consuming CPU resources at all, my array copy code spends a lot of time blocked. There are few contentions, but a lot of time blocked. Since the testing application is single threaded, I assume there is some GC contention magic going on. I'm not confident enough in how the GC works in these scenarios, so I'm asking here.
My question - is there a way to create new arrays that reduces the strain on the GC? Or is there some other way I can speed this up (simplified for testing and readability purposes):
public MyCollection(MyCollection copyFrom)
{
_items = new KeyValuePair<T, double>[copyFrom._items.Length]; //this line is reported to have a lot of contention time
Array.Copy(copyFrom._items, _items, copyFrom._items.Length);
_numItems = copyFrom._numItems;
}
Not so sure what's going on here, but contention is a threading problem, not an array copying problem. And yes, a concurrency analyzer is liable to point at a new statement since memory allocation requires acquiring a lock that protects the heap.
That lock is held for a very short time when allocations come from the gen #0 heap. So having threads fighting over the lock and losing a great deal of time being locked out is a very unlikely mishap. It is not so fast when the allocation comes from the Large Object Heap. Happens when the allocation is 85,000 bytes or more. But then a thread would of course be pretty busy with copying the array elements as well.
Do watch out for what the tool tells you, a very large number of total contentions does not automatically mean you have a problem. It only gets ugly when threads end up getting blocked for a substantial amount of time. If that is a real problem then you next need to look at how much time is spent on garbage collection. There is a basic perf counter for that, you can see it in Perfmon.exe. Category ".NET CLR Memory", counter "% Time in GC", instance = yourapp. Which is liable to be high, considering the amount of copying you do. A knob you can tweak if that is the real problem is to enable server GC.
There's a concept of persistent immutable data structure. This is one of the possible solutions that basically let's you create immutable objects, while still modifying them, in a memory efficient way.
For example,
Roslyn has a SyntaxTree object, that is immutable. You can modify the immutable object, and get back modified immutable object. Note that the "modified immutable object" has possibly no memory allocations, because it can build on the "first immutable object".
The same concept is also used in Visual Studio text editor itself. The TextBuffer is immutable object, but each time you press a keyboard button, new immutable TextBuffer is created, however, they do not allocate memory(as that would be slow).
Also, if it's true that you're facing the issues with LOH, it can help sometimes when you allocate the big memory block yourself, and use that as "reusable" memory pool, thus avoiding GC completely. It's worth considering.
No. You can wait for the new runtime in 2015 though that will use SIMD instructions for the Array.Copy operation. This will be quite a lot faster. The current implementation is very sub-optimal.
At the end, the trick is in avoiding memory operations - which sometime just is not possible.
I'm converting a C# project to C++ and have a question about deleting objects after use. In C# the GC of course takes care of deleting objects, but in C++ it has to be done explicitly using the delete keyword.
My question is, is it ok to just follow each object's usage throughout a method and then delete it as soon as it goes out of scope (ie method end/re-assignment)?
I know though that the GC waits for a certain size of garbage (~1MB) before deleting; does it do this because there is an overhead when using delete?
As this is a game I am creating there will potentially be lots of objects being created and deleted every second, so would it be better to keep track of pointers that go out of scope, and once that size reachs 1MB to then delete the pointers?
(as a side note: later when the game is optimised, objects will be loaded once at startup so there is not much to delete during gameplay)
Your problem is that you are using pointers in C++.
This is a fundamental problem that you must fix, then all your problems go away. As chance would have it, I got so fed up with this general trend that I created a set of presentation slides on this issue. – (CC BY, so feel free to use them).
Have a look at the slides. While they are certainly not entirely serious, the fundamental message is still true: Don’t use pointers. But more accurately, the message should read: Don’t use delete.
In your particular situation you might find yourself with a lot of long-lived small objects. This is indeed a situation which a modern GC handles quite well, and which reference-counting smart pointers (shared_ptr) handle less efficiently. If (and only if!) this becomes a performance problem, consider switching to a small object allocator library.
You should be using RAII as much as possible in C++ so you do not have to explicitly deleteanything anytime.
Once you use RAII through smart pointers and your own resource managing classes every dynamic allocation you make will exist only till there are any possible references to it, You do not have to manage any resources explicitly.
Memory management in C# and C++ is completely different. You shouldn't try to mimic the behavior of .NET's GC in C++. In .NET allocating memory is super fast (basically moving a pointer) whereas freeing it is the heavy task. In C++ allocating memory isn't that lightweight for several reasons, mainly because a large enough chunk of memory has to be found. When memory chunks of different sizes are allocated and freed many times during the execution of the program the heap can get fragmented, containing many small "holes" of free memory. In .NET this won't happen because the GC will compact the heap. Freeing memory in C++ is quite fast, though.
Best practices in .NET don't necessarily work in C++. For example, pooling and reusing objects in .NET isn't recommended most of the time, because the objects get promoted to higher generations by the GC. The GC works best for short lived objects. On the other hand, pooling objects in C++ can be very useful to avoid heap fragmentation. Also, allocating a larger chunk of memory and using placement new can work great for many smaller objects that need to be allocated and freed frequently, as it can occur in games. Read up on general memory management techniques in C++ such as RAII or placement new.
Also, I'd recommend getting the books "Effective C++" and "More effective C++".
Well, the simplest solution might be to just use garbage collection in
C++. The Boehm collector works well, for example. Still, there are
pros and cons (but porting code originally written in C# would be a
likely candidate for a case where the pros largely outweigh the cons.)
Otherwise, if you convert the code to idiomatic C++, there shouldn't be
that many dynamically allocated objects to worry about. Unlike C#, C++
has value semantics by default, and most of your short lived objects
should be simply local variables, possibly copied if they are returned,
but not allocated dynamically. In C++, dynamic allocation is normally
only used for entity objects, whose lifetime depends on external events;
e.g. a Monster is created at some random time, with a probability
depending on the game state, and is deleted at some later time, in
reaction to events which change the game state. In this case, you
delete the object when the monster ceases to be part of the game. In
C#, you probably have a dispose function, or something similar, for
such objects, since they typically have concrete actions which must be
carried out when they cease to exist—things like deregistering as
an Observer, if that's one of the patterns you're using. In C++, this
sort of thing is typically handled by the destructor, and instead of
calling dispose, you call delete the object.
Substituting a shared_ptr in every instance that you use a reference in C# would get you the closest approximation at probably the lowest effort input when converting the code.
However you specifically mention following an objects use through a method and deleteing at the end - a better approach is not to new up the object at all but simply instantiate it inline/on the stack. In fact if you take this approach even for returned objects with the new copy semantics being introduced this becomes an efficient way to deal with returned objects also - so there is no need to use pointers in almost every scenario.
There are a lot more things to take into considerations when deallocating objects than just calling delete whenever it goes out of scope. You have to make sure that you only call delete once and only call it once all pointers to that object have gone out of scope. The garbage collector in .NET handles all of that for you.
The construct that is mostly corresponding to that in C++ is tr1::shared_ptr<> which keeps a reference counter to the object and deallocates when it drops to zero. A first approach to get things running would be to make all C# references in to C++ tr1::shared_ptr<>. Then you can go into those places where it is a performance bottleneck (only after you've verified with a profile that it is an actual bottleneck) and change to more efficient memory handling.
GC feature of c++ has been discussed a lot in SO.
Try Reading through this!!
Garbage Collection in C++
I have a program that processes high volumes of data, and can cache much of it for reuse with subsequent records in memory. The more I cache, the faster it works. But if I cache too much, boom, start over, and that takes a lot longer!
I haven't been too successful trying to do anything after the exception occurs - I can't get enough memory to do anything.
Also I've tried allocating a huge object, then de-allocating it right away, with inconsistent results. Maybe I'm doing something wrong?
Anyway, what I'm stuck with is just setting a hardcoded limit on the # of cached objects that, from experience, seems to be low enough. Any better Ideas? thanks.
edit after answer
The following code seems to be doing exactly what I want:
Loop
Dim memFailPoint As MemoryFailPoint = Nothing
Try
memFailPoint = New MemoryFailPoint( mysize) ''// size of MB of several objects I'm about to add to cache
memFailPoint.Dispose()
Catch ex As InsufficientMemoryException
''// dump the oldest items here
End Try
''// do work
next loop.
I need to test if it is slowing things down in this arrangement or not, but I can see the yellow line in Task Manager looking like a very healthy sawtooth pattern with a consistent top - yay!!
You can use MemoryFailPoint to check for available memory before allocating.
You may need to think about your release strategy for the cached objects. There is no possible way you can hold all of them forever so you need to come up with an expiration timeframe and have older cached objects removed from memory. It should be possible to find out how much memory is left and use that as part of your strategy but one thing is certain, old objects must go.
If you implement your cache with WeakRerefences (http://msdn.microsoft.com/en-us/library/system.weakreference.aspx) that will leave the cached objects still eligible for garbage collection in situations where you might otherwise throw an OutOfMemory exception.
This is an alternative to a fixed sized cache, but potentially has the problem to be overly aggressive in clearing out the cache when a GC does occur.
You might consider taking a hybrid approach, where there are a (tunable) fixed number of non-weakreferences in the cahce but you let it grow additionally with weakreferences. Or this may be overkill.
There are a number of metrics you can use to keep track of how much memory your process is using:
GC.GetTotalMemory
Environment.WorkingSet (This one isn't useful, my bad)
The native GlobalMemoryStatusEx function
There are also various properties on the Process class
The trouble is that there isn't really a reliable way of telling from these values alone whether or not a given memory allocation will fail as although there may be sufficient space in the address space for a given memory allocation memory fragmentation means that the space may not be continuous and so the allocation may still fail.
You can however use these values as an indication of how much memory the process is using and therefore whether or not you should think about removing objects from your cache.
Update: Its also important to make sure that you understand the distinction between virtual memory and physical memory - unless your page file is disabled (very unlikely) the cause of the OutOfMemoryException will be caused by a lack / fragmentation of the virtual address space.
If you're only using managed resources you can use the GC.GetTotalMemory method and compare the results with the maximum allowed memory for a process on your architecture.
A more advanced solution (I think this is how SQL Server manages to actually adapt to the available memory) is to use the CLR Hosting APIs:
the interface allows the CLR to inform the host of the consequences of
failing a particular allocation
which will mean actually removing some objects from the cache and trying again.
Anyway I think this is probably an overkill for almost all applications unless you really need an amazing performance.
The simple answer... By knowing what your memory limit is.
The closer you are to reach that limit the more you ARE ABOUT to get an OutOfMemoryException.
The more elaborated answer.... Unless you yourself writes a mechanism to do that kind of thing, programming languages/systems do not work that way; as far as I know they cannot inform you ahead or in advance you are exceeding limits BUT, they gladly inform you when the problem has occurred, and that usually happens through exceptions which you are supposed to write code to handle.
Memory is a resource that you can use; it has limits and it also has some conventions and rules for you to follow to make good use of that resource.
I believe what you are doing of setting a good limit, hard coded or configurable, seems to be your best bet.
I have an out of memory exception using C# when reading in a massive file
I need to change the code but for the time being can I increase the heap size (like I would in Java) as a shaort term fix?
.Net does that automatically.
Looks like you have reached the limit of the memory one .Net process can use for its objects (on 32 bit machine this is 2 standard or 3GB by using the /3GB boot switch. Credits to Leppie & Eric Lippert for the info).
Rethink your algorithm, or perhaps a change to a 64 bit machine might help.
No, this is not possible. This problem might occur because you're running on a 32-bit OS and memory is too fragmented. Try not to load the whole file into memory (for instance, by processing line by line) or, when you really need to load it completely, by loading it in multiple, smaller parts.
No you can't see my answer here: Is there any way to pre-allocate the heap in the .NET runtime, like -Xmx/-Xms in Java?
For reading large files it is usually preferable to stream them from disk, reading them in chunks and dealing with them a piece at a time instead of loading the whole thing up front.
As others have already pointed out, this is not possible. The .NET runtime handles heap allocations on behalf of the application.
In my experience .NET applications commonly suffer from OOM when there should be plenty of memory available (or at least, so it appears). The reason for this is usually the use of huge collections such as arrays, List (which uses an array to store its data) or similar.
The problem is these types will sometimes create peaks in memory use. If these peak requests cannot be honored an OOM exception is throw. E.g. when List needs to increase its capacity it does so by allocating a new array of double the current size and then it copies all the references/values from one array to the other. Similarly operations such as ToArray makes a new copy of the array. I've also seen similar problems on big LINQ operations.
Each array is stored as contiguous memory, so to avoid OOM the runtime must be able to obtain one big chunk of memory. As the address space of the process may be fragmented due to both DLL loading and general use for the heap, this is not always possible in which case an OOM exception is thrown.
What sort of file are you dealing with ?
You might be better off using a StreamReader and yield returning the ReadLine result, if it's textual.
Sure, you'll be keeping a file-pointer around, but the worst case scenario is massively reduced.
There are similar methods for Binary files, if you're uploading a file to SQL for example, you can read a byte[] and use the Sql Pointer mechanics to write the buffer to the end of a blob.
I have 10 threads writing thousands of small buffers (16-30 bytes each) to a huge file in random positions. Some of the threads throw OutOfMemoryException on FileStream.Write() opreation.
What is causing the OutOfMemoryException ? What to look for?
I'm using the FileStream like this (for every written item - this code runs from 10 different threads):
using (FileStream fs = new FileStream(path, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite, BigBufferSizeInBytes, FileOptions.SequentialScan))
{
...
fs.Write();
}
I suspect that all the buffers allocated inside the FileStream don't get released in time by the GC. What I don't understand is why the CLR, instead of throwing, doesn't just run a GC cycle and free up all the unused buffers?
If ten threads are opening files as your code shows, then you have a maximum of ten undisposed FileStream objects at any one time. Yes, FileStream does have an internal buffer, the size of which you specify with "BigBufferSizeInBytes" in your code. Could you please disclose the exact value? If this is big enough (e.g. ~100MB) then it could well be the source of the problem.
By default (i.e. when you don't specify a number upon construction), this buffer is 4kB and that is usually fine for most applications. In general, if you really care about disk write performance, then you might increase this one to a couple of 100kB but not more.
However, for your specific application doing so wouldn't make much sense, as said buffer will never contain more than the 16-30 bytes you write into it before you Dispose() the FileStream object.
To answer your question, an OutOfMemoryException is thrown only when the requested memory can't be allocated after a GC has run. Again, if the buffer is really big then the system could have plenty of memory left, just not a contiguous chunk. This is because the large object heap is never compacted.
I've reminded people about this one a few times, but the Large object heap can throw that exception fairly subtially, when seemingly you have pleanty of available memory or the application is running OK.
I've run into this issue fairly frequently when doing almost exactally what your discribing here.
You need to post more of your code to answer this question properly. However, I'm guessing it could also be related to a potential Halloween problem (Spooky Dooky) .
Your buffer to which you are reading from may also be the problem (again large object heap related) also again, you need to put up more details about what's going on there in the loop. I've just nailed out the last bug I had which is virtually identical (I am performing many parallel hash update's which all require independent state to be maintained across read's of the input file)....
OOP! just scrolled over and noticed "BigBufferSizeInBytes", I'm leaning towards Large Object Heap again...
If I were you, (and this is exceedingly difficult due to the lack of context), I would provide a small dispatch "mbuf", where you copied in and out instead of allowing all of your disperate thread's to individually read across your large backing array... (i.e. it's hard to not cause insadential allocations with very subtile code syntax).
Buffers aren't generally allocated inside the FileStream. Perhaps the problem is the line "writing thousands of small buffers" - do you really mean that? Normally you re-use a buffer many, many, many times (i.e. on different calls to Read/Write).
Also - is this a single file? A single FileStream is not guaranteed to be thread safe... so if you aren't doing synchronization, expect chaos.
It's possible that these limitations arise from the underlying OS, and that the .NET Framework is powerless to overcome these kind of limitations.
What I cannot deduce from your code sample is whether you open up a lot of these FileStream objects at the same time, or open them really fast in sequence. Your use of the 'using' keyword will make sure that the files are closed after the fs.Write() call. There's no GC cycle required to close the file.
The FileStream class is really geared towards sequential read/write access to files. If you need to quickly write to random locations in a big file, you might want to take a look at using virtual file mapping.
Update: It seems that virtual file mapping will not be officially supported in .NET until 4.0. You may want to take a look at third party implementations for this functionality.
Dave
I'm experiencing something similar and wondered if you ever pinned down the root of your problem?
My code does quite a lot of copying between files, passing quite a few megs between different byte files. I've noticed that whilst the process memory usage stays within a reasonable range, the system memory allocation shoots up way too high during the copying - much more than is being used by my process.
I've tracked the issue down to the FileStream.Write() call - when this line is taken out, memory usage seems to go as expected. My BigBufferSizeInBytes is the default (4k), and I can't see anywhere where these could be collecting...
Anything you discovered whilst looking at your problem would be gratefully received!